Category Archives: Certificates

CA Misissuance: A Tale of Two Incident Responses

Certificate Authorities on the web are tasked with validating that the websites we visit are truly associated with the domains that serve the content that makes up that website. As a result, we gain confidence that we know which domains we are communicating with and can trust that no one else can view our interactions with them.

CAs are trusted because user agents—typically web browsers—chose to delegate the “Trust On First Use” (TOFU) problem to these authorities. A user agent acts on behalf of the user, aiming to serve and protect the user’s interests while navigating the internet. In this context, ensuring secure communication with the correct site is managed through this delegation of domain control verification to CAs.

Many often forget that a CA’s role does not stop at the time of certificate issuance; they also manage revocations when keys are compromised or replaced, the certificates are issued in error, or when they are not compliant with the industry standards the CA has agreed to meet as a publicly trusted CA—a responsibility pivotal to maintaining the trust billions place in them.

To illustrate this further, not only is this their responsibility, but they also have obligations under the audit regimes they operate under. Depending on the scenario, they may need to revoke the certificates within 24 hours to 5 days.  When you are potentially talking about tens of thousands or millions of certificates, this requires both good engineering and preparation.

“By failing to prepare, you are preparing to fail.” — Benjamin Franklin

In this post, I want to delve into how two different CAs approached issues in handling two misissuances involving failures to meet the requirements for domain control verification. The contrast between these two incidents is striking and highlights the importance of preparation, good engineering, and strategic decision-making in crisis management.

Case Studies

Let’s Encrypt: A Proactive Approach

Let’s Encrypt demonstrated exemplary crisis management during their 2020 domain control verification-related incident. Through compensating controls, they internally discovered a bug in their domain control verification logic. With the right data, tooling, and processes in place, they were able to act decisively—filing the initial incident quickly, analyzing the scope and impact of the incident, and revoking over 1.7 million out of 3 million impacted certificates within hours. They were able to do this because the service and organization was built from day one to make it possible. One of the most telling elements of their response was their ability to quickly determine which certificates had already been replaced and which were still in use. This information guided their communication strategy and helped identify which certificates were safe to revoke immediately. Furthermore, within the 24-hour revocation requirement, they filed a subsequent delayed revocation incident

This captured why the remaining certificates were not immediately revoked and when they would all expire, helping browsers and the community understand not only that they had done their research but also how their handling of the issue could impact the web and what the exposure of the delayed revocation represented.

DigiCert: A Reactive Approach

In contrast, DigiCert’s response to their domain control-related misissuance highlights several areas for improvement. The issue had persisted for an extended period and was identified externally, rather than through internal engineering and compliance processes designed to catch such errors.

DigiCert also struggled with the logistics of mass revocations. Faced with tens of thousands of certificates, they had difficulty determining which could be revoked quickly, which were actually impacted, and as a result initially opted to start a plan to revoke all certificates within a 24-hours. Though they started this clock when they figured out the scope of the incident, it took them a while to do so, suggesting a lack of preparation for incidents. They also appear to have done this all without a clear strategy to mitigate the customer impact and through a sequence of events ultimately rolled out a 5-day delay on top of the initial plan. This all led to the unusual step of a subscriber filing a Temporary Restraining Order (TRO) against the revocation, which set a troubling precedent and complicated the situation further.

A significant challenge was that many DigiCert customers do not utilize automated certificate lifecycle management extensively, which is essential during such incidents. This issue reflects back on DigiCert’s own processes, as manual lifecycle management is the default rather than the exception.

This incident raises questions about DigiCert’s preparedness, including whether they maintain a comprehensive list of all valid certificates, track replacements, and monitor customer use of lifecycle management. It also prompts a reflection on their internal review processes and test automation for compliance with CA/Browser Forum Requirements.

“Learn all you can from the mistakes of others. You won’t have time to make them all yourself.” – Alfred Sheinwold

It also makes me wonder about the type of design review process that was in place, the description in the incident sounds like a fix was applied at the user interface level which no serious engineer would do. Beyond that, it’s also unclear what kind of test automation is incorporated into their engineering processes to ensure compliance with associated requirements. For example, it would be quite reasonable for the process to mandate unit tests for all normative level requirements from the CA/Browser Forum Requirements and associated IETF documents. This would ensure that initial and subsequent releases meet requirements, which is particularly crucial for an organization with many engineers who may not all be experts in every requirement.

Engineering Readiness and Strategic Decision-Making

These case studies highlight critical lessons in engineering readiness and crisis management. Systems designed for scale and availability must be built with proactive planning, automation, and testing in mind. The reality is that systems designed for both scale and availability are also operated with engineering readiness as a core tenant. If you read the O’Reilly Site Reliability Engineering book, you will notice the same undertow of planning, automating, testing, and preparing, that is because the only way to have a reliable system or process is to engineer it to be that way.

Beyond that, if you’ve ever been on call, you understand how crucial it is that both mundane and unexpected issues be prevented and automated so you are able to respond effectively or even better — never experience entire classes of problems as you learn how to operate a CA. This is why, following Let’s Encrypt’s 2020 incident and their second mass revocation in 2021—for certificates that were one second longer than their policy allowed —they began exploring ways to ensure this never happens again. This is what led us to  Automated Renewal and Installation (ARI) which aims to make it possible for these unplanned certificate renewals boring. This proactive approach aligns with the principle:

“An error doesn’t become a mistake until you refuse to correct it.” – Orlando A. Battista

The browsers have long stressed the importance of blameless postmortems and a culture of self-improvement they do this because without that people within an organization do not feel empowered to do the right thing. At the same time how an organization responds to an incident is very telling about its culture, transparency, operational excellence, focus on compliance, and engineering practices and as such serves as a useful indicator of an organization’s competence and suitability to be part of the threat model of the internet. 

At the last CA/Browser Forum Face to Face meeting the Chrome team talked a bit about this.

Lessons Learned

These incidents highlight several key lessons for CAs, including the importance of automated systems, transparent communication, and continuous learning.

1. Preparation is Key: Automated systems, and tooling to support and predefined response strategies are crucial. For effective incident management, CAs should develop and test incident response plans regularly to ensure rapid action during a crisis.

2. Transparency Builds Trust: During crises, maintaining open and regular communication is essential. CAs should ensure that their communication strategies are clear and consistent to build and maintain trust with stakeholders and the community.

3. Learn from Others: Assign teams and individuals to conduct regular reviews of both historical and current incidents. Have them present these findings to the organization and rotate this responsibility across different disciplines to ensure knowledge is shared.

Conclusion

Acknowledging that DigiCert displayed a level of transparency during their incident, it’s clear that the event still exposed significant gaps in their compliance and engineering practices. Transparency, while crucial, is only one component of effective crisis management. The DigiCert case illustrates that even when stakeholders are kept informed, underlying deficiencies in operational procedures and compliance frameworks can compromise both the efficacy of the response and the overall security posture.

“The decisions you make reflect the priorities you have.”

Some recommendations for CAs looking to learn from this incident include:

  • DigiCert’s incident highlighted the need for more robust compliance mechanisms. Regularly updating compliance protocols and ensuring they are in strict alignment with industry standards is essential. This includes periodic reviews and updates to incorporate new regulatory requirements and best practices.
  • The incident revealed that while DigiCert was transparent, there was a lack of sufficient engineering controls to prevent or mitigate the impact of such issues. Implementing rigorous engineering best practices such as continuous integration, automated testing, and detailed, proactive monitoring of systems can help prevent similar incidents.
  • While DigiCert was forthcoming about the issue, their response strategy could benefit from a more structured approach that includes predefined escalation paths, detailed role assignments, and rapid response teams specialized in different aspects of the crisis management process.
  • Beyond immediate crisis management, fostering a culture prioritizing continuous improvement and learning from past mistakes is vital. This involves regular training, simulations of potential issues, and a blame-free environment that encourages reporting and swift resolution of potential threats.

In conclusion, effective crisis management involves not only responding to incidents but also implementing robust preventive measures. Regular updates to designs, implementations, tooling, improved engineering controls, and a culture of continuous improvement are essential for minimizing the likelihood and impact of future incidents.

Exploring Requirements for Timelines Certificate Problem Reports and Revocations

Today, DigiCert’s mass revocation is in the news, so I thought it would be worthwhile to explore the rules for a CA when it comes to revoking a subscriber certificate and the choices and constraints a CA faces in circumstances like this.

First, let’s examine the types of issues that might lead a CA to revoke a certificate before its natural expiration. Broadly speaking, there are two categories of these issues: 

Externally identified Issues, these start from what is called a Certificate Problem Report (CPR), they may originate from: Subscribers, or Relying parties, application software suppliers, and other third parties

Internally identified issues, these may be identified by: Pre-issuance checks, Post-issuance checks, or Internal periodic reviews

Problem reports from external entities almost always require some minimal level of investigation. An exception might be cases where the reporter has access to the private key and requests a certificate be revoked, this could come from the subscriber, or someone who has gained access to the private key. Even in these cases you often need to do multiple communication round trips because requests often do not contain sufficient information to authenticate the request from the outset.

There are other cases too; for example, sometimes a problem report claims unauthorized issuance of a certificate, but it turns out the reporter had authorized it, perhaps by signing up for a CDN or SaaS offering that uses that CA. In these cases you might need to contact the person who requested the certificate, or the service provider for more information or as the reporter to prove control of the domain to verify your authorization to request the revocation — again introducing a few round trips. 

However, most problem reports are spam or requests to take down a website. In this second case, if a CA has a policy about revoking domains that serve such content then it also needs some level of investigation. On that check out this post on the CA’s role in phishing mitigation on why CAs should stay out of this class of issue altogether. 

These are just a few examples of certificate problem reports a CA might see but there are more, one of the most actionable is a claim that there was a violation of the requirements that a CA must meet, such as including incorrect information in a certificate or using a certificate profile with a wrong extension, or using a domain control verification process that does not meet requirements.

The workflow for these requests is straightforward, but it still requires some investigation. The questions that need to be answered before taking action include things like: When was the certificate issued? What rules were in force at that time? What data was used when validating the entitlement of the requester? What version of the software implemented the validation at the time of issuance? Is the reporter’s interpretation of the requirement technically correct?

This is why the Baseline Requirements (BRs) allow CAs some time for an investigation:

CA SHOULD revoke a certificate within 24 hours and MUST revoke a Certificate within 5 days

NOTE: Here is a conversation on the last changes to this text, searching the forums is always a good way to validate what the intent was at the time of authorship.

Let’s look at an example here, in 2020 Let’s Encrypt had a misissuance due to a bug in CAA handling. This issue was found internally through what we would call mitigating controls, a process that exists to double-check you are doing the right thing.

That’s not a lot of time, especially if such a report comes in on New Year’s Eve. CAs maintain on-call rotations to investigate issues, but it’s easy to imagine that from time to time, an investigation might take longer than 24 hours to fully complete. Despite this, the CAs shouldn’t have an infinite amount of time for such investigations, which is why the text limits the time to revocation to a maximum of 5 days.

Let’s explore what this might look like in the case of a real incident. On February 29, 2020, Let’s Encrypt through an internal review discovered an issue in their domain control verification logic and kicked off an investigation, and set off to do a revocation of over 3 million certificates on that same day they discovered the issue they announced their plan to revoke the affected certificates. Revocations began on March 4, 2020, and by the March 5, 2020 deadline, they had revoked 1,706,505 certificates. Network scans revealed that a large portion of the remaining 1,003,596 certificates were still in active use.

Due to concerns over the potential impact revocation would have on these active sites, they filed an incident for delayed revocation just before the deadline. These remaining certificates either expired or were revoked by May 29, 2020, after which the ticket remained open for community discussion and further analysis.

Let’s Encrypt did many things right here, including implementing compensating controls to detect issues internally, preparing to analyze the scope of impact immediately, and revoking the certificates they could without causing disruption as soon as possible. When it became clear that they would not meet the deadline requirements, they filed a bug to document their non-compliance with the obligations before the deadline lapsed. Moreover, they communicated clearly and regularly throughout the entire process until the issue was resolved.

But what about the missed revocation deadline!?! The reality is that delayed revocations happen, It’s not ideal, but they do happen. What is important is how a CA deals with them, and how credibly. Let’s Encrypt executed like a well-oiled machine here; no one complained. (They did this all without getting distrusted, for more on how CAs usually get distrusted see.)

There are other cases where delayed revocations were done as well, for example in 2019 Apple, Google, and GoDaddy were all hit by a bug in EJBCA, all of which had some delayed revocations, and while there may have been some challenges in each of those cases, what you could tell is that like Let’s Encrypt they were being proactive, transparent, and working towards addressing the issue responsibly and when you looked at their broader history there was not a sign of systemic failures.

OK so if you are in the CA industry right now you should be saying, but Ryan, What about the Certificate Problem Reports response requirements?! This Digicert issue was reported externally, those requirements are also relevant if so you are right, these requirements say:

The CA SHALL maintain a continuous 24×7 ability to accept and respond to revocation requests and Certificate Problem Reports.

And …

The period from receipt of the Certificate Problem Report or revocation-related notice to published revocation MUST NOT exceed the time frame set forth in Section 4.9.1.1. The date selected by the CA SHOULD consider the following criteria:

– The nature of the alleged problem (scope, context, severity, magnitude, risk of harm);

– The consequences of revocation (direct and collateral impacts to Subscribers and Relying Parties);

– The number of Certificate Problem Reports received about a particular Certificate or Subscriber;

– The entity making the complaint (for example, a complaint from a law enforcement official that a Web site is engaged in illegal activities should carry more weight than a complaint from a consumer alleging that they didn’t receive the goods they ordered); and

– Relevant legislation.

While In Section 4.9.1.1 it says

With the exception of Short-lived Subscriber Certificates, the CA SHALL revoke a Certificate within 24 hours and use the corresponding CRLReason (see Section 7.2.2) if one or more of the following occurs:

If you’re paying attention, you might be thinking to yourself, “Wait a minute, doesn’t that mean, since the request came in from an external reporter, that the certificate should be revoked within 24 hours?”

Well, yes, it does! This totally makes sense in some cases, for example, if a notice comes in of a key compromise that has all the required detail to confirm it, but as we discussed above, requests received externally often require some level of verification. This begs the question is 24 hours enough time to handle that or not? But one thing is for sure arguing that the rules are wrong or that the rules should be changed during an incident response is a recipe for disaster.

So what would I have done if I were DigiCert? I would have stopped issuance as soon as the issue was identified. I would have then set engineers to work on resolving the issue in the right way. In parallel, I would have prepared a preliminary incident report, noting that I had stopped the issuance. I would have also noted any confusion I may have had about the language in the requirements in that incident report, I would have then committed to start revoking following the requirements, and before the 24-hour clock expired, I would have started replacing and revoking the certificates that could be revoked safely. If it was clear that despite active, good faith, best efforts, the revocation was not going to be completed in time, then, I would have filed a delayed revocation incident before that clock expired—essentially following the same playbook Let’s Encrypt used back in 2020.

After successfully navigating the incident, I would have worked within the CA/Browser Forum to discuss this language and make sure it was it is clearer why the difference exists or change it to align with the other text.

We did not really tackle the topic of ‘Internally Identified Issues’? Though I do mentioned them above but didn’t really explain them in detail, so let’s see if I can capture them without rambling too much.

As I mentioned earlier, there are several classes. In the Let’s Encrypt case, they had some compensating controls around design reviews that identified an issue. When it comes to cases like this, it is pretty straightforward; by the time you have completed the design review, you are sure there is an issue at play, and the clock starts ticking.

Regarding pre-certificate linting, it’s a failure in your linting management if an issue that could or should have been caught by your linting gets issued; this is more akin to a design defect than a CPR-like incident and as such it would certainly trigger the 24-hour clock.

Then there is post-issuance linting, which is used either as a spot check during data gathering for an audit, as a compensating control, for example, or when you are testing new rules for the linter or refining your linter policy configuration. These are by their very nature a lot more like a CPR in that, because you don’t have a fine-tuned policy for the linter, they too need some level of investigation.

To be clear, none of this appears related to the DigiCert incident, but they have not published a detailed incident report yet, which will include an end-to-end timeline and that may change some of this analysis but I wanted to try to cover the topic more holistically in this post before declaring victory tonight. 

Regardless of the above, the key takeaway is that if you are a CA or a customer of a CA, you need to be prepared to respond to an incident. One of the most useful actions you can take to achieve this is to adopt ACME for certificate lifecycle management. In particular, look for ACME implementations and CAs that support ARI, an extension of ACME. This extension enables a CA to signal to its enrollment clients that their certificate needs to be replaced sooner than expected, allowing them to roll certificates without requiring on-call incident response or causing downtime. The key takeaway is if you are a CA or a customer of a CA you need to be ready to respond to an incident. 

If you are interested in this topic, I recommend checking out the Sectigo “Root Causes” podcast which has a great conversation on Chrome’s recent presentation at the CA/Browser Forum on Incident Reporting. 

This presentation can be summarized as : 

– Incident reporting is an opportunity to demonstrate continuous improvement and to make the ecosystem more resilient and secure.

– We rely on the public incident reporting process to demonstrate that continued trust is justified.

– We think incident reporting should be boring (i.e. routine).

I hope this was helpful. Let me know if you think I missed something or if my analysis is wrong in some way. When processing prose as code, there are almost always slightly different takes, and I wouldn’t be surprised to find out others feel differently than me on this issue.

Thanks to all of you who read this and provided feedback — it takes a village.

Understanding Patterns in WebPKI CA Issues

There’s a saying, “where there’s smoke, there’s fire.” This adage holds especially true in the context of WebPKI Certificate Authorities (CAs). Patterns of issues are one of the key tools that root programs use to understand what’s happening inside organizations. While audits are essential, they are often insufficient. Historical cases like Wirecard and Enron illustrate how audits can provide a partial and sometimes incorrect picture. Just as in most interactions in life, understanding who you are dealing with is crucial for successful navigation, especially when a power dynamic differential exists.

The Limitations of Audits

Currently, there are 86 organizations in the Microsoft root program. Most root programs have at most two people involved in monitoring and policing these 86 CAs. Technologies like Certificate Transparency make this possible, and open-source tools like Zlint and others use this data to find technically observable issues. However, these tools, combined with audits, only provide a small slice of the picture. Audits are backward-looking, not forward-looking. To understand where an organization is going, you need to understand how they operate and how focused they are on meeting their obligations.

This is where the nuanced side of root program management, the standards, and norms of the ecosystem, come into play. If we look at signals in isolation, they often appear trivial. However, when we examine them over a long enough period in the context of their neighboring signals, a more complete picture becomes apparent.

For example, consider a CA with minor compliance issues that seem trivial in isolation. A single misissued certificate might not seem alarming. But when you see a pattern of such incidents over time, combined with other issues like poor incident response or associations with controversial entities, the picture becomes clearer. These patterns reveal deeper issues within the organization, indicating potential systemic problems.

Root Program Challenges

Root programs face significant challenges in managing and monitoring numerous CAs. With limited personnel and resources, they rely heavily on technology and community vigilance. Certificate Transparency logs and tools like Zlint help identify and flag issues, but they are only part of the solution. Understanding the intentions and operational integrity of CAs requires a deeper dive into their practices and behaviors.

In the WebPKI ecosystem, context is everything. Root programs must consider the broader picture, evaluating CAs not just on isolated incidents but on their overall track record. This involves looking at how CAs handle their responsibilities, their commitment to security standards, and their transparency with the community. A CA that consistently falls short in these areas, even in seemingly minor ways, can pose a significant risk to the ecosystem.

Conclusion

Understanding the nuances of CA operations and focusing on their adherence to obligations is critical. By examining patterns over time and considering the broader context, root programs can better identify and address potential risks. The combination of audits, technological tools, and a keen understanding of organizational behavior forms a more comprehensive approach to maintaining trust in the WebPKI system.

It’s always important to remember that CAs need to be careful to keep this in mind. After all, it’s not just what you do, but what you think you do. Having your house in order is essential. By learning from past mistakes and focusing on continuous improvement, organizations can navigate public reporting obligations more effectively, ensuring they emerge stronger and more resilient.

Exploring Browser Distrust

Browser distrust events of WebPKI Certificate Authorities occur on average approximately every 1.23 years. These events highlight the critical role the WebPKI plays in maintaining secure communications on the internet and how failures within this system can have far-reaching implications. By examining these incidents, we can identify common patterns and underlying causes that lead to distrust, so as implementors and operators, we don’t end up repeating the same mistakes.

Identifying Common Patterns

As they say, those who don’t know history are destined to repeat it, so it is worthwhile to take a look at the history of CA distrust events to understand what, if any common patterns exist:

  • Security Breaches: Involves unauthorized access to the CA’s infrastructure, leading to potential misuse of certificate issuance capabilities.
  • Compromise of CA Infrastructure: Refers to breaches where the core infrastructure of the CA is compromised, resulting in unauthorized certificate issuance.
  • Fraudulent Certificates: Occurs when certificates are issued without proper authorization, often leading to the impersonation of legitimate websites.
  • Poor Incident Response Handling: Indicates that the CA failed to adequately respond to security incidents, exacerbating the impact of the initial problem.
  • Misissuance of Certificates: Happens when CAs issue certificates incorrectly, either to the wrong entities or without proper validation, undermining trust.
  • Facilitating Man-In-The-Middle Attacks: Refers to situations where misissued or improperly handled certificates enable attackers to intercept and alter communications.
  • Improper Practices: Includes actions by CAs that deviate from accepted standards and best practices, leading to security risks.
  • Deceptive Actions: Involves deliberate misleading actions by CAs, such as backdating certificates or other forms of dishonesty.
  • Insecure Practices: Encompasses practices by CAs that fail to maintain adequate security controls, leading to vulnerabilities.
  • Non-Compliance with Industry Standards: Indicates that the CA has repeatedly failed to adhere to industry standards and guidelines, leading to a loss of trust.
  • Ties to Controversial Entities: Involves associations with entities that raise ethical or security concerns, leading to distrust.
  • Limited Value to Ecosystem: Indicates that the CA does not provide significant value to the security ecosystem, often due to questionable practices or minimal compliance.
  • Operational Vulnerabilities: Refers to weaknesses in the CA’s operational security, such as using default passwords or having exposed administrative tools, making them susceptible to attacks.

Browser Distrust Events

  1. DigiNotar (2011):
    • Event: DigiNotar was hacked, leading to the issuance of fraudulent certificates. This prompted Mozilla, Google, and Microsoft to revoke trust in DigiNotar certificates.
    • Labels: Security Breaches, Compromise of CA Infrastructure, Fraudulent Certificates, Poor Incident Response Handling
    • Details: Mozilla Security Blog, Threatpost
  2. TurkTrust (2013):
    • Event: It was discovered that TurkTrust mistakenly issued two intermediate CA certificates, one of which was used to issue a fraudulent certificate for *.google.com. This led to the distrust of the TurkTrust CA by major browsers.
    • Labels: Misissuance of Certificates, Facilitating Man-In-The-Middle Attacks
    • Details: Krebs on Security
  3. ANSSI (French CA) (2013):
    • Event: It was discovered that ANSSI had issued a certificate to a network appliance company, which used it to inspect encrypted traffic. This led Google to revoke trust in the intermediate certificate.
    • Labels: Misissuance of Certificates, Facilitating Man-In-The-Middle Attacks
    • Details: Google Security Blog, Mozilla Blog
  4. CNNIC (China Internet Network Information Center) (2015):
    • Event: CNNIC was distrusted after it issued an intermediate certificate to MCS Holdings, which misused it to issue unauthorized certificates.
    • Labels: Misissuance of Certificates, Facilitating Man-In-The-Middle Attacks, Improper Practices
    • Details: Tom’s Hardware, Mozilla Security Blog, Mozilla Security Blog
  5. WoSign and StartCom (2016):
    • Event: WoSign (and StartCom) were distrusted after discovering multiple security issues, including backdating certificates, lying, and improper issuance.
    • Labels: Misissuance of Certificates, Deceptive Actions, Insecure Practices
    • Details: Tom’s Hardware, Google Security Blog
  6. Symantec (2017):
    • Event: Google announced a gradual distrust of Symantec certificates due to numerous instances of certificate misissuance, impacting millions of websites.
    • Labels: Misissuance of Certificates, Non-Compliance with Industry Standards, Poor Incident Response Handling
    • Details: Bleeping Computer, Google Security Blog
  7. Certinomis (2019):
    • Event: Mozilla distrusted Certinomis due to numerous incidents of misissuance and poor handling of security concerns.
    • Labels: Misissuance of Certificates, Facilitating Man-In-The-Middle Attacks
    • Details: Venafi Blog
  8. PROCERT (2020):
    • Event: Mozilla distrusted Procert due to numerous incidents of misissuance and poor handling of security concerns.
    • Labels: Non-Compliance with Industry Standards, Poor Incident Response Handling
    • Details: Venafi Blog
  9. TrustCor (2022):
    • Event: TrustCor was distrusted due to concerns about its ties to companies linked to the US intelligence community and its failure to provide satisfactory responses to these concerns.
    • Labels: Ties to Controversial Entities, Limited Value to Ecosystem
    • Details: gHacks, SSLs.com, SSL Shopper
  10. Camerfirma (2021):
    • Event: Mozilla and Google removed trust in Camerfirma due to a series of compliance issues and failure to maintain industry standards.
    • Labels: Non-Compliance with Industry Standards, Poor Incident Response Handling
    • Details: The Register
  11. Visa (2022):
    • Event: Issues include incomplete security audits that are required according to the
    • Labels: Non-Compliance with Industry Standards, Poor Incident Response Handling
    • Details: Feisty Duck
  12. e-Tugra (2023):
    • Event: e-Tugra was distrusted due to security concerns. A researcher found numerous vulnerabilities in e-Tugra’s systems, including default passwords and accessible administrative tools, leading to a loss of trust.
    • Labels: Operational Vulnerabilities, Insecure Practices, Poor Incident Response Handling
    • Details: Sectigo, GitHub, Ian Carroll’s Blog
  13. Ecommerce CA (EU) (2024):
    • Event: The Ecommerce CA in the EU faced browser distrust due to various security and compliance issues, leading to its removal from trusted lists.
    • Labels: Non-Compliance with Industry Standards, Operational Vulnerabilities, Poor Incident Response Handling
    • Details: EuroCommerce, Ecommerce Europe
  14. Entrust (2024):
    • Event: The Chrome Security Team announced the distrust of several Entrust roots due to a pattern of compliance failures and unmet improvement commitments.
    • Labels: Non-Compliance with Industry Standards, Poor Incident Response Handling
    • Details: Google Security Blog

Conclusion

The frequency and patterns of browser distrust events underscore the critical importance of preventive measures, transparency, and effective incident response.

Implementing robust security practices, conducting regular audits, and maintaining compliance with industry standards can significantly reduce the risk of such incidents. Transparency in operations and public disclosure of security issues foster an environment of accountability and trust.

An ounce of prevention is indeed worth more than a pound of cure. By focusing on proactive measures and cultivating a culture of continuous improvement, Certificate Authorities can better navigate the complexities of WebPKI. Effective crisis communication and incident response plans are essential for managing the fallout from security breaches and maintaining the trust of users and the broader web community.

By learning from past incidents and addressing their root causes, we can work towards a more secure and resilient internet, where trust in the WebPKI system is consistently upheld. The collective effort of CAs, browser vendors, and security researchers will play a pivotal role in achieving this goal, ensuring the integrity and reliability of our online ecosystem.

Why We Trust WebPKI Root Certificate Authorities

I’ve always likened the WebPKI governance system to our legal system, where congress sets the laws and the judiciary ensures compliance. Justice Breyer’s recent explanation on “rules” and “standards” in law, as discussed on the Advisory Opinions podcast, resonates well with how WebPKI operates in practice. In WebPKI, “rules” are explicitly defined through audits derived from CA/Browser Forum standards, incorporated into programs like WebTrust for CAs, and enforced through contractual obligations. These rules ensure aspire to consistent security and reliability across the web.

In contrast, “standards” in WebPKI encompass community norms, best practices, and recommendations specific to each root program. These standards are adaptable, evolving with technological advancements, security threats, and collective learning among CAs. They provide a framework that upholds the integrity of the Internet, ensuring that CAs remain transparent and live up to their promises while adhering to ecosystem norms, requirements, and best practices.

Similar to the Supreme Court, the WebPKI governance system consists of multiple ‘justices,’ with each root program acting akin to a Supreme Court justice. Their decisions on adherence or abstention from standards shape the outcomes that dictate the security and reliability of the Internet. Thus, the trust users place in WebPKI and its stewards is earned through a consistent, transparent, and accountable framework that ensures integrity across devices and browsers.

The Dual Role of Root Programs

1. As Trusted Stewards

While there’s no explicit voting process for root program management, users effectively select them through their choice of browsers or operating systems that incorporate these programs. This implicit trust in their ability to deliver on their security promises to users grants root programs the authority to establish and enforce rigorous standards for Root CAs. These standards determine inclusion in trust stores based on compliance assessments and judgments on value a CA would bring the web’s users, ensuring Root CAs uphold a consistent and transparent standard of integrity that users and web operators can rely on.

2. As Judicial Authorities

Root programs also serve a critical judicial function within the WebPKI landscape, akin to a Supreme Court. They interpret rules and standards, resolve ambiguities, settle community disputes, and establish precedents that guide CA operational practices. This role ensures equitable and consistent application of standards across all Root CAs.

Enforcing Compliance and Transparency

1. Maintaining Checks and Balances

Root programs enforce checks and balances through rigorous audits and monitoring, similar to judicial reviews. These processes assess Root CAs’ adherence to these
“rules” and “standards” and ensure accountability,with the goal of  preventing misuse of their authority on the web.

2. Promoting Transparency and Accountability

Root programs need to operate with a high degree of transparency, akin to open judicial proceedings. Decisions on trust or distrust of Root CAs need to be communicated clearly, accompanied by reasons for these decisions. This transparency ensures that all stakeholders, from end-users to website operators, understand and trust the framework protecting their privacy.

Case Study: The Ongoing Entrust Incident

A current discussion in the mozilla.dev.security.policy forum provides a compelling example of how the WebPKI governance framework operates in practice. This incident underscores the nuanced interaction between rules and standards, as well as the critical importance of transparency and accountability in maintaining trust.

The issue at hand involves Entrust’s performance concerns related to certificate misissuance. Such incidents are pivotal in demonstrating how root programs must navigate complex challenges and uphold rigorous standards of integrity within the web. The ongoing dialogue in the forum highlights the deliberative process undertaken by root programs to address such issues, ensuring that decisions are transparently communicated to stakeholders.

Cases like this illustrate the intricate balance that root programs must maintain between enforcing strict rules for security and adapting standards to accommodate technological advancements. The resolution of these incidents ultimately defines why users can trust root CAs, as it showcases the procedural approach, the transparency the process was designed for, and the goal of achieving accountability through this governance framework.

Why This All Matters

Understanding the dual role of root programs as regulatory bodies and judicial authorities underscores their essential role in maintaining trust. It emphasizes the significance of their decisions in shaping how privacy is delivered on the web, focusing on delivering a robust, evolving transparent, and accountable governance framework to guide these decisions.

The trust placed in WebPKI and its stewards are earned through a system that respects user choice and adheres to principles of fairness, ensuring that end-users can trust they are communicating with the correct website.

From Static to Dynamic: Adapting PKI for Cloud-Native Architectures

When it comes to workload and service credential management, a common misconception is that you can simply reuse your existing Certificate Authority (CA) and Certificate Lifecycle Management (CLM) infrastructure to achieve your desired end-state more quickly. Given that nearly every organization has client and server TLS certificates for devices and web servers and many have made significant investments in managing and issuing these certificates, this notion seems logical. However, the design constraints of these use cases differ significantly. The transition from static to dynamic environments, along with the associated credential governance problems, makes it clear that a purpose-built approach is necessary for managing workload and service credentials.

The Static Nature of Traditional CA and CLM Infrastructure

Traditional CA and CLM infrastructure primarily deals with TLS server certificates, where the identities involved are usually domain names or IP addresses. These identifiers are largely static, pre-defined, and managed out-of-band. As a result, the lifecycle of these certificates, which includes issuance, renewal, and revocation, follows a relatively predictable pattern. These certificates are usually issued for validity periods ranging from 90 days to a year, and the processes and infrastructure are designed to handle these constraints. However, use cases surrounding workload and service credentials have a totally different set of constraints, given the highly dynamic environment and various regulatory frameworks they must adhere to.

The Dynamic Nature of Workloads and Services

Workloads and services operate in a significantly different environment. These identities often come and go as services scale up or down or undergo updates. The sheer scale at which workload and service credentials are issued and managed can be hundreds of times greater than that of traditional client and server TLS certificate use cases.

Unlike the static domain names or IP addresses used in these traditional TLS certificates, the identifiers for workloads and services are dynamically assigned as workloads spin up and down or are updated. Workload and service credentials often have a much shorter lifespan compared to server certificates. They might need to be reissued every few minutes or hours, depending on the nature of the workload and the policies of the environment. This changes the expectations of the availability and scalability of the issuing infrastructure, leading to a need for credentials to be issued as close to the workload as possible to ensure issuance doesn’t become a source of downtime.

These workloads and services are also often deployed in clusters across various geographic locations to ensure fault tolerance and scalability. Unlike server certificates that can rely on a centralized CA, workload and service credentials often need a more distributed approach to minimize latency and meet performance requirements.

Another critical difference lies in the scope of trust. For TLS client and server certificates, the security domain is essentially defined by a root certificate or the corresponding root store. However, in the context of workloads and services, the goal is to achieve least privilege. This often necessitates a much more complicated PKI to deliver on this least-privileged design.

Additionally, federation is often needed in these use cases, and it is usually not possible to standardize on just X.509. For example, you might need to interoperate with a service that works based on OAuth. Even in such cases, you want to manage all of this within one common framework to have unified control and visibility regardless of the technology used for the deployment.

Purpose-Built Solutions for Workload and Service Credential Management

Given the unique challenges of managing workload and service credentials in dynamic, cloud-native environments, existing CA and CLM investments often fall short. Purpose-built solutions, such as those based on SPIFFE (Secure Production Identity Framework for Everyone), tend to offer a more effective approach by providing dynamic, attested identities tailored for these environments.

Dynamic Identifier Assignment

Unlike static identifiers managed by traditional CLM solutions, SPIFFE dynamically assigns identifiers as workloads spin up and down, aligning with the nature of modern cloud environments.

Decentralized Issuance

By issuing credentials as close to the workload as possible, SPIFFE-based solutions reduce latency and align issuance with the availability goals of these deployments, ensuring that credentials are issued and managed efficiently without becoming a source of downtime.

Granular Policy Enforcement

SPIFFE-based solutions enable enforcing fine-grained policy through its namespacing and security domain concepts. This enables organizations to define and enforce policies at a more granular level, ensuring that workloads only access the necessary resources.

Identity Governance and Federation

SPIFFE-based solutions tend to support extending existing identity governance programs to workload and service credentials while also facilitating seamless and secure access to external services across different trust domains.

Multifactor Authentication

These SPIFFE-based solutions also provide support for attestation, which can be thought of as multi-factor authentication for workloads. This attestation verifies the workload and services’ running state and environment, tying the credentials to those environments and helping minimize the risks of credential theft.

Integration with Other Systems and Protocols

These environments can seldom rely exclusively on just X.509, which is why SPIFFE supports both X.509 and JWT formats of credentials. This flexibility allows seamless integration with various systems and protocols within cloud deployments.

Conclusion

While it might be tempting to reuse existing CA and CLM infrastructure for workload and service credential management, this approach fails to address the unique challenges posed by dynamic, cloud-native environments. The ephemeral nature of workload identities, the high frequency of credential issuance, and the need for granular trust domains necessitate purpose-built solutions like those based on SPIFFE.

These purpose-built systems are designed to handle the dynamic nature of modern environments, providing strong, attested identities and ensuring that credential management aligns with the evolving needs of cloud-native workloads. Understanding these distinctions is crucial for developing an effective identity management strategy for modern workloads and services without having to go through a reset event down the line.

To learn more about effective machine identity management, check out SPIRL and see how they are leading the way.

Credential Management vs. Secret Management: Choosing the Right Approach

If we examine the contents of most secret management solutions, like HashiCorp Vault, we will find that we primarily store the logical equivalent of user IDs and passwords for services, workloads, and machines. Much like the old-school practice of writing passwords on post-it notes and sharing them, these secrets are then distributed to whatever entity needs access to the protected resource. As a result, it is no surprise that we encounter the same problems as password management: weak secrets, shared secrets, and stolen secrets. This is why we see machine credentials and keys being stolen in events like STORM 0558.

Secret Management: A Machine-Scale Password Management

Secret management can be seen as password management at a machine scale. These systems centralize secrets  (e.g., API keys, tokens, and passwords) to manage their sprawl and then make it easy to pass these shared secrets around various services and applications. However, just as simple passwords alone are seen as insufficient for user authentication—necessitating the adoption of multi-factor authentication (MFA) and migration to non-password authenticators like Passkeys—the same shortcomings apply to this legacy approach to managing these credentials.

The reality is that these secret managers were designed to address the symptoms rather than the root cause of secret sprawl. By storing and distributing these static secrets without treating them as credentials to be managed, you centralize the chaos but do not solve the fundamental problem: the need for robust, scalable machine identity management. Machines, services, and workloads, much like users credentials, require purpose-built approaches. Just as we have applied multi-factor authentication, identity governance and administration (IGA) and solutions like OKTA to user management, we need analogous systems for machines.

Credential Management for Machines

At machine and workload scale, credential management solutions need to be architected differently to handle the dynamic nature of modern cloud environments. This is where purpose-built solutions come into play. These solutions must address the unique challenges of credential management in dynamic, cloud-native environments. Unlike secret management, which often relies on static secrets, credential management provides dynamic, attested identities for workloads. For example, SPIFFE (Secure Production Identity Framework for Everyone) offers a strategy for managing dynamic identities, providing the foundation for standards-based, scalable, and robust workload identity management.

Higher-Level Concepts in Workload Credential Management:

  1. Provisioning and Deprovisioning:
    • Efficiently creating, updating, and removing machine identities as services start, stop, and scale.
  2. Role-Based Access Control (RBAC):
    • Assigning access rights based on predefined roles, ensuring that machines and services have appropriate permissions.
  3. Periodic Reviews:
    • Conducting regular reviews of machine access rights to ensure compliance and appropriateness.
  4. Policy Enforcement:
    • Defining and enforcing access control policies, ensuring that machines and services adhere to security guidelines.
  5. Audit and Reporting:
    • Generating comprehensive reports to demonstrate compliance with regulatory requirements and internal policies.
  6. Risk Analysis:
    • Identifying and mitigating risks associated with machine identities and their entitlements.
  7. Behavioral Analysis:
    • Monitoring machine behavior to detect anomalies and potential security threats.

Just as legacy approaches to secret management fail for workloads, machines, and services, these legacy IGA approaches will not work either. We need built-for-purpose solutions to adapt and not repeat the mistakes of the past. Those solutions will look like credential management systems that enable these functions as a byproduct of how they work, rather than bolting them onto an existing secret management or IGA process. 

This is why I have been an advisor to SPIRL since its founding. The founders, Evan and Eli, are why I knew I had to be involved. As the authors of the standard and having had real production-at-scale experience with SPIFFE, I knew they had what it took, which is why I have been working with them in a more formal capacity as of this last month. They are building the infrastructure that makes, as I like to say, the right thing the easy thing, ensuring their customers don’t repeat the mistakes of the past when it comes to workload and service identity.

Conclusion

Secret management tools were a necessary step in addressing the sprawl of machine secrets, and they will continue to be needed. However, as the complexity and scale of cloud environments grow, so does the need for more sophisticated solutions for workload and service credentials. Credential management systems provide the dynamic, robust framework required for modern machine identity management.

By adopting credential management as the primary approach and using secret management as the exception, organizations can achieve greater security, scalability, and operational efficiency in their cloud-native architectures. To learn more about effective machine identity management, check out SPIRL and see how they are leading the way.

ACME vs. SPIFFE: Choosing the Right One

In the world of certificate lifecycle management for workloads, two approaches often come into focus: ACME (Automated Certificate Management Environment) and SPIFFE (Secure Production Identity Framework for Everyone). While both can deliver certificates to a device or workload, they cater to different use cases. Let’s explore their core differences and why these matter, especially for dynamic workload deployments.

ACME: Proving Control Over Identifiers

ACME is primarily designed to prove control over an identifier, typically a domain name, and to deliver a certificate that demonstrates successful control verification.

  • Control Verification: ACME verifies control through various methods such as HTTP-01, DNS-01, TLS-ALPN, and External Account Binding.
  • Attestation (Optional): Attestation in ACME is secondary and optional, primarily enabling verifying if the key is well protected.
  • Pre-assigned Identifiers: ACME assumes that the identifier (like a domain name) is pre-assigned and focuses on validating control over this identifier.

This approach is particularly useful for scenarios where identifiers are static and pre-assigned, making it ideal for server authenticated TLS and applications that rely on domain names.

SPIFFE: Dynamic Assignment of Identifier Based on Attestation

SPIFFE, conversely, is designed for dynamic workloads, which frequently change as services scale or update. SPIFFE assigns identifiers to workloads dynamically, based on attestation.

  • Identifier Assignment: SPIFFE assigns an identifier to the subject (such as a workload) using attestation about the subject to construct this identifier.
  • Attestation: Unlike ACME, attestation is a core component in SPIFFE, enabling robust multi-factor authentication (MFA) for workloads based on what is running and where it is running.
  • Namespacing and Least Privilege: SPIFFE facilitates the namespacing of identifiers, building a foundation that enables authorization frameworks that promote least privilege, ensuring workloads only access necessary resources.
  • Minimal Security Domains: At the core of SPIFFE is the concept of security domains, which serve as trust boundaries between services, helping to minimize the attack surface.
  • JWT and X.509: SPIFFE supports SVIDs in both X.509 and JWT formats, enabling seamless integration with various systems and protocols within cloud deployments.

Importance in Dynamic Workload Deployments

The differences between ACME and SPIFFE are particularly significant in dynamic workload deployments:

  • Flexibility and Scalability: SPIFFE’s dynamic identifier assignment is highly suitable for environments where workloads are frequently spun up and down, such as in microservices architectures and Kubernetes clusters.
  • Role-Based Authentication: By making attestation a core component and promoting least privilege, SPIFFE ensures that each workload is authenticated and authorized precisely for its intended role.
  • Granular Policy Enforcement: SPIFFE’s namespacing and minimal security domain features enable fine-grained policy enforcement, enabling organizations to define and enforce policies at a more granular level, such as per-workload or per-service.
  • Identity Federation: SPIFFE supports identity federation enabling different trust domains to interoperate securely. This is particularly beneficial in multi-cloud and hybrid environments where different parts of an organization or third-parties.

Conclusion

While both ACME and SPIFFE are used for certificate management, they cater to different scenarios and can complement each other effectively. ACME is ideal for static environments where identifiers are pre-allocated, focusing on certificate lifecycle management for issuing, renewing, and revoking certificates. It excels in managing the lifecycle of certificates for web servers and other relatively static resources. 

On the other hand, SPIFFE excels in dynamic, high-scale environments, emphasizing credential lifecycle management with flexible and robust authentication and authorization through dynamic identifiers and attestation. SPIFFE is designed for modern, cloud-native architectures where workloads are ephemeral, and identifiers need to be dynamically issued and managed. 

By understanding these differences, you can leverage both ACME and SPIFFE to enhance your identity management strategy. Use ACME for managing certificates in static environments to ensure efficient lifecycle management. Deploy SPIFFE for dynamic environments to provide strong, attested identities for your workloads.

Effortless Certificate Lifecycle Management for S/MIME

In September 2023, the SMIME Baseline Requirements (BRs) officially became a requirement for Certificate Authorities (CAs) issuing S/MIME certificates (for more details, visit CA/Browser Forum S/MIME BRs).

The definition of these BRs served two main purposes. Firstly, they established a standard profile for CAs to follow when issuing S/MIME certificates. Secondly, they detailed the necessary steps for validating each certificate, ensuring a consistent level of validation was performed by each CA.

One of the new validation methods introduced permits mail server operators to verify a user’s control over their mailbox. Considering that these services have ownership and control over the email addresses, it seems only logical for them to be able to do domain control verification on behalf of their users since they could bypass any individual domain control challenge anyway. This approach resembles the HTTP-01 validation used in ACME (RFC 8555), where the server effectively ‘stands in’ for the user, just as a website does for its domain.

Another new validation method involves delegating the verification of email addresses through domain control, using any approved TLS domain control methods. Though all domain control methods are allowed for in TLS certificates as supported its easiest to think of the DNS-01 method in ACME here. Again the idea here is straightforward: if someone can modify a domain’s TXT record, they can also change MX records or other DNS settings. So, giving them this authority suggests they should reasonably be able to handle certificate issuance.

Note: If you have concerns about these two realities, it’s a good idea to take a few steps. First, ensure that you trust everyone who administers your DNS and make sure it is securely locked down. 

To control the issuance of S/MIME certificates and prevent unauthorized issuance, the Certification Authority Authorization (CAA) record can be used. Originally developed for TLS, its recently been enhanced to include S/MIME (Read more about CAA and S/MIME).

Here’s how you can create a CAA record for S/MIME: Suppose an organization, let’s call it ‘ExampleCo’, decides to permit only a specific CA, ‘ExampleCA’, to issue S/MIME certificates for its domain ‘example.com’. The CAA record in their DNS would look like this:

example.com. IN CAA 0 smimeemail "ExampleCA.com"

This configuration ensures that only ‘ExampleCA.com’ can issue S/MIME certificates for ‘example.com’, significantly bolstering the organization’s digital security.

If you wanted to stop any CA from issuing a S/MIME certificate you would create a record that looks like this: 

example.com. IN CAA 0 issuemail ";"

Another new concept introduced in this round of changes is a new concept called an account identifier in the latest CAA specification. This feature allows a CA to link the authorization to issue certificates to a specific account within their system. For instance:

example.com. IN CAA 0 issue "ca-example.com; account=12345"

This means that ‘ca-example.com’ can issue certificates for ‘example.com’, but only under the account number 12345.

This opens up interesting possibilities, such as delegating certificate management for S/MIME or CDNs to third parties. Imagine a scenario where a browser plugin, is produced and managed by a SaaS on behalf of the organization deploying S/MIME. This plug-in takes care of the initial enrollment, certificate lifecycle management, and S/MIME implementation acting as a sort of S/MIME CDN.

This new capability, merging third-party delegation with specific account control, was not feasible until now. It represents a new way for organizations to outsource the acquisition and management of S/MIME certificates, simplifying processes for both end-users and the organizations themselves.

To the best of my knowledge, no one is using this approach yet, and although there is no requirement yet to enforce CAA for SMIME it is in the works. Regardless the RFC has been standardized for a few months now but despite that, I bet that CAs that were issuing S/MIME certificates before this new CAA RFC was released are not respecting the CAA record yet even though they should be. If you are a security researcher and have spare time that’s probably a worthwhile area to poke around 😉

The Rise of Key Transparency and Its Potential Future in Email Security

Key Transparency has slowly become a crucial part of providing truly secure end-to-end encrypted messaging. Don’t believe me? The two largest providers of messaging services, Apple and Facebook (along with their WhatsApp service), have openly adopted it, and I am hopeful that Google, one of its early advocates, will follow suit.

At the same time, we are on the precipice of interoperable group messaging as Messaging Layer Security (MLS) was recently standardized. Its contributors included representatives from employees of the mentioned services and more, which suggests they may adopt it eventually. What does this have to do with Key Transparency? It acknowledges the need for secure, privacy-preserving key discovery through its inclusion of Key Transparency in its architecture.

It’s also noteworthy to see that Apple has agreed to support RCS, Android’s messaging protocol. While there is no public hint of this yet, it’s possible that since they have positioned themselves as privacy champions over the last decade frequently toting their end-to-end encryption, we may see them push for MLS to be adopted within RCS, which could net the world its first interoperable cross-network messaging with end-to-end encryption, and that would need a key discovery mechanism.

In that spirit, recently the Internet Engineering Task Force (IETF) has established a Working Group on Key Transparency, and based on the participation in that group, it seems likely we will see some standardization around how to do Key Transparency in the future.

What’s Next for Key Transparency Adoption Then?

I suspect the focus now shifts to S/MIME, a standard for public key encryption and signing of emails. Why? Well, over the last several years, the CA/Browser Forum adopted Baseline Requirements (BRs) for S/MIME to help facilitate uniform and interoperable S/MIME, and those became effective on September 1, 2023 – this means CAs that issue these certificates will need to conform to those new standards.

At the same time, both Google and Microsoft have made strides in their implementations of S/MIME for their webmail offerings.

In short, despite my reservations about S/MIME due to its inability to address certain security challenges (such as metadata confidentiality, etc), it looks like it’s witnessing a resurgence, particularly fueled by government contracts. But does it deliver on the promise today? In some narrow use cases like mail signing or closed ecosystem deployments of encrypted mail where all participants are part of the same deployment, it is probably fair to say yes.

With that said, mail is largely about interoperable communications, and for that to work with encrypted S/MIME, we will need to establish a standard way for organizations and end-users to discover the right keys to use with a recipient outside of their organization. This is where Key Transparency would fit in.

Key Transparency and S/MIME

Today, it is technically possible for two users to exchange certificates via S/MIME, enabling them to communicate through encrypted emails. However, the process is quite awkward and non-intuitive. How does it work? You either provide the certificate out of band to those in the mail exchange, and they add it to their contact, or some user agents automatically use the keys associated with S/MIME signatures from your contacts to populate the recipient’s keys.

This approach is not ideal for several reasons beyond usability. For instance, I regularly read emails across three devices, and the private keys used for signing may not be the same on each device. Since consistent signing across devices isn’t required, if I send you an email from my phone and then you send me an encrypted message that I try to open on my desktop, it won’t open.

Similarly, if I roll over my key to a new one because it was compromised or lost, we would need to go through this certificate distribution workflow again. While Key Transparency doesn’t solve all the S/MIME-related problems, it does provide a way to discover keys without the cumbersome process, and at the same time, it allows recipients to know all of my active and published certificates, not just the last one they saw.

One of the common naive solutions to this problem is to have a public directory of keys like what was used for PGP. However, such an approach often becomes a directory for spammers. Beyond that, you have the problem of discovering which directory to use with which certificate. The above Key Transparency implementations are all inspired by the CONIKS work, which has an answer to this through the use of a Verifiable Random Function (VRF). The use of the VRF in CONIKS keeps users’ email addresses (or other identifiers) private. When a user queries the directory for a key, the VRF is used to generate a unique, deterministic output for each input (i.e., the user’s email). This output is known only to the directory and the user, preserving privacy.

The generic identifier-based approach in Key Transparency means it can neatly address the issue of S/MIME certificate discovery. The question then becomes, how does the sender discover the Key Transparency server?

Key Transparency Service Discovery

The answer to that question probably involves DNS resource records (RRs). We use DNS every day to connect domain names with IP addresses. It also helps us find services linked to a domain. For instance, this is how your email server is located. DNS has a general tool, known as an SRV record, which is designed to find other services. This tool would work well for discovering the services we’re discussing.

_sm._keytransparency._https.example.com. 3600 IN SRV 10 5 443 sm-kt.example.com.

In this example, _sm the identifier is placed before _keytransparency. and _https shows that this SRV record is specifically for a Key Transparency service for Secure Messaging. This allows someone to ask DNS for a S/MIME-specific Key Transparency service. It also means we can add other types of identifiers later, allowing for the discovery of various KT services for the same domain.

Conclusion

While S/MIME faces many challenges, such as key roaming, message re-encrypting on key rollover, and cipher suite discoverability, before it becomes easy to use and widely adopted—not to mention whether major mail services will invest enough in this technology to make it work—there’s potential for a directory based on Key Transparency if they do.

Hopefully, the adoption of Key Transparency will happen if this investment in S/MIME continues, as it’s the only large-scale discovery service for user keys we’ve seen in practice. Unlike other alternatives, it’s both privacy-respecting and transparently verifiable, which are important qualities in today’s world. Only time will tell, though.