Exploring Requirements for Timelines Certificate Problem Reports and Revocations

Today, DigiCert’s mass revocation is in the news, so I thought it would be worthwhile to explore the rules for a CA when it comes to revoking a subscriber certificate and the choices and constraints a CA faces in circumstances like this.

First, let’s examine the types of issues that might lead a CA to revoke a certificate before its natural expiration. Broadly speaking, there are two categories of these issues: 

Externally identified Issues, these start from what is called a Certificate Problem Report (CPR), they may originate from: Subscribers, or Relying parties, application software suppliers, and other third parties

Internally identified issues, these may be identified by: Pre-issuance checks, Post-issuance checks, or Internal periodic reviews

Problem reports from external entities almost always require some minimal level of investigation. An exception might be cases where the reporter has access to the private key and requests a certificate be revoked, this could come from the subscriber, or someone who has gained access to the private key. Even in these cases you often need to do multiple communication round trips because requests often do not contain sufficient information to authenticate the request from the outset.

There are other cases too; for example, sometimes a problem report claims unauthorized issuance of a certificate, but it turns out the reporter had authorized it, perhaps by signing up for a CDN or SaaS offering that uses that CA. In these cases you might need to contact the person who requested the certificate, or the service provider for more information or as the reporter to prove control of the domain to verify your authorization to request the revocation — again introducing a few round trips. 

However, most problem reports are spam or requests to take down a website. In this second case, if a CA has a policy about revoking domains that serve such content then it also needs some level of investigation. On that check out this post on the CA’s role in phishing mitigation on why CAs should stay out of this class of issue altogether. 

These are just a few examples of certificate problem reports a CA might see but there are more, one of the most actionable is a claim that there was a violation of the requirements that a CA must meet, such as including incorrect information in a certificate or using a certificate profile with a wrong extension, or using a domain control verification process that does not meet requirements.

The workflow for these requests is straightforward, but it still requires some investigation. The questions that need to be answered before taking action include things like: When was the certificate issued? What rules were in force at that time? What data was used when validating the entitlement of the requester? What version of the software implemented the validation at the time of issuance? Is the reporter’s interpretation of the requirement technically correct?

This is why the Baseline Requirements (BRs) allow CAs some time for an investigation:

CA SHOULD revoke a certificate within 24 hours and MUST revoke a Certificate within 5 days

NOTE: Here is a conversation on the last changes to this text, searching the forums is always a good way to validate what the intent was at the time of authorship.

Let’s look at an example here, in 2020 Let’s Encrypt had a misissuance due to a bug in CAA handling. This issue was found internally through what we would call mitigating controls, a process that exists to double-check you are doing the right thing.

That’s not a lot of time, especially if such a report comes in on New Year’s Eve. CAs maintain on-call rotations to investigate issues, but it’s easy to imagine that from time to time, an investigation might take longer than 24 hours to fully complete. Despite this, the CAs shouldn’t have an infinite amount of time for such investigations, which is why the text limits the time to revocation to a maximum of 5 days.

Let’s explore what this might look like in the case of a real incident. On February 29, 2020, Let’s Encrypt through an internal review discovered an issue in their domain control verification logic and kicked off an investigation, and set off to do a revocation of over 3 million certificates on that same day they discovered the issue they announced their plan to revoke the affected certificates. Revocations began on March 4, 2020, and by the March 5, 2020 deadline, they had revoked 1,706,505 certificates. Network scans revealed that a large portion of the remaining 1,003,596 certificates were still in active use.

Due to concerns over the potential impact revocation would have on these active sites, they filed an incident for delayed revocation just before the deadline. These remaining certificates either expired or were revoked by May 29, 2020, after which the ticket remained open for community discussion and further analysis.

Let’s Encrypt did many things right here, including implementing compensating controls to detect issues internally, preparing to analyze the scope of impact immediately, and revoking the certificates they could without causing disruption as soon as possible. When it became clear that they would not meet the deadline requirements, they filed a bug to document their non-compliance with the obligations before the deadline lapsed. Moreover, they communicated clearly and regularly throughout the entire process until the issue was resolved.

But what about the missed revocation deadline!?! The reality is that delayed revocations happen, It’s not ideal, but they do happen. What is important is how a CA deals with them, and how credibly. Let’s Encrypt executed like a well-oiled machine here; no one complained. (They did this all without getting distrusted, for more on how CAs usually get distrusted see.)

There are other cases where delayed revocations were done as well, for example in 2019 Apple, Google, and GoDaddy were all hit by a bug in EJBCA, all of which had some delayed revocations, and while there may have been some challenges in each of those cases, what you could tell is that like Let’s Encrypt they were being proactive, transparent, and working towards addressing the issue responsibly and when you looked at their broader history there was not a sign of systemic failures.

OK so if you are in the CA industry right now you should be saying, but Ryan, What about the Certificate Problem Reports response requirements?! This Digicert issue was reported externally, those requirements are also relevant if so you are right, these requirements say:

The CA SHALL maintain a continuous 24×7 ability to accept and respond to revocation requests and Certificate Problem Reports.

And …

The period from receipt of the Certificate Problem Report or revocation-related notice to published revocation MUST NOT exceed the time frame set forth in Section 4.9.1.1. The date selected by the CA SHOULD consider the following criteria:

– The nature of the alleged problem (scope, context, severity, magnitude, risk of harm);

– The consequences of revocation (direct and collateral impacts to Subscribers and Relying Parties);

– The number of Certificate Problem Reports received about a particular Certificate or Subscriber;

– The entity making the complaint (for example, a complaint from a law enforcement official that a Web site is engaged in illegal activities should carry more weight than a complaint from a consumer alleging that they didn’t receive the goods they ordered); and

– Relevant legislation.

While In Section 4.9.1.1 it says

With the exception of Short-lived Subscriber Certificates, the CA SHALL revoke a Certificate within 24 hours and use the corresponding CRLReason (see Section 7.2.2) if one or more of the following occurs:

If you’re paying attention, you might be thinking to yourself, “Wait a minute, doesn’t that mean, since the request came in from an external reporter, that the certificate should be revoked within 24 hours?”

Well, yes, it does! This totally makes sense in some cases, for example, if a notice comes in of a key compromise that has all the required detail to confirm it, but as we discussed above, requests received externally often require some level of verification. This begs the question is 24 hours enough time to handle that or not? But one thing is for sure arguing that the rules are wrong or that the rules should be changed during an incident response is a recipe for disaster.

So what would I have done if I were DigiCert? I would have stopped issuance as soon as the issue was identified. I would have then set engineers to work on resolving the issue in the right way. In parallel, I would have prepared a preliminary incident report, noting that I had stopped the issuance. I would have also noted any confusion I may have had about the language in the requirements in that incident report, I would have then committed to start revoking following the requirements, and before the 24-hour clock expired, I would have started replacing and revoking the certificates that could be revoked safely. If it was clear that despite active, good faith, best efforts, the revocation was not going to be completed in time, then, I would have filed a delayed revocation incident before that clock expired—essentially following the same playbook Let’s Encrypt used back in 2020.

After successfully navigating the incident, I would have worked within the CA/Browser Forum to discuss this language and make sure it was it is clearer why the difference exists or change it to align with the other text.

We did not really tackle the topic of ‘Internally Identified Issues’? Though I do mentioned them above but didn’t really explain them in detail, so let’s see if I can capture them without rambling too much.

As I mentioned earlier, there are several classes. In the Let’s Encrypt case, they had some compensating controls around design reviews that identified an issue. When it comes to cases like this, it is pretty straightforward; by the time you have completed the design review, you are sure there is an issue at play, and the clock starts ticking.

Regarding pre-certificate linting, it’s a failure in your linting management if an issue that could or should have been caught by your linting gets issued; this is more akin to a design defect than a CPR-like incident and as such it would certainly trigger the 24-hour clock.

Then there is post-issuance linting, which is used either as a spot check during data gathering for an audit, as a compensating control, for example, or when you are testing new rules for the linter or refining your linter policy configuration. These are by their very nature a lot more like a CPR in that, because you don’t have a fine-tuned policy for the linter, they too need some level of investigation.

To be clear, none of this appears related to the DigiCert incident, but they have not published a detailed incident report yet, which will include an end-to-end timeline and that may change some of this analysis but I wanted to try to cover the topic more holistically in this post before declaring victory tonight. 

Regardless of the above, the key takeaway is that if you are a CA or a customer of a CA, you need to be prepared to respond to an incident. One of the most useful actions you can take to achieve this is to adopt ACME for certificate lifecycle management. In particular, look for ACME implementations and CAs that support ARI, an extension of ACME. This extension enables a CA to signal to its enrollment clients that their certificate needs to be replaced sooner than expected, allowing them to roll certificates without requiring on-call incident response or causing downtime. The key takeaway is if you are a CA or a customer of a CA you need to be ready to respond to an incident. 

If you are interested in this topic, I recommend checking out the Sectigo “Root Causes” podcast which has a great conversation on Chrome’s recent presentation at the CA/Browser Forum on Incident Reporting. 

This presentation can be summarized as : 

– Incident reporting is an opportunity to demonstrate continuous improvement and to make the ecosystem more resilient and secure.

– We rely on the public incident reporting process to demonstrate that continued trust is justified.

– We think incident reporting should be boring (i.e. routine).

I hope this was helpful. Let me know if you think I missed something or if my analysis is wrong in some way. When processing prose as code, there are almost always slightly different takes, and I wouldn’t be surprised to find out others feel differently than me on this issue.

Thanks to all of you who read this and provided feedback — it takes a village.

Leave a Reply

Your email address will not be published. Required fields are marked *