Tag Archives: REVOCATION

My response, to his response, to my response? or short-lived certificates part 3

The conversation on short-lived certificates and their value continuesIn the most recent conversation, we have started to shift from an either or position to one where we explore what is needed to make revocation checking a viable technology, this is a topic I am passionate about.

That said, there still seems to be some confusion on short-lived certificates, specifically the author states:

Of course, the cost of short-lived certs is very high as change the whole computing infrastructure so that certificates are renewed on a daily basis (daily for it to be secure enough vs 90 day certificates in my view) and introduce this new moving part that might cause vulnerability and operational issues.

Let’s explore this “cost” argument for a moment. First, when I issue an end-entity certificate I minimally have to perform two cryptographic signatures, one on the certificate and one on the OCSP response (it’s actually more than this in some cases but to keep the conversation simple I have omitted the others).

If we look at the performance optimization that Firefox has implemented where they do not require revocation checking when the certificate is within a subset the resolution of the revocation period short-lived certificates can, in fact, reduce the cost for a CA. This is because you no longer need to sign two things during the first few days/hours of the life of the certificate and you do not need to distribute that response.

The only way I think short lived certificates are more expensive for the CA is if you compare a model of certificates issued for a period better measured in hours to a model with certificate validity measured in years providing weekly revocation updates. This, however, is a bad model, and something in-between is needed.

The author also believes the use of automation represents a security vulnerability, so this deserves a response. It is true that complexity is the enemy of security, it is even true that automation can if poorly implemented can hurt security. The inverse is also true, however.  It is also generally accepted that availability is a component of security and one of the more common problems in the WebPKI is poor manual management practices resulting in the lack of understanding what is deployed and those certificates that are deployed expiring [see 1,2,3, and 4] and taking down services.

It is also important to look at the big picture when evaluating the net-security benefits of automation, for example, does anyone honestly believe we would ever get to a world where the majority of the web is encrypted if organizations have to staff people to manually generate certificate requests and hand carry them to CAs? Is the net-benefit of automation of reliability and scope of deployment worth its secondary effects?

The author also suggests that a certificate must have a validity period of only a day to be “secure enough”. This seems both arbitrary and wrong, as stated previously the User Agents and WebTrust allows an OCSP response or CRL can be a week old and still be trusted.

One of the largest reasons for this is that clock skew is a big problem in the real world and as a result, you need to keep validity periods of certificates and revocation messages outside this skew period to prevent skew related failures.

The decision to define “secure enough” at a day, both defines the problem in an intractable way and furthermore ignores the fact that it establishes a double standard that does nothing to address the issue if stale revocation information.

If we were to bring this conversation back to how we improve certificate revocation I would say there should be one standard for how recent the client’s understanding of the certificate’s validity would need to be.

On that topic, the author goes on to discuss how 32 bytes is better than 470 (the size of the smallest OCSP response). I could not agree more about this, in fact in the 90s’ when I was at Valicert we implemented a proposal from Paul Kocher called Certificate Revocation Trees. This approach uses of Merkle Trees (the heart of the Bitcoin ledger) to provide a very space efficient solution to this problem. Unfortunately, we were unable to popularize this at the time.

Ben Laurie began work on a variation on this approach that leveraged sparse Merkle Trees that he called Revocation Transparency. I personally like the idea of this approach because it leverages the work done to make Certificate Transparency scalable. For example, Trillian, the foundational server for Google’s next generation log server is designed to scale to Trillions of certificates.

That said, there are a number of similar approaches that could be equally scalable.

While I do think that an approach similar to the above could be made to work today, I also think it is more of a long-term solution in that even with the significantly increased rate of technological adoption it would take close to ten years given the state of things for such a solution to be fully deployed if we started right now.

As such I would start with the problem definition, which would need to involve a more formal analysis of the role of revocation checking today so that the right solution was built.

In parallel I would want to see the industry adopt a more strategic plan to address the more practical and immediately solvable problems, including:

  • Measuring and improving the revocation infrastructure operated by CAs,
  • Establishing global performance and reliability metrics and reporting that all CAs must meet,
  • Funding improvements to Nginx and Apache’s OCSP Stapling implementations,
  • Working with browsers to adopt the performance optimization firefox has implemented for revocation checking,
  • Working with TLS stacks, User Agents, Servers and Service Providers to adopt OCSP Must-Staple,
  • Defining an OCSP transport based on DNS that would reduce dependency on CA infrastructure reliability,
  • Evangalizing the adoption of OCSP stapling with administrators.

Ryan

P.S.

The author also has also added in someone else who has asked some questions or more correctly seems to question my version of the historical narrative. To provide some context, my narrative comes from my practical experience working with Microsoft, eBay, Amazon, and other large companies in the mid to late 90s and through the mid-2000s.

I too have worked with the BBN Safekeeper, I have a fun story how we hired some people to extract the keys from one of these boxes I would be happy to share over a beer sometime.

However, a cool device, the first one I remember working with was the KOV-8 in the 1993/4 timeframe.

Anyway, it is true that SSL started its life in 1994/5 at which point only software implementations of crypto were used (they were all BSAFE) but it is also true that mass deployment of SSL (relatively speaking) did not start until the late 90s and early 00s and that is the time my narrative was based on.

He also has also questioned the narrative of what Windows supported in the context of key protection. Since the author knows me personally he must have simply forgotten that I was the PM for these technologies and was at Microsoft working in this area for about 15 years.

Again I think there is some confusion here, the author states:

The software based Cryptographic Service Provider for RSA allowed keys to be marked ‘not for export’ from a very early release if not the first.

and:

The CAPI features used to protect private keys were expanded and exposed as a separate API in Windows 2000 as the Data Protection API.

As someone who worked at Microsoft on these technologies for a long time I can say with absolute confidence they were not built to provide key isolation, do not provide key isolation properties and were actually not used by the SSL implementation (SCHANNEL) for the server keys. If you’re interested in learning more about the capabilities of Windows in this area check out this post I did recently.

He has also questioned the role of ValiCert in the definition of the RFC, thankfully the IETF PKIX archives are still there and if you care to look you can see Mike was basically checked out, Warwick was not publicly paticipating and the work to finalize the protocol was largely done by Ambarish Malpani the founder of Valicert.

Why bother with short-lived certificates and keys in TLS?

There seems to be a lot of confusion and misinformation about the idea of short-lived certificates and keys so I thought I would pen some thoughts about the topic in the hope of providing some clarification.

I have seen some argue the rationale behind short-lived certificates is to address the shortcoming in the CA and browser revocation infrastructure, I would argue this is not the case at all.

In my mind, the main reason for them is to address the issue of key compromise. Long-lived keys have a long period in which they are exposed to theft (see Heartbleed) and therefor are a higher value to an attacker since a stolen key enables the attacker to impersonate the associated website for that period.

The most important thing to keep in mind is that the nature of key compromise is such that you almost never know it has happened until it is too late (consider Diginotar as an extreme example).

The importance of protecting the SSL private key is why in the 90s when SSL deployment started in earnest, large companies tried to deploy SSL in their environment using SSL “accelerators” and “security modules” that protected the keys. This was the “right technical” thing to do not not the right “practical thing” because it significantly reduced the scalability of SSL protected services and at the same time massively increased the cost to deploy SSL.

By the mid 00’s we recognized this was not workable and the use of accelerators basically stopped outside a few edge cases and software keys were used instead. Some implementations, like the Microsoft SCHANNEL implementation, tried to protect the keys by moving them to a separate process mitigating the risk of theft to some degree. Others simply loaded the keys into the web servers process and as a result exposed them to Heartbleed like attacks.

What’s important here is that when this shift happened no one pushed meaningful changes to the maximum validity period of certificates and by association their private keys. This meant that we had software keys exposed to theft for five years (max validity period at the time) with no reliable way to detect they were stolen and in the unlikely event we did find out of the theft we would rely on the unreliable CA revocation infrastructure to communicate that issue to User Agents — clearly this was not an ideal plan.

In short, by significantly reducing the validity of the certificate and keys, we also significantly reduce the value to the attacker.

Another issue that short-lived certificates help mitigate is the evolution of the WebPKI, with a long-lived certificate you get virtually no security benefits of policy and technology improvements until the old certificates and keys are expunged from the ecosystem. Today this is rollout of new policy largely accomplished via “natural expiration” which means you have to wait until the last certificate that was issued under the old policies expires before absolute enforcement is possible.

So what is the ideal certificate validity period then? I don’t think there is a one-size-fits-all answer to that question. The best I can offer is:

As short as possible but no shorter.

With some systems it is not possible to deploy automation and until it is one needs to pick a validity period that is short enough to mitigate the key compromise and policy risks I discussed while long enough to make management practical.

It is probably easier to answer the question what is the shortest validity period we can reliably use. The answer to that question is buried in the clock skew of the relying parties. Chrome recently released a new clock synchronization feature that significantly reduces errors related certificate validity periods. But until that is fully deployed and other UAs adopt similar solutions you are probably best to keep certificate validity periods at 30 days to accommodate skew and potential renewal failures.

In-short, long-lived keys for SSL exist because we never re-visited the threat model of key compromise when we stopped using hardware-protected keys in our SSL deployments and short-lived keys help deal with the modern reality of SSL deployments.

Ryan

– I have added a few pictures for fun and made some minor text changes for clarity.

Revocation reasons don’t make sense for the WebPKI of today

As the old saying goes, to err is human but to forgive is divine, in WebPKI you deal with errors through the concept of revocation.

In the event a CA determines it made a mistake in the issuance of a certificate or has been notified by the subscriber they would like to see a certificate marked invalid, they revoke the corresponding certificate.

The two mechanisms they have available to them to do this are called Certificate Revocation Lists and OCSP responses. The first is a like a phonebook of all known “revoked” certificates while the last is more like a lookup that it enables User Agents to ask the status of a particular set of certificates.

In both cases they have the option to indicate why the certificate was revoked, the available options include:

Reason Description
Key Compromise We believe the key to have been stolen or can no longer be sure it has not been.
CA Compromise The issuing CA itself has been compromised (think DigiNotar).
Affiliation Changed The certificate contained affiliation information, for example, it may have been an EV certificate and the associated business is no longer owned by the same entity.
Privilege Withdrawn The right to represent the given host was revoked for some reason.

There are a few other reasons but they are not materially relevant to WebPKI use cases so I left them out.

CA Compromise is clearly a special and exceptional case, so much so that browsers have developed their own mechanism for dealing with it (see OneCRL and CRLSets) so it’s not particularly relevant to the options available to CAs anymore, at least in the WebPKI.

Another thing to consider when looking at revocation reasons is that about 90% of all certificates in used in the WebPKI (for TLS) are considered to be Domain Validated (DV) certificates, this limits the applicability of the remaining reasons substantially.

DV OV EV
Key Compromise Yes Yes Yes
Affiliation Changed Maybe Yes Yes
Privilege Withdrawn Maybe Yes Yes


As you can see, the available “reasons” for DV certificates are largely reduced to just “Key Compromise”. There is an argument that both “Privilege Withdrawn” and “Affiliation Changed” could be applicable to DV but in my opinion, they are a bit of a stretch. If nothing else the difference between these two reasons in the case of DV is so subtle you will almost never see these two usages be used on a DV certificate.

So what about Key Compromise?  How useful is it? Clearly, it is an important use case, your trust in an SSL certificate is frankly only as good as your confidence in how well the key is protected!

The reality is that most key compromise cases can not be detected and worse yet is that it’s far too easy to compromise an SSL key. Take for example the Heartbleed (http://heartbleed.com/) bug from last year, SSL keys are often kept in the process and relatively minor coding mistakes, as a result, can potentially expose that key and users won’t necessarily know about it.

Some platforms go out of their way to prevent such attacks at the expense of performance like we did at Microsoft with SCHANNEL where we keept the key out of process but this is the exception and not the rule.

Also popping a web server due to some third-party code vulnerability and getting to the shell in the context of the server is unfortunately far too easy. As a result, when an attacker does get a shell on a web server one of the first things they should be doing is taking that key so they can use it later, but again, how is this detected?

That is not to say “Key Compromise” as a revocation reason is not important when there is a large-scale vulnerability like Hearbleed but this is not exactly a common case.

The other cases for the “Key Compromise” revocation reason are notably dependent on the subscriber:

  • Knowing the compromise happened,
  • Being mindful enough to notify the CA,
  • Being willing to let the world know there was a compromise.

This last point of public signaling of “compromise” combined with the fact user agents treat all revocations the same (if they check at all) means that we seldom see subscribers even specify a reason when they ask for a certificate to be revoked.

As I look at why a certificate gets revoked today I think there are a few cases, these include:

  • CA has decided the certificate should be revoked,
  • The subscriber has decided the certificate should be revoked,
  • A root program has decided a certificate should be revoked.

These are far more meaningful signals to relying parties that the ones listed above and are not so specific that these is any negative signaling in their use.

Today it is not possible to specify these things but maybe we should take a second look at revocation reasons and if we continue to use them make them more relevant to the way the WebPKI works today.

Ryan

CAs and SSL and Phishing Oh My!

 

 

 

 

 

 

NOTE: This post reflects my personal beliefs and is not necessarily those of my employer Google, or Let’s Encrypt where I am a member of their Technical Advisory Board.

Introduction

Recently Vincent from The SSL Store published a blog post calling out Let’s Encrypt for issuing certificates to domains that contain the world PayPal.

The TL;DR for his post is he believes that Let’s Encrypt is enabling phishers by issuing them SSL certificates that contain the word “PayPal” and then refusing to revoke them when arbitrary third-parties ask them to.

As a result of his post, several news sources have decided to write articles about how “Let’s Encrypt” is acting as an enabler of these Phishers [1] [2].

Unfortunately, Vincent’s post and the associated articles don’t cover this in the most complete and balanced way so over my morning coffee today I decided to write this post to discuss the other side of the argument.

If this is a topic that interests you please also check out the Let’s Encrypt blog post where they talk about why they have taken this position.

Exploration

Let’s explore the opportunities CAs have to check for phishing, the tools they have available to them, the effectiveness of those tools, the consequences of this approach, how complete a solution based on the tools available to them would be and what the resulting experience would be for users.

Opportunities

The WebPKI’s CAs role, historically, has been that of a Passport office, you present proof you control a domain, and possibly that you are an authorized member of an organization and you get a digital certificate that attests to that.

This certificate could be valid for up to 1095 days. Once the certificate is issued the CA, largely speaking, has no natural opportunity to verify this information again. It is worth noting that this month the CABForum voted to shorten this period to 825 days.

Tools

In the event a CA determines it made a mistake in the issuance of a certificate or has been notified by the subscriber they would like to see a certificate marked invalid, the tool they have available to them is called “revocation”.

The two types of revocation that are under the control of a Certificate Authority are called Certificate Revocation Lists and OCSP responses. The first is a like a phonebook of all known “revoked” certificates while the last is more like a lookup that it enables User Agents to ask the status of a particular set of certificates.

Effectiveness

Earlier we discussed the lifetime of certificates, this is important to understand because the large majority of phishing sites do not start out as Phishing sites, as such issuance time checks seldom net positive results.

After issuance, this leaves you with periodic checks of the site,  third-party reports of phishing and relying on revocation checking as an enforcement mechanism. This is a recipie for failure, there are a few reasons for this, but one of the more significant is the general ineffectiveness of revocation checking.

Revocation checking is the most taxing thing a CA does. This is because the revocation mechanisms available to them will result in every relying party contacting them to download a OCSP response or CRL covering that certificate.

As a result, OCSP has a tendency to be both slow and unreliable. This forced browsers to implement this check as a “soft fail”, in other words, if the connection times out or fails for some reason they assume the certificate as good.

To give that some context about 8% of all revocation checks done by Firefox fail and the median response time is over 200ms.

As a result of this in 2012 Chrome, which is used by about 50% of all users, more-or-less disabled revocation checking except for exceptional circumstances.

What this means is that revocation checking, even for its intended purpose, is far from an effective tool. Expanding its use to include protecting users from phishers would not improve its effectiveness and arguably it would (due to the infrastructure implications) make it even less reliable.

It is also important to note that every wildcard certificate can be used for a hostname containing “PayPal” without the CA ever being made aware, a good example is https://paypal.github.io/ which is protected by a wildcard certificate issued to Github.

Consequences

To understand the consequences of expanding the CAs role include protecting us from phishing we first need to understand what a certificate represents, or more importantly what it does not represent. It does not represent the content, it represents the host that is serving content and it is the content that “phishes”.

Today, in the age of cloud services, there is a good chance the host that is serving the content is a service operated by WordPress, or maybe Amazon’s S3. These services allow users to sign up and post arbitrary content for free or very little money.

If we decide that revocation checking is the right tool to get phishing content off the web we would be saying a CA should revoke WordPress’s certificate if one of it’s users posted something someone reported as phishing content. That would, for the situations where revocation checking takes place and happens to work, take WordPress off the Internet. Is that what we want to happen?

If so, who is it we are asking to perform this task? There are well over 400 CAs in the Microsoft Root Program do we believe these are the right organizations to be policing the internet for the appropriateness of content?

If so what criteria should they use to do so and what do we do if they abuse this censorship role?

Completeness

It is easy to say that a CA should not issue a certificate if it contains the word “PayPal”. I could even see an argument that those that would be hurt by such a rule, for example, http://www.PayPalSucks.com and (a theoretical) PayPalantir.com are an acceptable loss.

This would, however not catch homoglyphs like when a Cyrillic “a” is used instead of the latin “a” which would very likley require a manual review of the name and content to determine the intent of the domain owner which is near impossible to do with any level of accuracy or fairness.

Even with that, what about ING, as one of the world’s largest banks, they too are commonly phished, should a CA be able to issue a certificate to https://www.fishing.com. And if they do and the issuing CA receives a complaint that it is Phishing ING what should they do?

And what about global markets and languages? In Romania there is a company called Amazon that is a cleaning company, should anyone be able to request their website be revoked because it contains the word Amazon?

If we promote the CA to content police, how do we do so in a complete way?

User Experience

With CAs acting as the content police, what would a user see when they encounter a revoked site? While it varies browser to browser the experience is almost always a blocking “interstitial”, for example:

 chrome revoked firefox revoked


If you look closely you will see these are not screens that you can bypass, revoked sites are effectively removed from the internet.

This is in contrast to Safe Browsing and Smartscreen which were designed for this particular problem set and therefore provide the user a chance to visit the site after a contextually relevant warning:

 SafeBrowsing smartscreen

Conclusion

I hope you see from the above that relying on Certificate Authorities as content police as a means to protect users from phishers a bad idea, at a minimum, it would be:

  • Ineffective,
  • Incomplete,
  • Unmanageable,
  • and Duplicative.

But more importantly it would be establishing a large loosely managed group as the de-facto content censors on the internet and as Steven Spielberg said, there is a fine line between censorship, good taste, and moral responsibility.

So what should CAs do about phishing then? It is my position they should check the Google Safe Browsing API prior to issuance (which by the way, Let’s Encrypt does), and they should report Phishers to the Safe Browsing service if they encounter any.

It is also important to answer the question about what users should do to protect themselves from phishing. I understand the desire to say there is only one indicator they need to be worried about, it’s just not realistic.

When I talk to regular users I tell them to do three things, the first of which is to use an up-to-date and modern browser that uses Smart Screen or Safe Browsing. Second, you should only provide data to sites you know and only over SSL. And finally, try to only provide sites information when it was you initiated the exchange of information.

P.S.

Thanks To Vincent Lynch and the others who were kind enough to proof this post before publishing.

How did I get involved in PKI?

In the mid 90s I was a security consultant, I principally worked on authentication systems (Smart cards, One Time Passwords, Kerberos, PKI, etc.).

Back then the only people who cared about these things were organizations concerned with protecting lives or money. This meant most of our contracts were with governments, banks, and fortune 50s. This was an amazing experience that I would not trade for the world — it gave me the chance to work with some amazing people in some of the most paranoid and security conscious environments in the world.

While not my first exposure to PKI the first time “it was all I did” was when I worked for a company called ValiCert. The founders saw a problem:

Who was watching the certificate authorities and who would make sure that the revocation infrastructure would scale to meaningfully work in the event miss-issuances or key compromises happened?

We had developed technologies that were intended to address these problems. This technology looked very similar to Certificate Transparency, OCSP stapling and certificate pinning which are again all-the-rage these days.

Unfortunately the Certificate Authorities did not like the the idea of being “watched” by a third-party; the largest CA went so far to threaten with lawsuits and modified their Relying Party Agreements to state that third parties could not re-distribute any information about what certificates they had revoked or issued.

Another entity had patents they claimed covered some of our optimizations and given the browsers were minimally investing in this area we did not get adequate traction so we pivoted into other areas.

For personal reasons I ultimately ended up at Microsoft where I was responsible for a number of security technologies and one of the “little things” I ran was the Microsoft Root Program.

When this was assigned to me I was told it was the least important thing on my plate and that I could measure my success through the number of escalations we got relating to it — basically I was told to invest as little as possible to keep things quiet. The root program was a necessity but shipping software was what we were all about.

The first thing I did for the root program was review its requirements and try to understand who were its participants and what agreements we had with them. I was surprised to see there were in-essence no requirements, no authoritative list of contacts at each of the organizations and no contracts with any of its members. I felt marginally better when I found that Netscape had only one requirement and that was your check for $250,000 USD cleared, the upside of which also meant they probably had contracts with each CA but there were no technical or audit requirements in their program either.

To remedy I began to work with my AWSOME paralegal and lawyer on defining the first “root program” with both technical and audit requirements. We did not want to approach this as a profit center like Netscape but instead establish a set of requirements that were technically sound that encouraged CAs to spend on improving their infrastructure and having it reviewed by others

To this end I picked up a project that had been begun by my predecessor to work with the American Institute of Public Accountants (AICPA) to help define and adopt what is WebTrust for CAs today.

We were the first root program to adopt this new audit. I remember being interviewed by the AICPA for a video on their website on how excellent it was to work with them – they must have taken 50 cuts during that session because of my bumbling.

With these new requirements in hand we set out to get contractual agreements with each of the CAs where they would commit to meet these new requirements and make clear conditions on which we could kick them out for not complying. Given this required them to make operational changes to their practices as well as budget and manage a third-party audit it took a complete product release cycle to get all of this in place.

At the end of the operating system release we had an audited set of CAs and contractual agreements with each one of them. Now our goal was to get these CAs into one room so we could encourage them to adopt common issuance practices.

This was important for a number of reasons, one of the most obvious was that each one of the CAs used a different taxonomy to describe what they did. The simplest example of this was that one CAs in-person verified certificate would be called a Class 1 and another’s was a Class 3.

To top things almost all of the CAs wanted to see the browser “chrome” differentiate between their weakly authenticated certificates and those that were strongly authenticated. This of course was not possible without a common practices  and means of marking certificates to make it clear what practices were used in the vetting of the subscriber.

The internal consensus was that there would be value to users to be able to tell the difference so we decided to try to make this happen. To do that we arranged to get these CAs in one room so we could talk about standardizing practices and certificate formats.  To make this happen I reached out to my contact at the AICPA and asked him to work with me to arrange what was the very first gathering of publicly trusted CAs and trust store providers. We met in Washington DC because I felt we could leverage the work done by the US Government to accelerate the standardization of these things.

Unfortunately one of the newest CAs who only issued low assurance certificates saw adopting common standards for vetting and labeling a risk to their business and as a result they through a wrench in the my plan. They filed a claim with the FTC that what the event an attempt to create anti-competitive marketplace and as a result I was deposed by the DOJ. Ultimately the issue was closed and I understand the disposition was that the claim was baseless.

At this point I was instructed by management and our legal council to stop pushing for this standardization as it represented too much legal risk for the company.

As an aside a few months later the largest CA acquired the troublemaker.

About a year and a half later the CAs self-organized and attempted to agree on a smaller set of standardization, the definition of what is called Extended Validation today. This was effectively a new label for what most CAs were offering in their “high assurance” certificates. The CABFORUM was now born.

At this point I had moved onto another team at Microsoft. During my time at Microsoft I worked on a number of very cool projects with some great people. Several of the projects I worked on used PKI but my involvement was much more on the peripheral to the industry at that point.

Years later I decided to leave Microsoft — the Diginotar incident was a big contributor to this decision. I felt that the industry was a mess, they were under investing in their infrastructure, not supporting the open source community they were dependent on and not actively working to improve adoption of SSL. I wanted to change this, I had decided I would start my own Certificate Authority and set an example for the industry on how a CA should approach these things.

This is when GlobalSign approached me and asked me to join as their CTO, I really liked the team, they were principled, hard working and looking to change the way things were done. I spent nearly three years in this role and we accomplished a great deal, I also still work with them on technical research / direction  but I have since moved onto a startup doing work on Bitcoin related technologies.

I did not accomplish all of the things I wanted to but I still have hopes that these systemic issues will be resolved as I do believe trusted-third parties are needed on the internet.

Anyway this is how I got into PKI.

Ryan

Advanced Troubleshooting of Certificate Validation Related Problems on Windows Part 2

One of the most complicated things to troubleshoot in X.509 is failures related to Name Constraints handling, there are a few ways to approach this but one of the easiest is to use the Extended Error Information in the Certificate viewer.

Lets walk through an exercise so you can try using this on your own, first download this script, if you extract its contents and run makepki.bat you get a multi-level PKI that looks something like this:

clip_image002

Before we begin its useful to understand that Name Constraints are applied at an Issuing CA level, in this case the constraints say that the CA is authoritative for any DNS or RFC822 name within in the example.com domain, it also allows a specific base distinguished name (DN).

In our example these constraints are applied by the “Test Partner CA” on to the “Customer CA”; you can see the restrictions in its certificate:

clip_image003

This script creates two certificates, one for email (user.cer) and another for SSL (www.cer), normally the script makes both of these certificates fall within the example.com domain space but for the purpose of this post I have modified the openssl.cfg to put the email certificate in the acme.com domain. — The idea here is that since this domain is not included in the constraints a RFC 3280 compliant chain engine will reject it and we have an error to diagnose.

Normally we don’t already know what the failure associated with a certificate chain is, after all that’s why we are debugging it but the process we use here will help us figure out the answer — applications are notorious for sharing why a certificate chain was rejected.

Since we can’t rely on the applications one way for us to figure this out on our own is to look at the chain outside of the application in the Windows Certificate viewer to see if we can tell what the issue is there.

Let’s start by looking at the “good” certificate:

clip_image005

What’s important here is that we see no errors, but how do we know that? Well let’s look at the bad certificate and see what the difference is:

clip_image007

The certificate clearly has a problem and Windows has done a decent job of telling us what the problem was, there is a name in this certificate that is inconsistent with the Name Constraints associated with this certificate chain. The question is — which name?

To answer that question we need to look at the Name Constraints, remember that this is in the CA certificates it can be in any or all of them which means we need to first figure out where the Constraints are.

To do that we start at the “Certification Path”, here you see something like this:

clip_image008

Notice the little yellow flag, that’s the certificate viewer saying this is where the problem is, let’s take a look at that certificate:

clip_image009

But wait! There isn’t a problem with this one, that’s because this certificate is actually good it’s the certificate it issued that is the problem. Here is the non-obvious part, let’s look at the details tab:

clip_image010

Here we see Windows is telling us exactly what the problem was with the certificate that this CA has issued, it is the email address and its inclusion of the acme.com domain.

Remember that it was restricted, via the Name Constraints extension, to the example.com domain when issuing email certificates. The chain engine would be non-compliant with the associated RFC if it trusted this certificate.

But why did we see this Extended Error Information in this certificate and not the leaf itself? This is actually by design and makes sense when you think about it. Name Constraints can exist in each issuing CA certificate in the chain and what is happening here is that CryptoAPI is telling us that the issuer certificate that was flagged is the offending actor.

This Extended Error Information is available in the Certificate Viewer in a number of cases (for example Revocation), though most of the interesting cases require the application to launch the certificate viewer as you get to view the certificate chain (and all of the associated state) that they were using when they encountered the problem.

Advanced Troubleshooting of Certificate Validation Related Problems on Windows Part 1

The Windows platform for validating X.509 certificates has a feature I don’t see many discuss — its robust logging subsystem.

This allows a non-developer (and developers) to get insights into what is happening with applications interactions with CryptoAPI 2 and to some degree what is happening inside those APIs.

To use this feature you must first enable it, the easiest way to do that is via the EventViewer Management Console (eventvwr.msc), once in there you must navigate to the CryptoAPI 2 (CAPI2) node of the viewer:

clip_image001[6] clip_image002[6] clip_image003[6]

 

Once you get there you select Properties on the operational log, which will give you a dialog that looks something like this:

clip_image007

 

Once you “Enable Logging” and “Apply” the changes, immediately all calls to CryptoAPI will be logged. CryptoAPI is used all the time so by the time you close that dialog you will have some events you can look at:

clip_image008

 

The high level view of each event doesn’t tell you much, for example in the above picture we really only know that the action that was being performed with a “Build Chain”, this particular event corresponds to a call to CertGetCertificateChain.

To really understand what is going on you need to look at the Details tab:

clip_image009

 

Here we can see what was passed into a given into that API call, we can see what certificates we passed in by the calling application, what settings they chose when making that call:

clip_image011

 

Some other things you can see in this particular API call include what the disposition of the call was and of course what certificate chain was built.

Every major API in CryptoAPI has logging similar to this, you get to see what is passed in and what came out. Additionally major “objects” are sometimes logged as well, for example here is an event showing a certificate that CryptoAPI was operating against:

clip_image012

 

Another very useful API to be able to look at is CertVerifyRevocation, this will be logged in this fashion:

clip_image013

 

Some of these events include references to temporary files, for example in this case you see:

clip_image014

This is a CRL for the “GlobalSign PersonalSign 1 CA – G2”, its stored in the Time Valid Object Cahce (TvoCache), you can look at this cache with the “certutil –urlcache” command.

 

The events will all tell you which application was the caller and if they are part of a sequence of tasks you get enough information to correlate them and put them in order.

clip_image015

 

I won’t go through all of the events but as you can see this is super valuable when trying to figure out – Why did that application do that?!

As you might imagine this logging can slow things down and produces a bunch of data so be sure to turn it off when you are done.

What is the status of revocation checking in browsers?

Today we did an announcement of some work we have been doing with CloudFlare to speed up SSL for all of our customers through some improvements to our revocation infrastructure.

One of the things that come up when talking about this is how each of the browsers handles revocation checking, I thought it might be useful to put together a quick post that talks about this to clear up some confusion.

The first thing that’s important to understand is that all major browsers do some form of revocation checking, that includes Opera, Safari, Chrome, Firefox and Internet Explorer.

Let’s talk about what that means, the IETF standards for X.509 certificates define three ways for revocation checking to be done, the first is Certificate Revocation Lists (CRLs), next there is the Online Certificate Status Protocol (OCSP) and finally there is something called Simple Certificate Validation Protocol (SCVP).

In the context of browsers we can ignore SCVP as no browser implements them; this leaves us with CRLs and OCSP as the standards compliant ways of doing revocation checking.

All of the above browsers support these mechanisms, in addition to these standard mechanisms Google has defined a proprietary certificate revocation checking scheme called CRLsets.

If we look at StatCounter for browser market share that means today at least 64.84% (its likely more) of the browsers out there are doing revocation checking based on either OCSP or CRLs by default.

This means that when a user visits a website protected with SSL it has to do at least one DNS look-up, one TCP socket and one HTTP transaction to validate the certificate the web server presents and more likely several of these.

This is important because of the way revocation checking needs to be done, you need to know if the server you are talking to really is who they say they are before you start to trust them – that’s why when browsers do OCSP and CRLs they do this validation before they download the content from the web page.

This means that your content won’t be displayed to the user until this check happens and this can take quite a while.

For example in the case of IE and Chrome (when it does standards based revocation checking on Windows) it uses CryptoAPI which will time-out after 15 seconds of attempting to check the status of a certificate.

The scary part is that calls to this API do actually time out and when they do this delay is experienced by the users of your website!

So what can you do about it? It’s simple really you have to be mindful of the operational capacity and performance of the certificate authority you get your certificate from.

Check out this monitoring portal I maintain for OCSP and this one I maintain for CRLs, you will see GlobalSign consistently outperforms every other CA for the performance of their revocation infrastructure in most cases it’s nearly 6x as fast and in others is much more than that.

The other thing to understand is that today the default behavior of these browsers when checking the status of a certificate via OCSP or CRLs is to do what is often referred to as a “soft-revocation failure”.

This basically means that if they fail for any reason to check the status of a certificate (usually due to performance or reliability issues) they will treat the certificate as good anyways. This is an artifact of CAs not operating sufficiently performant and reliable infrastructure to allow the browsers to treat network related failures critically.

Each of these browsers all have options you can use to enable “hard” or “strict” revocation checking but until the top CAs operate infrastructure that meets the performance and reliability requirements of the modern web no browser will make these the default.

Finally its also important to understand that even with this “soft-failure” your website experiences the performance cost of doing these checks.

It’s my belief that the changes we have put into place in our own infrastructure meet that bar and I hope the other CAs follow in our lead as it is in the best interest of the Internet.

Ryan

Serving OCSP POST responses on a CDN

The other day I did a blog post on how we are using a CDN to front our OCSP services, the CDN we are using is CloudFlare who is one of our partners.

In that blog post I mentioned that POSTs requests from an OCSP client would normally be a cache miss for a CDN and this introduces additional latency in serving these responses.

Even though the response times we were getting had this additional latency the performance was still acceptable but we wanted to do better so a few weeks ago I went to the CloudFlare office and worked with them on making their services OCSP aware.

Specifically we made it so that when they get an OCSP POST they can determine what cached response to return (for example from a prior GET) instead of going back to our responders to have our responder do that for them.

As of today this change has gone live, as you know I love numbers which is why I have been publishing these repository performance numbers. What you will see if you look at those is that our worldwide average is right around 100ms, if we take out china and Australia that figure drops to around 60ms.

The problem with these numbers is they only test the GET variant of the protocol, that is because neither Pingdom nor Monitis let me simulate binary POSTs (which is what the OCSP POST variant looks like).

With that said we can extrapolate what the numbers look like fairly easily; from my current network location (which is slow) this is what I see:

 

 

What I want you to notice here is that over 15 requests when I do a POST to the CDN instance of our responder I get the same response time average as I do when I perform the same test with the GET variant.

What this tells us is that POST is performing the exact same as a GET which from this we can safely say that in our case the performance numbers I have been publishing for GET are also accurate for POSTs.

What you will also see that our responder is slower to server via POST, this is because it is designed around nonced OCSP requests and as such isn’t optimized as much as it could be for caching them.

I should also note that our decision to put a CDN in front of our OCSP responder does not break clients that want to send nonced requests (no browsers do by default btw), these are simply treated as a cache miss.

Ryan

How to seed the CryptoAPI URL cache with an OCSP response

It is possible to “staple” an OCSP response into higher level protocols such as TLS. This concept has been supported in Windows since Windows VISTA, shortly after it was added to OpenSSL/Apache and soon it will also be in Nginx.

When “stapling” is used the subscriber (the web server in the TLS case) requests the status of his own certificate from the OCSP responder his CA operates to get a time valid OCSP response for his own certificate.

Since the OCSP response is signed by the CA it can be relayed by the web server for the duration in which that OCSP response is time valid to save the client of the web server (the relying party, aka the browser) the need to make an additional socket connection back to the CA.

This has both performance and privacy benefits.

You can apply the same concept to other PKI related protocols/applications as well, for example in a document signing application like Adobe Acrobat. In such an application the subscriber might sign the document, timestamp it using a timestamp protocol like RFC 3161 and then attach a time valid OCSP response to it so that the document is verifiable at a later date.

This scenario is important not only for its performance and privacy benefits but because it is a practical necessity because CAs do not typically maintain revocation information (OCSP responses and CRLs) for expired certificates.

By stapling the OCSP response to the signed and time-stamped document the relying party can verify the signature, the certificates and the revocation status of the certificates in the context of the timestamp that was attached to the document.

But how do you do that with CryptoAPI?

It’s actually pretty straight forward, as the relying party when you call CertGetCertificateChain to validate the associated certificate you need to:

  1. Once you have verified your timestamp is cryptographically valid and trusted, take the time it and pass it in as pTime.
  2. On the CERT_CONTEXT there is a CERT_INFO structure that contains an array of CERT_EXTENSION, here you create an extension of type CERT_OCSP_RESPONSE_PROP_ID and in there you put a basic signed OCSP response .

When CryptoAPI does the chain validation it will try to use the OCSP response you passed in, if it finds a problem with the provided response it may go online to get a new one that is “OK”.

This online behavior can be controlled by indicating to CertGetCertificateChain you do not want online revocation checking (see CERT_CHAIN_REVOCATION_CHECK_CACHE_ONLY).

Ryan