Tag Archives: X509

WebPKI, TLS, cross-signs, device compatibility, and TLS record size.

Both the Chrome and Mozilla root program have signaled the intent to substantially shorten the time we rely on roots in the WebPKI. I believe this to be a good objective but I am struggling to get my head around the viability and implications of the change.

In this post, I wanted to capture my inner dialog as I try to do just that 🙂

I figure a good place to start when exploring this direction is why this change is a good idea. The simplest answer to this question is probably that root certificates are really just “key bags” — by that I mean they are just a convenient way to distribute asymmetric keys to clients. 

Asymmetric keys have an effective lifetime, this means the security properties they offer start to decay from the moment the key is created. This is because the longer the key exists the more time an attacker has had to guess the corresponding private key from its public key, or worse compromise the private key in some way. For this reason, there are numerous sources of guidance on usage periods for asymmetric keys, for example, NIST guidance from 2020 says that a 2048 key is usable between 2019 and 2030.

There is also the question of ecosystem agility to consider. Alexander Pope once said To Err is Human, to forgive divine. This may actually be the most important quote when looking at computer security. After all, security practitioners know that all systems will have security breaches, this is why mature organizations focus so much on detection and response. I believe the same thing is needed for ecosystem management, the corollary, in this case, is how quickly you can respond to unexpected changes in the ecosystem.

Basically having short-lived root keys is good for security and helps ensure the ecosystem does not become calcified and overly dependent on very old keys continuing to exist by forcing them to change regularly.

With that, all said, if it were easy we would already be doing it! So what are the challenges that make it hard to get to this ideal? One of the hairiest is the topic of the Internet of Things (IOT). Though I would be the first to admit there is no one size fits all answer here, broadly these devices should not use the WebPKI.

In 2022, the market for the Internet of Things is expected to grow 18% to 14.4 billion active connections. It is expected that by 2025, as supply constraints ease and growth further accelerates, there will be approximately 27 billion connected IoT devices. 

IOT-Analytics.com

A big reason for this is that these devices last a very long time compared to their mobile phone and desktop counterparts. They also, unfortunately, tend not to have managed root stores and if they do, they do it as part of firmware updates and have limited attach rates of these updates. To make things worse these devices take crude simplifying assumptions around what cryptographic algorithms are supported and what root CAs will be used by servers in the future.

But for most IoT applications, like those enabling the smart city, the device life cycle is 10 to 20 years or more. For instance, it doesn’t make sense to replace the wireless module in smart streetlights every few years.

Ingenu

A good topical example of the consequences here might be the payment industry making a decision in the 90s to adopt a single WebPKI CA  (a VeriSign-owned and operated root certificate) in payment terminals without having a strategy for updating the device’s root stores.  These terminals communicated to web servers that were processing payments. In short, the servers had to use certificates from this one WebPKI Root CA or no payment terminals could reach them.

This design decision was made during a time when SHA-1 was considered “secure”. The thing is cryptographic attacks are always evolving, and since 2005 SHA-1 has not been considered secure. As a result, browsers moved to prevent the use of this algorithm. 

This meant that one of two things would happen, either browsers would break payment terminals by preventing WebPKI CAs from using this algorithm, or payment terminals’ lack of an update strategy would put the internet at risk by slowing the migration from SHA-1. Unfortunately, the answer was the latter.

In short, when you have no update story for roots and you rely on the WebPKI you are essentially putting both your product and the internet at risk. In this case, while a root and firmware update story should have existed regardless, the payment industry should have had a dedicated private PKI for these use cases.

OK, so what does this have to do with moving to a shorter root validity period in Browser root programs? Well, when a TLS server chooses what certificate to present to a client it doesn’t know much more than the IP address of the client. This means it can’t selectively choose a TLS certificate based on it being an IOT device, a phone running an old version of a browser, or a desktop browser.

This is where the concept of “root ubiquity” and what those in the industry sometimes called “root baking”. Some root programs process requests for new root inclusion very quickly, maybe within a year your root can become a member of their program but that isn’t enough.  You need the very large majority of the devices that rely on that root program to pick up the new root before the devices will trust it.

In the case of Microsoft Windows, the distribution of a root certificate happens via a feature called AutoRoot Update. The uptake for consumer devices of a new root is very high as this is seldom turned off and the URL that is used to serve this update is seldom blocked. Enterprises and data centers are another matter altogether though. While I do not have any data on what the split is, anecdotally I can say it’s still hit or miss if a root has been distributed in this system as a result of this.

Chrome on the other hand uses a configuration management system for all Chrome that also happens to control what roots are trusted. Again while I have no data I can point out anecdotally it appears this works very well and since chrome updates all browsers automatically they tend to have the latest binaries and settings.

Safari on the other hand manages the roots in the firmware/operating system update. This means if a user does not update their devices’ software the roots will never be present even if you are a member of the root program.

These are just a few examples of some of the root programs CAs have to worry about, and the problems they have to consider.  

What this means is the slowest root program in the WebPKI to accept and distribute roots holds back the adoption of new CAs.  

As an aside, It is worth noting that around 59% of all browser traffic (not including IoT devices) is mobile. This includes a lot of developing country traffic where phones are kept longer and replacement models are often still old models with older software and root stores that are not updated. This is before you get to TVs, printers, medical devices, etc that seldom do forced firmware updates and never manage roots independently from firmware.

As a result of the above, these days I generally tell people that it takes 5-7 years post inclusion to get sufficient ubiquity in a root to be able to rely on it for a popular service as a result.

A CA will typically try to address this problem by acquiring what is known as a cross-sign (see Unobtainium), there are two recent examples of this I can point to. One is the cross-sign for Let’s Encrypt and the other is the cross-sign for Google Trust Services. The problem is most CAs don’t want to help other CAs with cross-signs for obvious reasons. Beyond that increasingly there are fewer and fewer of the “old keys” still in use on the web that is baked into the oldest devices meaning there are not a lot of choices for those cross signs even if you can get one.

Let’s assume for the purpose of this (now long) post that you are able to get a cross sign either from a competitor or because you happen to have custody of older key material that was grandfathered into the root program that has the needed ubiquity.

In the simplest case, you get to a situation where you are asking those that use your certificates to include the two CA certificates in their TLS bundle.  At a minimum, we start to pay a performance penalty for doing this.

As an example consider that each certificate is about 1.5KB in size, by adding this new certificate to the bundle every fresh TLS negotiation will carry this tax.  If we assume a typical chain is normally 3 in length that makes our new total 4.5kb in overhead. It’s not a large figure but if you consider a high-traffic site like Amazon it adds up quickly.

How many purchases are made on Amazon daily? On a daily basis, Amazon ships more than 66,000 orders per hour in the US. About 1.6 million packages are shipped on a daily basis.

capitalcounselor.com

If we assume each of Amazon’s orders equates to a single fresh TLS negotiation (it should be many more than that since many servers power the site) that is a daily increase of 3.6GB of traffic. Now Amazon can deal with that with no problem but it would slow down the experience for users, and potentially cost those users without unlimited internet plans some of their allotted bandwidth. There is also the potential of fragmentation as the TLS packets get larger which increases the latency for the user.

Don’t get me wrong, these are not deal breakers but they are taxes that this cross sign approach represents. 

My fear, and to be clear I’ve not completed the thought process yet, the above story creates a situation where CAs need to provide multiple cross signs to support all the various device combinations that are out there in this new world.

If so this scares me for a few reasons, off the top of my head these include:

  1. It is already hard enough to get a single cross sign doing so multiple times will be that much harder,
  2. It will put new entrants at a disadvantage because they will not have legacy key material to rely on nor the high capital requirements to secure cross signs if commercial terms can even be reached allowing them to get one,
  3. We have spent the last decade making TLS the default and we are close to being able to declare victory on this journey if TLS becomes less reliable we may lose ground on this journey,
  4. It is a sort of regressive taxation on people with older and slower devices that are likely already slow and paying for data on a usage basis.
  5. The additional data required will have a negative impact on Time to First Paint (TTFP) in that this extra data has to be exchanged regardless.

Mom always said don’t complain if you don’t have a proposal to make things better, so I figure I should try to propose some alternatives. Before I do though I want to reiterate that this post isn’t a complaint, instead it’s just a representation of my inner dialog on this topic. 

OK, so what might be a better path for us getting to this world of shorter-lived roots? I guess I see a few problems that lead to the above constraints

  1. Even in the browser ecosystem, we need to see root lists updated dynamically,
  2. Inclusion into a root store should take months for your initial inclusion and later updates should happen in weeks,
  3. Browsers should publish data to help CAs understand how ubiquitous their root distributions are so they can assess if cross signs are even necessary,
  4. There needs to be some kind of public guidance to IoT devices on what they should be doing for the usage of certificates in their devices and this needs to include root management strategies.
  5. IOT devices need to stop using the WebPKI, it is increasingly the Desktop and Browser PKI and anything else will get squashed like a bug in the future if they are not careful 🙂
  6. There should be a Capability and Maturity Model for root programs and associated update mechanisms that can be used to drive change in the existing programs 
  7. CAs should stop using one issuing hierarchy for all cases and divide up the hierarchies into as narrow of slices as possible to reduce the need for one root to be trusted in all scenarios.
  8. Root programs should allow and go so far as to encourage CAs to have more than the 3 (typical) roots they allow now to support CAs in doing that segmentation 
  9. WebPKI root programs should operate two root programs, the legacy one with all of its challenges and the new one focused on agility, automation, and very narrow use cases. Then they should use the existence of that program to drive others to the adoption of those narrower use cases.
  10. And I guess finally we have a chance with the Post Quantium TLS discussions to look at creating an entirely new WebPKI (possibly without X.509!) that is more agile and narrow from the get-go.

Why crawling is not an adequate measurement methodology for the WebPKI

The answer is simple — It’s an incomplete view of the use of the WebPKI.

There are a number of different methodologies a web crawler-based approach might take in measuring the size of the WebPKI. The most naive approach would be to simply scan all IPv4 address space and log all of the certificates you see during this scan.

The problem is that this only shows a small fraction of the certificates that are out there. When you connect to an IP address and the associated web server doesn’t know what host you are trying to connect to it will return its “default” website and use the associated certificate.

That same IP address may literally be responsible for serving millions of sites based on the client’s indicated hostname. With this IP-based enumeration approach at best you would get one certificate from that host, at worse you wouldn’t even get that because some servers are not configured with a default site. This is just one problem with this approach there are many more.

Though most WebPKI market share reports do not document their methodology anecdotally it appears most work on this crawler approach and at least historically some have taken periodic drops from CAs to make their view “more complete”.

Today though the only way to measure CA market share that should be used is by relying on the pre-certificate counts in Certificate Transparency logs.

Why bother with short-lived certificates and keys in TLS?

There seems to be a lot of confusion and misinformation about the idea of short-lived certificates and keys so I thought I would pen some thoughts about the topic in the hope of providing some clarification.

I have seen some argue the rationale behind short-lived certificates is to address the shortcoming in the CA and browser revocation infrastructure, I would argue this is not the case at all.

In my mind, the main reason for them is to address the issue of key compromise. Long-lived keys have a long period in which they are exposed to theft (see Heartbleed) and therefor are a higher value to an attacker since a stolen key enables the attacker to impersonate the associated website for that period.

The most important thing to keep in mind is that the nature of key compromise is such that you almost never know it has happened until it is too late (consider Diginotar as an extreme example).

The importance of protecting the SSL private key is why in the 90s when SSL deployment started in earnest, large companies tried to deploy SSL in their environment using SSL “accelerators” and “security modules” that protected the keys. This was the “right technical” thing to do not not the right “practical thing” because it significantly reduced the scalability of SSL protected services and at the same time massively increased the cost to deploy SSL.

By the mid 00’s we recognized this was not workable and the use of accelerators basically stopped outside a few edge cases and software keys were used instead. Some implementations, like the Microsoft SCHANNEL implementation, tried to protect the keys by moving them to a separate process mitigating the risk of theft to some degree. Others simply loaded the keys into the web servers process and as a result exposed them to Heartbleed like attacks.

What’s important here is that when this shift happened no one pushed meaningful changes to the maximum validity period of certificates and by association their private keys. This meant that we had software keys exposed to theft for five years (max validity period at the time) with no reliable way to detect they were stolen and in the unlikely event we did find out of the theft we would rely on the unreliable CA revocation infrastructure to communicate that issue to User Agents — clearly this was not an ideal plan.

In short, by significantly reducing the validity of the certificate and keys, we also significantly reduce the value to the attacker.

Another issue that short-lived certificates help mitigate is the evolution of the WebPKI, with a long-lived certificate you get virtually no security benefits of policy and technology improvements until the old certificates and keys are expunged from the ecosystem. Today this is rollout of new policy largely accomplished via “natural expiration” which means you have to wait until the last certificate that was issued under the old policies expires before absolute enforcement is possible.

So what is the ideal certificate validity period then? I don’t think there is a one-size-fits-all answer to that question. The best I can offer is:

As short as possible but no shorter.

With some systems it is not possible to deploy automation and until it is one needs to pick a validity period that is short enough to mitigate the key compromise and policy risks I discussed while long enough to make management practical.

It is probably easier to answer the question what is the shortest validity period we can reliably use. The answer to that question is buried in the clock skew of the relying parties. Chrome recently released a new clock synchronization feature that significantly reduces errors related certificate validity periods. But until that is fully deployed and other UAs adopt similar solutions you are probably best to keep certificate validity periods at 30 days to accommodate skew and potential renewal failures.

In-short, long-lived keys for SSL exist because we never re-visited the threat model of key compromise when we stopped using hardware-protected keys in our SSL deployments and short-lived keys help deal with the modern reality of SSL deployments.

Ryan

– I have added a few pictures for fun and made some minor text changes for clarity.

CAs and SSL and Phishing Oh My!

 

 

 

 

 

 

NOTE: This post reflects my personal beliefs and is not necessarily those of my employer Google, or Let’s Encrypt where I am a member of their Technical Advisory Board.

Introduction

Recently Vincent from The SSL Store published a blog post calling out Let’s Encrypt for issuing certificates to domains that contain the world PayPal.

The TL;DR for his post is he believes that Let’s Encrypt is enabling phishers by issuing them SSL certificates that contain the word “PayPal” and then refusing to revoke them when arbitrary third-parties ask them to.

As a result of his post, several news sources have decided to write articles about how “Let’s Encrypt” is acting as an enabler of these Phishers [1] [2].

Unfortunately, Vincent’s post and the associated articles don’t cover this in the most complete and balanced way so over my morning coffee today I decided to write this post to discuss the other side of the argument.

If this is a topic that interests you please also check out the Let’s Encrypt blog post where they talk about why they have taken this position.

Exploration

Let’s explore the opportunities CAs have to check for phishing, the tools they have available to them, the effectiveness of those tools, the consequences of this approach, how complete a solution based on the tools available to them would be and what the resulting experience would be for users.

Opportunities

The WebPKI’s CAs role, historically, has been that of a Passport office, you present proof you control a domain, and possibly that you are an authorized member of an organization and you get a digital certificate that attests to that.

This certificate could be valid for up to 1095 days. Once the certificate is issued the CA, largely speaking, has no natural opportunity to verify this information again. It is worth noting that this month the CABForum voted to shorten this period to 825 days.

Tools

In the event a CA determines it made a mistake in the issuance of a certificate or has been notified by the subscriber they would like to see a certificate marked invalid, the tool they have available to them is called “revocation”.

The two types of revocation that are under the control of a Certificate Authority are called Certificate Revocation Lists and OCSP responses. The first is a like a phonebook of all known “revoked” certificates while the last is more like a lookup that it enables User Agents to ask the status of a particular set of certificates.

Effectiveness

Earlier we discussed the lifetime of certificates, this is important to understand because the large majority of phishing sites do not start out as Phishing sites, as such issuance time checks seldom net positive results.

After issuance, this leaves you with periodic checks of the site,  third-party reports of phishing and relying on revocation checking as an enforcement mechanism. This is a recipie for failure, there are a few reasons for this, but one of the more significant is the general ineffectiveness of revocation checking.

Revocation checking is the most taxing thing a CA does. This is because the revocation mechanisms available to them will result in every relying party contacting them to download a OCSP response or CRL covering that certificate.

As a result, OCSP has a tendency to be both slow and unreliable. This forced browsers to implement this check as a “soft fail”, in other words, if the connection times out or fails for some reason they assume the certificate as good.

To give that some context about 8% of all revocation checks done by Firefox fail and the median response time is over 200ms.

As a result of this in 2012 Chrome, which is used by about 50% of all users, more-or-less disabled revocation checking except for exceptional circumstances.

What this means is that revocation checking, even for its intended purpose, is far from an effective tool. Expanding its use to include protecting users from phishers would not improve its effectiveness and arguably it would (due to the infrastructure implications) make it even less reliable.

It is also important to note that every wildcard certificate can be used for a hostname containing “PayPal” without the CA ever being made aware, a good example is https://paypal.github.io/ which is protected by a wildcard certificate issued to Github.

Consequences

To understand the consequences of expanding the CAs role include protecting us from phishing we first need to understand what a certificate represents, or more importantly what it does not represent. It does not represent the content, it represents the host that is serving content and it is the content that “phishes”.

Today, in the age of cloud services, there is a good chance the host that is serving the content is a service operated by WordPress, or maybe Amazon’s S3. These services allow users to sign up and post arbitrary content for free or very little money.

If we decide that revocation checking is the right tool to get phishing content off the web we would be saying a CA should revoke WordPress’s certificate if one of it’s users posted something someone reported as phishing content. That would, for the situations where revocation checking takes place and happens to work, take WordPress off the Internet. Is that what we want to happen?

If so, who is it we are asking to perform this task? There are well over 400 CAs in the Microsoft Root Program do we believe these are the right organizations to be policing the internet for the appropriateness of content?

If so what criteria should they use to do so and what do we do if they abuse this censorship role?

Completeness

It is easy to say that a CA should not issue a certificate if it contains the word “PayPal”. I could even see an argument that those that would be hurt by such a rule, for example, http://www.PayPalSucks.com and (a theoretical) PayPalantir.com are an acceptable loss.

This would, however not catch homoglyphs like when a Cyrillic “a” is used instead of the latin “a” which would very likley require a manual review of the name and content to determine the intent of the domain owner which is near impossible to do with any level of accuracy or fairness.

Even with that, what about ING, as one of the world’s largest banks, they too are commonly phished, should a CA be able to issue a certificate to https://www.fishing.com. And if they do and the issuing CA receives a complaint that it is Phishing ING what should they do?

And what about global markets and languages? In Romania there is a company called Amazon that is a cleaning company, should anyone be able to request their website be revoked because it contains the word Amazon?

If we promote the CA to content police, how do we do so in a complete way?

User Experience

With CAs acting as the content police, what would a user see when they encounter a revoked site? While it varies browser to browser the experience is almost always a blocking “interstitial”, for example:

 chrome revoked firefox revoked


If you look closely you will see these are not screens that you can bypass, revoked sites are effectively removed from the internet.

This is in contrast to Safe Browsing and Smartscreen which were designed for this particular problem set and therefore provide the user a chance to visit the site after a contextually relevant warning:

 SafeBrowsing smartscreen

Conclusion

I hope you see from the above that relying on Certificate Authorities as content police as a means to protect users from phishers a bad idea, at a minimum, it would be:

  • Ineffective,
  • Incomplete,
  • Unmanageable,
  • and Duplicative.

But more importantly it would be establishing a large loosely managed group as the de-facto content censors on the internet and as Steven Spielberg said, there is a fine line between censorship, good taste, and moral responsibility.

So what should CAs do about phishing then? It is my position they should check the Google Safe Browsing API prior to issuance (which by the way, Let’s Encrypt does), and they should report Phishers to the Safe Browsing service if they encounter any.

It is also important to answer the question about what users should do to protect themselves from phishing. I understand the desire to say there is only one indicator they need to be worried about, it’s just not realistic.

When I talk to regular users I tell them to do three things, the first of which is to use an up-to-date and modern browser that uses Smart Screen or Safe Browsing. Second, you should only provide data to sites you know and only over SSL. And finally, try to only provide sites information when it was you initiated the exchange of information.

P.S.

Thanks To Vincent Lynch and the others who were kind enough to proof this post before publishing.

WebCrypto and PKI

Like it or not here it comes — within the next few months WebCrypto will be supported in various degrees across all mainstream browsers. There are plenty of posts out there talking about the security concerns of performing cryptography in the browser so I wont go into those here.

What I wanted to talk about was now that it’s here (mostly) what can we do with it? There will be those who say you do harm by making any claims about the assurances a web application makes relating to identity and confidentiality when WebCrypto is involved. The reality though is its happening and we should think about how we enable applications to use it responsibly.

This is why I started work on PKI.JS and ASN1.JS with Yuri Strozhevsky. Now that browsers have these basic crypto primitives available to them it is possible to build web applications that are interoperable with the security services used off the web, it is also possible to build new services on the web that simply were not possible before.

Now there have been libraries that that offered ASN.1 and PKI related capabilities (for example jsRSAsign, Forge and Lapo-asn1js) but none of these were complete and none built around WebCrypto as the source of crypto.

What Yuri and I set out to do is create a set of libraries that addressed these gaps and used public test suites (when available) to ensure conformance with the associated standards, including:

  1. X.509 and CRLs– RFC 5280
  2. CMS / PKCS 7 (Signed & EnvelopedData) – RFC 5652
  3. PKCS10 – RFC 2986
  4. PKCS8 – RFC 5208
  5. OCSP – RFC 6960
  6. Time-stamping – RFC 3161

For example for ASN.1 Yuri used his freely available test suite and for path building he tested against the PKITS test suite.

This of course does not mean the libraries are 100% compliant or defect free, in-fact I can promise you they are not but where test suites were clearly available we tried to utilize them so we would end up with a highly stable and standards compliant library.

At this point the libraries work in all modern browsers but only support signing, verifying, encrypting and decrypting in the Chrome dev-channel but in theory should work on Firefox nightlies as well. Unfortunately the profile and version of WebCrypto supported by Internet Explorer is outdated enough at this point these features do not work there at all yet.

These libraries have not yet been published to their public repositories but I expect them to be within the week under an BSD style license, to give some perspective on the size of this project I expect it to be just under 20,000 lines of code when released. It’s my hope that other people take this and build upon them so that the Internet has a browser friendly way to interact with these technologies.

NOTE: While I hate disclaimers like this but these libraries have not undergone any significant review please do not consider them production ready more work is needed before that’s the case.

NOTE: It’s also worth noting that until at least two browsers release their WebCrypto implementations as final products that these libraries may stop working or not work uniformly across browsers, for example at this time the nightly Chrome builds do nor support key exports which prevents implementation of the key storage structures.

P.S I actually miss spoke earlier, we did not end up include PKCS #12 in this version but most of the base structures are supported.

What’s in a certificate chain and why?

Have you ever wondered why your web server certificate has a “chain” of other certificates associated with it?

The main reason is so that browsers can tell if your certificate was issued by a CA that has been verified to meet the security, policy and operational practices that all CAs are mandated to meet. That certificate at the top of the chain is commonly called the “root”. Its signature on a certificate below it indicates that the CA operating the root believes that practices of the CA below it meets that same high bar.

But why not issue directly off of the “root” certificates? There are a few reasons; the main one is to prevent key compromise. To get a better understanding, it’s useful to know that the private keys associated with the “root” are kept in an offline cryptographic appliance located in a safe, which is located in a vault in a physically secured facility.
These keys are only periodically brought out to ensure the associated cryptographic appliance is still functioning, to issue any associated operational certificates (for example an OCSP responder certificate) that may be needed, and to sign fresh Certificate Revocation Lists (CRLs). This means that for an attacker to gain access to these keys, they would need to gain physical access to this cryptographic appliance as well as the cryptographic tokens and corresponding secrets that are used to authenticate the device.

CAs do this because keeping keys offline is a great way to reduce the risk of a compromised key, but it’s a poor way to offer a highly available and performant service, so the concept of an Issuing CA (ICA) was introduced. This concept also enabled the “root” to respond to CA key compromise events by revoking a CA certificate that should no longer be trusted. This also enables delegation of control, limiting those who can influence a given ICA to sign something.

Another way CAs solve the “online CA” problem is to use what is commonly referred to as a Policy Certificate Authority (PCA). This model allows a CA to segment operational practices more granularly. For example, maybe the CA is audited to be in compliance with a specific set of government standards so the ICAs associated with those practices would be signed by the corresponding PCA. This not only allows segmentation of policy and procedures, but it also enables separation of usage scenarios. For example, one PCA may only allow issuance of certificates for secure mail while the other PCA may allow issuance of SSL certificates. These PCAs are also very commonly operated as offline entities and have ICAs right underneath them.

While the above two models represent the most common ways a PKI might be segmented, they are not the only two. For example, the operational practices required to be a publicly trusted CA are far stricter than what a typical data center might employ. For this reason, it’s very common for CAs to manage PKIs for other organizations within their facilities.

CAs may also “roll” ICAs as a means to manage CRL size. For example, if a given CA has had to revoke many certificates during its lifespan, it may decide to manage the size of CRLs – it would be appropriate to create a new ICA and take the previous one out of service so that future CRLs can still be downloaded quickly by clients. When this happens both CA certificates may be valid for an overlapping time, but only the more recent one is actively in use.

Long story short, some counts on the number of Certificate Authorities that exist on the internet can be deceiving. One of the easiest ways to see this is to look at a CA called DFN-Verein. They are an educational PKI that manages all of the CAs in their PKI in the same facilities, using the same practices, but for security reasons they create separate ICAs for each organization in their network.

Simply put, the count of CAs in a PKI is not a good way to assess the number of entities issuing certificates in the PKI ecosystem. What you really want to count is how many facilities manage publicly trusted certificates. The problem is that it is too difficult to count – what you can do, however, is count the number of organizations associated with ownership of each “root”. Thankfully Microsoft makes this fairly easy. In March, I did a post on my blog showing a breakdown of the ownership. Unfortunately, this approach does not give you a count of operational facilities that are used for the subordinate CAs, but it’s quite likely that given the operational requirements and costs associated with maintaining them that these two numbers are relatively close.

So what would I like for you to take away from this post? I suppose there are two key points:

  • A public CA using several Certificate Authorities under their direct control is actually a good thing as it indicates they are managing the risk of operating their services and planning for migrations to new algorithms and keys as appropriate.
  • Counting the number of “roots” and “subordinate CAs” found by crawling the web does not actually represent the number of organizations that can act as publicly trusted certificate authorities.

That is not to say the efforts to crawl the web to understand how PKI is deployed and used is not valuable, it is – quite valuable. These projects are an important way to keep an eye on the practices that are actually used in the management of Public PKI.

Additionally, efforts to support Least Privilege designs in PKI and adopt means to actively monitor certificate issuance, such as Certificate Transparency, all represent positive moves to help us better understand what is actually out there.

The (soon to be) not-so Common Name

If you are reading this post you are probably familiar with the use of digital certificates in SSL even if you are you may not be familiar with their history. Before we go there though we should start with what, at its core a digital certificate actually is.

Fundamentally a digital certificate is a binding of entitlements and constraints to a key, in other words they say things like “The holder of the private key associated with this certificate can rightfully use the name Ryan Hurst when signing emails”.

When originally conceived they were to be used to help bind subjects (people and resources) to their representations in directories. This is why the Subject Name in a certificate is structured as a Distinguished Name (DN) as this is how a directory uniquely identify a subject.

This makes sense when looking up an encryption key for a user in an enterprise directory but not so well on the Internet where there is no global directory of users.

This brings us to SSL, it was introduced in the mid 1990s and at the time nearly every large enterprise was already deploying Directories and Certificate Authorities as part of their identity management frameworks. The technology of X.509 was tested, broadly accepted and fit the bill for the problem the designers of the protocol had so they included it as is.

At the time there was only one way to represent concept of a subject of a certificate and that was the Common Name (CN) so they chose to put the DNS name of the SSL server there. This was technically acceptable but was a re-purposing of a field that was really intended for a users actual name.

After SSL was finalized the IETF released their profile of X.509 for use on the Internet this standard introduced the concept of a Subject Alternative Name (SAN) where you can put names that are not associated with a directory. The problem is that ship had sailed, by the late 90s when this was standardized everyone had already settled on using the Common Name.

This led us down a bad path, first of all many servers (especially today) have multiple DNS names and application that supported only the Common Name field couldn’t work with a single certificate with more than one DNS name in it. This was addressed in the short term by using a single certificate for each DNS name but this came at a high cost, we also needed to use a single IP address for each domain name.

Another problem with this approach is applications never really knew what to expect in the Common Name field. Is the value in that field a person’s name or is it a DNS name? This is a problem because often times there are rules you need to use to validate a piece of data before using it and this is especially true for DNS names.

For these reasons (and more) since at least 1999 (when RFC 2459 was standardized) we have been on a slow path to moving away from the use of Common Names for domain names to using Subject Alternative Names.

Fast forward to 2012 some Stanford researchers publish a paper titled “The most dangerous code in the world: validating SSL certificates in non-browser software” which identifies a bunch of applications who fail to do the most basic certificate validation tasks correctly and as a result are the source of a bunch of security vulnerabilities.

These applications gave their users a false sense of security not out of malice but as a result of a lack of understanding of the technology they used to deliver on that promise. A big part of that is the complexity 18 years of technological evolution carries with it.

To address this a number of things need to change but one of the most immediate changes is what that the definition of what constitute a “valid” SSL certificate is changing to make the rule-set a little simpler for the application developer and to rule out options that are no longer considered good practice.

We see this happening in a few ways. First the CA/Browser Forum has worked with Browsers to define a set of Baseline Practices that all Certificates must meet, we are also seeing Browsers doing sanity checks to ensure these practices are in-fact followed.

These baseline requirements mandate that certificate authorities always include at least one Subject Alternative Name in the SSL certificates they issue, this means that today an application doesn’t need to look in both the Common Name and the Subject Alternative Name they only need to check the latter.

Currently most Certificate Authorities will include the first DNS Name from the Subject Alternative Name in the Common Name field also but this is done primarily for legacy reasons and at some point in the not so distant future will stop.

When it does certificates will be a little smaller and developers lives will be a little easier.

Ryan

Resources

· Baseline Requirements

· Internet X.509 Public Key Infrastructure Certificate and Certificate Revocation List (CRL) Profile

· Microsoft Security Advisory: Update for minimum certificate key length

Average CRL size and download time

The other day I had a great conversation with Robert Duncan over at Netcraft, he showed me some reports they have made public about CRL and OCSP performance and uptime.

One thing that I have been meaning to do is to look at average CRL size across the various CAs in a more formal way I just never got around to doing it; conveniently one of the Netcraft reports though included a column for CRL size. So while I was waiting for a meeting to start I decided to figure out what the average sizes were; I focused my efforts on the same CAs I include in the revocation report, this is what I came up with:

 

CA Average CRL Size(K) CRL Download Time @ 56k (s)
Entrust 512.33 74.95
Verisign 200.04 29.26
GoDaddy 173.79 25.42
Comodo 120.75 17.66
Cybertrust/Verizon 75.00 10.97
DigiCert 21.66 3.17
GlobalSign 21.25 3.11
Certum 20.00 2.93
StartSSL 9.40 1.38
TrendMicro 1.00 0.15

 

From this we can derive two charts one for size and another for download time at 56k (about 6% of internet users as of 2010):

clip_image002 clip_image004

 

I overlaid the red line at 10s because that is the timeout that most clients use to indicate when they will give up trying to download, some clients will continue trying in the background so that the next request would have the CRL already cached for the next call.

This threshold is very generous, after all what user is going to hang around for 10 seconds while a CRL is downloaded? This gets worse though the average chain is greater than 3 certificates per chain, two that need to have their status checked :/.

This is one of the reasons we have soft-fail revocation checking, until the Baseline Requirements were published inclusion of OCSP references was not mandatory and not every CA was managing their CRLs to be downloadable within that 10 second threshold.

There are a few ways CAs can manage their CRL sizes, one of the most common is simply roll new intermediate CAs when the CRL size gets unmanageable.

There is something you should understand about the data in the above charts; just because a CRL is published doesn’t mean it represents active certificates – this is one of the reasons I had put of doing this exercise because I wanted to exclude that case by cross-referencing the signing CA with crawler data to see if active certificates were associated with each CRL.

This would exclude the cases where a CA was taken out of operation and all of the associated certificates were revoked as a precautionary exercise – this can happen.

So why did I bother posting this then? It’s just a nice illustration as to why we cannot generally rely on CRLs as a form of revocation checking. In-fact this is very likely why some browsers do not bother trying to download CRLs.

All posts like this should end with a call to action (I need to do better about doing that), in this case I would say it is for CAs to review their revocation practices and how they make certificate status available to ensure it’s available in a fast and reliable manner.

A look at revocation repository uptime

It is no secret that in the last two months GlobalSign was affected by outages at relating to our use of CloudFlare. I won’t go into the specifics behind those outages because the CloudFlare team does a great job of documenting their outages as well as working to make sure the mistakes of the past do not reoccur. With that said we have been working closely with CloudFlare to ensure that our services are better isolated from their other customers and to optimize their network for the traffic our services generate.

I should add that I have a ton of faith in the CloudFlare team, these guys are knowledgeable, incredibly hard working and very self critical — I consider them great partners.

When looking at these events it is important to look at them holistically; for example one of the outages was a result of mitigating what has been called the largest publically announced DDOS in the history of the Internet.

While no downtime is acceptable and I am embarrassed we have had any downtime it’s also important to look at the positives that come from these events, for one we have had an opportunity to test our mitigations for such events and improve them so that in the future we can withstand even larger such attacks.

Additionally it’s also useful to look the actual uptime these services have had and to give those numbers some context look at them next to one of our peers. Thankfully I have this data as a result of the revocation report which tracks performance and uptime from 21 different network worldwide perspectives every minute.

For 05/2012-12/2012 we see:

Service Uptime(%) Avg(ms)
GlobalSign/AlphaSSL OCSP 100.00 101.29
VeriSign/Symantec/Thawte/GeoTrust/Trustcenter OCSP 99.92 319.40
GlobalSign/AlphaSSL CRL 100 96.86
VeriSign/Symantec/Thawte/GeoTrust/Trustcenter CRL 99.97 311.42

 

For 01/2013 to 04/2013 we see:

Service Uptime(%) Avg(ms)
GlobalSign/AlphaSSL OCSP 99.98 76.44
VeriSign/Symantec/Thawte/GeoTrust/Trustcenter OCSP 99.85 302.88
GlobalSign/AlphaSSL CRL 99.98 76.44
VeriSign/Symantec/Thawte/GeoTrust/Trustcenter CRL 99.22 296.97

NOTE:  Symantec operates several different infrastructures – which one you hit is dependent on which brand you buy from and some cases which product you buy. We operate only two brands which share the same infrastructure. I averaged the results for each of their brands together to create these two tables. If you want to see the independent numbers see the Excel document linked to this post.

 

As you can see no one is perfect; I don’t share this to say our downtime is acceptable because it is not, but instead I want to make it clear this is data we track and use to improve our services and to make it clear what the impact really was.

By the way if you want to see the data I used in the above computation you can download these spreadsheets.

Why we built the Revocation Report

For over a year I have been monitoring the industry’s largest OCSP and CRL repositories for performance and uptime. I started this project for a few reasons but to understand them I think it’s appropriate to start with why I joined GlobalSign.

If you’re reading this post you are likely aware of the last few years of attacks against public Certificate Authorities (CA). Though I am no stranger to this space, like you I was watching it all unfold from the outside as I was working at Microsoft in the Advertising division where I was responsible for Security Engineering for their platform.

I recall looking at the industry and feeling frustrated about how little has changed in the last decade, feeling like the Internet was evolving around the CA ecosystem – at least technologically. The focus seemed almost exclusively on policies, procedures and auditing which are of course extremely important when you’re in this business but by themselves they are not a solution.

When I looked at the CA ecosystem there were a few players who I thought understood this; the one I felt got it the most was GlobalSign. Instead of joining the race to the bottom they were developing solutions to help with key management, certificate lifecycle management, and publishing guides to help customers deploy certificates cost effectively.

As a result when they approached me with the opportunity to join them as their CTO and set the technology direction for the company I was intrigued. Those of you who know me know I love data, I believe above all things successful businesses (if they recognize it or not) leverage the Define, Measure, Analyze, Improve and Control cycle to ensure they are solving the right problems and doing so effectively.

To that end when I joined GlobalSign as their CTO and I wanted market intelligence on what the status quo was for technology, operating practices and standards compliance so that I could use to adjust my own priorities as I planned where GlobalSign was going to focus its investments.

It was more than that though, as many of you know I am not new to PKI and especially not to revocation technologies having developed several products / features in this area as well as contributing to the associated standards over the years. I was always frustrated by many public certificate authorities’ inability or unwillingness to acknowledge the inadequacy of their revocation infrastructure and its contribution to slow TLS adoption and bad user agent behavior when it comes to revocation checking.

More directly the reliability and performance of major CA operational infrastructure was why browsers had to implement what is now called “soft-fail” revocation behaviors; the treating of failures to check the status of a certificate the same as a successful check. Yet it is these same people who point fingers at the browsers when the security implications of this behavior are discussed.

I wanted to see that change.

This is why from the very beginning of this project I shared all the data I had with other CAs, my hope was they would use it to improve their infrastructure but unfortunately short of one or two smaller players no one seemed concerned – I was shouting at the wind.

With the limited feedback I had received for the data I had been collecting I decided to put together what is now the revocation report. As part of this project I switched to a different monitoring provider (Monitis) because it gave me more control of what was being monitored and had a more complete API I could use to get at the data.

In parallel I began to work with CloudFlare to address what I felt was one barrier to optimally using a CDN to distribute OCSP responses (inability to cache POSTs). The whole time chronicling my experiences, thoughts and decisions on my blog so that others could learn from my experience and the industry as a whole could benefit.

When I set up the Monitis account I licensed the ability to monitor the top responders from 21 locations worldwide every minute. At first I just published the graphical reports that Monitis had but they had a few problems:

  1. They did not perform very well (at the time).
  2. It was not laid out in such a way you could see all the data at once (also now fixed).
  3. It did not exclude issues associated with their monitoring sensors.
  4. It gave no context to the data that was being presented.

This is what took me to working with Eli to build the revocation report we have today, the internet now has a public view into approximately eleven months (and growing) of performance data for revocation repositories. Eli and I are also working on mining and quantizing the data so we can do something similar for responder uptime but this has taken longer than expected due to other priorities — we will finish it though.

So the question at this point is “was the effort worth it?” — I think so, both of us put a lot of time into this project but I believe it’s been a success for a few reasons:

  1. It allowed me to figure out how to improve our own revocation infrastructure; we now perform at about the same speed as gstatic.google.com for a similarly sized object which is what the bar should be.
  2. Both StartSSL and Entrust have now followed suit and made similar changes to their infrastructure improving their performance by about 3x (besting our performance by a few ms!).
  3. Symantec has improved their primary revocation repository performance by almost 40% and I understand more improvements are on the way.
  4. We are a closer to having data based argument we can present to browsers about why they can and should re-enable hardfail revocation checking by default.
  5. It gives customers visibility into the invisible performance hit associated with the decision of who you choose as your certificate provider.

What do you think? Do you find this valuable? Are there any other elements you think we should be tracking?