Category Archives: Certificates

Baseline requirements are just that a baseline

I’ll start by saying this post is just a collection of personal thoughts and not a statement from my employer, nor does it reflect anyone’s opinions other than my own.

A common problem in the WebPKI is that CAs forget why they exist. No one cares about their business objectives. They exist to serve the public interest, more explicitly they exist to help the web solve the TOFU problem for domain control verification.

They exist because user agents (usually browsers) made the decision to delegate the TOFU problem to certificate authorities. To understand the browser’s motivation you simply need to look at the name “user agent” to understand that they directly serve the user.

These user agents used to meet individually with each and every one of these CAs to discuss the user agent requirements for CAs but ultimately decided to do this in a more collaborative way by participating in the CA/Browser Forum. 

Simply put, this organization would very likley stop existing if the user agents stopped participating and did not rely on the documents that come from this organization.

We know this to be true because of the voting structure of the forum and because the forum does not produce documents that govern the user agents, it produces documents that govern the practices of CAs.

To understand the nature of the CA/Browser program you just have to look at the name of the first and most important document they produced — the Server Certificate Baseline Requirements.

base·line

noun

     a minimum or starting point used for comparisons.

That’s right, it sets the minimum bar that a user agent should consider when trusting a certificate authority.

Far too often compliance programs end up driving the roadmap and operational practices of CAs and it stops being about what is right and becomes what is minimally required.

This leads to all kinds of spectacular failures, including basic failures in communication:

Instead of this compliance-focused mindset, CAs must be focused on why they exist, who they serve, and what is at risk if they mess things up.

I would argue that every major failure in the WebPKI has stemmed from forgetting about these simple questions.

TOFU and the Web

I’ll start by saying this post is just a collection of personal thoughts and not a statement from my employer, nor does it reflect anyone’s opinions other than my own.

For the uninitiated, TOFU is an acronym for Trust On First Use. The basic idea is that the user makes the trust decision associated with a communication channel on the first visit.

To understand the concept here, you need to know what we mean by “communication channel and what we “trust” them for.

Let’s start with what we mean by the “communication channel”. In the context of the web, for the most part, a user agent (often a browser) initiates communication to a service located at a fully qualified domain name. This is usually done using the Transport Layer Security (TLS) protocol. 

Next, let’s look at what “trust decision” is being made. Notice we don’t have an application yet, we have just established a session that an application and a service can use to exchange information. As a result, the trust decision we are making is:

Did my traffic end up at a service associated with the aforementioned fully qualified domain name?

The naive way to answer the question is to ask that question every time a user visits the service. In fact, some browsers support this model explicitly:

This is the moral equivalent of saying “do you feel lucky?” because the user has no way to answer this question.

It is actually worse than that, the few ways you can answer that question require the participation of the service being authenticated. That means any solution has to potentially scale up to the 5 billion users on the internet

Since each of these users has varying degrees of technical prowess and the service in question has a limited amount of time and resources; doing this for every user interactivity would be impractical.

To give this some modern color, Signal solves this problem by having the users in a conversation verify out of band they are talking to the right person. Almost no one does this of course and just begins chatting without going through this dance.

The gray hairs of the internet decided to solve this problem by delegating this trust decision to entities known as Certificate Authorities (CAs). They indirectly act as a middleman for the session establishment, they do this by issuing the service a credential it can use to prove that someone did this verification for them. 

These CAs do this verification by having the service perform some authenticatable action that only someone that controls the fully qualified domain name in question could complete This whole dance is basically delegated TOFU of control of the fully qualified domain name.

Over the last 28 years, this delegation has taken us from no HTTPS to near-ubiquitous HTTPS.

This is of course fantastic but it does beg the question what has this journey taught us and what might we change moving forward?

For one right now, if we believe this table, there are about 89 organizations in the world that are approved Certificate Authorities. But the top 5 of those CAs issue 99% of all certificates in use on the web.

That’s right 96% of the CAs that can issue certificates for the web could disappear tomorrow and most people would not notice.

What concerns me is that running a CA is hard, expensive, and has materially different operating constraints than other services. This combined with the love nation-state attackers have for these services has led to numerous colorful events over the years 1,2,3,4,5,6.

To be clear I am not making the case that those CAs should be removed, just making it clear that the web is carrying a much larger surface area than it strictly needs to deliver on its original design goal. 

That takes us back to the question of what lessons the last 28 years have taught us and what might we change moving forward. The answer here is likely buried in the adage of having one joint and keeping it well-oiled is key. 

The journey from 0% of the web being protected with HTTPS to nearly 100% has been one of simplification and automation. There is plenty of opportunity for us to do more of that and doing so will be key to the survivability of encryption on the web, some problems we might want to see in addressed that spirit include:

  1. Too many single points of failure in the WebPKI. If any of those top 5 CAs were to fail we would experience a massive failure on the web. For us to address this we need to standardize on a single certificate enrollment protocol, this would make the CAs plug-and-play replacements for each other. We already have a de-facto standard RFC 8555 (ACME) which is responsible for around 70% of all certificate issuance. We should just embrace that and make it official. Once we see sufficient ACME adoption we need the ACME clients to fail over and round-robin across all the associated CAs. This will help ensure clients are able to survive any single CA failure.
  1. We need the WebPKI to be less fragile. There are lots of examples of WebPKI fragility, some are:
  • Clients sometimes get hard coded to expect a single issuing CA. This is a very bad practice, one some CAs have already started to dissuade by randomly selecting issuing CAs.
  • Reliance on very long-lived root certificates instead of very frequent root updates. This is also thankfully starting to change but solving this problem will take us a very long time thanks to root programs that do not continually manage what roots are trusted in deployed systems.
  • The inability of the web to survive CA mistakes and potential security issues without creating outages. There is also work being done here in an extension to ACME called ARI to help enable the web to seamlessly renew certificates in such cases.
  1. Too many CAs with too broad permissions. Since most CAs are not necessary for the web to operate but may still have a legitimate need to exist we should leverage usage and name constraints in root trust more. This would go a long way to both reduce the value of a compromise of these CA long tail CAs to an attacker as well as reduce the exposure to the web as a whole to the practices of these CAs. We should also come back and revisit the need for CAs to keep those permissions periodically.
  1. Over-reliance on manual assessments by unqualified auditors and security professionals. This introduces a number of issues, some of which include:
  1. Root programs are too slow to respond to bad or questionable actors. Suspension of a CA is not-done and total distrust can take months or longer. There are a lot of politics involved here but equally important is the tooling to remove CAs, both procedurally and technically. In these things are simply not “well oiled”. They need to be made far more agile so that this long tail can be more effectively managed.

So what can you do about all of this? Well if you are a root program you should be looking at how you help address the above issues through the way you manage that program.

If you are an end user? Well, the answer there is easy — adopt ACME-based certificate lifecycle management, keep an eye on Certificate Transparency logs for certificates issued by that long tail of CAs and consider specifying CAA restrictions that limit issuance to a small number of CAs.

User-managed smart cards are an oxymoron

In an earlier post, I talked about what it is like to get an EV Code Signing Certificate in 2022 but what I didn’t talk about in that post was token management.

The first thing is that most smart cards implement a proprietary card edge, for example, the SafeNet 5110 is probably the most popular PKI-based smart card token. Since this token is totally proprietary you need to install what is called the SafeNet Authentication Client (SAC) before you can do much of anything with this token.

Most CAs will mail you an unprovisioned token, and tell you to install this client which contains amongst other things a PKCS#11 library that can work with it. The most important part of this package is actually the administration client though.

Without this, you can not initialize the token (roughly the equivalent of formatting it), set or change the passwords for using it, unblock it, or delete certificates from it (necessary as they have limited space).

What is important to understand is that these smart cards have their own administration model, in essence, there is a root user (the Security Officer or SO) and a regular user (that’s the person doing the code signing).

To make things easier for the recipient of a token these tokens are shipped with default secrets, at a minimum that involves the SO and user password (often all zeros or some other well-known sequence of numbers). For example, the default SO password for the SafeNet 5100 is a sequence of 48 zeros. 

While the SAC client will generate a random value for the SO password if you want it to, if you do not know what you are doing, it’s actually pretty natural to let it keep re-using the all zeros secret. On the other hand, if you let it assign a random value and do not secure it in some way so it is not lost if you type your password wrong 5 times the token is bricked.

This leaves us with a spectrum of choices that starts with a charade of protection and ends with something that would fail to meet the most basic business continuity expectations.

One might argue that anyone who wants to use a smart card would be savvy enough to know the above and make sure to use the appropriate practices. I personally find that hard to believe. If you do a search on the web right now I bet the only thing you will find is a few screenshots of how to get the token to the point it can have a key generated on it.

There are some providers who have built out remote signing capabilities using proprietary REST APIs and tooling, these are arguably a better choice than using a smart card but they also tend not to work very well or be very reliable. Not exactly what you want in a process that can gate your releases.

In summary, today, even when you manage to get a smart card enrollment to work remotely, there is a very good chance that you are not managing the smart card lifecycle and associate secretes effectively.

How to keep bad actors out in open ecosystems

There is a class of problems in information security that are intractable. This is often because they have conflicting non-negotiable requirements. 

A good example of this is what I like to call the “re-entry” problem. For instance, let’s say you operated a source control service, let’s call it “SourceHub”. To increase adoption you need “SourceHub” to be quick and easy for anyone to join and get up and running. 

But some users are malicious and will do nefarious things with “SourceHub” which means that you may need to kick them off and keep them off of your service in a durable way.

These two requirements are in direct conflict, the first essentially requires anonymous self-service registration and the second requires strong, unique identification.

The requirement of strong unique identification might seem straightforward on the surface but that is far from the case. For example, in my small social circle, there are four “Natallia’s” who are often are at the same gatherings, everyone must qualify which one we are talking to at these events. I also used to own ryanhurst.com but gave it up because of the volume of spam I would get from fans of Ryan Hurst the actor because spam filters would fail to categorize this unique mail as spam due to the nature of its origin.

Some might say this problem gets easier when dealing with businesses but unfortunately, that is not the case. Take for example the concept of legal identity in HTTPS certificates — we know that business names are not globally unique, they are not even unique within a country which often makes the use of these business names as an identifier useless. 

We also know that the financial burden to establish a “legal business” is very low. For example in Kentucky, it costs $40 to open a business. The other argument I often hear is that despite the low cost of establishing the business the time involved is just too much for an attacker to consider.  The problem with this argument is the registration form takes minutes to fill out and if you toss in an extra $40 the turnaround time goes from under 3 weeks to under 2 days — not exactly a big delay when looking at financially incentified attackers.

To put this barrier in the context of a real-world problem let’s look at Authenticode signing certificates. A basic organizationally validated Authenticode code signing certificate costs around $59, With this certificate and that business registration, you can get whatever business name and application name you want to show in the install prompt in Windows.

This sets the bar to re-enter this ecosystem once evicted to around $140 dollars and a few days of waiting.

But what about Authenticode EV code signing? By using an EV code signing certificate you get to start with some Microsoft Smart Screen reputation from the get-go – this certainly helps grease the skids for getting your application installed so users don’t need to see a warning.


But does this reduce the re-entry problem further? Well, the cost for an EV code signing certificate is around $219 which does take the financial burden for the attacker to about $300. That is true at least for the first time – about $50 of that first tome price goes to a smart card like the SafeNet 5110cc or Yubikey 5.  Since the same smart card can be used for multiple certificates that cost goes down to $250 per re-entry. It is fair to say the complication of using a smart card for key management also slows the time it takes to get the first certificate, this gets the timeline from incorporation to having an EV code signing certificate in hand to about 1.5 weeks.

These things do represent a re-entry hurdle, but when you consider that effective Zero Day vulnerabilities can net millions of dollars I would argue not a meaningful one. Especially when you consider the attacker is not going to use their own money anyway.

You can also argue that it offers some rate-limiting value to the acquisition pipeline but since there are so many CAs capable of issuing these certificates you could register many companies, somewhat like Special Purpose Acquisition Companies (SPAC) in the stock market, so that when the right opportunity exists to use these certificates it’s ready and waiting.

This hurdle also comes at the expense of adoption of code signing. This of course begs the question of was all that hassle was worth it?

Usually the argument made here is that since the company registration took place we can at least find the attacker at a later date right? Actually no, very few (if any) of these company registrations verify the address of the applicant.

As I stated, in the beginning, this problem isn’t specific to code signing but in systems like code signing where the use of the credential has been separated from the associated usage of the credential, it becomes much harder to manage this risk.

To keep this code signing example going let’s look at the Apple Store. They do code-signing and use certificates that are quite similar to EV code-signing certificates. What is different is that Apple handles the entire flow of entity verification, binary analysis, manual review of the submission, key management, and signing. They do all of these things while taking into consideration the relationship of each entity and by considering the entire history available to them.  This approach gives them a lot more information than you would have if you did each of these things in isolation from each other. 

While there are no silver bullets when it comes to problems like this getting the abstractions at the right level does give you a lot more to work with when trying to defend from these attacks.

What would it look like to go back to first principles when it comes to root store management in 2022?

In the early 2000s, Microsoft mandated that all CAs in its root program would need to be audited for conformity to WebTrust For CAs (WebTrust), It was the first root program to do so and I was the root program manager that rolled out that change. 

For those who are interested Web Trust for CAs is an audit that looks back at a period of past behavior to assess conformity to a set of criteria specific to certificate authorities. Only approved auditors are allowed to officially perform these assessments.

The decision to require the WebTrust audit was pretty simple, there was literally no other independent standard that attempted to assess the operational readiness and security of certificate authorities, and as far as such standards went it was more good than it was bad.

The other root programs didn’t have objective criteria, some operated on a pay-to-play model with virtually no requirements, while others simply wanted a management assertion of conformity to a small list of requirements and a contract signed. 

The idea of working with an independent group that had a vested interest in creating a baseline standard for the operation of CAs, that would be able to evolve over time, seemed like a no-brainer given how poor and inconsistent CA practices were at the time.

Nearly two decades later the CA/Browser Forum is where the AICPA and others go to discuss how the audit criteria should evolve over time. It’s objectively a success in that today there is quite a bit of uniformity in the operational practices that CAs have and its existence has led to many improvements to the WebPKI.

With that said, if we go back to first principles today I think I would take a different approach. I would design a root program that was heavily based on technical security controls, and behaviors that can be verified externally continuously.

If we’re going first principles then let’s take a step back at what responsibility a CA has on the internet. A TLS certificate does one thing fundamentally, it binds a domain name to a public key. It does this so that user agents (browsers) can verify that the server on the other end can verify that they are talking to the right server.

This user agent believes this certificate is telling the truth because the browser operators will remove CAs that demonstrate they are not trustworthy. 

Today I would guess 60-70% of the certificates on the web are issued via the ACME (RFC 8555) protocol. this protocol has a handful of methods it allows a requestor to use to prove they control the domain name (for example DNS-01, HTTP-01, and TLS-ALPN-02).

What this means is if ACME is the only way you can get a certificate from a CA, it also means that it is possible to externally test how the CA performs domain control verification. 

There has been another innovation in this space, we now have Certificate Transparency, this is a community of append-only ledgers that in union contain all of the WebPKI certificates. If a certificate does not get logged to the CT ledgers then users with major browsers (Safari, Chrome, Edge, etc) will see an error when the user visits the associated site.

This now gets us a list of all the things, with this list we can programmatically review 100% of the certificate issuance to ensure that they meet the externally visible technical requirements. This already happens today in the WebPKI thanks to the excellent Zlint.

What about how the CAs key material is protected? I love a good HSM too! Well, some HSMs already support attestations about the keys they are protecting, for example, Google Cloud, and Amazon Web Services use Marvel HSMs in their cloud offering and those devices offer hardware attestations about how a given key is protected.

What about the software that makes up the service? Well, some languages make naturally reproducible builds, for example, Go. This means that you could conceptually run CA software on hardware capable of making attestations about what it is running and from that attestation and if that software was open source you could audit the behavior of that software for security and conformity criteria.

You could even go so far as incorporating these attestations into certificate transparency pre-certificates along with how the domain control verification was performed for the “final” certificate so these things could be audited on a per certificate basis.

You could take the additional step of schematizing all of the audit logs that each of these CAs is required to produce and require them to make them public and verifiable via something like an append-only ledger.

We could even go so far as to eliminate the need for revocation checking by adopting very short-lived certificates as a mandate.

Now don’t get me wrong, this isn’t a complete list, and I doubt there is any set of technical controls that would totally eliminate the need for onsite independent visits but I would argue that they could be more frequent or technically deeper if as an industry we could asses much of the conformity of CAs in real-time and track the changes to their infrastructure.

This sounds like a big lift, after all, there is a top of CAs that exist in the WebPKI, the reality is that the large majority of these don’t focus on TLS as demonstrated by the fact that 5 CAs issue 98.59% of all TLS certificates. This means that you wouldn’t need to get many entities to embrace this new model to cover most of the internet.

If you had enough participants this new root program could be very agile for this new root program. After a short period of initial automated testing and a signed letter of conformity from an auditor, you could automate the inclusion processes. Should the continental monitoring find an issue you could automate the reporting to the CA and even automate the distrust of the CA in some cases.

This new WebPKI would need to exist in parallel to the old one for a long time, in particular, the existence of all of the embedded devices without reasonable firmware update would hold back its total adoption as websites need certificates that work everywhere but you can see a path to an agile, continually verified web PKI that leveraged more of these technical controls rather than all of the manual and procedural verification that is involved today.

I want to be clear, the above is a thought exercise, I am not aware of anyone planning on doing anything like the above, but I do think after 20 years it is entirely reasonable to rethink the topic of what the WebPKI looks like and we don’t need to rest on old patterns “just because that’s how it is done” — we have come a long distance in the last couple of decades and it’s worth looking at back these problems now and again.

WebPKI, TLS, cross-signs, device compatibility, and TLS record size.

Both the Chrome and Mozilla root program have signaled the intent to substantially shorten the time we rely on roots in the WebPKI. I believe this to be a good objective but I am struggling to get my head around the viability and implications of the change.

In this post, I wanted to capture my inner dialog as I try to do just that 🙂

I figure a good place to start when exploring this direction is why this change is a good idea. The simplest answer to this question is probably that root certificates are really just “key bags” — by that I mean they are just a convenient way to distribute asymmetric keys to clients. 

Asymmetric keys have an effective lifetime, this means the security properties they offer start to decay from the moment the key is created. This is because the longer the key exists the more time an attacker has had to guess the corresponding private key from its public key, or worse compromise the private key in some way. For this reason, there are numerous sources of guidance on usage periods for asymmetric keys, for example, NIST guidance from 2020 says that a 2048 key is usable between 2019 and 2030.

There is also the question of ecosystem agility to consider. Alexander Pope once said To Err is Human, to forgive divine. This may actually be the most important quote when looking at computer security. After all, security practitioners know that all systems will have security breaches, this is why mature organizations focus so much on detection and response. I believe the same thing is needed for ecosystem management, the corollary, in this case, is how quickly you can respond to unexpected changes in the ecosystem.

Basically having short-lived root keys is good for security and helps ensure the ecosystem does not become calcified and overly dependent on very old keys continuing to exist by forcing them to change regularly.

With that, all said, if it were easy we would already be doing it! So what are the challenges that make it hard to get to this ideal? One of the hairiest is the topic of the Internet of Things (IOT). Though I would be the first to admit there is no one size fits all answer here, broadly these devices should not use the WebPKI.

In 2022, the market for the Internet of Things is expected to grow 18% to 14.4 billion active connections. It is expected that by 2025, as supply constraints ease and growth further accelerates, there will be approximately 27 billion connected IoT devices. 

IOT-Analytics.com

A big reason for this is that these devices last a very long time compared to their mobile phone and desktop counterparts. They also, unfortunately, tend not to have managed root stores and if they do, they do it as part of firmware updates and have limited attach rates of these updates. To make things worse these devices take crude simplifying assumptions around what cryptographic algorithms are supported and what root CAs will be used by servers in the future.

But for most IoT applications, like those enabling the smart city, the device life cycle is 10 to 20 years or more. For instance, it doesn’t make sense to replace the wireless module in smart streetlights every few years.

Ingenu

A good topical example of the consequences here might be the payment industry making a decision in the 90s to adopt a single WebPKI CA  (a VeriSign-owned and operated root certificate) in payment terminals without having a strategy for updating the device’s root stores.  These terminals communicated to web servers that were processing payments. In short, the servers had to use certificates from this one WebPKI Root CA or no payment terminals could reach them.

This design decision was made during a time when SHA-1 was considered “secure”. The thing is cryptographic attacks are always evolving, and since 2005 SHA-1 has not been considered secure. As a result, browsers moved to prevent the use of this algorithm. 

This meant that one of two things would happen, either browsers would break payment terminals by preventing WebPKI CAs from using this algorithm, or payment terminals’ lack of an update strategy would put the internet at risk by slowing the migration from SHA-1. Unfortunately, the answer was the latter.

In short, when you have no update story for roots and you rely on the WebPKI you are essentially putting both your product and the internet at risk. In this case, while a root and firmware update story should have existed regardless, the payment industry should have had a dedicated private PKI for these use cases.

OK, so what does this have to do with moving to a shorter root validity period in Browser root programs? Well, when a TLS server chooses what certificate to present to a client it doesn’t know much more than the IP address of the client. This means it can’t selectively choose a TLS certificate based on it being an IOT device, a phone running an old version of a browser, or a desktop browser.

This is where the concept of “root ubiquity” and what those in the industry sometimes called “root baking”. Some root programs process requests for new root inclusion very quickly, maybe within a year your root can become a member of their program but that isn’t enough.  You need the very large majority of the devices that rely on that root program to pick up the new root before the devices will trust it.

In the case of Microsoft Windows, the distribution of a root certificate happens via a feature called AutoRoot Update. The uptake for consumer devices of a new root is very high as this is seldom turned off and the URL that is used to serve this update is seldom blocked. Enterprises and data centers are another matter altogether though. While I do not have any data on what the split is, anecdotally I can say it’s still hit or miss if a root has been distributed in this system as a result of this.

Chrome on the other hand uses a configuration management system for all Chrome that also happens to control what roots are trusted. Again while I have no data I can point out anecdotally it appears this works very well and since chrome updates all browsers automatically they tend to have the latest binaries and settings.

Safari on the other hand manages the roots in the firmware/operating system update. This means if a user does not update their devices’ software the roots will never be present even if you are a member of the root program.

These are just a few examples of some of the root programs CAs have to worry about, and the problems they have to consider.  

What this means is the slowest root program in the WebPKI to accept and distribute roots holds back the adoption of new CAs.  

As an aside, It is worth noting that around 59% of all browser traffic (not including IoT devices) is mobile. This includes a lot of developing country traffic where phones are kept longer and replacement models are often still old models with older software and root stores that are not updated. This is before you get to TVs, printers, medical devices, etc that seldom do forced firmware updates and never manage roots independently from firmware.

As a result of the above, these days I generally tell people that it takes 5-7 years post inclusion to get sufficient ubiquity in a root to be able to rely on it for a popular service as a result.

A CA will typically try to address this problem by acquiring what is known as a cross-sign (see Unobtainium), there are two recent examples of this I can point to. One is the cross-sign for Let’s Encrypt and the other is the cross-sign for Google Trust Services. The problem is most CAs don’t want to help other CAs with cross-signs for obvious reasons. Beyond that increasingly there are fewer and fewer of the “old keys” still in use on the web that is baked into the oldest devices meaning there are not a lot of choices for those cross signs even if you can get one.

Let’s assume for the purpose of this (now long) post that you are able to get a cross sign either from a competitor or because you happen to have custody of older key material that was grandfathered into the root program that has the needed ubiquity.

In the simplest case, you get to a situation where you are asking those that use your certificates to include the two CA certificates in their TLS bundle.  At a minimum, we start to pay a performance penalty for doing this.

As an example consider that each certificate is about 1.5KB in size, by adding this new certificate to the bundle every fresh TLS negotiation will carry this tax.  If we assume a typical chain is normally 3 in length that makes our new total 4.5kb in overhead. It’s not a large figure but if you consider a high-traffic site like Amazon it adds up quickly.

How many purchases are made on Amazon daily? On a daily basis, Amazon ships more than 66,000 orders per hour in the US. About 1.6 million packages are shipped on a daily basis.

capitalcounselor.com

If we assume each of Amazon’s orders equates to a single fresh TLS negotiation (it should be many more than that since many servers power the site) that is a daily increase of 3.6GB of traffic. Now Amazon can deal with that with no problem but it would slow down the experience for users, and potentially cost those users without unlimited internet plans some of their allotted bandwidth. There is also the potential of fragmentation as the TLS packets get larger which increases the latency for the user.

Don’t get me wrong, these are not deal breakers but they are taxes that this cross sign approach represents. 

My fear, and to be clear I’ve not completed the thought process yet, the above story creates a situation where CAs need to provide multiple cross signs to support all the various device combinations that are out there in this new world.

If so this scares me for a few reasons, off the top of my head these include:

  1. It is already hard enough to get a single cross sign doing so multiple times will be that much harder,
  2. It will put new entrants at a disadvantage because they will not have legacy key material to rely on nor the high capital requirements to secure cross signs if commercial terms can even be reached allowing them to get one,
  3. We have spent the last decade making TLS the default and we are close to being able to declare victory on this journey if TLS becomes less reliable we may lose ground on this journey,
  4. It is a sort of regressive taxation on people with older and slower devices that are likely already slow and paying for data on a usage basis.
  5. The additional data required will have a negative impact on Time to First Paint (TTFP) in that this extra data has to be exchanged regardless.

Mom always said don’t complain if you don’t have a proposal to make things better, so I figure I should try to propose some alternatives. Before I do though I want to reiterate that this post isn’t a complaint, instead it’s just a representation of my inner dialog on this topic. 

OK, so what might be a better path for us getting to this world of shorter-lived roots? I guess I see a few problems that lead to the above constraints

  1. Even in the browser ecosystem, we need to see root lists updated dynamically,
  2. Inclusion into a root store should take months for your initial inclusion and later updates should happen in weeks,
  3. Browsers should publish data to help CAs understand how ubiquitous their root distributions are so they can assess if cross signs are even necessary,
  4. There needs to be some kind of public guidance to IoT devices on what they should be doing for the usage of certificates in their devices and this needs to include root management strategies.
  5. IOT devices need to stop using the WebPKI, it is increasingly the Desktop and Browser PKI and anything else will get squashed like a bug in the future if they are not careful 🙂
  6. There should be a Capability and Maturity Model for root programs and associated update mechanisms that can be used to drive change in the existing programs 
  7. CAs should stop using one issuing hierarchy for all cases and divide up the hierarchies into as narrow of slices as possible to reduce the need for one root to be trusted in all scenarios.
  8. Root programs should allow and go so far as to encourage CAs to have more than the 3 (typical) roots they allow now to support CAs in doing that segmentation 
  9. WebPKI root programs should operate two root programs, the legacy one with all of its challenges and the new one focused on agility, automation, and very narrow use cases. Then they should use the existence of that program to drive others to the adoption of those narrower use cases.
  10. And I guess finally we have a chance with the Post Quantium TLS discussions to look at creating an entirely new WebPKI (possibly without X.509!) that is more agile and narrow from the get-go.

Why crawling is not an adequate measurement methodology for the WebPKI

The answer is simple — It’s an incomplete view of the use of the WebPKI.

There are a number of different methodologies a web crawler-based approach might take in measuring the size of the WebPKI. The most naive approach would be to simply scan all IPv4 address space and log all of the certificates you see during this scan.

The problem is that this only shows a small fraction of the certificates that are out there. When you connect to an IP address and the associated web server doesn’t know what host you are trying to connect to it will return its “default” website and use the associated certificate.

That same IP address may literally be responsible for serving millions of sites based on the client’s indicated hostname. With this IP-based enumeration approach at best you would get one certificate from that host, at worse you wouldn’t even get that because some servers are not configured with a default site. This is just one problem with this approach there are many more.

Though most WebPKI market share reports do not document their methodology anecdotally it appears most work on this crawler approach and at least historically some have taken periodic drops from CAs to make their view “more complete”.

Today though the only way to measure CA market share that should be used is by relying on the pre-certificate counts in Certificate Transparency logs.

How to measure the WebPKI ecosystem

The web is dependent on there being a robust, secure, and scalable set of CAs being able to provide TLS certificates. It is unhealthy for there to be a single provider because if for any reason they have an operational or security issue they could become unavailable leaving the web in a world of hurt.

Beyond that in the name of TLS reliability TLS certificate consumers should be relying on multiple CAs for their certificates. For example, to reduce exposure to outages your certificate lifecycle management solution should support failover from one CA to the next. 

Another example of why you should use multiple CAs is to help ensure relying on party agility to changes in CAs, for example, if a CA changes which root key material they use you may lose (or gain) device compatibility, or if an issuing CA changes and someone is pinning you might break them. By to ensure device compatibility long term one should use multiple CAs to help ensure the relying party ecosystem you support is agile to these changes.

For this to work though you need to have an ecosystem of CAs you can use interchangeably, ACME (RFC 8555) helps here substantially because it provides a normalized way to interact with CAs to get these certificates. That is only helpful if there are multiple CAs that implement the protocol and if those CAs are able to scale to meet the needs of those who rely on them.

This is particularly important when you look at SaaS-like offerings the larger ones will often demand millions of certificates that need to be able to be revoked and re-issued in less than 24 hours in some cases so the scalability of the CA becomes particularly important.

Assessing the scalability of a CA is hard but one of the closest proxies you have is their overall market share.

In the US, according to the Google Transparency Report, 97% of all web traffic is protected with TLS. To put that in context there were 366.8 million registered domain names as of 2022.

Certificates can represent more than one domain name so depending on what you are measuring certificate count may not be the best metric to asses CA market share. With that said in the context of scalability, it’s probably a good metric.

What are some ways to evaluate the CA impact and market share?

  • How many certificates are issued by the CA and are unexpired.
  • How many domains are contained within the unexpired certificates issued by a CA.
  • What percentage of web traffic would be covered by the certificates issued by a CA.
  • What percentage of certificates issued by the CA are unexpired and actively in use.

Each of these answers different questions, and they progressively get harder to measure as you go down the list. The easiest by far is how many certificates are issued and still unexpired. This is because all CAs log what is called a pre-certificate to the Certificate Transparency ecosystem before issuance.

NOTE: Publication of a pre-certificate is not required by the rules of the ecosystem however not doing so would mean that users relying on that certificate would get an error.

While the existence of a pre-certificate doesn’t promise the certificate is in use it does signal that someone who controlled that domain wanted to use a certificate for that domain. They wouldn’t have bothered going to the trouble of doing that if there was not an intent to use the certificate in some way.

The easiest way to look at this data is to use the excellent https://crt.sh/cert-populations report. While it does go down from time to time it also provides very fresh views into the un-expired pre-certificate count.

NOTE: Since not all CAs publish what is referred to as the “final certificate” you can safely ignore the Certificate count data on this report.

So what does this data look like (As of July 29th, 2022)?

CertificatesPrecertificates
ALLUnexpiredALLUnexpired% of Unexpired Population
Internet Security Research Group2,834,892,521264,685,3352,553,476,280228,023,48050.18%
Sectigo109,399,9847,245,014373,669,758106,119,71323.35%
DigiCert560,740,35744,640,273497,448,38945,475,97210.01%
GoDaddy6,371,9601,874,81252,669,26031,293,3046.89%
Google Trust Services LLC17,28417828,112,66215,443,3063.40%
Amazon13,540,55698,980104,887,85914,757,6503.25%
GlobalSign nv-sa16,729,66393723,636,7786,893,7281.52%
Actalis55233,236,4931,691,7440.37%
Asseco Data Systems S.A. (previously Unizeto Certum)6,298,4726209,375,7421,571,8520.35%
Start Commercial (StartCom) Ltd.1,495,580982,866,004883,0220.19%
?1,241,4632143,924,285567,4800.12%
Entrust739,9015242,304,521554,4310.12%
SECOM Trust Systems CO., LTD.156,234-112,217,668242,8150.05%
WoSign CA Limited88,6607250,823110,1010.02%
Certainly LLC31,361205240,103101,5330.02%
Buypass186,2002702,127,22898,8350.02%
QuoVadis53,636432236,06397,4540.02%
SecureTrust311,226227301,19778,4890.02%
Microsoft Corporation Core Services Engineering & Operations ( “Microsoft CSEO”)216,44873,560212,90574,6970.02%
Deutsche Telekom Security GmbH57,57032147,94949,5560.01%
JPRS15,7383482,64236,5120.01%
SwissSign AG237,88668,27283,50426,9750.01%
Government of Spain, Fábrica Nacional de Moneda y Timbre (FNMT)86,87223,86657,20623,7720.01%

What you will see is the top 5 CAs out of 233 issue 98.59% of all TLS certificates. While I would like to see this distribution be more normalized to ensure that the ecosystem is not overly dependent on any one entity as far as health goes it does show there are several large providers out there that support the Web who have demonstrated they can scale to meet large certificate consumption needs.

One thing you will notice in this data is that the variability in the pre-certificate “ALL” and “Unexpired” count is quite large in some cases. This is because some CAs like Let’s Encrypt and Google Trust Services either predominantly, or exclusively issue shorter-lived certificates. This results in the certificate count in “All” being much higher than the “Unexpired” case.

So what can we take away from this data? I think there are four key takeaways:

  1. Support of certificate issuance via ACME has made shorter-lived certificates viable and they now represent the large majority of certificates on the web.
  2. Support of ACME has helped grow the percentage of the web that is encrypted from about half of the web to nearly 100% of the web.
  3. 2.15% of CAs issue 98.59% of all TLS certificates on the web.

How is ACME different than XCEP/WSTEP anyway?

If you read my blog there is a reasonable chance that you are familiar with RFC 8555, the standard for Automatic Certificate Management Environment (ACME). Even though ACME is a relatively young protocol it is already used by the majority of websites on the internet for certificate lifecycle management.

While I won’t go into a lot of detail for this post to make sense you have to understand a couple of things about the ACME protocol. 

The first is that it works on the concept of dynamic “account” registration. By that I mean requestors can in real-time, request that an “account” be created for them. This account is represented by a public key pair that the ACME service will use to persist meta-data about the requestor. The ACME service can deny this request for any reason it likes but commonly in Web PKI as long as the request is wellformed it is accepted.

The next thing you need to understand is that it has the concept of “challenges” that can be used to communicate conditions that must be met before a certificate is issued. For example, an ACME service may request that the account holder demonstrate that they are authorized to get a certificate for a given domain name by placing a specific value in DNS at a well-known location. Since only a DNS administrator could perform that action the ACME service can have confidence the requestor controls the name it has requested a certificate for.

And finally, there is nothing in the ACME protocol that limits its use to just web server certificates. It is a general framework that can be used to acquire and manage certificates of any type. For example, CISCO is using ACME in their WebEx offering to facilitate the acquisition of what are essentially email certificates via OIDC authentication as a way to authenticate chat members.

But there is another very popular protocol, a well set of protocols that fewer people know about, that is XCEP and WSTEP. These protocols are used by Windows machines to both determine what kind of certificates a machine or user should enroll for as well as enabling the enrollment for those certificates. 

Similar to ACME these protocols also support, although in a more ridged rigid way, the ability for the issuer to challenge the client for additional information necessary to get a certificate of a particular type. For example, you can configure a certificate type (known as a template) to require that the requestor provide a cryptographic attestation backed by a TPM to be used to prove the machine belongs to the organization operating the certificate authority.

There are differences though, the first of which is the concept of a template, this enables XCEP/WSTEP to have one URL endpoint issue many types of certificates which is very important within an enterprise which is where certificates are used for many different scenarios.

Another difference is that XCEP/WSTEP presume the authorization of the client happened out of the band before the client requested the certificate. The dynamic approach to challenges that was adopted by ACME allowed it to tackle this problem in-band or rely on the out-of-band authorization. It supports this out-of-band concept through the concept of External Account Binding which allows the requestor to use an API key gathered out of band to prove on account creation the account key is associated with some pre-enrolled user.

And finally, ACME has a clear model for extensibility built into it. What this means is that one can easily extend it with additional capabilities. The most fundamental part of this is the Directory resource which lists all of the APIs supported by this ACME instance. One could use this, for example, to add a “Templates” API that would allow an ACME client to request specific types of certificates from the ACME endpoint.

Similarly, the concept of the challenge allows the server to demand the client do any number of things before the certificate is to be issued so the idea of adding a TPM challenge, for example, is trivial within this framework.

In short, ACME, contrary to popular belief, is not a protocol for getting and managing website certificates it is a framework for getting and managing any certificate. More importantly, it is extensible in such a way that with just a few minor additions it would be a proper superset of all the capabilities within the Windows enrollment protocol suite.

Why is this important? That’s easy! When I talk to anyone who is using certificates at any reasonable scale their concerns almost instantly come to the complexity of managing the certificate lifecycle management of those certificates across the various products and services that use them.

When we look at this complexity most of it arises from the use of a mish-mash of solutions for lifecycle management that when viewed in isolation seemed sufficient but when looked at holistically were actually woefully insufficient. 

If as an industry we move these legacy systems to a single protocol so that certificates regardless of them being for public or private PKI or representing users, machines or workloads use one protocol we will have a reliable substrate that we can use to authenticate and authorize with agility.

The next decade of Public Key Infrastructure…

Background

Before we talk about the future we need to make sure we have a decent understanding of the past. X.509 based Public Key Infrastructure originally was created in the late 80s with a focus on enterprise and government use cases.  These use cases were largely for private systems, it was not until a decade later this technology was applied to the internet at large.

Since the standards for enrollment and lifecycle management at the time were building blocks rather than solutions and were designed for government and enterprise use cases rather than the internet, the Web PKI, as it became known, relied largely on manual certificate lifecycle management and a mix of proprietary automation solutions.

While the use of PKI in the enterprise continued, primarily thanks to Microsoft AD/CS and its automatic certificate lifecycle management (I worked on this project), the Web PKI grew in a far more visible way. This was primarily a result of the fact that these certificates had to be acquired manually which led to the creation of an industry focused on sales and marketing of individual certificates.

The actors in this system had no incentive to push automation as it would accelerate the commoditization of their products. The reality was that these organizations had also lost much of their technical chops as they became sales and marketing organizations and could no longer deliver the technology needed to bring this automation anyways.

This changed in 2016 when the Internet Security Research Group, an organization I am involved in, launched Let’s Encrypt. This was an organization of technologists looking to accelerate the adoption of TLS on the web and as such started with a focus on automation as it was clear that without automation growth of HTTPS adoption would continue to be anemic. What many don’t know is in when Let’s Encrypt launched HTTPS adoption was at about 40% and year over year growth was hovering around 2-3%, about the rate of growth of the internet and — it was not accelerating. 

Beyond that TLS related outages were becoming more frequent in the press, even for large organizations. Post mortems would continuously identify the same root causes, a manual process did not get executed or was executed incorrectly.

The launch of Let’s Encrypt gave the Internet the first CA with a standards-based certificate enrollment protocol (ACME), this combined with the short-lived nature of the certificates they issued meant those that adopted it would have to use automation for their services to reliably offer TLS. This enabled products to make TLS work reliably and by default, a great example of this is the Caddy web server. This quickly took the TLS adoption rate to around 10% year over year and now we are hovering around 90%+ HTTPS on the internet.

While this was going on the concept of microservices merged with containers which led to container-orchestration, which later adopted the concept of mesh networking. This mesh networking was often based on mutual-TLS (mTLS). The most visible manifestation of that being SPIFFE, the solution used by Kubernetes.

At the same time, we saw networks becoming more composable, pushing authentication and authorization decisions out to the edge of the network. While this pattern has had several names over the years we now call it Zero Trust and a visible example of that today is Beyond Corp from Google. These solutions again are commonly implemented ontop of mutual TLS (mTLS).

We now also see the concept of Secure Access Service Edge (SASE) or Zero-Trust Edge gaining speed which extends this same pattern to lower-level network definition. Again commonly implemented ontop of mTLS.

The reality is that the Web PKI CAs were so focused on sales and marketing they missed almost all of these trends. You can see them now paying lip service to this by talking about DevOps in their sales and marketing but the reality is that the solutions they offer in this area are both too late and too little. This is why cloud technology providers like Hashicorp and cloud providers like Amazon and Google (I am involved in this also) had to step in and provide their offerings.

We now see that Web PKI CAs are starting to more seriously embrace automation for the public PKI use cases, for example, most of the major CAs now offer ACME support to some degree and generally have begun to more seriously invest in the certificate lifecycle management for other use cases.

That being said many of these CAs are making the same mistakes they have made in the past. Instead of working together and ensuring standards and software exist to make lifecycle management work seamlessly across vendors, most are investing in proprietary solutions that only solve portions of the problems at hand.

What’s next?

The usage of certificates and TLS has expanded massively in the last decade and there is no clear alternative to replace its use so I do not expect the adoption of TLS to wain anytime soon.

What I do think is going to happen is a unification of certificate lifecycle management for private PKI use cases and public PKI use cases. Mesh networking, Zero-Trust, and Zero-Trust edge is going to drive this unification.

This will manifest into the use of ACME for these private PKI use cases, in-fact this has already started, just take a look at Cert Manager and Small Step Certificates as small examples of this trend. 

This combined with the ease of deploying and managing private CAs via the new generation of Cloud CA offerings will result in more private PKIs being deployed and the availability problems from issues like certificate expiration and scalability will no longer be an issue.

We will also see extensions to the ACME protocol that make it easier to leverage existing trust relationships which will simplify the issuance process for private use cases as well as ways to leverage hardware-backed device identity and key protection to make the use of these certificate-based credentials even more secure.

As is always the case the unification of common protocols will enable interoperability across solutions, improve reliability and as a result accelerate the adoption of these patterns across many products and problems.

It will also mean that over time the legacy certificate enrollment protocols such as SCEP, WSTEP/XCEP, CMC, EST, and others will become less common.

Once this transition happens this will lead us to a world where we can apply policy based on subjects, resources, claims, and context across L3 to L7 which will transform the way we think about access control and security segmentation. It will give both more control and visibility into who has access to what.

What does this mean for the Web PKI?

First I should say that Web PKI is not going anywhere – with that said it is evolving.

Beyond the increase in automation and shorter certificate validities over the next decade we will see several changes, one of the more visible will be the move to using dedicated PKI hierarchies for different use cases. For example, we will ultimately see server authentication, client authentication, and document signing move to their own hierarchies. This move will better reflect the intent of the Web PKI and prevent these use cases from holding the Web PKI’s evolution back.

This change will also minimize the browser influence on those other scenarios. It will do this at the expense of greater ecosystem complexity around root distribution but the net positive will be felt regardless. I do think this shift will give the European CAs an advantage in that they can rely on the EUTL for distribution and many non-web user agents simply do not want to manage a root program of their own so the EUTL has the potential to be adopted more. I will add that is my hope these user agents instead adopt solution-specific root programs vs relying on a generic one not built for purpose.

The Web PKI CAs that have not re-built their engineering chops are going to fall further behind the innovation curve. Their shift from engineering companies to sales and marketing companies resulted in them missing the move to the cloud and those companies that are going through digital transformation via the adoption of SaaS, PaaS, and modern cloud infrastructures are unlikely to start that journey by engaging with a traditional Web PKI CA.

To address this reality the Web PKI CAs will need to re-invent themselves into product companies focusing on solving business problems rather than selling certificates that can be used to solve business problems. This will mean, for example, directly offering identity verification services (not selling certificates that contain assertions of identity), providing complete solutions for document signing rather than certificates one can use to sign a document or turnkey solutions for certificate and key lifecycle management for enterprise wireless and other related use cases.

This will all lead to workloads that were once on the Web PKI by happenstance being moved to dedicated workload/ecosystem-specific private PKIs. The upside of this is that the certificates used by these infrastructures will have the opportunity to aggressively profile X.509 vs being forced to carry the two decades of cruft surrounding it like they are today.

The Web PKI CAs will have an opportunity to outsource the root certificate and key management for these use cases and possibly subcontract out CA management for the issuing CAs but many of these “issuing CA” use cases are likely to go to the cloud providers since that is where the workloads will be anyway.

Due to the ongoing balkanization of the internet that is happening through increased regional regulation, we will see smaller CAs get acquired, mainly for their market presence to let the larger providers play more effectively in those markets.

At the same time, new PKI ecosystems like those used for STIR/SHAKEN and various PKIs to support IoT deployments will pop up and as the patterns used by them are found to be inexpensive, effective, and easily deployable they will become more common.

We will also see that the lifecycle management for both public and private PKI will unify ontop of the ACME enrollment protocol and that through that a new generation of device management platforms will be built around a certificate-based device identity anchored in keys bound to hardware where the corresponding certificates contain metadata about the device it is bound to.

This will lay the groundwork for improved network authentication within the enterprise using protocols like EAP-TTLS and EAP-TLS, enable Zero-Trust and Zero-Trust Edge deployments to be more easily deployed which will, in turn, blur the lines further between what is on-premise and what is in the cloud.

This normalization of the device identity concepts we use across solutions and the use of common protocols for credential lifecycle will result in better key hygiene for all use cases, and simplify deployment for those use cases.