I’ll start by saying this post is just a collection of personal thoughts and not a statement from my employer, nor does it reflect anyone’s opinions other than my own.
For the uninitiated, TOFU is an acronym for Trust On First Use. The basic idea is that the user makes the trust decision associated with a communication channel on the first visit.
To understand the concept here, you need to know what we mean by “communication channel” and what we “trust” them for.
Let’s start with what we mean by the “communication channel”. In the context of the web, for the most part, a user agent (often a browser) initiates communication to a service located at a fully qualified domain name. This is usually done using the Transport Layer Security (TLS) protocol.
Next, let’s look at what “trust decision” is being made. Notice we don’t have an application yet, we have just established a session that an application and a service can use to exchange information. As a result, the trust decision we are making is:
Did my traffic end up at a service associated with the aforementioned fully qualified domain name?
The naive way to answer the question is to ask that question every time a user visits the service. In fact, some browsers support this model explicitly:
This is the moral equivalent of saying “do you feel lucky?” because the user has no way to answer this question.
It is actually worse than that, the few ways you can answer that question require the participation of the service being authenticated. That means any solution has to potentially scale up to the 5 billion users on the internet.
Since each of these users has varying degrees of technical prowess and the service in question has a limited amount of time and resources; doing this for every user interactivity would be impractical.
To give this some modern color, Signal solves this problem by having the users in a conversation verify out of band they are talking to the right person. Almost no one does this of course and just begins chatting without going through this dance.
The gray hairs of the internet decided to solve this problem by delegating this trust decision to entities known as Certificate Authorities (CAs). They indirectly act as a middleman for the session establishment, they do this by issuing the service a credential it can use to prove that someone did this verification for them.
These CAs do this verification by having the service perform some authenticatable action that only someone that controls the fully qualified domain name in question could complete This whole dance is basically delegated TOFU of control of the fully qualified domain name.
Over the last 28 years, this delegation has taken us from no HTTPS to near-ubiquitous HTTPS.
This is of course fantastic but it does beg the question what has this journey taught us and what might we change moving forward?
For one right now, if we believe this table, there are about 89 organizations in the world that are approved Certificate Authorities. But the top 5 of those CAs issue 99% of all certificates in use on the web.
That’s right 96% of the CAs that can issue certificates for the web could disappear tomorrow and most people would not notice.
What concerns me is that running a CA is hard, expensive, and has materially different operating constraints than other services. This combined with the love nation-state attackers have for these services has led to numerous colorful events over the years 1,2,3,4,5,6.
To be clear I am not making the case that those CAs should be removed, just making it clear that the web is carrying a much larger surface area than it strictly needs to deliver on its original design goal.
That takes us back to the question of what lessons the last 28 years have taught us and what might we change moving forward. The answer here is likely buried in the adage of having one joint and keeping it well-oiled is key.
The journey from 0% of the web being protected with HTTPS to nearly 100% has been one of simplification and automation. There is plenty of opportunity for us to do more of that and doing so will be key to the survivability of encryption on the web, some problems we might want to see in addressed that spirit include:
- Too many single points of failure in the WebPKI. If any of those top 5 CAs were to fail we would experience a massive failure on the web. For us to address this we need to standardize on a single certificate enrollment protocol, this would make the CAs plug-and-play replacements for each other. We already have a de-facto standard RFC 8555 (ACME) which is responsible for around 70% of all certificate issuance. We should just embrace that and make it official. Once we see sufficient ACME adoption we need the ACME clients to fail over and round-robin across all the associated CAs. This will help ensure clients are able to survive any single CA failure.
- We need the WebPKI to be less fragile. There are lots of examples of WebPKI fragility, some are:
- Clients sometimes get hard coded to expect a single issuing CA. This is a very bad practice, one some CAs have already started to dissuade by randomly selecting issuing CAs.
- Reliance on very long-lived root certificates instead of very frequent root updates. This is also thankfully starting to change but solving this problem will take us a very long time thanks to root programs that do not continually manage what roots are trusted in deployed systems.
- The inability of the web to survive CA mistakes and potential security issues without creating outages. There is also work being done here in an extension to ACME called ARI to help enable the web to seamlessly renew certificates in such cases.
- Too many CAs with too broad permissions. Since most CAs are not necessary for the web to operate but may still have a legitimate need to exist we should leverage usage and name constraints in root trust more. This would go a long way to both reduce the value of a compromise of these CA long tail CAs to an attacker as well as reduce the exposure to the web as a whole to the practices of these CAs. We should also come back and revisit the need for CAs to keep those permissions periodically.
- Over-reliance on manual assessments by unqualified auditors and security professionals. This introduces a number of issues, some of which include:
- The fiduciary relationship between an auditor and their client is such that no one wants to fail their client’s audit because they want to do next year’s audit.
- Usually, these audits are performed by individuals with limited experience in PKI, operational systems, and security.
- These audits are backward-looking and look at only a small sample of the issued certificates. Moving to a model that augments these audits with continual technical assessments would make a big difference here.
- Root programs are too slow to respond to bad or questionable actors. Suspension of a CA is not-done and total distrust can take months or longer. There are a lot of politics involved here but equally important is the tooling to remove CAs, both procedurally and technically. In these things are simply not “well oiled”. They need to be made far more agile so that this long tail can be more effectively managed.
So what can you do about all of this? Well if you are a root program you should be looking at how you help address the above issues through the way you manage that program.
If you are an end user? Well, the answer there is easy — adopt ACME-based certificate lifecycle management, keep an eye on Certificate Transparency logs for certificates issued by that long tail of CAs and consider specifying CAA restrictions that limit issuance to a small number of CAs.