The Growing Security Concerns of Modern Firmware and the Need for Change.

Today’s firmware is larger and more complex than ever before. In 1981, the IBM PC BIOS was a mere 8 KB, but now UEFI, even without considering machines with BMCs, can be 32 MB or even larger! To illustrate the magnitude of the problem, Intel will soon release its next-generation SoCs with support for 128 MB of firmware!

Essentially, UEFI has become a real-time OS with over 6 million lines of code and is still growing. This is larger than most modern operating systems. Furthermore, the various boot phases and hardware layers create significant complexity for defenders. Increased surface area leads to more vulnerabilities.

The most impactful and difficult-to-patch vulnerabilities reside in the underbelly of technology. Firmware, file systems, BGP, and other foundational aspects of technology that we often take for granted are becoming more vulnerable to attacks. It’s time to prioritize security for the very foundation of our tech. Benjamin Franklin once said, “A failure to plan is a plan to fail.” This adage often applies to long-term vulnerabilities in technology. Insufficient planning can lead to an inability to detect issues, inadequate data to assess their true severity, and a lack of responsiveness.

Firmware serves as a prime example. Many firmware-level issues remain unpatched because firmware often lacks the measurement and patching middleware we expect from software. Moreover, hardware vendors frequently behave as if their job is complete once they release a patch. Imagine if, in 2023, software vendors merely dropped a patched piece of software into a barely discoverable HTTP-accessible folder and proclaimed, “Thank goodness we’ve done our part.” This scenario largely reflects the current state of firmware.

One reason for this situation is that the problem on the surface appears intractable. A typical PC may house dozens of firmware components, with no inventory of what exists. This firmware often originates from multiple vendors and may include outdated chips that have not been updated.

Another fitting saying is, “You can’t manage what you can’t measure.” Combine this with the exponential growth of firmware surface area and the increasing number of internet-connected devices containing firmware, and you have a massive security issue arising from decades of neglect.

There is no silver bullet here. One aspect to address is the way firmware is built. USB Armory aims to solve this by making firmware memory safe, easy to read, and with minimal dependencies. While this is a positive step, it is not sufficient on its own. Binarly.io has created the best automation available for detecting firmware issues automatically, which is invaluable considering that old approaches will persist for decades.

To drive change, we need better measurement and widespread adoption of automatic update mechanisms for firmware of all sizes. These mechanisms must be safe, reliable, and robust. Misaligned incentives contribute to the problem, often resulting from a lack of accountability and transparency. This is why I dedicated as much time as I could to binary.transparency.dev while at Google.

The momentum around software supply chain security is essential, as it sheds some light on the problem, but alone it is not enough to bring about the necessary change. If you create a chip with firmware that has a vulnerability, your responsibility does not end with shipping a patch. If you ship devices without providing a way to seamlessly patch all firmware, you are failing.

Relying on the next hardware refresh cycle to update firmware on your devices in the field is insufficient. With cloud adoption, refresh cycles might even lengthen. A long-term strategy to support your devices is necessary; not doing so externalizes the consequences of your inaction on society.

If you have devices in the field that are in use, and you don’t have a confident inventory of the dependencies that exist in them, and you’re not monitoring those dependencies and the firmware itself for issues, you are part of the problem, externalizing consequences on society.

We can do better.

To improve firmware security, the industry must collaborate and adopt best practices. This includes embracing transparency, robust patch management systems, and long-term support strategies for devices. By working together to address these challenges, we can build a more secure foundation for the technology that underpins our modern world.

In conclusion, it’s crucial that we prioritize firmware security, as it plays a fundamental role in the safety and reliability of our devices and systems. By implementing more effective measurement, automatic update mechanisms, and long-term support strategies, we can reduce the risks associated with outdated and vulnerable firmware. This will help create a safer digital environment for everyone.

P.S. Thanks to @matrosov and @zaolin for their insights on the problem on Twitter.

Global Trend in Tech Regulation and its Consequences for the EU and the US

The United States has long been a leader in technological innovation, with companies such as Google, Apple, Facebook, and Amazon paving the way. As of October 2021, 62% of global tech unicorns have emerged from the US, with China accounting for 19%, while only 12% have come from the EU. One explanation for this delta is the size of the regulatory regime in the EU, which is known to favor larger companies and make it more expensive and harder for small companies to enter the market.

It’s been only 29 years since the EU transitioned from a trading bloc to a union of 27 member states with coordinating policies. Considering the complications that such a transition represents, it’s not surprising that, relative to the US and China, the EU has more bureaucracy. However, things are changing, as there is now an adult generation that has grown up with the EU as part of their national identity. They have also seen the benefits of that partnership manifest for much of their lives. While the member states of the EU will continue to evolve how they work with each other, they have come a long way in terms of coordination and cooperation and have a solid foundation to build upon.

Another argument that I have heard is that the EU’s focus on creating a stable and cooperative union took away from the focus on technological and market growth. That may be true but over the last decade, they have focused on creating regulations they hope will create a Digital Single Market which they hope will address this problem. During this same period, the US regulatory framework largely stood still, but they also experienced the most rapid growth of technology companies of any nation during this time.

It’s worth noting that the EU’s approach to regulation has been very implementation-specific when compared to the U.S. approach to similar regulation, as seen with the EIdAS, the EU’s digital signature legislation, and the associated supporting efforts which choose which technologies must be used. The first version of which left the topic of interoperability as a footnote and ignored the concept of reciprocity. This essentially created member-state monopolies around the legislation where country-to-country transactions would still be signed on pen and paper. That did change a few years ago, but better technological approaches to solving the associated problems were established and proven since the initial legislation was proposed two decades ago, and their adoption was held back due to this legislation’s technical specificity.

On the other hand, there is a credible argument to be made that the US has failed when it comes to creating meaningful legislation to protect its citizens from the overreach of tech giants and the increasingly online nature of daily life. In fact, many would argue that, at least for the last decade, they have failed when it comes to creating meaningful legislation, period. This failure has created an opportunity for the EU to step up and leave its mark on the global technology industry, which it certainly has been doing.

What is noteworthy here is that many of these regulations are being framed as globally binding. The concept being applied here is called extraterritoriality, which basically means that the jurisdiction of the law extends beyond the physical borders of the country or region that has enacted it. The argument is that by extending the reach of its laws to cover EU citizens wherever they may be, they are protecting the rights of its citizens and promoting a level playing field for all companies, regardless of where they are based.

But what is a regulation without teeth? This is why these regulations usually empower the EU member states to impose fines on companies, regardless of where they exist, if the associated regulations are found not to have been met. The trend to leverage extraterritoriality is sure to continue and likely accelerate. In the case of security, one such upcoming regulation from the EU is NIS2, which is looking to regulate incident response handling, again with potential fines for non-compliance.

It’s not to say that all of this EU legislation is bad, though I would argue that the more explicit it is on technical details, the worse it is. For example, consider the EU Cookie legislation–it has resulted in the online equivalent of graffiti all over the internet with no meaningful improvement for users.

As I think about this, one of the things I struggle with is that the geographic nexus of a business is increasingly not relevant to users. In one respect, this is an argument for exactly what the EU is doing. But there are also 195 sovereign nations, each with its own cultural norms and political interests. Can they all regulate online businesses regardless of where they are located? What will the long-term effects of this global regulatory trend be?

Of course, the answer is that only those countries that have enough financial clout to get away with this, but even then, would the EU want its businesses regulated in this way by the US or China? And what do you do when the regulations conflict? Does the future require proof of citizenship before using any online service?

The Importance of Proper Measurement in Enterprise IT Management: Lessons from Cloud Deployments

Peter Drucker once said, “You cannot manage what you cannot measure.” This quote is applicable in many aspects of technology development and business management. Neglecting measurement often leads to problems. Another relevant saying is “Lies, damned lies, and statistics.” The importance of what we measure and how we measure it cannot be overstated when using data to manage complex systems.

In enterprise IT, endpoint management systems are often touted as the source of truth for the environment, but this is misleading. They measure the intended state of the devices, not their actual state. In other words, they track what they believe they have done, rather than whether the change was correctly and successfully deployed. This might seem similar or sufficient, but it is not. Endpoint systems have many different software and settings that can interact in unexpected ways, making the data they provide nearly useless for risk assessment, according to most security professionals.

As for EDR systems, I would argue that they are simply advanced anti-viruses. They try to protect the operating system which is largely intractable. give the enterprise visibility to attacks and infections, while providing minimal tools for a response. To be clear EDR is valuable, but its overhead is high and it is not designed for device intelligence or observability; its purpose is detection and response.

If enterprises had proper investments in observability, they could discover outages before users report them. They could use the intelligence about the state and usage of their devices to proactively defend and isolate their assets, turning what has been a reactive and inaccurate dataset into a tool to protect their networks and improve user productivity.

There is a lot to learn from Cloud deployments when looking at how to solve these problems. For example, unlike most IT departments, cloud assets are heavily instrumented with logs being centralized, with dashboards reporting real-time uptime and health. There is an entire market of solutions focused on enabling this visibility, just consider how many log aggregation and analytics offerings such as Stackdriver, CloudWatch, and New Relic exist. 

Additionally, these assets typically have cross-platform machine identities that are used to facilitate security domain separation, and interestingly these identities are increasingly using hardware-backed keys to secure those credentials. These credentials are also used to help each endpoint in those deployments achieve some basic level of self-protection, where the credentials used by these assets will capture the posture of the machines and the peers they interact with, particularly when crossing security domains and enforcing policy based on these interactions.

I believe that over the next decade, enterprise IT deployments will increasingly resemble cloud deployments, with a focus on zero-trust design patterns. However, for that to happen there are product gaps that will need to be filled. For example, there is no turnkey solution for desktop state observability with structured data that can be used for real-time policy decisions. The big tech companies have built these solutions for their own deployments but there is nothing turnkey in the market that addresses this need. Then there is the issue of how to authenticate the device and its state. There are lots of pieces in this area but no solutions. If you are going to use the observed device state for the policy you also have to worry about the validity of the associated data, while this is not possible in existing systems to totally solve this problem there are lots of things that can be done to make data more reliable.

Finally, as we see this evolution, there is a need to rethink the way enterprises approach identity. It will become an alternative to Active Directory. No security practitioner would design their source of truth in the enterprise in the same way as Active Directory is today.

The Changing Landscape of Internet Protection

The United States government and big companies like Google have both played a significant role as protectors in their respective spheres. The US government, as the world’s leading military power, has tried to serve as a global peacekeeper and defender of democracy through its high military spending. Similarly, Google, as a success in the technology industry, has leveraged its financial power from advertising to fund various initiatives to protect and grow the internet and the web.

However, in recent years, Google has undergone significant changes. The COVID-19 pandemic accelerated the need for companies to adapt and re-envision themselves, and Google was no exception. The company, in a relatively short period, hired many externally sourced professional managers and underwent a series of re-organizations and corresponding cultural changes that have fundamentally changed the company forever. This, combined with recent fiscal irresponsibility in hiring, and inability to drive a cohesive and integrated product strategy (how many messengers does one company need after all?), led to Google’s first-ever layoffs, which included a significant impact on internet investments, and being transparent my own job.

This raises important questions about the future of internet protection. While Google was clearly not the exclusive protector of the internet, its direct and indirect investments in the internet have played a significant role in protecting it. It seems unlikely that moving forward, they can be counted on in the same way. We need to see other large organizations whose businesses similarly take a critical dependency on a safe and secure internet step up and fill the gap.

This is problematic for a number of reasons, one of the largest being that public companies are, to a fault, focused on quarter-to-quarter growth. That is of course the right thing for their shareholders, at least in the near term, but the sort of infrastructure investments necessary to secure the Internet can take a decade or more of investment. That’s quite hard for an organization to justify when the most valuable resource they have is engineers when those engineers can be spent working on securing and improving their private networks or directly generating revenue.

Many of these organizations already donate to security non-profits such as ISRG and OpenSSF and work through them to make some of these improvements to the Internet, but the funding to these organizations is often piecemeal and usually only aligned with the latest security trends. Furthermore, these investments are often associated with pet projects of the sponsoring companies rather than taking a strategic and proportional investment into different problem areas. A great example of this is how many of the most important open-source projects lack the basic funding necessary to sufficiently secure them long-term.

One approach to mitigating this is to lean on these security non-profits more and give them more autonomy and funding. This would allow them to take on a larger and more independent role via a larger commitment to funding with appropriate multi-year roadmaps, milestones, and success criteria approved by some mix of funding stakeholders invited experts in the specific areas in question and key engineering participants from relevant funding organizations. This would create predictable funding and a space where long-term roadmaps for research, securing, and improving the internet can be established without relying on a small number of companies with business models that support direct investment in those activities.

This approach would have its own challenges, for example, like most non-profits, these organizations will struggle to be impact focused rather than focused on their own pet projects or philosophies. That said, this is largely part of the human condition and something that always needs to be managed. The answer to how to manage this is surely buried in what systematic approach is used for choosing initiatives, measuring their success, and having lots of meaningful milestones along the way to enable them to checkpoint, course correct and the will power kill projects as appropriate.

In summary, the role of a protector comes with great responsibility. The recent changes at Google highlight the need for a diverse set of stakeholders to come together to safeguard the internet and the web and take less dependency on one company carrying too large of the load when it comes to protecting the internet.

The Limitations of Audits: What You Need to Know

In recent years, the SOC2 and ISO 27001 badges have become a staple at the bottom of nearly every SaaS website. This is largely due to the growing number of startups providing solutions for SOC2 and ISO 27001 audits, such as Vanta and Secureframe. These companies greatly reduce the time and money necessary to comply with various audit regimes. The fact large companies are increasingly incorporating the requirements for SOC2 and ISO 27001 audits into their procurement process of course also drives the demand for these audits. With that said it’s important to understand what these badges really mean and what they don’t.

One common misunderstanding is that audits are ongoing assessments of an organization’s security practices. In reality, audits are almost always point-in-time retrospectives, meaning they only reflect what was the case at the time of the audit and not what is currently the case. This is a material point, as many small organizations view the work associated with these audits as an annual tax, rather than integrating the associated practices into how they work.

Another factor to consider is the scope of the audits. The entity being audited gets to choose the scope, and it’s common for startups to exclude their IT environment (such as desktops) from their audits. This means that not all SOC2 audits are created equal, and you may not be getting the full picture.

Beyond that when audits involve sampling data, it is usually the subject of the audit who chooses the sample. This means that either by accident or on purpose, the analysis may be based on the most favorable data.

There is also the concept of accepted and managed risks. An organization can often get by with poor security practices as long as they acknowledge the risk and have a plan to resolve it within a fixed period of time. This period of time can be continuously extended, allowing the organization to avoid addressing the issue indefinitely.

Then you need to remember that the auditor works for the organization being audited. They want to be hired again next year, so they may be willing to accept the organization’s interpretation of the expectations in order to secure future business. This means that the results of the audit may not be completely impartial.

And finally, there is the question of the qualifications of the auditors conducting these assessments. Often, they are exclusively accountants and not technologists, meaning that they are not equipped to evaluate the technical security or correctness of the systems being audited. Instead, they are essentially tasked with assessing if the checklist represented in the audit regime can be reasonably deemed as met. 

In conclusion, while SOC2 and ISO 27001 audits have good intentions, they are far from sufficient in assessing an organization’s security practices. The fact that the audits are point-in-time retrospectives, the scope of the audit is chosen by the entity being audited, the auditor works for the organization, and the results may be based on favorable data, all contribute to the limitations of these audits. As a result, it’s important to be aware of these limitations and not rely solely on a SOC2 badge as a sign of a secure organization.

Using Caddy with Google Trust Services

Caddy is a powerful and easy-to-use web server that can be configured to use a variety of certificate authorities (CA) to issue SSL/TLS certificates. One popular CA is Google Trust Services, which offers an ACME endpoint that is already compatible with Caddy because it implements the industry-standard ACME protocol (RFC 8555). 

This means that Caddy can automatically handle the process of certificate issuance and renewal with Google Trust Services, once the External Account Binding (EAB) credentials required have been configured. 

How do I use it? 

Using global options

To use the Google Trust Services ACME endpoint you will need an API key so you can use a feature in ACME called External Account Binding. This enables us to associate your certificate requests to your Google Cloud account and allows us to impose rate limits on a per-customer basis. You may easily get an API key using the following commands:

$ gcloud config set project <project ID>
$ gcloud projects add-iam-policy-binding project-foo \  –member=user:[email protected] \  –role=roles/publicca.externalAccountKeyCreator
# Request a key:
$ gcloud alpha publicca external-account-keys create

You will need to add this API key and specify the Google Trust Services ACME directory endpoint along with your email address in your Caddyfile:

{    
acme_ca https://dv.acme-v02.api.pki.goog/directory
email  [email protected]
acme_eab {
        key_id  <key_id>
        mac_key <mac_key>
    }}

It is important to remember that when you use this configuration Google Trust Services explicitly is used as the only CA.

If you want to use multiple CAs for redundancy Caddy which is recommended the configuration would look something like this:

{     
cert_issuer acme https://dv.acme-v02.api.pki.goog/directory  {
          eab <key_id>  <key>
     }
     cert_issuer acme
}

In this example Google Trust Services will be tried and if there is a problem it will fall back to Let’s Encrypt.

It is also worth noting that the Google Trust Services EAB key is one time use only. This means that once Caddy has created your ACME account these can be safely removed.

Using the tls directive

If you want to use Google Trust Services for only some of your sites, you can use the tls directive in your Caddyfile like you’re used to:

tls [email protected] {
    ca https://dv.acme-v02.api.pki.goog/directory
   eab <key_id> <mac_key>
}

The email address in this example identifies the ACME account to be used when doing enrollment.

In conclusion, using Caddy with Google Trust Services is a simple and simple and secure way to issue and manage SSL/TLS certificates for your websites. With the easy-to-use Caddyfile configuration, you can quickly and easily configure your server to use Google Trust Services for all of your sites or just a select few. With Google Trust Services, you can trust that your websites will be secure and your visitors’ data will be protected.

To err is human, to forgive is divine

Every service should strive to provide perfect availability, the reality is that it’s not possible to be perfect. Mistakes happen, it’s how you deal with them that is important.

Successfully dealing with availability issues requires planning, and when dealing with a client-server solution, it requires both parties to make improvements.

In the context of ACME today the large majority of clients have no failover or logic. This means if the enrollment fails due to connectivity issues to the specified CA, the CA has an outage, or the CA suspends operations for one reason or another the enrollment will fail.

This is a problem for any service protected with TLS, which is basically every service, that wants to have a highly available service. One of the ways services deal with this is to proactively acquire two certificates for every endpoint, this is viable, but I would argue this is not the right solution.

The right solution is to not rely on a single CA and instead failover between many CAs that are capable of servicing your needs. This way if any single CA fails you are fine, you can just keep chugging along.

That is not sufficient to address this risk though. Not all certificates have the same level of device trust. Sometimes clients make bad assumptions and trust specific CA hierarchies assuming these configurations are static, even when they are not.

To help mitigate this behavior clients should implement a round-robin or random CA selection logic so that subsequent renewals will hit different CAs. This will force clients to make sure they work with any of your chosen CAs. This way you won’t find yourself breaking apps that make those bad assumptions when your CA fails.

Caddy Server already implements both of these strategies as I understand it but every ACME client should be doing the same thing.

A Boy Scout is always prepared

My father always says it’s not a problem to make a mistake, what is important is how you deal with it. 

The same thing is true when it comes to WebPKI CAs, broadly the incident response process used in this ecosystem could be categorized as Blameless Post Mortem. The focus is on what happened, what contributed to it, and what was done to address the issue and not on fault.

A few years ago a number of large CAs had to do millions of revocations, in all of the cases I am thinking the required deadline for those revocations was 5 days. Revoking a large number of certificates that are not directly obvious but if you’re a CA who has done any moderate level of planning it’s something you should be up for.

The thing is that doing so can cause harm, for example, the issue that necessitates the revocation might be incredibly subtle and not security-impacting. Nonetheless, the requirement is what the requirement is — the certificate needs to be revoked.

The question then becomes how can you meet that timeline objective without creating an unnecessary outage for customers? If you defy the rules you risk being distrusted, if you act blindly you could take down your customer’s services.

The question then becomes how do you contact millions of customers and give them enough time to replace their certificates without an outage with these constraints?

Like most scale problems the answer is automation, in the context of certificate lifecycle management that means an extension to the ACME protocol. To that end, there is now a draft for something called “ACME Renewal Information” which when implemented by CAs and ACME clients will enable a CA to signal that there may be a need to replace their certificate earlier than expected.

The basic idea with this proposal is that the CA will make available hints on when it would like certificates to be updated and the client will periodically check this information and use it to guide its renewal behavior.

To be clear, this is just a hint, a CA might be providing this hint just to smooth out the load, but there is no mandate to rely on the hint. With that said I do hope that all major ACME clients implement this standard and respect it by default because it will make the WebPKI a lot less fragile.

Baseline requirements are just that a baseline

I’ll start by saying this post is just a collection of personal thoughts and not a statement from my employer, nor does it reflect anyone’s opinions other than my own.

A common problem in the WebPKI is that CAs forget why they exist. No one cares about their business objectives. They exist to serve the public interest, more explicitly they exist to help the web solve the TOFU problem for domain control verification.

They exist because user agents (usually browsers) made the decision to delegate the TOFU problem to certificate authorities. To understand the browser’s motivation you simply need to look at the name “user agent” to understand that they directly serve the user.

These user agents used to meet individually with each and every one of these CAs to discuss the user agent requirements for CAs but ultimately decided to do this in a more collaborative way by participating in the CA/Browser Forum. 

Simply put, this organization would very likley stop existing if the user agents stopped participating and did not rely on the documents that come from this organization.

We know this to be true because of the voting structure of the forum and because the forum does not produce documents that govern the user agents, it produces documents that govern the practices of CAs.

To understand the nature of the CA/Browser program you just have to look at the name of the first and most important document they produced — the Server Certificate Baseline Requirements.

base·line

noun

     a minimum or starting point used for comparisons.

That’s right, it sets the minimum bar that a user agent should consider when trusting a certificate authority.

Far too often compliance programs end up driving the roadmap and operational practices of CAs and it stops being about what is right and becomes what is minimally required.

This leads to all kinds of spectacular failures, including basic failures in communication:

Instead of this compliance-focused mindset, CAs must be focused on why they exist, who they serve, and what is at risk if they mess things up.

I would argue that every major failure in the WebPKI has stemmed from forgetting about these simple questions.

TOFU and the Web

I’ll start by saying this post is just a collection of personal thoughts and not a statement from my employer, nor does it reflect anyone’s opinions other than my own.

For the uninitiated, TOFU is an acronym for Trust On First Use. The basic idea is that the user makes the trust decision associated with a communication channel on the first visit.

To understand the concept here, you need to know what we mean by “communication channel and what we “trust” them for.

Let’s start with what we mean by the “communication channel”. In the context of the web, for the most part, a user agent (often a browser) initiates communication to a service located at a fully qualified domain name. This is usually done using the Transport Layer Security (TLS) protocol. 

Next, let’s look at what “trust decision” is being made. Notice we don’t have an application yet, we have just established a session that an application and a service can use to exchange information. As a result, the trust decision we are making is:

Did my traffic end up at a service associated with the aforementioned fully qualified domain name?

The naive way to answer the question is to ask that question every time a user visits the service. In fact, some browsers support this model explicitly:

This is the moral equivalent of saying “do you feel lucky?” because the user has no way to answer this question.

It is actually worse than that, the few ways you can answer that question require the participation of the service being authenticated. That means any solution has to potentially scale up to the 5 billion users on the internet

Since each of these users has varying degrees of technical prowess and the service in question has a limited amount of time and resources; doing this for every user interactivity would be impractical.

To give this some modern color, Signal solves this problem by having the users in a conversation verify out of band they are talking to the right person. Almost no one does this of course and just begins chatting without going through this dance.

The gray hairs of the internet decided to solve this problem by delegating this trust decision to entities known as Certificate Authorities (CAs). They indirectly act as a middleman for the session establishment, they do this by issuing the service a credential it can use to prove that someone did this verification for them. 

These CAs do this verification by having the service perform some authenticatable action that only someone that controls the fully qualified domain name in question could complete This whole dance is basically delegated TOFU of control of the fully qualified domain name.

Over the last 28 years, this delegation has taken us from no HTTPS to near-ubiquitous HTTPS.

This is of course fantastic but it does beg the question what has this journey taught us and what might we change moving forward?

For one right now, if we believe this table, there are about 89 organizations in the world that are approved Certificate Authorities. But the top 5 of those CAs issue 99% of all certificates in use on the web.

That’s right 96% of the CAs that can issue certificates for the web could disappear tomorrow and most people would not notice.

What concerns me is that running a CA is hard, expensive, and has materially different operating constraints than other services. This combined with the love nation-state attackers have for these services has led to numerous colorful events over the years 1,2,3,4,5,6.

To be clear I am not making the case that those CAs should be removed, just making it clear that the web is carrying a much larger surface area than it strictly needs to deliver on its original design goal. 

That takes us back to the question of what lessons the last 28 years have taught us and what might we change moving forward. The answer here is likely buried in the adage of having one joint and keeping it well-oiled is key. 

The journey from 0% of the web being protected with HTTPS to nearly 100% has been one of simplification and automation. There is plenty of opportunity for us to do more of that and doing so will be key to the survivability of encryption on the web, some problems we might want to see in addressed that spirit include:

  1. Too many single points of failure in the WebPKI. If any of those top 5 CAs were to fail we would experience a massive failure on the web. For us to address this we need to standardize on a single certificate enrollment protocol, this would make the CAs plug-and-play replacements for each other. We already have a de-facto standard RFC 8555 (ACME) which is responsible for around 70% of all certificate issuance. We should just embrace that and make it official. Once we see sufficient ACME adoption we need the ACME clients to fail over and round-robin across all the associated CAs. This will help ensure clients are able to survive any single CA failure.
  1. We need the WebPKI to be less fragile. There are lots of examples of WebPKI fragility, some are:
  • Clients sometimes get hard coded to expect a single issuing CA. This is a very bad practice, one some CAs have already started to dissuade by randomly selecting issuing CAs.
  • Reliance on very long-lived root certificates instead of very frequent root updates. This is also thankfully starting to change but solving this problem will take us a very long time thanks to root programs that do not continually manage what roots are trusted in deployed systems.
  • The inability of the web to survive CA mistakes and potential security issues without creating outages. There is also work being done here in an extension to ACME called ARI to help enable the web to seamlessly renew certificates in such cases.
  1. Too many CAs with too broad permissions. Since most CAs are not necessary for the web to operate but may still have a legitimate need to exist we should leverage usage and name constraints in root trust more. This would go a long way to both reduce the value of a compromise of these CA long tail CAs to an attacker as well as reduce the exposure to the web as a whole to the practices of these CAs. We should also come back and revisit the need for CAs to keep those permissions periodically.
  1. Over-reliance on manual assessments by unqualified auditors and security professionals. This introduces a number of issues, some of which include:
  1. Root programs are too slow to respond to bad or questionable actors. Suspension of a CA is not-done and total distrust can take months or longer. There are a lot of politics involved here but equally important is the tooling to remove CAs, both procedurally and technically. In these things are simply not “well oiled”. They need to be made far more agile so that this long tail can be more effectively managed.

So what can you do about all of this? Well if you are a root program you should be looking at how you help address the above issues through the way you manage that program.

If you are a website operator? Well, the answer there is easy — adopt ACME-based certificate lifecycle management, keep an eye on Certificate Transparency logs for certificates issued by that long tail of CAs and consider specifying CAA restrictions that limit issuance to a small number of CAs.