Beyond the Facade: Evaluating Long-Term Impacts of Enterprise Software Architectures

Many enterprise products these days have a core architecture that consists of placing a proxy in front of an existing service. While the facade architecture makes sense in some cases, it’s usually a temporary measure because it increases the costs of administration, scaling, security, and debugging. It also adds complexity to general management.

The use cases for these offerings often involve one team in an organization providing incremental value to a service operated by another team. This introduces various organizational politics, which make anything more than a proof of concept not really viable, at least on an ongoing basis.

Essentially, anyone who has ever carried a pager or tried to deploy even the most basic system in a large enterprise should avoid this pattern except as a stopgap solution for a critical business system. It is far better, in the medium and long term, to look at replacing the fronted system with something that provides the needed integration or capability natively.

For example, some solutions aim to enable existing systems to use quantum-secure algorithms. In practice, these solutions often look like a single TLS server sitting in front of another TLS server, or a TLS-based VPN where a TLS client (for example, integrating via SOCKS) interfaces with your existing TLS client, which then communicates with that TLS server sitting in front of another TLS server. You can, of course, implement this, and there are places where it makes sense. However, on a long-term basis, you would be better off if there were native support for quantum-secure algorithms or switching out the legacy system altogether.

Similarly, it’s very common now for these enterprise-focused solutions to split the architecture between on-premise/private cloud and a SaaS component. This approach has several benefits: the on-premise part enables integration, core policy enforcement, and transaction handling, and, if done right, ensures availability. The SaaS component serves as the control plane. This combination gives you the best benefits of both on-premise and SaaS offerings and can be implemented while minimizing the security impact of the service provider.

Another pattern that might be confused with this model is one where transactional policy enforcement or transaction handling happens in the SaaS part of the solution, while the integration element remains on-premise. This is probably the easiest to deploy, so it goes smoothly in a proof of concept. However, it makes the SaaS component mission-critical, a performance bottleneck, and a single point of failure, while also pulling it entirely into the threat model of the component. There are cases where this model can work, but for any system that needs to scale and be highly reliable, it’s often not the best choice.

These architectural decisions in third-party solutions have other large impacts that need to be considered, such as data residency and compliance. These are especially important topics outside the US but are also issues within the US if you handle sensitive data and work in regulated markets. Beyond that, architecture and design choices of third-party products can have vendor lock-in consequences. For example, if the solution is not based on open standards, you may find yourself in an inescapable situation down the road without a forklift upgrade, which is often not viable organizationally if the solution fails to deliver.

So why does all of this matter? When we evaluate the purchase of enterprise security software, we need to be sure to look beyond the surface, beyond the ingredient list, and understand how the system is built and how those design decisions will impact our security, availability, performance, maintainability and total cost of ownership both in the near term and long term. Enterprise architects and decision-makers should carefully consider these factors when evaluating different architectural patterns.

When Words Mislead: Cybersecurity’s Terminology Problem

At Black Hat this year, I did my usual walk around the vendor floor. I talked to lots of companies about their products. One thing that stood out to me is vendors either by accident or on purpose are redefining terms in a way that does harm. One vendor in particular was calling “bearer tokens” “attestations” in both their marketing and product documentation. Let’s use this as an example and break down why this matters.

What’s an attestation?

An attestation is when someone says something is true and puts their reputation behind that statement. It’s like when your friend vouches for you at a new job. A good technology example is a TPM attestation. The TPM in your computer can prove certain things about how your machine started up. When we trust the chip’s design and the company that made it, we can believe what it tells us.

What’s a claim?

A claim is just something someone says. It might be true, but there’s no proof. If I tell you my name is Ryan Hurst, that’s a claim. I haven’t shown you my ID or anything. Claims can also be about other people or things. If I say Cloudflare is safe to use, that’s just my opinion unless I back it up with something.

What’s a bearer token?

A bearer token is like a temporary password. It’s a secret that proves who you are to a service. Anyone who has the token can pretend to be you. We use them because they’re necessary, but we try to limit their use in modern systems.

You usually get a bearer token by trading in something more permanent, like an API key, which is essentially a long-lived password. It’s like swapping a house key for a hotel room key. The hotel key only works for a short time, but anyone who finds it can get into your room.

Why does any of this matter?

When companies use the wrong terms to explain what they do it can lead people to make bad security choices. For example, If you hear a vendor say their system relies on bearer tokens and then you do a search on the term, you’ll find experts talking about their risks and how to manage them. But if you search for attestations, you’ll find different info about how they help prove things are reliable, trustworthy or factual.

If a company selling security products tells you it does one thing, but it does another, it’s a bad sign. They either have some technical debt buried in the design that may have a negative impact, don’t know what they’re talking about, or they’re trying to confuse you. Either way, you might want to look at other options.

So, when you’re buying security products for your company, pay attention to how vendors use words. If they’re changing the meaning of important terms, be careful. It could mean you’re not getting what you think you are.

The True Essence of Secure by Design

When we discuss “secure by design,” we often focus on capabilities, features, and defaults—such as logging and monitoring, default-deny, regular updates, authentication, and minimizing by default privileges—rather than focus on the word “design.” Don’t get me wrong; those things are important. But if we have learned anything from the last several decades of evolving security practices, it is that security features and settings alone do not make for secure products.

For instance, the recent CrowdStrike outages and the Microsoft Storm 0558 incident illustrate that design choices have a significant impact on the security and reliability of a system. So what does it mean to have a secure “design” then?

The whole concept of interconnected systems was bolted onto the way computers work. We didn’t even see a TCP/IP stack included by default in an operating system until BSD did this in 1983. Microsoft Windows didn’t get a TCP/IP stack until 1994, around the same time TISFWTK was released, which was a framework to build a firewall, not a firewall itself. For some, these developments may seem like ancient history; however, the methods we use to build and secure software and systems have evolved from this bolt-on approach to security.

“The beginning is the most important part of the work.” — Plato

In my view, a secure design incorporates an understanding of the entire lifecycle of a system. It recognizes that the question is not if it will be hacked, but when. It then employs questions such as “How do you reduce the chances of that happening?” and “How can you make it easy to recover when it does?” to inform the system’s shape.

The Storm case is a great example of insecure design. Loading extremely valuable keys into memory on a front-end machine, in the user context of code parsing user-supplied data, enabled an unauthenticated user on the internet to trigger a memory dump that exposed the key to theft via another attack vector. At a minimum, this key should have been kept in a separate user process, one that operated at a different permission level.

Similarly, the update mechanism for CrowdStrike’s ‘signatures’ functioned like a command and control network for a large array of bots installed on mission-critical systems. The fact that they could, in essence, push code at runtime, at will into the kernel privilege space of their customers means that a malicious actor with access to these systems could push malicious code onto all of those endpoints simultaneously. This insecure design choice put customers in a precarious position, making CrowdStrike’s update system a direct and uncontrollable method for compromising their systems.

How Did We Get Here?

Today, much of the software we depend on still carries the legacy of this bolt-on approach to security. This led the industry to adopt compliance and audit standards similar to the programs used in financial services, such as the Sarbanes-Oxley Act, which paved the way for standards like ISO 27001 and PCI DSS which aspired to manage this risk. However, these measures, like the financial audits that preceded them and missed major frauds such as those at Enron and Wirecard, unsurprisingly also often fail to prevent significant security risks.

This in turn led to a shift towards posture management, where products aimed at risk discovery have proliferated. However, these solutions predominantly focus on identifying potential problems and leave the filtering and remediating of them to IT security, thereby turning into a flurry of risk mitigation tasks rather than identifying the root causes and manifesting into design changes that would fundamentally reduce the risk. These approaches are symptomatic of a broader issue in security—treating the symptoms of poor foundational security rather than addressing the root causes.

Alongside these developments, the last few decades have seen massive growth in the adoption of distributed computing technologies, which first led to MSPs, then SaaS, and now cloud computing. This brought about the concept of shared responsibility, where security is managed as an incremental part of “solving the problem” for the buyer, allowing them to focus on their core business. For many, if not most business problems, this is the right way to go but it also often complicates accountability and obscures the visibility of security controls, sometimes exacerbating the very issues it seeks to mitigate. 

Is It Really That Bleak?

At first glance, this all might seem pretty dire. After all, we do not get to replace all the technology we depend on at once, and even if we could, these new solutions will always come with their own problems. However, we must accept that you can’t effectively retrofit “secure by design” onto an existing system—or if you do, at best it will take decades to replace poorly designed bits and pieces of those existing systems. Even when improvements do occur, as we saw with Microsoft’s Trustworthy Computing initiative, eventually the organization may lose focus and start to regress, leaving us on a rollercoaster of empty promises.

We have seen some organizations take this to heart; a good example is Google. They built their infrastructure and processes on platforms that address foundational issues, allowing the vast majority of their developers to remain largely unaware of the broad spectrum of security challenges that operating services like theirs are subject to. For instance, most Google developers couldn’t tell you much about the infrastructure that runs their code. They build on top of mandated frameworks that “make the right thing the easy thing,” enabling the central security organization to perform core security functions.

A telling example of the design of these systems is how the entire concepts of authentication, authorization, and encryption are abstracted away from the developer, so they largely just happen automatically. To be clear, this approach is not without its faults, and Google’s systems are neither perfect nor suitable for all types of workloads. However, the base principles applied here are fairly universal: Make investments in infrastructure design upfront to ensure that the systems built on them are secure, well, by design.

We are starting to see the industry come around to this reality with the advent of platforms like Kubernetes and infrastructure such as SPIFFE (the Secure Production Identity Framework For Everyone), which are designed to foster environments where security is integrated at the core rather than being an add-on but this is just the beginning of this focus on incorporating security infrastructure into the way we build, not the end.

Conclusion

What is necessary is a cultural shift in how we approach design. For this to happen, customers need to start demanding transparency from vendors about how their systems are designed, assessed, operated, and serviced. Part of this responsibility involves conducting their own assessments and not relying solely on blind faith. The recent recognition on supply chain risks and global scale outages is a direct result of such blind trust.

By demanding vendors to provide a comprehensive narrative on how they’ve built their systems to be secure, rather than merely listing security features, and by engaging in initial and continuous technological due diligence to hold them accountable, we can hasten the transition to a world where security is woven into the fabric of every system, not merely added as an afterthought.

In conclusion, while we are just at the beginning of this journey and will have to continue managing systems that were not designed with transparent “secure by design” principles for decades, the direction is clear. We are moving towards a world where code, as well as users, are authenticated, and automated intelligent systems that utilize expert systems and AI evaluate whether systems are regressing or living up to their promises. If we continue on this path, our children and grandchildren may have the opportunity to live in a world that is inherently more secure, or at the very least, they will be better equipped to manage the security risks associated with the massive expansion of technology that we rely on daily.

CA Misissuance: A Tale of Two Incident Responses

Certificate Authorities on the web are tasked with validating that the websites we visit are truly associated with the domains that serve the content that makes up that website. As a result, we gain confidence that we know which domains we are communicating with and can trust that no one else can view our interactions with them.

CAs are trusted because user agents—typically web browsers—chose to delegate the “Trust On First Use” (TOFU) problem to these authorities. A user agent acts on behalf of the user, aiming to serve and protect the user’s interests while navigating the internet. In this context, ensuring secure communication with the correct site is managed through this delegation of domain control verification to CAs.

Many often forget that a CA’s role does not stop at the time of certificate issuance; they also manage revocations when keys are compromised or replaced, the certificates are issued in error, or when they are not compliant with the industry standards the CA has agreed to meet as a publicly trusted CA—a responsibility pivotal to maintaining the trust billions place in them.

To illustrate this further, not only is this their responsibility, but they also have obligations under the audit regimes they operate under. Depending on the scenario, they may need to revoke the certificates within 24 hours to 5 days.  When you are potentially talking about tens of thousands or millions of certificates, this requires both good engineering and preparation.

“By failing to prepare, you are preparing to fail.” — Benjamin Franklin

In this post, I want to delve into how two different CAs approached issues in handling two misissuances involving failures to meet the requirements for domain control verification. The contrast between these two incidents is striking and highlights the importance of preparation, good engineering, and strategic decision-making in crisis management.

Case Studies

Let’s Encrypt: A Proactive Approach

Let’s Encrypt demonstrated exemplary crisis management during their 2020 domain control verification-related incident. Through compensating controls, they internally discovered a bug in their domain control verification logic. With the right data, tooling, and processes in place, they were able to act decisively—filing the initial incident quickly, analyzing the scope and impact of the incident, and revoking over 1.7 million out of 3 million impacted certificates within hours. They were able to do this because the service and organization was built from day one to make it possible. One of the most telling elements of their response was their ability to quickly determine which certificates had already been replaced and which were still in use. This information guided their communication strategy and helped identify which certificates were safe to revoke immediately. Furthermore, within the 24-hour revocation requirement, they filed a subsequent delayed revocation incident

This captured why the remaining certificates were not immediately revoked and when they would all expire, helping browsers and the community understand not only that they had done their research but also how their handling of the issue could impact the web and what the exposure of the delayed revocation represented.

DigiCert: A Reactive Approach

In contrast, DigiCert’s response to their domain control-related misissuance highlights several areas for improvement. The issue had persisted for an extended period and was identified externally, rather than through internal engineering and compliance processes designed to catch such errors.

DigiCert also struggled with the logistics of mass revocations. Faced with tens of thousands of certificates, they had difficulty determining which could be revoked quickly, which were actually impacted, and as a result initially opted to start a plan to revoke all certificates within a 24-hours. Though they started this clock when they figured out the scope of the incident, it took them a while to do so, suggesting a lack of preparation for incidents. They also appear to have done this all without a clear strategy to mitigate the customer impact and through a sequence of events ultimately rolled out a 5-day delay on top of the initial plan. This all led to the unusual step of a subscriber filing a Temporary Restraining Order (TRO) against the revocation, which set a troubling precedent and complicated the situation further.

A significant challenge was that many DigiCert customers do not utilize automated certificate lifecycle management extensively, which is essential during such incidents. This issue reflects back on DigiCert’s own processes, as manual lifecycle management is the default rather than the exception.

This incident raises questions about DigiCert’s preparedness, including whether they maintain a comprehensive list of all valid certificates, track replacements, and monitor customer use of lifecycle management. It also prompts a reflection on their internal review processes and test automation for compliance with CA/Browser Forum Requirements.

“Learn all you can from the mistakes of others. You won’t have time to make them all yourself.” – Alfred Sheinwold

It also makes me wonder about the type of design review process that was in place, the description in the incident sounds like a fix was applied at the user interface level which no serious engineer would do. Beyond that, it’s also unclear what kind of test automation is incorporated into their engineering processes to ensure compliance with associated requirements. For example, it would be quite reasonable for the process to mandate unit tests for all normative level requirements from the CA/Browser Forum Requirements and associated IETF documents. This would ensure that initial and subsequent releases meet requirements, which is particularly crucial for an organization with many engineers who may not all be experts in every requirement.

Engineering Readiness and Strategic Decision-Making

These case studies highlight critical lessons in engineering readiness and crisis management. Systems designed for scale and availability must be built with proactive planning, automation, and testing in mind. The reality is that systems designed for both scale and availability are also operated with engineering readiness as a core tenant. If you read the O’Reilly Site Reliability Engineering book, you will notice the same undertow of planning, automating, testing, and preparing, that is because the only way to have a reliable system or process is to engineer it to be that way.

Beyond that, if you’ve ever been on call, you understand how crucial it is that both mundane and unexpected issues be prevented and automated so you are able to respond effectively or even better — never experience entire classes of problems as you learn how to operate a CA. This is why, following Let’s Encrypt’s 2020 incident and their second mass revocation in 2021—for certificates that were one second longer than their policy allowed —they began exploring ways to ensure this never happens again. This is what led us to  Automated Renewal and Installation (ARI) which aims to make it possible for these unplanned certificate renewals boring. This proactive approach aligns with the principle:

“An error doesn’t become a mistake until you refuse to correct it.” – Orlando A. Battista

The browsers have long stressed the importance of blameless postmortems and a culture of self-improvement they do this because without that people within an organization do not feel empowered to do the right thing. At the same time how an organization responds to an incident is very telling about its culture, transparency, operational excellence, focus on compliance, and engineering practices and as such serves as a useful indicator of an organization’s competence and suitability to be part of the threat model of the internet. 

At the last CA/Browser Forum Face to Face meeting the Chrome team talked a bit about this.

Lessons Learned

These incidents highlight several key lessons for CAs, including the importance of automated systems, transparent communication, and continuous learning.

1. Preparation is Key: Automated systems, and tooling to support and predefined response strategies are crucial. For effective incident management, CAs should develop and test incident response plans regularly to ensure rapid action during a crisis.

2. Transparency Builds Trust: During crises, maintaining open and regular communication is essential. CAs should ensure that their communication strategies are clear and consistent to build and maintain trust with stakeholders and the community.

3. Learn from Others: Assign teams and individuals to conduct regular reviews of both historical and current incidents. Have them present these findings to the organization and rotate this responsibility across different disciplines to ensure knowledge is shared.

Conclusion

Acknowledging that DigiCert displayed a level of transparency during their incident, it’s clear that the event still exposed significant gaps in their compliance and engineering practices. Transparency, while crucial, is only one component of effective crisis management. The DigiCert case illustrates that even when stakeholders are kept informed, underlying deficiencies in operational procedures and compliance frameworks can compromise both the efficacy of the response and the overall security posture.

“The decisions you make reflect the priorities you have.”

Some recommendations for CAs looking to learn from this incident include:

  • DigiCert’s incident highlighted the need for more robust compliance mechanisms. Regularly updating compliance protocols and ensuring they are in strict alignment with industry standards is essential. This includes periodic reviews and updates to incorporate new regulatory requirements and best practices.
  • The incident revealed that while DigiCert was transparent, there was a lack of sufficient engineering controls to prevent or mitigate the impact of such issues. Implementing rigorous engineering best practices such as continuous integration, automated testing, and detailed, proactive monitoring of systems can help prevent similar incidents.
  • While DigiCert was forthcoming about the issue, their response strategy could benefit from a more structured approach that includes predefined escalation paths, detailed role assignments, and rapid response teams specialized in different aspects of the crisis management process.
  • Beyond immediate crisis management, fostering a culture prioritizing continuous improvement and learning from past mistakes is vital. This involves regular training, simulations of potential issues, and a blame-free environment that encourages reporting and swift resolution of potential threats.

In conclusion, effective crisis management involves not only responding to incidents but also implementing robust preventive measures. Regular updates to designs, implementations, tooling, improved engineering controls, and a culture of continuous improvement are essential for minimizing the likelihood and impact of future incidents.

Exploring Requirements for Timelines Certificate Problem Reports and Revocations

Today, DigiCert’s mass revocation is in the news, so I thought it would be worthwhile to explore the rules for a CA when it comes to revoking a subscriber certificate and the choices and constraints a CA faces in circumstances like this.

First, let’s examine the types of issues that might lead a CA to revoke a certificate before its natural expiration. Broadly speaking, there are two categories of these issues: 

Externally identified Issues, these start from what is called a Certificate Problem Report (CPR), they may originate from: Subscribers, or Relying parties, application software suppliers, and other third parties

Internally identified issues, these may be identified by: Pre-issuance checks, Post-issuance checks, or Internal periodic reviews

Problem reports from external entities almost always require some minimal level of investigation. An exception might be cases where the reporter has access to the private key and requests a certificate be revoked, this could come from the subscriber, or someone who has gained access to the private key. Even in these cases you often need to do multiple communication round trips because requests often do not contain sufficient information to authenticate the request from the outset.

There are other cases too; for example, sometimes a problem report claims unauthorized issuance of a certificate, but it turns out the reporter had authorized it, perhaps by signing up for a CDN or SaaS offering that uses that CA. In these cases you might need to contact the person who requested the certificate, or the service provider for more information or as the reporter to prove control of the domain to verify your authorization to request the revocation — again introducing a few round trips. 

However, most problem reports are spam or requests to take down a website. In this second case, if a CA has a policy about revoking domains that serve such content then it also needs some level of investigation. On that check out this post on the CA’s role in phishing mitigation on why CAs should stay out of this class of issue altogether. 

These are just a few examples of certificate problem reports a CA might see but there are more, one of the most actionable is a claim that there was a violation of the requirements that a CA must meet, such as including incorrect information in a certificate or using a certificate profile with a wrong extension, or using a domain control verification process that does not meet requirements.

The workflow for these requests is straightforward, but it still requires some investigation. The questions that need to be answered before taking action include things like: When was the certificate issued? What rules were in force at that time? What data was used when validating the entitlement of the requester? What version of the software implemented the validation at the time of issuance? Is the reporter’s interpretation of the requirement technically correct?

This is why the Baseline Requirements (BRs) allow CAs some time for an investigation:

CA SHOULD revoke a certificate within 24 hours and MUST revoke a Certificate within 5 days

NOTE: Here is a conversation on the last changes to this text, searching the forums is always a good way to validate what the intent was at the time of authorship.

Let’s look at an example here, in 2020 Let’s Encrypt had a misissuance due to a bug in CAA handling. This issue was found internally through what we would call mitigating controls, a process that exists to double-check you are doing the right thing.

That’s not a lot of time, especially if such a report comes in on New Year’s Eve. CAs maintain on-call rotations to investigate issues, but it’s easy to imagine that from time to time, an investigation might take longer than 24 hours to fully complete. Despite this, the CAs shouldn’t have an infinite amount of time for such investigations, which is why the text limits the time to revocation to a maximum of 5 days.

Let’s explore what this might look like in the case of a real incident. On February 29, 2020, Let’s Encrypt through an internal review discovered an issue in their domain control verification logic and kicked off an investigation, and set off to do a revocation of over 3 million certificates on that same day they discovered the issue they announced their plan to revoke the affected certificates. Revocations began on March 4, 2020, and by the March 5, 2020 deadline, they had revoked 1,706,505 certificates. Network scans revealed that a large portion of the remaining 1,003,596 certificates were still in active use.

Due to concerns over the potential impact revocation would have on these active sites, they filed an incident for delayed revocation just before the deadline. These remaining certificates either expired or were revoked by May 29, 2020, after which the ticket remained open for community discussion and further analysis.

Let’s Encrypt did many things right here, including implementing compensating controls to detect issues internally, preparing to analyze the scope of impact immediately, and revoking the certificates they could without causing disruption as soon as possible. When it became clear that they would not meet the deadline requirements, they filed a bug to document their non-compliance with the obligations before the deadline lapsed. Moreover, they communicated clearly and regularly throughout the entire process until the issue was resolved.

But what about the missed revocation deadline!?! The reality is that delayed revocations happen, It’s not ideal, but they do happen. What is important is how a CA deals with them, and how credibly. Let’s Encrypt executed like a well-oiled machine here; no one complained. (They did this all without getting distrusted, for more on how CAs usually get distrusted see.)

There are other cases where delayed revocations were done as well, for example in 2019 Apple, Google, and GoDaddy were all hit by a bug in EJBCA, all of which had some delayed revocations, and while there may have been some challenges in each of those cases, what you could tell is that like Let’s Encrypt they were being proactive, transparent, and working towards addressing the issue responsibly and when you looked at their broader history there was not a sign of systemic failures.

OK so if you are in the CA industry right now you should be saying, but Ryan, What about the Certificate Problem Reports response requirements?! This Digicert issue was reported externally, those requirements are also relevant if so you are right, these requirements say:

The CA SHALL maintain a continuous 24×7 ability to accept and respond to revocation requests and Certificate Problem Reports.

And …

The period from receipt of the Certificate Problem Report or revocation-related notice to published revocation MUST NOT exceed the time frame set forth in Section 4.9.1.1. The date selected by the CA SHOULD consider the following criteria:

– The nature of the alleged problem (scope, context, severity, magnitude, risk of harm);

– The consequences of revocation (direct and collateral impacts to Subscribers and Relying Parties);

– The number of Certificate Problem Reports received about a particular Certificate or Subscriber;

– The entity making the complaint (for example, a complaint from a law enforcement official that a Web site is engaged in illegal activities should carry more weight than a complaint from a consumer alleging that they didn’t receive the goods they ordered); and

– Relevant legislation.

While In Section 4.9.1.1 it says

With the exception of Short-lived Subscriber Certificates, the CA SHALL revoke a Certificate within 24 hours and use the corresponding CRLReason (see Section 7.2.2) if one or more of the following occurs:

If you’re paying attention, you might be thinking to yourself, “Wait a minute, doesn’t that mean, since the request came in from an external reporter, that the certificate should be revoked within 24 hours?”

Well, yes, it does! This totally makes sense in some cases, for example, if a notice comes in of a key compromise that has all the required detail to confirm it, but as we discussed above, requests received externally often require some level of verification. This begs the question is 24 hours enough time to handle that or not? But one thing is for sure arguing that the rules are wrong or that the rules should be changed during an incident response is a recipe for disaster.

So what would I have done if I were DigiCert? I would have stopped issuance as soon as the issue was identified. I would have then set engineers to work on resolving the issue in the right way. In parallel, I would have prepared a preliminary incident report, noting that I had stopped the issuance. I would have also noted any confusion I may have had about the language in the requirements in that incident report, I would have then committed to start revoking following the requirements, and before the 24-hour clock expired, I would have started replacing and revoking the certificates that could be revoked safely. If it was clear that despite active, good faith, best efforts, the revocation was not going to be completed in time, then, I would have filed a delayed revocation incident before that clock expired—essentially following the same playbook Let’s Encrypt used back in 2020.

After successfully navigating the incident, I would have worked within the CA/Browser Forum to discuss this language and make sure it was it is clearer why the difference exists or change it to align with the other text.

We did not really tackle the topic of ‘Internally Identified Issues’? Though I do mentioned them above but didn’t really explain them in detail, so let’s see if I can capture them without rambling too much.

As I mentioned earlier, there are several classes. In the Let’s Encrypt case, they had some compensating controls around design reviews that identified an issue. When it comes to cases like this, it is pretty straightforward; by the time you have completed the design review, you are sure there is an issue at play, and the clock starts ticking.

Regarding pre-certificate linting, it’s a failure in your linting management if an issue that could or should have been caught by your linting gets issued; this is more akin to a design defect than a CPR-like incident and as such it would certainly trigger the 24-hour clock.

Then there is post-issuance linting, which is used either as a spot check during data gathering for an audit, as a compensating control, for example, or when you are testing new rules for the linter or refining your linter policy configuration. These are by their very nature a lot more like a CPR in that, because you don’t have a fine-tuned policy for the linter, they too need some level of investigation.

To be clear, none of this appears related to the DigiCert incident, but they have not published a detailed incident report yet, which will include an end-to-end timeline and that may change some of this analysis but I wanted to try to cover the topic more holistically in this post before declaring victory tonight. 

Regardless of the above, the key takeaway is that if you are a CA or a customer of a CA, you need to be prepared to respond to an incident. One of the most useful actions you can take to achieve this is to adopt ACME for certificate lifecycle management. In particular, look for ACME implementations and CAs that support ARI, an extension of ACME. This extension enables a CA to signal to its enrollment clients that their certificate needs to be replaced sooner than expected, allowing them to roll certificates without requiring on-call incident response or causing downtime. The key takeaway is if you are a CA or a customer of a CA you need to be ready to respond to an incident. 

If you are interested in this topic, I recommend checking out the Sectigo “Root Causes” podcast which has a great conversation on Chrome’s recent presentation at the CA/Browser Forum on Incident Reporting. 

This presentation can be summarized as : 

– Incident reporting is an opportunity to demonstrate continuous improvement and to make the ecosystem more resilient and secure.

– We rely on the public incident reporting process to demonstrate that continued trust is justified.

– We think incident reporting should be boring (i.e. routine).

I hope this was helpful. Let me know if you think I missed something or if my analysis is wrong in some way. When processing prose as code, there are almost always slightly different takes, and I wouldn’t be surprised to find out others feel differently than me on this issue.

Thanks to all of you who read this and provided feedback — it takes a village.

HSMs Largely Protect Keys from Theft Rather Than Abuse

HSMs were designed to protect keys from theft and to move those keys into a different security domain than the code that uses those keys. The workloads using these HSMs use credentials or, worse, shared secrets that are often pushed down to the machines via CI pipelines or at imaging time to authenticate to them. These API keys and credentials are often stored in key vaults like HashiCorp, making them no more secure than the key vault itself. Unfortunately, they’re also seldom rotated. If an attacker gains code execution on the box or gains access to the API keys credentials in some other way, they can sign or encrypt with the keys on the HSM that the associated credentials have access to. The attacker doesn’t even need to stay resident on the box with the key or credential because they are usually just stored in environment variables and files on the box, allowing them to be taken and used later from a network perspective with a line of sight to the HSM.

In short, beyond a simplistic access control model HSMs usually do not protect keys in use; they protect them from theft. To make things worse, since they have no concept of the workload, resulting in the auditing mechanisms they have lacking adequate detail to even usefully monitor the use of keys.

Challenges Using HSMs

By design, the administrative model of HSMs is quite different from what we’re typically used to. The goal of the HSM design was to prevent regular IT staff or third parties with access to the facilities containing the HSMs or those with code execution on the boxes connected to them from being able to abscond with the keys. These use cases were almost always low-volume systems that were infrequently used relative to other workloads. Their performance often becomes a bottleneck. It’s possible to design deployments that can keep up with and meet the availability requirements of large-scale systems, but this often requires deploying clusters of HSMs in every region and every cluster where your workload exists and having your data center staff manage these devices, which were designed around largely manual physical administration.

To make things worse, the only compartmentalization concept these HSMs have is usually the concept of a “slot,” which you can think of as a virtual HSM within that physical HSM. Each one you use increases your operational overhead, resulting in customers often either sharing one slot with many workloads or simply not using the HSMs at all. This often makes the capital and non-capital costs of their use in at-scale systems, as well as their use in less secure cases impractical. In cases where these challenges make HSMs impractical, there are often other approaches that can still help mitigate key compromise risks so it’s not a question of all or nothing.

When HSMs Make Sense

To be clear, HSMs are useful for securing cryptographic keys from theft and are essential in several high-stakes scenarios. For example, when keys need to live for long periods and be managed independently of individuals in an organization who may come and go, and when physical theft of keys is a concern, HSMs are crucial for a sustainable solution. A great example of this is the key material associated with a root Certificate Authority or cryptocurrency wallet. These keys seldom change, live for many years, and must survive many risks that many other use cases do not face.

Beyond key storage, in some cases, HSMS can be used as part of a larger security system where the consumption of key material is a small part of the security operation. In these cases, code that will be executed within an HSM is written to enable them to be part of how the overall system delivers abuse protection. For example, Apple has developed code that runs on HSMs to help iPhone users recover their accounts with reduced exposure to attacks from Apple staff. Some cryptocurrency companies implement similar measures to protect their wallets. In these use cases, the HSM is used as a trusted execution environment, a stronger confidential computing-like capability, for the TCB of a larger software system. This is achieved by running code on the HSM that exposes a higher-level transactional interface with constraints such as quorums, time-of-day restrictions, rate limiting, or custom workload policies. These solutions often generate the message to be signed or encrypted in the HSM and then use a key protected within the HSM to sign or encrypt that artifact.

HSMs are also often mandated in some environments, largely for historical reasons, but they’re required nonetheless. The impracticality of this requirement has led to modifications in the security model offered in cloud HSMs over their traditional designs, these modifications weaken the original security guarantees that HSMs were expected to deliver to enable modern systems to continue their use. For example, HSMs originally required operators to bring the HSMs back up after a power failure using physically inserted tokens or smart cards and pins, but now they can be configured to automatically unlock. Additionally, the use of HSMs in the cloud is now often gated by simple API keys rather than smart cards or other asymmetric credentials bound to the subject using the key. With all this said, requirements are requirements, and many industries like finance, healthcare, and government have requirements such as FIPS 140-2 Level 2+ and Common Criteria protection of keys which lead to mandated use of HSMs even when they may not be the most appropriate answer to how to protect keys.

The Answer: Last Mile Key and Credential Management

While HSMs provide essential protection for cryptographic keys from theft, for many use cases they fall short in preventing the misuse of keys and credentials. To address this gap, organizations also need robust last-mile key and credential management to complement HSMs, ensuring the entire lifecycle of a key is secured. Video game companies do it, media companies do it, and so should the software and services we rely on to keep our information safe.

  1. Key Isolation and Protection: Protect keys from the workloads that use them by using cryptographic access controls and leveraging the security capabilities provided by the operating system. 
  1. Dynamic Credential Management: Implement systems that automatically rotate credentials and API keys. This limits the value of exfiltrated credentials and keys to an attacker. 
  1. Granular Access Controls: Implementing strong attested authentication of the workload utilizing the keys enabling access controls to ensure that only authorized entities can access the cryptographic keys.
  1. Enhanced Visibility and Auditing: Integrate solutions that provide detailed visibility into how and where keys and credentials are used. Enabling detection usage anomalies, and quick impact assessments to security incidents.
  1. Automated Lifecycle Management: Utilize automated tools to manage the entire lifecycle of keys and credentials, from creation and distribution to rotation. Increasing confidence in your ability to roll keys when needed.

This combination of approaches not only protects keys and credentials from theft and reduces their value to attackers but also ensures their proper and secure use, which turns key management into more of a risk management function. A good litmus test for effective key management is whether, in the event of a security incident, you could rotate keys and credentials in a timely manner without causing downtime, or assess with confidence that the keys and credentials were sufficiently protected throughout their lifecycle so that a compromise of an environment that uses cryptography does not translate to a compromised key.

Thinking more holistically about the true key lifecycle and its threat model can help ensure you pass these basic tests.

Turning Catastrophe into Opportunity: Improving Processes and Vendor Accountability

We often hear about customers not wanting to deploy agents, usually citing the rationale that while each one may only consume a bit of memory and CPU the sum of them slowly but surely grind systems to a halt. The real impact is management overhead, operational risk, and attack surface. Attack surface is also something that requires nuance to understand. For example, software that runs in a privileged context, such as in the kernel, parses data, communicates on a network, or is loaded at boot, represents a lot more risk than software that runs with least privilege in its own user session or in the runtime context of the user.

As they say, you can’t have your cake and eat it too. There are always trade-offs, and when it comes to software, those trade-offs usually involve choosing between performance, security, and delivering on a value proposition.

The full impact of the Crowdstrike outage won’t be understood for some time, but one thing is for sure: organizations that sell solutions reliant on agents will need to provide much more justification and explanation about how their software works and how they manage updates.

It also means that organizations, at least the mature ones, will be re-evaluating what they have on these endpoints, the value they provide, and the risks they represent from both security and operational perspectives. These organizations will also be revisiting the controls they use to manage the ingestion of the software they rely on, and how they manage that risk over time since software is a living entity and not something static.

In short, like most catastrophes, there is a silver lining. This will be a great opportunity to improve existing processes to help prevent entire classes of failures like this. Hopefully, it will include a more robust investment in holding vendors accountable and thoroughly checking their work.

Reading the Tea Leaves: What Led to the Largest IT Outage in History

Last night, on July 18, 2024, a significant IT outage disrupted businesses worldwide, affecting airlines, financial services, TV broadcasters, and more. Some have described this outage as potentially the “largest IT outage in history.” It was linked to a recent update from cybersecurity giant CrowdStrike. According to CrowdStrike CEO George Kurtz, the disruption was due to a defect in a content update for Windows hosts (source). This incident underscores a critical lesson in software deployment: the importance of thorough testing and staged rollouts, often referred to as canarying.

What is Canarying?

Canarying a release involves deploying updates to a small subset of users or systems before a full-scale rollout. Named after the canaries once used in coal mines to detect toxic gases, canarying in software development serves as an early warning system. By releasing software to a limited audience first, potential issues can be identified in a controlled environment, allowing for fixes before broader deployment.

Did Canarying Happen?

The recent outage suggests that these canarying stages were either skipped or inadequately performed. In complex IT environments, the surface area for potential configurations of general-purpose computing endpoints is vast, making it impossible to test all combinations of usage patterns, hardware, software, and configurations. This is where canarying proves invaluable. Issues that are difficult to detect in a general release can be caught by exposing the update to a limited audience first. The failure to do so can lead to significant disruptions, as seen in this case. This isn’t solely CrowdStrike’s problem; it affects customers too. Enterprises often use tools like Microsoft’s Windows Server Update Services (WSUS) to control what and when updates are deployed to their devices, mitigating risks associated with issues just like this.

Looking at the Scale

The broader implications of this incident echo concerns from the 2000s about the risks of a Windows monoculture (source). Back then, critics argued that Microsoft’s dominance created a single point of failure. If a vulnerability or defect affected Windows, it could potentially impact millions of systems worldwide. This monoculture made it easier for malware to propagate, as seen with worms like Code Red (source) and Nimda (source), which exploited Windows systems en masse.

Historical incidents like the WannaCry ransomware attack (source) and the recent CDK Global outage impacting the automotive industry illustrate the widespread impact software vulnerabilities can have. The CDK outage, caused by two cyberattacks, forced the company to shut down its systems for approximately two weeks, resulting in an estimated $1.02 billion loss for dealerships (source). During that time, the impacted dealerships had the option to revert back to pen-and-paper and third-party software workarounds. However, this will likely not be possible in the CrowdStrike outage, as a result, it will likely have a much larger financial impact, especially given the broader scope and scale of the affected industries. Moreover, this incident echoes the CSRB monoculture concerns highlighted in the STORM-0558 incident, where Google’s response to the CSRB report on Microsoft emphasized the risks of relying on a single vendor (source).

Conclusion

This outage reminds us of a key lesson: always thoroughly test software before full deployment. But there’s a bigger lesson here, one that echoes Ken Thompson’s famous words: “You can’t trust code you didn’t totally create yourself.” This outage highlights the risks of relying too heavily on third-party software without proper mitigating controls.

To build more reliable systems, we need to be vigilant. This means not just blindly trusting vendors but verifying their work. It’s also crucial for basic business continuity, not just for avoiding security defects (source) and regressions like we recently saw with regreSSHion (source). And let’s not forget the risks associated with software monocultures; we should diversify, to the extent we can, our software to reduce the risk of widespread failures and where we can’t implement appropriate mitigating controls.

In essence, while we may not write every line of code ourselves, we must manage the risks represented by each piece of software we rely on, or we are doomed to repeat outages like today.

Content is King in Phishing and the Role of Publicly Trusted CAs

Phishing attacks often begin with a seemingly simple email. These emails appear to be from trusted sources and include links to fake websites or messages that create a false sense of urgency, prompting users to act without considering the consequences. Once on the phishing site, everything looks familiar. Attackers meticulously copy the layout, design, and even the content from real websites. The only noticeable difference in a well-executed phishing site is where the data is submitted.

Although studies show that users typically do not scrutinize the address bar or understand URLs, some more technical users may check the domain name hosting the content as a last resort to verify its legitimacy. This approach, however, is problematic for several reasons:

  1. Domain Confusion: Not everyone knows which domain is appropriate for which content. For example, TD Bank, N.A. issues co-branded credit cards for various companies. When paying your credit card bill online, you might see a domain name owned by TD Bank rather than the brand on your credit card.
  2. Brand Ambiguity: The global uniqueness of business names or brands is not something you can rely on. For instance, when I hear “Stripe,” I think of the San Francisco-based payments company, but there could be other companies with the same name, such as one in the pavement striping industry. To put that into context it costs about $100 and 24 hours of somebody else’s money to register a company with virtually whatever name you want.
  3. Unrelated Legal and Domain Names: Company brands and their legal names often differ, as do their domain names. This discrepancy makes domain names more useful to attackers than to users for detection.

Furthermore, compromised websites from small businesses, churches, and other less technical organizations are frequently used to host phishing content. The key takeaway here is that content, not the domain name, is what makes a phishing campaign successful. The familiar look and feel of the content on phishing sites play a crucial role in deceiving users, making them less likely to question the site’s authenticity.

The Role of Certificate Authorities

Certificate authorities (CAs) in the web ecosystem exist to solve the trust-on-first-use problem introduced by our desire to protect the content we submit over the web from prying eyes (for more information, see TOFU and the Web). At its core, this problem arises because if you perform an anonymous key exchange or simply download the public key of a website from an unauthenticated source like DNS (while DNSSEC does technically exist, it is poorly adopted outside of TLDs), an attacker could replace the published key with one they control. This allows them to perform a man-in-the-middle (MITM) attack.

Certificate authorities are the answer to this problem. Browsers maintain a set of requirements that CAs must meet if they are to be distributed as “trusted” within their browsers (for more information, see Why We Trust WebPKI Root Certificate Authorities). In this ecosystem, browsers delegate solving the trust-on-first-use problem to the CAs. Essentially, they require CAs to prove that the requester of a certificate has administrative control over the domain and issue a certificate that attests to this. This enables browsers, as agents of the user, to assure the user that no one else should be able to intercept the information sent to the website (for some value of “website” — the modern website is made up of many third-party components that can potentially see your content also, but that’s a topic for another post).

Where things start to get tricky is how standards are defined. Anyone who works in a standards body knows that getting to a consensus is hard and often leads to less-than-ideal explanations of things. This is made worse by not everyone who participates understands the whole problem domain. Unfortunately, this is as true in the CA/Browser Forum as it is in other standards bodies. In the context of phishing, this comes into play in the “Baseline Requirements for the Issuance and Management of Publicly-Trusted TLS Server Certificates”, which has the concept of a “High-Risk Certificate Request” that states:

High-Risk Certificate Request: A Request that the CA flags for additional scrutiny by reference to internal criteria and databases maintained by the CA, which may include names at higher risk for phishing or other fraudulent usage, names contained in previously rejected certificate requests or revoked Certificates, names listed on the Miller Smiles phishing list or the Google Safe Browsing list, or names that the CA identifies using its own risk-mitigation criteria.

(It is worth noting Google Safe Browsing isn’t a list it is a web service…. sigh….)


And

The CA SHALL provide all personnel performing information verification duties with skills-training that covers basic Public Key Infrastructure knowledge, authentication and vetting policies and procedures (including the CA’s Certificate Policy and/or Certification Practice Statement), common threats to the information verification process (including phishing and other social engineering tactics), and these Requirements.

On its surface, this seems harmless enough. It doesn’t mandate that the CA do anything for phishing beyond training its validation staff about the contents but does allow them to do something about it if they can. After all, maybe they see phishing content and can stop the issuance altogether before a certificate is issued, preventing the use of SSL on the phishing site altogether. The problem is this all does more harm than good.

Why CAs Don’t Make Good Content Police

Determining whether content is phishing or legitimate can be highly subjective and context-dependent. What might appear as a phishing attempt to one person could be a legitimate operation to another.

As of 2024, there are approximately 1.13 billion websites on the internet, of which around 201.9 million are actively maintained and regularly updated. Each of these websites often consists of many different subdomains, all of which receive certificates, the large majority via automation with no opportunity for manual content inspection.

Simply put, this is a scale of a problem that does not lend itself to manual review or subjective assessments of phishing. But it gets worse.

There are around 86 WebPKI trusted certificate authorities on the web today, many of which operate with just a handful of people. Beyond that, though they exist to serve the whole web, they are in many different legal jurisdictions around the world, each with its own cultural norms and expectations. However, the web is an international asset, and if we were to rely on CAs to be the content police on the web, we would have hugely inconsistent results, especially given the current rules.

It is also worth noting if the decision-making power regarding content trustworthiness is applied at the domain name control verification, users are disempowered and it begins to resemble censorship rather than protection. Users should have the ability to choose what kind of subjectivity they want when it comes to protecting themselves from malicious content.

So why don’t we just make the rules clearer on what constitutes appropriate content? Most would agree this is a hugely difficult problem, but even if we put that aside, the reality is that CAs visit websites only at issuance time, often before there is any content published, since SSL is a requirement for a website to even launch. Beyond that, websites often serve different content to different users. In my last role at Microsoft, I was responsible for security engineering for the advertising business. Malicious advertisers would often serve Microsoft IP addresses content that met our policies but serve other users content that violated our policies. So even if CAs did check for phishing, all the phishers would need to do is serve clean content to the CA or change the content after the fact.

Beyond that, there is the question of how a CA would respond to finding a website hosting phishing content. They could revoke it, but as I mentioned earlier, often the website hosting content has been compromised in some way, and revoking that certificate would impact that other website. For example, it’s not uncommon to see phishing content served through shared services like drive.google.com or S3 bucket URLs. Revoking that certificate would impact all of those users, that is if revocation was actually effective, but it is not.

Revocation services like OCSP and CRL repositories are often so slow or unreliable that browsers were never able to deploy what we call hard fail revocation. This means that even when a certificate is revoked, the revocation message might never reach the browser for various reasons, so the CA may feel better that they have revoked the certificate, but in practice, it would at best deliver inconsistent results, making the web less reliable without actually addressing the real problems that enable phishing.

For more insights on the challenges and limitations of CAs in fighting phishing and malware, check outLet’s Encrypt’s post on “The CA’s Role in Fighting Phishing and Malware”.

So What Can We Do to Help with the Problem of Phishing?

To effectively combat phishing, we need a robust reputation system that continuously monitors content over time from many different network perspectives. This includes residential IPs, commercial ones, and more. Here are just a few examples of things we could look at with such a service:

  1. Analyze Domain Registration: Look at the age and subject of the domain registration and the registrar of the domain since some are used more commonly than others.
  2. Examine Hosting Providers: Identify patterns in hosting providers serving the content, as certain providers may be more frequently associated with phishing activities.
  3. Inspect Website Scripts: Evaluate the JavaScript on the website to understand its functionality and detect potentially malicious behavior.
  4. Assess Content Similarity: Compare the content served by the website to other known websites to identify similarities that might indicate phishing.
  5. Utilize Machine Learning Models: Feed all of this data into well-trained machine learning models to generate probability scores indicating the likelihood of phishing.

This system should be integrated into user agents, such as browsers, so phishing checks can occur as part of the rendering process, providing real-time protection for users. Technologies like Google Safe Browsing and Microsoft SmartScreen already do some of this, and similar projects are often staffed with hundreds of engineers and other professionals. They deal with recourse for mislabeling, monitor false positives and false negatives, and tweak algorithms to improve solutions, all while designing privacy-preserving methods.

Are these services perfect? Absolutely not! Can they improve? The answer is unquestionably yes, but the key thing is that ad hoc censorship at issuance time by CAs is simply not the answer. It makes the web less reliable and at best offers a false sense of value in exchange for giving CAs a market message that they are fighting phishing when in reality they deliver no material value to the problem.

Global Consistency for AKD: Using Armored Witness to Prevent Split Views

By Al Cutter and Ryan Hurst

When it comes to end-to-end encryption, securing the directory of public keys for each recipient is crucial so relying parties can have confidence they are talking to the right party. An attacker who can manipulate these directories can impersonate any participant in the communication without detection.

Over the years, various efforts have worked towards addressing this challenge. These efforts aimed to solve the problem of undetectable modifications to these central key servers. For example:

  • CONIKS was the first academic attempt to create a key transparency framework that enabled users to verify the authenticity of keys in a privacy-preserving way without relying solely on the key server’s integrity.
  • Google Key Transparency, based on CONIKS, was Google’s effort to prove it was possible to deploy these patterns at an internet scale.
  • Apple Contact Discovery was the first deployment of these patterns for a large-scale messaging system.

Arguably the most successful and well-understood implementation of these patterns though is Facebook’s Auditable Key Directory (AKD), which is intended to be used across its messaging offerings but is already deployed with WhatsApp. Like the other solutions mentioned above, it is reliant on constant monitoring and verification to ensure global consistency, mitigating “Split View” attacks against its relying parties and ensuring these end-to-end encrypted messaging solutions maintain the promised security properties.

A research paper by Sarah Meiklejohn et al., titled “Think Global, Act Local: Gossip and Client Audits in Verifiable Data Structures” (https://arxiv.org/abs/2011.04551), set out to define a verifiable scheme to perform this verification in a cost-efficient and provable way. The essential idea in the paper is that the problem can be solved by gossiping the state of the underlying verifiable data structure across a network of perspectives. The goal is to ensure that relying parties verify they are seeing the same version of the directory as everyone else who is participating.

Introducing Google’s Armored Witness

Google’s Armored Witness project aims to operationalize this concept by providing a geographically distributed network of secure hardware-based “notaries.” Its design is based on custom hardware, the USB Armory, designed with security, verifiability, and maximal openness given commercially available components and associated constraints. It modifies the USB Armory by removing unused components and adding Power over Ethernet (PoE).

The USB Armory platform uses a Go unikernel design, which has many benefits. For one, Go naturally supports “reproducible builds,” meaning that any attempt to rebuild from an application’s source will yield the same binary. This is important because it enables the hardware to report what firmware it is running and map that back to the source, allowing you to understand what is happening inside. 

The hardware itself also contains a trusted execution environment and a secure element that enables sensitive code to be isolated and the keys to be secured from many classes of logical and physical attacks. These keys are then used to do the “notarization” of their observation of these ledgers.

Check out this presentation from Andrea Barsani at 2024 CanSecWest to learn more about the hardware.

Hardware alone isn’t enough, though. We need a global network based on these devices so that relying parties can get assurance that an adequate number of network perspectives are seeing the same log contents.

If you want to learn more about the broader problem being solved, check out this video of Fillipo Valsorda from Real World Crypto 2024, and for more on the Witness network idea, Fillipo also talked about the work here at the CATS conference.

But How Armored Witness Might Apply to AKD?

Here’s how an integration of Google’s Armored Witness into Facebook’s AKD framework could work:

  1. Log Integration and Epoch Consistency: By integrating Armored Witness into the AKD framework, each Armored Witness device can act as a witness to a log of AKD epoch roots. Specifically::
    • Log Integration: Each AKD epoch root is added to the log, such that the Nth epoch root is present at the Nth log index.
    • Witness Network: The witness network, comprising Armored Witness devices, counter-signs checkpoints from these logs, providing an additional layer of verification.
  2. Client-Side Verification: As the AKD is updated, AKD clients are provided with  epoch update bundles, which include:
  3. Clients verify these bundles by:
    • Checking the signature on the directory epoch root.
    • Verifying the countersignatures on the log checkpoint.
    • Using the inclusion proof to confirm the correct log position of the epoch root, ensuring that the Nth root is in the Nth log position.
  4. Split View Protection: The distributed nature of Armored Witness devices ensures that:
    • Split views of the log are prevented due to the geographically distributed countersigning by the witness network.
    • The log sequence number matches the directory epoch number, maintaining a strict one-to-one correspondence.
  5. Heavy Lift Verifiers: Heavy lift verifiers, which perform more in-depth consistency checks, can source epoch roots from the log for pairwise append-only verification. This process ensures the correct operation of the directory over time, further strengthening the security framework.
  6. The new directory epoch root commitment.
  7. Inclusion proof hashes for the epoch root in the log.
  8. The countersigned log checkpoint.

Conclusion

Integrating Google’s Armored Witness with Facebook’s AKD framework offers a simple solution to a very hard problem. More importantly, adopting this approach not only addresses the split-view problem for Facebook but also provides a substrate that can protect all verifiable data structures.

By leveraging a generic network of geographically distributed Armored Witness devices, solutions dependent on verifiable data structures to prove integrity, such as AKD deployments, can achieve robust split-view protection, ensuring integrity and authenticity across different regions.