We often hear about customers not wanting to deploy agents, usually citing the rationale that while each one may only consume a bit of memory and CPU the sum of them slowly but surely grind systems to a halt. The real impact is management overhead, operational risk, and attack surface. Attack surface is also something that requires nuance to understand. For example, software that runs in a privileged context, such as in the kernel, parses data, communicates on a network, or is loaded at boot, represents a lot more risk than software that runs with least privilege in its own user session or in the runtime context of the user.
As they say, you can’t have your cake and eat it too. There are always trade-offs, and when it comes to software, those trade-offs usually involve choosing between performance, security, and delivering on a value proposition.
The full impact of the Crowdstrike outage won’t be understood for some time, but one thing is for sure: organizations that sell solutions reliant on agents will need to provide much more justification and explanation about how their software works and how they manage updates.
It also means that organizations, at least the mature ones, will be re-evaluating what they have on these endpoints, the value they provide, and the risks they represent from both security and operational perspectives. These organizations will also be revisiting the controls they use to manage the ingestion of the software they rely on, and how they manage that risk over time since software is a living entity and not something static.
In short, like most catastrophes, there is a silver lining. This will be a great opportunity to improve existing processes to help prevent entire classes of failures like this. Hopefully, it will include a more robust investment in holding vendors accountable and thoroughly checking their work.
Last night, on July 18, 2024, a significant IT outage disrupted businesses worldwide, affecting airlines, financial services, TV broadcasters, and more. Some have described this outage as potentially the “largest IT outage in history.” It was linked to a recent update from cybersecurity giant CrowdStrike. According to CrowdStrike CEO George Kurtz, the disruption was due to a defect in a content update for Windows hosts (source). This incident underscores a critical lesson in software deployment: the importance of thorough testing and staged rollouts, often referred to as canarying.
What is Canarying?
Canarying a release involves deploying updates to a small subset of users or systems before a full-scale rollout. Named after the canaries once used in coal mines to detect toxic gases, canarying in software development serves as an early warning system. By releasing software to a limited audience first, potential issues can be identified in a controlled environment, allowing for fixes before broader deployment.
Did Canarying Happen?
The recent outage suggests that these canarying stages were either skipped or inadequately performed. In complex IT environments, the surface area for potential configurations of general-purpose computing endpoints is vast, making it impossible to test all combinations of usage patterns, hardware, software, and configurations. This is where canarying proves invaluable. Issues that are difficult to detect in a general release can be caught by exposing the update to a limited audience first. The failure to do so can lead to significant disruptions, as seen in this case. This isn’t solely CrowdStrike’s problem; it affects customers too. Enterprises often use tools like Microsoft’s Windows Server Update Services (WSUS) to control what and when updates are deployed to their devices, mitigating risks associated with issues just like this.
Looking at the Scale
The broader implications of this incident echo concerns from the 2000s about the risks of a Windows monoculture (source). Back then, critics argued that Microsoft’s dominance created a single point of failure. If a vulnerability or defect affected Windows, it could potentially impact millions of systems worldwide. This monoculture made it easier for malware to propagate, as seen with worms like Code Red (source) and Nimda (source), which exploited Windows systems en masse.
Historical incidents like the WannaCry ransomware attack (source) and the recent CDK Global outage impacting the automotive industry illustrate the widespread impact software vulnerabilities can have. The CDK outage, caused by two cyberattacks, forced the company to shut down its systems for approximately two weeks, resulting in an estimated $1.02 billion loss for dealerships (source). During that time, the impacted dealerships had the option to revert back to pen-and-paper and third-party software workarounds. However, this will likely not be possible in the CrowdStrike outage, as a result, it will likely have a much larger financial impact, especially given the broader scope and scale of the affected industries. Moreover, this incident echoes the CSRB monoculture concerns highlighted in the STORM-0558 incident, where Google’s response to the CSRB report on Microsoft emphasized the risks of relying on a single vendor (source).
Conclusion
This outage reminds us of a key lesson: always thoroughly test software before full deployment. But there’s a bigger lesson here, one that echoes Ken Thompson’s famous words: “You can’t trust code you didn’t totally create yourself.” This outage highlights the risks of relying too heavily on third-party software without proper mitigating controls.
To build more reliable systems, we need to be vigilant. This means not just blindly trusting vendors but verifying their work. It’s also crucial for basic business continuity, not just for avoiding security defects (source) and regressions like we recently saw with regreSSHion (source). And let’s not forget the risks associated with software monocultures; we should diversify, to the extent we can, our software to reduce the risk of widespread failures and where we can’t implement appropriate mitigating controls.
In essence, while we may not write every line of code ourselves, we must manage the risks represented by each piece of software we rely on, or we are doomed to repeat outages like today.
Phishing attacks often begin with a seemingly simple email. These emails appear to be from trusted sources and include links to fake websites or messages that create a false sense of urgency, prompting users to act without considering the consequences. Once on the phishing site, everything looks familiar. Attackers meticulously copy the layout, design, and even the content from real websites. The only noticeable difference in a well-executed phishing site is where the data is submitted.
Although studies show that users typically do not scrutinize the address bar or understand URLs, some more technical users may check the domain name hosting the content as a last resort to verify its legitimacy. This approach, however, is problematic for several reasons:
Domain Confusion: Not everyone knows which domain is appropriate for which content. For example, TD Bank, N.A. issues co-branded credit cards for various companies. When paying your credit card bill online, you might see a domain name owned by TD Bank rather than the brand on your credit card.
Unrelated Legal and Domain Names: Company brands and their legal names often differ, as do their domain names. This discrepancy makes domain names more useful to attackers than to users for detection.
Furthermore, compromised websites from small businesses, churches, and other less technical organizations are frequently used to host phishing content. The key takeaway here is that content, not the domain name, is what makes a phishing campaign successful. The familiar look and feel of the content on phishing sites play a crucial role in deceiving users, making them less likely to question the site’s authenticity.
The Role of Certificate Authorities
Certificate authorities (CAs) in the web ecosystem exist to solve the trust-on-first-use problem introduced by our desire to protect the content we submit over the web from prying eyes (for more information, see TOFU and the Web). At its core, this problem arises because if you perform an anonymous key exchange or simply download the public key of a website from an unauthenticated source like DNS (while DNSSEC does technically exist, it is poorly adopted outside of TLDs), an attacker could replace the published key with one they control. This allows them to perform a man-in-the-middle (MITM) attack.
Certificate authorities are the answer to this problem. Browsers maintain a set of requirements that CAs must meet if they are to be distributed as “trusted” within their browsers (for more information, see Why We Trust WebPKI Root Certificate Authorities). In this ecosystem, browsers delegate solving the trust-on-first-use problem to the CAs. Essentially, they require CAs to prove that the requester of a certificate has administrative control over the domain and issue a certificate that attests to this. This enables browsers, as agents of the user, to assure the user that no one else should be able to intercept the information sent to the website (for some value of “website” — the modern website is made up of many third-party components that can potentially see your content also, but that’s a topic for another post).
Where things start to get tricky is how standards are defined. Anyone who works in a standards body knows that getting to a consensus is hard and often leads to less-than-ideal explanations of things. This is made worse by not everyone who participates understands the whole problem domain. Unfortunately, this is as true in the CA/Browser Forum as it is in other standards bodies. In the context of phishing, this comes into play in the “Baseline Requirements for the Issuance and Management of Publicly-Trusted TLS Server Certificates”, which has the concept of a “High-Risk Certificate Request” that states:
High-Risk Certificate Request: A Request that the CA flags for additional scrutiny by reference to internal criteria and databases maintained by the CA, which may include names at higher risk for phishing or other fraudulent usage, names contained in previously rejected certificate requests or revoked Certificates, names listed on the Miller Smiles phishing list or the Google Safe Browsing list, or names that the CA identifies using its own risk-mitigation criteria.
(It is worth noting Google Safe Browsing isn’t a list it is a web service…. sigh….)
And
The CA SHALL provide all personnel performing information verification duties with skills-training that covers basic Public Key Infrastructure knowledge, authentication and vetting policies and procedures (including the CA’s Certificate Policy and/or Certification Practice Statement), common threats to the information verification process (including phishing and other social engineering tactics), and these Requirements.
On its surface, this seems harmless enough. It doesn’t mandate that the CA do anything for phishing beyond training its validation staff about the contents but does allow them to do something about it if they can. After all, maybe they see phishing content and can stop the issuance altogether before a certificate is issued, preventing the use of SSL on the phishing site altogether. The problem is this all does more harm than good.
Why CAs Don’t Make Good Content Police
Determining whether content is phishing or legitimate can be highly subjective and context-dependent. What might appear as a phishing attempt to one person could be a legitimate operation to another.
As of 2024, there are approximately 1.13 billion websites on the internet, of which around 201.9 million are actively maintained and regularly updated. Each of these websites often consists of many different subdomains, all of which receive certificates, the large majority via automation with no opportunity for manual content inspection.
Simply put, this is a scale of a problem that does not lend itself to manual review or subjective assessments of phishing. But it gets worse.
There are around 86 WebPKI trusted certificate authorities on the web today, many of which operate with just a handful of people. Beyond that, though they exist to serve the whole web, they are in many different legal jurisdictions around the world, each with its own cultural norms and expectations. However, the web is an international asset, and if we were to rely on CAs to be the content police on the web, we would have hugely inconsistent results, especially given the current rules.
It is also worth noting if the decision-making power regarding content trustworthiness is applied at the domain name control verification, users are disempowered and it begins to resemble censorship rather than protection. Users should have the ability to choose what kind of subjectivity they want when it comes to protecting themselves from malicious content.
So why don’t we just make the rules clearer on what constitutes appropriate content? Most would agree this is a hugely difficult problem, but even if we put that aside, the reality is that CAs visit websites only at issuance time, often before there is any content published, since SSL is a requirement for a website to even launch. Beyond that, websites often serve different content to different users. In my last role at Microsoft, I was responsible for security engineering for the advertising business. Malicious advertisers would often serve Microsoft IP addresses content that met our policies but serve other users content that violated our policies. So even if CAs did check for phishing, all the phishers would need to do is serve clean content to the CA or change the content after the fact.
Beyond that, there is the question of how a CA would respond to finding a website hosting phishing content. They could revoke it, but as I mentioned earlier, often the website hosting content has been compromised in some way, and revoking that certificate would impact that other website. For example, it’s not uncommon to see phishing content served through shared services like drive.google.com or S3 bucket URLs. Revoking that certificate would impact all of those users, that is if revocation was actually effective, but it is not.
Revocation services like OCSP and CRL repositories are often so slow or unreliable that browsers were never able to deploy what we call hard fail revocation. This means that even when a certificate is revoked, the revocation message might never reach the browser for various reasons, so the CA may feel better that they have revoked the certificate, but in practice, it would at best deliver inconsistent results, making the web less reliable without actually addressing the real problems that enable phishing.
So What Can We Do to Help with the Problem of Phishing?
To effectively combat phishing, we need a robust reputation system that continuously monitors content over time from many different network perspectives. This includes residential IPs, commercial ones, and more. Here are just a few examples of things we could look at with such a service:
Analyze Domain Registration: Look at the age and subject of the domain registration and the registrar of the domain since some are used more commonly than others.
Examine Hosting Providers: Identify patterns in hosting providers serving the content, as certain providers may be more frequently associated with phishing activities.
Inspect Website Scripts: Evaluate the JavaScript on the website to understand its functionality and detect potentially malicious behavior.
Assess Content Similarity: Compare the content served by the website to other known websites to identify similarities that might indicate phishing.
Utilize Machine Learning Models: Feed all of this data into well-trained machine learning models to generate probability scores indicating the likelihood of phishing.
This system should be integrated into user agents, such as browsers, so phishing checks can occur as part of the rendering process, providing real-time protection for users. Technologies like Google Safe Browsing and Microsoft SmartScreen already do some of this, and similar projects are often staffed with hundreds of engineers and other professionals. They deal with recourse for mislabeling, monitor false positives and false negatives, and tweak algorithms to improve solutions, all while designing privacy-preserving methods.
Are these services perfect? Absolutely not! Can they improve? The answer is unquestionably yes, but the key thing is that ad hoc censorship at issuance time by CAs is simply not the answer. It makes the web less reliable and at best offers a false sense of value in exchange for giving CAs a market message that they are fighting phishing when in reality they deliver no material value to the problem.
Over the years, various efforts have worked towards addressing this challenge. These efforts aimed to solve the problem of undetectable modifications to these central key servers. For example:
CONIKS was the first academic attempt to create a key transparency framework that enabled users to verify the authenticity of keys in a privacy-preserving way without relying solely on the key server’s integrity.
Google Key Transparency, based on CONIKS, was Google’s effort to prove it was possible to deploy these patterns at an internet scale.
Apple Contact Discovery was the first deployment of these patterns for a large-scale messaging system.
Arguably the most successful and well-understood implementation of these patterns though is Facebook’s Auditable Key Directory (AKD), which is intended to be used across its messaging offerings but is already deployed with WhatsApp. Like the other solutions mentioned above, it is reliant on constant monitoring and verification to ensure global consistency, mitigating “Split View” attacks against its relying parties and ensuring these end-to-end encrypted messaging solutions maintain the promised security properties.
A research paper by Sarah Meiklejohn et al., titled “Think Global, Act Local: Gossip and Client Audits in Verifiable Data Structures” (https://arxiv.org/abs/2011.04551), set out to define a verifiable scheme to perform this verification in a cost-efficient and provable way. The essential idea in the paper is that the problem can be solved by gossiping the state of the underlying verifiable data structure across a network of perspectives. The goal is to ensure that relying parties verify they are seeing the same version of the directory as everyone else who is participating.
Introducing Google’s Armored Witness
Google’s Armored Witness project aims to operationalize this concept by providing a geographically distributed network of secure hardware-based “notaries.” Its design is based on custom hardware, the USB Armory, designed with security, verifiability, and maximal openness given commercially available components and associated constraints. It modifies the USB Armory by removing unused components and adding Power over Ethernet (PoE).
The USB Armory platform uses a Go unikernel design, which has many benefits. For one, Go naturally supports “reproducible builds,” meaning that any attempt to rebuild from an application’s source will yield the same binary. This is important because it enables the hardware to report what firmware it is running and map that back to the source, allowing you to understand what is happening inside.
The hardware itself also contains a trusted execution environment and a secure element that enables sensitive code to be isolated and the keys to be secured from many classes of logical and physical attacks. These keys are then used to do the “notarization” of their observation of these ledgers.
Hardware alone isn’t enough, though. We need a global network based on these devices so that relying parties can get assurance that an adequate number of network perspectives are seeing the same log contents.
Here’s how an integration of Google’s Armored Witness into Facebook’s AKD framework could work:
Log Integration and Epoch Consistency: By integrating Armored Witness into the AKD framework, each Armored Witness device can act as a witness to a log of AKD epoch roots. Specifically::
Log Integration: Each AKD epoch root is added to the log, such that the Nth epoch root is present at the Nth log index.
Witness Network: The witness network, comprising Armored Witness devices, counter-signs checkpoints from these logs, providing an additional layer of verification.
Client-Side Verification: As the AKD is updated, AKD clients are provided with epoch update bundles, which include:
Clients verify these bundles by:
Checking the signature on the directory epoch root.
Verifying the countersignatures on the log checkpoint.
Using the inclusion proof to confirm the correct log position of the epoch root, ensuring that the Nth root is in the Nth log position.
Split View Protection: The distributed nature of Armored Witness devices ensures that:
Split views of the log are prevented due to the geographically distributed countersigning by the witness network.
The log sequence number matches the directory epoch number, maintaining a strict one-to-one correspondence.
Heavy Lift Verifiers: Heavy lift verifiers, which perform more in-depth consistency checks, can source epoch roots from the log for pairwise append-only verification. This process ensures the correct operation of the directory over time, further strengthening the security framework.
The new directory epoch root commitment.
Inclusion proof hashes for the epoch root in the log.
The countersigned log checkpoint.
Conclusion
Integrating Google’s Armored Witness with Facebook’s AKD framework offers a simple solution to a very hard problem. More importantly, adopting this approach not only addresses the split-view problem for Facebook but also provides a substrate that can protect all verifiable data structures.
By leveraging a generic network of geographically distributed Armored Witness devices, solutions dependent on verifiable data structures to prove integrity, such as AKD deployments, can achieve robust split-view protection, ensuring integrity and authenticity across different regions.
My father grew up on a small farm in eastern Washington. They say you can take the boy out of the farm, but you can’t take the farm out of the boy. As a kid, I was always hearing farm life sayings from my grandfather and father. To this day, I think they are some of the best pieces of advice I have ever received. Here are a few:
Don’t wrestle with pigs. You both get filthy and the pig likes it: Avoid pointless arguments with people who thrive on conflict. It only wastes your time and energy.
Don’t count your chickens before they hatch: Don’t assume a deal is closed until the contract is signed.
Make hay while the sun shines: Capitalize on opportunities while they are available.
Don’t put all your eggs in one basket: Diversify your investments and strategies to mitigate risk.
The early bird catches the worm: Being proactive and early can lead to the best opportunities.
Every cloud has a silver lining: Look for opportunities in every challenge or setback.
Don’t cry over spilled milk: Learn from mistakes, but don’t dwell on them; move forward.
When the rooster crows, it’s time to get up: Start your day early to maximize productivity.
You can’t make an omelet without breaking a few eggs: Achieving success often requires taking some risks and making sacrifices.
You reap what you sow: Your results are a reflection of the effort and quality you put in.
Don’t look a gift horse in the mouth: Be grateful for opportunities and advantages you receive.
When in doubt, let your horse do the thinking: Trust the expertise of your team when you’re unsure.
Never approach a bull from the front, a horse from the rear, or a fool from any direction: Avoid unnecessary risks and confrontations in business.
Don’t close the barn door after the horse has bolted: Take preventive measures rather than waiting until you need to react.
A stitch in time saves nine: Address small issues before they become large problems.
The squeaky wheel gets the grease: The most noticeable or pressing issues are the ones that receive the most attention and resources first.
There’s a saying, “where there’s smoke, there’s fire.” This adage holds especially true in the context of WebPKI Certificate Authorities (CAs). Patterns of issues are one of the key tools that root programs use to understand what’s happening inside organizations. While audits are essential, they are often insufficient. Historical cases like Wirecard and Enron illustrate how audits can provide a partial and sometimes incorrect picture. Just as in most interactions in life, understanding who you are dealing with is crucial for successful navigation, especially when a power dynamic differential exists.
The Limitations of Audits
Currently, there are 86 organizations in the Microsoft root program. Most root programs have at most two people involved in monitoring and policing these 86 CAs. Technologies like Certificate Transparency make this possible, and open-source tools like Zlint and others use this data to find technically observable issues. However, these tools, combined with audits, only provide a small slice of the picture. Audits are backward-looking, not forward-looking. To understand where an organization is going, you need to understand how they operate and how focused they are on meeting their obligations.
This is where the nuanced side of root program management, the standards, and norms of the ecosystem, come into play. If we look at signals in isolation, they often appear trivial. However, when we examine them over a long enough period in the context of their neighboring signals, a more complete picture becomes apparent.
For example, consider a CA with minor compliance issues that seem trivial in isolation. A single misissued certificate might not seem alarming. But when you see a pattern of such incidents over time, combined with other issues like poor incident response or associations with controversial entities, the picture becomes clearer. These patterns reveal deeper issues within the organization, indicating potential systemic problems.
Root Program Challenges
Root programs face significant challenges in managing and monitoring numerous CAs. With limited personnel and resources, they rely heavily on technology and community vigilance. Certificate Transparency logs and tools like Zlint help identify and flag issues, but they are only part of the solution. Understanding the intentions and operational integrity of CAs requires a deeper dive into their practices and behaviors.
In the WebPKI ecosystem, context is everything. Root programs must consider the broader picture, evaluating CAs not just on isolated incidents but on their overall track record. This involves looking at how CAs handle their responsibilities, their commitment to security standards, and their transparency with the community. A CA that consistently falls short in these areas, even in seemingly minor ways, can pose a significant risk to the ecosystem.
Conclusion
Understanding the nuances of CA operations and focusing on their adherence to obligations is critical. By examining patterns over time and considering the broader context, root programs can better identify and address potential risks. The combination of audits, technological tools, and a keen understanding of organizational behavior forms a more comprehensive approach to maintaining trust in the WebPKI system.
It’s always important to remember that CAs need to be careful to keep this in mind. After all, it’s not just what you do, but what you think you do. Having your house in order is essential. By learning from past mistakes and focusing on continuous improvement, organizations can navigate public reporting obligations more effectively, ensuring they emerge stronger and more resilient.
Browser distrust events of WebPKI Certificate Authorities occur on average approximately every 1.23 years. These events highlight the critical role the WebPKI plays in maintaining secure communications on the internet and how failures within this system can have far-reaching implications. By examining these incidents, we can identify common patterns and underlying causes that lead to distrust, so as implementors and operators, we don’t end up repeating the same mistakes.
Identifying Common Patterns
As they say, those who don’t know history are destined to repeat it, so it is worthwhile to take a look at the history of CA distrust events to understand what, if any common patterns exist:
Security Breaches: Involves unauthorized access to the CA’s infrastructure, leading to potential misuse of certificate issuance capabilities.
Compromise of CA Infrastructure: Refers to breaches where the core infrastructure of the CA is compromised, resulting in unauthorized certificate issuance.
Fraudulent Certificates: Occurs when certificates are issued without proper authorization, often leading to the impersonation of legitimate websites.
Poor Incident Response Handling: Indicates that the CA failed to adequately respond to security incidents, exacerbating the impact of the initial problem.
Misissuance of Certificates: Happens when CAs issue certificates incorrectly, either to the wrong entities or without proper validation, undermining trust.
Facilitating Man-In-The-Middle Attacks: Refers to situations where misissued or improperly handled certificates enable attackers to intercept and alter communications.
Improper Practices: Includes actions by CAs that deviate from accepted standards and best practices, leading to security risks.
Deceptive Actions: Involves deliberate misleading actions by CAs, such as backdating certificates or other forms of dishonesty.
Insecure Practices: Encompasses practices by CAs that fail to maintain adequate security controls, leading to vulnerabilities.
Non-Compliance with Industry Standards: Indicates that the CA has repeatedly failed to adhere to industry standards and guidelines, leading to a loss of trust.
Ties to Controversial Entities: Involves associations with entities that raise ethical or security concerns, leading to distrust.
Limited Value to Ecosystem: Indicates that the CA does not provide significant value to the security ecosystem, often due to questionable practices or minimal compliance.
Operational Vulnerabilities: Refers to weaknesses in the CA’s operational security, such as using default passwords or having exposed administrative tools, making them susceptible to attacks.
Browser Distrust Events
DigiNotar (2011):
Event: DigiNotar was hacked, leading to the issuance of fraudulent certificates. This prompted Mozilla, Google, and Microsoft to revoke trust in DigiNotar certificates.
Labels: Security Breaches, Compromise of CA Infrastructure, Fraudulent Certificates, Poor Incident Response Handling
Event: It was discovered that TurkTrust mistakenly issued two intermediate CA certificates, one of which was used to issue a fraudulent certificate for *.google.com. This led to the distrust of the TurkTrust CA by major browsers.
Labels: Misissuance of Certificates, Facilitating Man-In-The-Middle Attacks
Event: It was discovered that ANSSI had issued a certificate to a network appliance company, which used it to inspect encrypted traffic. This led Google to revoke trust in the intermediate certificate.
Labels: Misissuance of Certificates, Facilitating Man-In-The-Middle Attacks
Event: WoSign (and StartCom) were distrusted after discovering multiple security issues, including backdating certificates, lying, and improper issuance.
Labels: Misissuance of Certificates, Deceptive Actions, Insecure Practices
Event: Google announced a gradual distrust of Symantec certificates due to numerous instances of certificate misissuance, impacting millions of websites.
Labels: Misissuance of Certificates, Non-Compliance with Industry Standards, Poor Incident Response Handling
Event: TrustCor was distrusted due to concerns about its ties to companies linked to the US intelligence community and its failure to provide satisfactory responses to these concerns.
Labels: Ties to Controversial Entities, Limited Value to Ecosystem
Event: e-Tugra was distrusted due to security concerns. A researcher found numerous vulnerabilities in e-Tugra’s systems, including default passwords and accessible administrative tools, leading to a loss of trust.
Event: The Chrome Security Team announced the distrust of several Entrust roots due to a pattern of compliance failures and unmet improvement commitments.
Labels: Non-Compliance with Industry Standards, Poor Incident Response Handling
The frequency and patterns of browser distrust events underscore the critical importance of preventive measures, transparency, and effective incident response.
Implementing robust security practices, conducting regular audits, and maintaining compliance with industry standards can significantly reduce the risk of such incidents. Transparency in operations and public disclosure of security issues foster an environment of accountability and trust.
An ounce of prevention is indeed worth more than a pound of cure. By focusing on proactive measures and cultivating a culture of continuous improvement, Certificate Authorities can better navigate the complexities of WebPKI. Effective crisis communication and incident response plans are essential for managing the fallout from security breaches and maintaining the trust of users and the broader web community.
By learning from past incidents and addressing their root causes, we can work towards a more secure and resilient internet, where trust in the WebPKI system is consistently upheld. The collective effort of CAs, browser vendors, and security researchers will play a pivotal role in achieving this goal, ensuring the integrity and reliability of our online ecosystem.
Incident response is notoriously challenging, and with the rise in public reporting obligations, the stakes have never been higher. In the WebPKI world, mishandling incidents can severely damage a company’s reputation and revenue, and sometimes even end a business. The Cyber Incident Reporting for Critical Infrastructure Act of 2022 has intensified this pressure, requiring some companies to report significant breaches to CISA within 72 hours. This isn’t just about meeting deadlines. The stakes are high, and the pressure is on. Look at the recent actions of the Cyber Safety Review Board (CSRB), which investigates major cyber incidents much like how plane crashes are scrutinized. The recent case of Entrust’s cascade of incidents in the WebPKI ecosystem, and the scrutiny they have gone under as a result, shows how critical it is to respond professionally, humbly, swiftly, and transparently. The takeaway? If you don’t respond adequately to an incident, someone else might do it for you, and even if not, mishandling can result in things spiraling out of control.
The Complexity of Public Reporting
Public reports attract attention from all sides—customers, investors, regulators, the media, and more. This means your incident response team must be thorough and meticulous, leaving no stone unturned. Balancing transparency with protecting your organization’s image is critical. A well-managed incident can build trust, while a poorly handled one can cause long-term damage.
Public disclosures also potentially come with legal ramifications. Everything must be vetted to ensure compliance and mitigate potential liabilities. With tight timelines like the CISA 72-hour reporting requirement, there’s little room for error. Gathering and verifying information quickly is challenging, especially when the situation is still unfolding. Moreover, public reporting requires seamless coordination between IT, legal, PR, and executive teams. Miscommunication can lead to inconsistencies and errors in the public narrative.
The Role of Blameless Post Mortems
Blameless post-mortems are invaluable. When there’s no fear of blame, team members are more likely to share all relevant details, leading to a clearer understanding of the incident. These post-mortems focus on systemic issues rather than pointing fingers, which helps prevent similar problems in the future. By fostering a learning culture, teams can improve continuously without worrying about punitive actions.
It’s essential to identify the root causes of incidents and ensure they are fixed durably across the entire system. When the same issues happen repeatedly, it indicates that the true root causes were not addressed. Implementing automation and tooling wherever possible is crucial so that you always have the information needed to respond quickly. Incidents that close quickly have minimal impact, whereas those that linger can severely damage a business.
Knowing they won’t be blamed, team members can contribute more calmly and effectively, improving the quality of the response. This approach also encourages thorough documentation, creating valuable resources for future incidents.
Evolving Public Reporting Obligations
New regulations demand greater transparency and accountability, pushing organizations to improve their security practices. With detailed and timely information, organizations can better assess and manage their risks. The added legal and regulatory pressure leads to faster and more comprehensive responses, reducing the time vulnerabilities are left unaddressed. However, these strict timelines and detailed disclosures increase stress on incident response teams, necessitating better support and processes. Additionally, when there are systemic failures in an organization, one incident can lead to others, overwhelming stakeholders and making it challenging to prioritize critical issues.
Importance of a Strong Communication Strategy
Maintaining trust and credibility through transparent and timely communication is essential. Clear messaging prevents misinformation and reduces panic, ensuring stakeholders understand the situation and response efforts. Effective communication can mitigate negative perceptions and protect your brand, even in the face of serious incidents. Proper communication also helps ensure compliance with legal and regulatory requirements, avoiding fines and legal issues. Keeping stakeholders informed supports overall recovery efforts by maintaining engagement and trust.
Implementing Effective Communication Strategies
Preparation is key. Develop a crisis communication plan that outlines roles, responsibilities, and procedures. Scenario planning helps anticipate and prepare for different types of incidents. Speed and accuracy are critical. Provide regular updates as the situation evolves to keep stakeholders informed.
Consistency in messaging is vital. Ensure all communications are aligned across all channels and avoid jargon. Transparency and honesty are crucial—acknowledge the incident and its impact, and explain the steps being taken to address it. Showing empathy for those affected and offering support and resources demonstrates that your organization cares. Keep employees informed about the incident and the organization’s response through regular internal briefings to ensure all teams are aligned and prepared to handle inquiries.
Handling Open Public Dialogues
Involving skilled communicators who understand both the technical and broader implications of incidents is crucial. Coordination between legal and PR teams ensures that messaging is clear and accurate. Implement robust systems to track all public obligations, deadlines, and commitments, with regular audits to ensure compliance and documentation. Prepare for potential delays or issues with contingency plans and pre-drafted communications, and proactively communicate if commitments cannot be met on time.
Communication with Major Customers: It often becomes necessary to keep major customers in the loop, providing them with timely updates and reassurances about the steps being taken. Build plans for how to proactively do this successfully.
Clear Objectives and Measurable Criteria: Define clear and measurable criteria for what good public responses look like and manage to this. This helps ensure that all communications are effective and meet the required standards.
External Expert Review: Retain external experts to review your incidents with a critical eye whenever possible. This helps catch misframing and gaps before you step into a tar pit.
Clarity for External Parties: Remember that external parties won’t understand your organizational structure and team dynamics. It’s your responsibility to provide them with the information needed to interpret the report the way you intended.
Sign-Off Process: Have a sign-off process for stakeholders, including technical, business, and legal teams, to ensure the report provides the right level of information needed by its readers.
Track Commitments and Public Obligations: Track all your commitments and public obligations and respond by any committed dates. If you can’t meet a deadline, let the public know ahead of time.
In the end, humility, transparency, and accountability are what make a successful public report.
Case Study: WoSign’s Non-Recoverable Loss of Trust
Incident: WoSign was caught lying about several aspects of their certificate issuance practices, leading to a total non-recoverable loss of trust from major browsers and ultimately their removal from trusted root stores.
Outcome: The incident led to a complete loss of trust from major browsers.
Impact: This example underscores the importance of transparency and honesty in public reporting, as once trust is lost, it may never be regained.
Case Study: Symantec and the Erosion of Trust
Incident: Symantec, one of the largest Certificate Authorities (CAs), improperly issued numerous certificates, including test certificates for domains not owned by Symantec and certificates for Google domains without proper authorization. Their non-transparent, combative behavior, and unwillingness to identify the true root cause publicly led to their ultimate distrust.
Outcome: This resulted in a significant loss of trust in Symantec’s CA operations. Both Google Chrome and Mozilla Firefox announced plans to distrust Symantec certificates, forcing the company to transition its CA business to DigiCert.
Impact: The incident severely damaged Symantec’s reputation in the WebPKI community and resulted in operational and financial setbacks, leading to the sale of their CA business.
Conclusion
Navigating public reporting obligations in WebPKI and other sectors is undeniably complex and challenging. However, by prioritizing clear, honest communication and involving the right professionals, organizations can effectively manage these complexities. Rigorous tracking of obligations, proactive and transparent communication, and a robust incident response plan are critical. Case studies like those of WoSign and Symantec underscore the importance of transparency and honesty—once trust is lost, it may never be regained.
To maintain trust and protect your brand, develop a crisis communication plan that prioritizes speed, accuracy, and empathy. Consistent, transparent messaging across all channels is vital, and preparing for potential incidents with scenario planning can make all the difference. Remember, how you handle an incident can build or break trust. By learning from past mistakes and focusing on continuous improvement, organizations can navigate public reporting obligations more effectively, ensuring they emerge stronger and more resilient.
I’ve always likened the WebPKI governance system to our legal system, where congress sets the laws and the judiciary ensures compliance. Justice Breyer’s recent explanation on “rules” and “standards” in law, as discussed on the Advisory Opinions podcast, resonates well with how WebPKI operates in practice. In WebPKI, “rules” are explicitly defined through audits derived from CA/Browser Forum standards, incorporated into programs like WebTrust for CAs, and enforced through contractual obligations. These rules ensure aspire to consistent security and reliability across the web.
In contrast, “standards” in WebPKI encompass community norms, best practices, and recommendations specific to each root program. These standards are adaptable, evolving with technological advancements, security threats, and collective learning among CAs. They provide a framework that upholds the integrity of the Internet, ensuring that CAs remain transparent and live up to their promises while adhering to ecosystem norms, requirements, and best practices.
Similar to the Supreme Court, the WebPKI governance system consists of multiple ‘justices,’ with each root program acting akin to a Supreme Court justice. Their decisions on adherence or abstention from standards shape the outcomes that dictate the security and reliability of the Internet. Thus, the trust users place in WebPKI and its stewards is earned through a consistent, transparent, and accountable framework that ensures integrity across devices and browsers.
The Dual Role of Root Programs
1. As Trusted Stewards
While there’s no explicit voting process for root program management, users effectively select them through their choice of browsers or operating systems that incorporate these programs. This implicit trust in their ability to deliver on their security promises to users grants root programs the authority to establish and enforce rigorous standards for Root CAs. These standards determine inclusion in trust stores based on compliance assessments and judgments on value a CA would bring the web’s users, ensuring Root CAs uphold a consistent and transparent standard of integrity that users and web operators can rely on.
2. As Judicial Authorities
Root programs also serve a critical judicial function within the WebPKI landscape, akin to a Supreme Court. They interpret rules and standards, resolve ambiguities, settle community disputes, and establish precedents that guide CA operational practices. This role ensures equitable and consistent application of standards across all Root CAs.
Enforcing Compliance and Transparency
1. Maintaining Checks and Balances
Root programs enforce checks and balances through rigorous audits and monitoring, similar to judicial reviews. These processes assess Root CAs’ adherence to these “rules” and “standards” and ensure accountability,with the goal of preventing misuse of their authority on the web.
2. Promoting Transparency and Accountability
Root programs need to operate with a high degree of transparency, akin to open judicial proceedings. Decisions on trust or distrust of Root CAs need to be communicated clearly, accompanied by reasons for these decisions. This transparency ensures that all stakeholders, from end-users to website operators, understand and trust the framework protecting their privacy.
Case Study: The Ongoing Entrust Incident
A current discussion in the mozilla.dev.security.policy forum provides a compelling example of how the WebPKI governance framework operates in practice. This incident underscores the nuanced interaction between rules and standards, as well as the critical importance of transparency and accountability in maintaining trust.
The issue at hand involves Entrust’s performance concerns related to certificate misissuance. Such incidents are pivotal in demonstrating how root programs must navigate complex challenges and uphold rigorous standards of integrity within the web. The ongoing dialogue in the forum highlights the deliberative process undertaken by root programs to address such issues, ensuring that decisions are transparently communicated to stakeholders.
Cases like this illustrate the intricate balance that root programs must maintain between enforcing strict rules for security and adapting standards to accommodate technological advancements. The resolution of these incidents ultimately defines why users can trust root CAs, as it showcases the procedural approach, the transparency the process was designed for, and the goal of achieving accountability through this governance framework.
Why This All Matters
Understanding the dual role of root programs as regulatory bodies and judicial authorities underscores their essential role in maintaining trust. It emphasizes the significance of their decisions in shaping how privacy is delivered on the web, focusing on delivering a robust, evolving transparent, and accountable governance framework to guide these decisions.
The trust placed in WebPKI and its stewards are earned through a system that respects user choice and adheres to principles of fairness, ensuring that end-users can trust they are communicating with the correct website.
Imagine a world where every conversation, every movement, and every interaction is tracked in real-time by unseen eyes. This isn’t the plot of a dystopian novel—it’s a very real possibility enabled by today’s rapid technological advancements. As we develop and deploy powerful new tools, and maintain the ones that we built the internet on, the line between technology that makes our lives easier and invasive surveillance becomes increasingly blurred.
“In the future everyone will want to be anonymous for 15 minutes”, Banksy, 2011
This growing concern isn’t unfounded. In recent years, the push for censorship and surveillance has become more pronounced, driven by various geopolitical, social, and technological factors. While these measures often are framed around national sovereignty, combating terrorism, and protecting children, the unintended consequences could be far-reaching. Whether this occurs accidentally or purposely, the outcome can be equally invasive and harmful.
We stand at a point in history where the rapid development and deployment of new technologies, such as Microsoft’s Recall and Apple’s Private Cloud Compute, might inadvertently build the infrastructure that enables governments’ ultimate surveillance goals.
The advent of generative AI, which can extract insights at scale from video, audio, and text, opens the door to mass surveillance more than any other technology in history. These AI advancements enable detailed monitoring and analysis of vast amounts of data, making it easier for governments to track individuals and gain unprecedented insights into personal lives. To be clear, these technologies have amazing potential and are already making many people’s lives easier and making businesses more productive. In some cases, they also represent large and positive trends in privacy protection and should also give us faith in our ability to enable these innovations in a privacy-respecting way.
However, their use raises crucial questions about how we, as system designers, should evolve our philosophies on security and privacy.
A Timeline of Surveillance and Encryption Battles
Let’s take a step back and examine the historical context of government attempts to monitor citizens and the corresponding technological responses.
1990s
Introduction of technologies to enable government access, such as the Clipper Chip.
Backlash against government surveillance efforts and exposure of security flaws in Clipper Chip.
Encryption classified as a munition, leading to strict export control, an effort that held back security advancement for years..
2000s
Relaxation of U.S. export controls on encryption products in a tacit acknowledgment of its failure.
Expansion of law enforcement access to personal data through legislation like the USA PATRIOT Act.
2010s
Major leaks reveal extensive government surveillance programs.
Public debates over government backdoors in encryption.
Expansion of government surveillance capabilities through new legislation.
Proposals for key escrow systems and their criticisms from security experts.
2020s
Controversies over commercial companies their data collection and sale practices, and the use of the data by the government against citizens.
Legislative proposals and debates around encryption and privacy, such as the EARN IT Act.
EU pushes to limit CAs ability to remove EU CAs from their root programs and the mandated trust of EU CAs.
Tech companies’ attempts to implement on-device monitoring of your data and the ensuing backlash.
Suppression of legitimate discourse about COVID-19, leading to long-term impacts on public trust.
Ongoing legislative efforts to expand surveillance capabilities in the EU.
New Technologies and Surveillance Risks
Fast forward to today, and we are witnessing the rise of new technologies designed to make our data more useful. However, in our rush to build out these new capabilities, we might be inadvertently paving the way for government surveillance on an unprecedented scale.
To give this some color, let’s look back at another technological innovation: payphones. Payphones became widespread in the 1930s, and by the 1950s, the FBI under J. Edgar Hoover’s COINTELPRO program was using phone tapping to target civil rights organizations, feminist groups, and others advocating for political change. Regardless of your political leanings, we can all hopefully agree that this was an infringement on civil liberties.
Pay Phone in the Haight, San Francisco, 1967
Similarly, the advent of generative AI, which can extract insights at scale from video, audio, and text, opens the door to mass surveillance more than any other technology in history. These AI advancements enable detailed monitoring and analysis of vast amounts of data, making it easier for governments to track individuals and gain unprecedented insights into personal lives. While these technologies are useful they present a double-edged sword: the potential for more usable technology and improved productivity on one hand, and the risk of pervasive surveillance on the other.
Fast forward to the early 2000s, the USA PATRIOT Act expanded the surveillance powers of U.S. law enforcement agencies, laying the groundwork for extensive data collection practices. These powers eventually led to the Snowden revelations in 2013, which exposed the vast scope of the NSA’s surveillance programs, including the bulk collection of phone metadata and internet communications.
“How would you like this wrapped?” John Jonik, early 2000s
Clay Bennett, Christian Science Monitor, 2006
As George Santayana famously said, ‘Those who cannot remember the past are condemned to repeat it.’ Technological advancements intended to be used for good can enable detailed, real-time surveillance on a national or even global scale. This dual-use nature underscores the critical need for robust, transparent frameworks to prevent misuse and ensure these technologies align with our fundamental rights and freedoms.
Striking a Balance: Innovation and Vigilance
We do not want, nor can we afford, to be Luddites; this is a world economy, and time and technology move on. Nor do we want to let fear convince us to give up essential liberty to purchase a little temporary safety. Instead, we must strive to strike a balance between innovation and vigilance. This balance involves advocating for robust, transparent, verifiable, and accountable frameworks that ensure the use of technology aligns with our fundamental rights and freedoms.
For example, historically, many security practitioners have pushed back against the use of client-side encryption claiming it is a distraction from real security work because the software or service provider can bypass these protections. However, as we see now, applying these technologies serves as a useful tool for protecting against accidental leakage and for defense in depth more broadly. This change exemplifies the need to continually reassess and adapt our security strategies so they match the actual threat model and not some ideal.
An essential aspect of this balance is the governance of the foundational elements that ensure our privacy on the web. Root programs that underpin the WebPKI play a critical role in maintaining trust and security online. The WebPKI provides the infrastructure for secure Internet communication, ensuring that users can trust the websites they visit and the transactions they perform. Efforts to weaken this system, such as the eIDAS regulation in the EU, pose significant risks. Certificate Transparency (CT) logs, which provide a transparent and publicly auditable record of all certificates issued, are crucial for detecting and preventing fraudulent certificates, thus ensuring the infrastructure remains robust and reliable. Trusted root certificate authorities (CAs) must demonstrate principled, competent, professional, secure, and transparent practices, with continual review to maintain the WebPKI’s integrity. By protecting and strengthening these foundational elements, we ensure that the infrastructure supporting our digital world remains secure and trustworthy. This vigilance is crucial as we continue to innovate and develop new technologies that rely on these systems.
By protecting and strengthening these foundational elements, we ensure that the infrastructure supporting our digital world remains secure and trustworthy. This vigilance is crucial as we continue to innovate and develop new technologies that rely on these systems.
Here are several guiding principles to consider:
Transparency: Companies developing these technologies must be transparent about their data practices and the potential risks involved. Clear communication about how data is collected, stored, and used can help build trust.
Accountability: Ensuring accountability involves creating systems and processes that allow for oversight and verification of how data is handled. This can include audits, reports, and other mechanisms that demonstrate a company’s commitment to ethical data management.
Privacy by Design: Incorporating privacy features from the ground up is essential. Technologies should embed privacy controls, transparency encryption, as Private Cloud Compute does, as fundamental components, not as afterthoughts.
Legal and Ethical Standards: There must be robust legal frameworks and ethical standards governing the use of surveillance technologies. These frameworks should ensure that surveillance is conducted legally, with proper oversight, and that it respects individuals’ rights and is discoverable when it has been done so it is harder to abuse.
Public Dialogue and Advocacy: Engaging in public dialogue about the implications of these technologies is crucial. Advocacy groups, policymakers, and the tech industry need to work together to educate the public and advocate for policies that protect privacy and civil liberties.
Technological Safeguards: Implementing technological safeguards to prevent misuse is critical. This includes encryption, verifiable designs, and transparent audit-ability so reliance on blind faith isn’t a necessary precondition to ensure that surveillance capabilities are not abused.
The Role of the Tech Community
The tech community plays a pivotal role in shaping the future of these technologies. By fostering a culture of ethical development and responsible innovation, developers, engineers, and tech companies can influence how new tools are designed and deployed. This involves prioritizing user privacy and security at every stage of development and ensuring that these principles are not compromised for convenience or profit.
Historical examples demonstrate the effectiveness of this approach. The Clipper Chip breakage in the 1990s is a notable case where researchers exposed significant security flaws, proving that government-mandated backdoors can undermine overall security. More recently, the advocacy for end-to-end encryption (E2EE) and the supporting technologies showcase the tech community’s positive influence. Companies like WhatsApp, Signal and Apple have implemented E2EE, setting new standards for privacy and security in communications.
Continued innovation in this regard can be seen in developments like key transparency and contact discovery. Key transparency mechanisms allow users to verify that the encryption keys they are using have not been tampered with, minimizing the amount of trust required in service providers and showing leadership for the industry as a whole.
Developers should also advocate for and adopt open standards that enhance verifiability, auditability, security, and privacy. These standards can serve as benchmarks for best practices, helping to establish a baseline of trust and reliability in new technologies. Collaboration within the tech community can lead to the creation of robust, interoperable systems that are harder for malicious actors to exploit.
Additionally, academia plays a crucial role in this endeavor by researching and developing provably secure models for various technologies. Academic work often focuses on creating rigorous, mathematically sound methods for ensuring security and privacy. By integrating these academic findings into practical applications, we can build more secure and trustworthy systems. This partnership between industry and academia fosters innovation and helps establish a foundation for secure technology development.
Moreover, the tech community must actively engage in policy discussions. By providing expert insights and technical perspectives, they can help shape legislation that balances innovation with the protection of civil liberties. This engagement can also help policymakers understand the technical complexities and potential risks associated with surveillance technologies, leading to more informed and balanced decision-making.
And finally, important we help regulators understand that technology alone cannot solve deeply rooted social problems and over-reliance on technology will lead to unintended consequences. A balanced approach that integrates technology with thoughtful policy, ethical considerations, and social engagement is essential if we want to see effective changes.
Education and Awareness
Educating the public about the implications of these technologies is crucial. Many people may not fully understand how their data is collected, used, and potentially monitored. By raising awareness about these issues, we can empower individuals to make informed decisions about their digital privacy.
Educational initiatives can take many forms, from public awareness campaigns to integrating privacy and security topics into school curriculums. At the same time, we must be careful to recognize the concerns and motivations of those advocating for these cases. This is not a situation where preaching to the choir or pontificating about abstract possibilities will be effective. We must anchor this education in recent history and concrete examples to avoid sounding like zealots.
Conclusion
The increasing push for censorship and the risk of enabling government surveillance underscore the delicate balance between technological advancement and civil liberties. Emerging technologies bring both promise and peril, potentially enabling unprecedented surveillance.
To address these risks, we must advocate for transparency, accountability, and privacy by design. Establishing robust legal and ethical standards, engaging in public dialogue, and raising awareness about digital privacy are essential steps. The tech community must prioritize ethical development, adopt open standards, and participate in policy discussions to ensure that innovation does not come at the cost of individual rights.
In conclusion, while new technologies offer significant benefits, they also pose risks that demand careful management. By balancing innovation with vigilance, we can create a future where technology enhances usability without compromising our fundamental freedoms.