Category Archives: Certificates

Incident Response Done Right: A CA’s Guide to Resilience

Imagine yourself as a pilot at 30,000 feet when an engine begins to sputter. You don’t panic—your training activates, you follow your checklist, and take control. For Certificate Authorities (CAs), incidents like misissued certificates or security breaches create similar high-stakes scenarios. They’re unexpected, critical, and unforgiving. Preparation isn’t just advisable—it’s essential. In the Web PKI world, where trust is paramount, improvisation isn’t an option.

These high-stakes scenarios aren’t rare exceptions—browser distrust events occur approximately every 1.23 years. Since 2011, over a dozen CAs have been distrusted, with poor incident response handling featuring prominently among the causes. These aren’t just statistics; they represent existential threats to CAs and the trust system underpinning secure internet communication.

Mozilla’s new CA Incident Response Requirements policy addresses a history of delayed responses, insufficient analyses, and unclear communication that has plagued the ecosystem. By incorporating Site Reliability Engineering (SRE) concepts, CAs can transform incidents into opportunities to strengthen resilience. Let’s examine the new policy, take a quick look SRE concepts and how they enhance it, and analyze real-world examples from Let’s Encrypt and DigiCert to illustrate best practices—and pitfalls to avoid.

Why the Mozilla Policy Matters: Trust at Stake

Incidents are inevitable. Whether a certificate misissuance, system failure, or security exploit, these events represent critical moments for CAs. Losing browser trust, as DigiNotar did in 2011 or Symantec by 2017, is catastrophic. One moment, you’re essential to Web PKI; the next, you’re a cautionary tale.

The evidence is clear: since 2011, CAs have experienced over 10 major incidents—averaging one every 14 months. More than half—over 57%—of these distrusts stem at least in part from delayed or mishandled responses, not just the incidents themselves. Each costs trust, revenue, or both (as DigiNotar’s bankruptcy demonstrated). The pattern reveals that your response defines you more than the incident itself. A prepared CA can recover and even strengthen its reputation. An unprepared one faces severe consequences.

Mozilla’s policy addresses the cycle of late notifications and superficial fixes that have damaged CAs previously. Structured timelines ensure transparency and accountability—essential elements for maintaining trust.

2025 Policy: Your Incident Response Framework

The new Common Incident Reporting Guidelines (effective March 2025) establish a the following framework for incident handling:

72-Hour Initial Disclosure: Three days to publicly acknowledge the issue, outline initial actions, and assess scope of impact.
14-Day Full Report: Two weeks to deliver a standardized, comprehensive Root Cause Analysis (RCA), detailed timeline, and prevention plan.

These aren’t just arbitrary deadlines—they’re designed to break the pattern of delays and ambiguity that has undermined trust in the WebPKI ecosystem. The policy establishes specific templates, report formats, and update requirements that formalize the approaches already taken by the most resilient CAs.

The requirements emphasize “candid, timely, and transparent” reporting—values that separate successful incident responses from catastrophic ones. What’s more, reports must demonstrate “a detailed understanding of root causes” and “clear, measurable explanations” of remediation actions.

The incident lifecycle follows this structure:

SRE: The Enhancement for Resilience

Mozilla provides structure, but Site Reliability Engineering (SRE)—pioneered by Google—offers tools that elevate your response. Two SRE concepts align perfectly with Mozilla’s requirements:

Automation: SRE emphasizes automating repetitive tasks. For the 72-hour disclosure, automated monitoring can identify issues immediately, while scripts—such as certificate revocation tools—activate without delay. Speed becomes your advantage.
Blameless Postmortems: The 14-day RCA isn’t about assigning blame—it’s about learning. SRE’s blameless approach investigates what failed and how to improve, converting every incident into a growth opportunity.

Automation in this case might look like this:

Together, Mozilla’s timelines and SRE’s methodologies establish a framework that’s proactive rather than reactive.

Case Studies: Preparation Demonstrated

Let’s Encrypt: Prepared When It Mattered

In 2020, Let’s Encrypt encountered a bug in their domain validation logic. Their response exemplified best practices:

Early Detection: Proactive monitoring and periodic reviews identified the issue quickly, before external parties did.
Automation in Action: They revoked 1.7 million certificates within hours due to their readiness.
Data-Driven Decisions: They were able to immediately identify which certificates had been replaced versus which were still in active use.
Transparent Communication: Regular updates and a thorough postmortem kept stakeholders informed.
Strategic Delayed Revocation: For certificates that couldn’t be immediately revoked without major disruption, they filed a separate delayed revocation incident with clear timelines.

They met CCADB’s deadlines with margin to spare and emerged stronger. Their preparation proved invaluable—and it was no coincidence. Their systems were designed from day one to handle such scenarios, with automation built into their core operations.

DigiCert: Caught Unprepared

DigiCert’s misissuance incident tells a contrasting story. An external party discovered the problem, and their response faltered:

Manual Processes: Without automation, revocations progressed slowly and required customer intervention.
Insufficient Planning: They struggled, facing subscriber resistance and legal complications, including a Temporary Restraining Order (TRO) from a customer.
Reactive Decision-Making: They initially announced a 24-hour revocation window, then extended it to 5 days as complications mounted.
Customer Impact: They did not know how many of their customers were ready to rotations, and so they had to treat everyone the same, amplifying disruption.
Design Issues: The initial fix appeared to be applied at the user interface level rather than addressing the core validation system—suggesting insufficient engineering practices.

Commercial CAs might argue their enterprise model makes automation harder than Let’s Encrypt’s, but complex customer relationships actually make preparation more critical, not less. The TRO demonstrates how business constraints amplify—rather than excuse—the need for rigorous incident readiness.

The contrast is instructive. Let’s Encrypt’s readiness maintained stability; DigiCert’s lack of preparation created vulnerability and legal complications that set a concerning precedent for the industry.

Implementing the New CCADB Requirements

To meet the new CCADB incident reporting requirements effectively, CAs should implement these eight critical capabilities:

Create Templated Response Plans: Develop standardized report templates aligned with CCADB’s new formats, with designated owners for each section.
Establish Monitoring Triggers: Implement automated monitoring that can identify potential incidents early and trigger response workflows.
Build Certificate Inventory Systems: Maintain comprehensive real-time data about certificate status, usage, and replacement to enable rapid impact assessment.
Create Tiered Revocation Capabilities: Implement automation for certificates with lifecycle management while maintaining processes for manual customers.”
Prepare customers and technology: Implement back-end changes, and work with customers to adopt systems that have been designed to meet these requirements.
Develop Blameless Postmortem Processes: Create structured processes for conducting Root Cause Analysis using methodologies like “5 Whys” and Fishbone Diagrams.
Create Revocation Automation: Implement systems to quickly revoke certificates in bulk with minimal manual intervention.
Align Legal Agreements: Ensure contracts include provisions for certificate revocations and incident response cooperation.
Test Incident Response Regularly: Conduct simulations of different incident types to ensure teams can meet the required reporting deadlines.

These systems shouldn’t be created during a crisis—they must be built, tested, and refined during normal operations to ensure they function when needed most.

Your Next Steps: Building Resilience

Ready to implement these principles? Follow this straightforward approach:

Create a Response Plan: Define roles, responsibilities, and timelines—your organization’s crisis protocol.
Automate Critical Functions: Implement detection and remediation tools—prioritize speed.
Develop Learning Processes: Conduct blameless postmortems to strengthen systems.
Prioritize Clear Communication: Share updates with stakeholders consistently, even during difficulties.

This isn’t complex—it’s disciplined. And for CAs, discipline is fundamental.

Preparation Is Essential

Incidents occur without warning. With a robust plan, automation, a learning orientation, and transparent communication, you can manage them effectively. Mozilla’s policy and Google’s SRE practices provide both structure and methodology to succeed. Let’s Encrypt demonstrated its effectiveness; DigiCert illustrated its necessity.

Don’t wait for an incident to expose weaknesses in your process. Preparation isn’t optional—it’s how you transform a crisis into an opportunity to demonstrate excellence. Plan systematically, automate intelligently, learn continuously, and you’ll build a CA that doesn’t merely survive but thrives.

Want to learn more? Mastering Incident Reporting in the WebPKI class covers mistakes and misconceptions: slides can be found here

The Account Recovery Problem and How Government Standards Might Actually Fix It

Leave a reply

Account recovery is where authentication systems go to die. We build sophisticated authentication using FIDO2, WebAuthn, and passkeys, then use “click this email link to reset” when something goes wrong. Or if we are an enterprise, we spend millions staffing help desks to verify identity through caller ID and security questions that barely worked in 2005.

This contradiction runs deep in digital identity. Organizations that require hardware tokens and biometrics for login will happily reset accounts based on a hope and a prayer. These companies that spend fortunes on authentication will rely on “mother’s maiden name” or a text message of a “magic number” for recovery. Increasingly we’ve got bank-vault front doors with screen-door back entrances.

The Government Solution

But there’s an interesting solution emerging from an unexpected place: government identity standards. Not because governments are suddenly great at technology, but because they’ve been quietly solving something harder than technology – how to agree on how to verify identity across borders and jurisdictions.

The European Union is pushing ahead with cross-border digital identity wallets based on their own standards. At the same time, a growing number of U.S. states—early adopters like California, Arizona, Colorado, and Utah—are piloting and implementing mobile driver’s licenses (mDLs). These mDLs aren’t just apps showing a photo ID; they’re essentially virtual smart cards, containing a “certificate” of sorts that is used to attest to certain information about you, similar to what happens with electronic reading of passports and federal CAC cards. Each of these mDL “certificates” are cryptographically traceable back to the issuing authority’s root of trust, creating verifiable chains of who is attesting to these attributes.

One of the companies helping make this happen is SpruceID, a company I advise. They have been doing the heavy lifting to enable governments and commercial agencies to accomplish these scenarios, paving the way for a more robust and secure digital identity ecosystem.

Modern Threats and Solutions

What makes this particularly relevant in 2024 is how it addresses emerging threats. Traditional remote identity verification relies heavily on liveness detection – systems looking at blink patterns, reflections and asking users to turn their heads, or show some other directed motion. But with generative AI advancing rapidly, these methods are becoming increasingly unreliable. Bad actors can now use AI to generate convincing video responses that fool traditional liveness checks. We’re seeing sophisticated attacks that can mimic these patterns the existing systems look at, even the more nuanced subtle facial expressions that once served as reliable markers of human presence.

mDL verification takes a fundamentally different approach. Instead of just checking if a face moves correctly, it verifies cryptographic proofs that link back to government identity infrastructure. Even if an attacker can generate a perfect deepfake video, they can’t forge the cryptographic attestations that come with a legitimate mDL. It’s the difference between checking if someone looks real and verifying they possess cryptographic proof of their identity.

Applications and Implementation

This matters for authentication because it gives us something we’ve never had: a way to reliably verify legal identity during account authentication or recovery that’s backed by the same processes used for official documents. This means that in the future when someone needs to recover account access, they can prove their identity using government-issued credentials that can be cryptographically verified, even in a world where deepfakes are becoming indistinguishable from reality.

The financial sector is already moving on this. Banks are starting to look at how they can integrate mDL verification into their KYC and AML compliance processes. Instead of manual document checks or easily-spoofed video verification, they will be able to use these to verify customer identity against government infrastructure. The same approaches that let customs agents verify passports electronically will now also be used to enable banks to verify customers.

For high-value transactions, this creates new possibilities. When someone signs a major contract, their mDL can be used to create a derived credential based on the attestations from the mDL about their name, age, and other artifacts. This derived credential could be an X.509 certificate binding their legal identity to the signature. This creates a provable link between the signer’s government-verified identity and the document – something that’s been remarkably hard to achieve digitally.

Technical Framework

The exciting thing isn’t the digital ID – they have been around a while – it’s the support for an online presentment protocol. ISO/IEC TS 18013-7 doesn’t just specify how to make digital IDs; it defines how these credentials can be reliably presented and verified online. This is crucial because remote verification has always been the Achilles’ heel of identity systems. How do you know someone isn’t just showing you a video or a photo of a fake ID? The standard addresses these challenges through a combination of cryptographic proofs and real-time challenge-response protocols that are resistant to replay attacks and deep fakes.

Government benefits show another critical use case. Benefits systems face a dual challenge: preventing fraud while ensuring legitimate access. mDL verification lets agencies validate both identity and residency through cryptographically signed government credentials. The same approach that proves your identity for a passport electronically at the TSA can prove your eligibility for benefits online. But unlike physical ID checks or basic document uploads, these verifications are resistant to the kind of sophisticated fraud we’re seeing with AI-generated documents and deepfake videos.

What’s more, major browsers are beginning to implement these standards as a first-class citizen. This means that verification of these digital equivalents of our physical identities will be natively supported by the web, ensuring that online interactions—from logging in to account recovery—are more easier and more secure than ever before.

Privacy and Future Applications

These mDLs have interesting privacy properties too. The standards support selective disclosure – proving you’re over 21 without showing your birth date, or verifying residency without exposing your address. You can’t do that with a physical ID card. More importantly, these privacy features work remotely – you can prove specific attributes about yourself online without exposing unnecessary personal information or risking your entire identity being captured and replayed by attackers.

We’re going to see this play out in sensitive scenarios like estate access. Imagine a case when someone needs to access a deceased partner’s accounts, they can prove their identity and when combined with other documents like marriage certificates and death certificates, they will be able to prove their entitlement to access that bank account without the overhead and complexity they need today. Some day we can even imagine those supporting documents to be in these wallets also, making it even easier.

The Path Forward

While the path from here to there is long and there are a lot of hurdles to get over, we are clearly on a path where this does happen. We will have standardized, government-backed identity verification that works across borders and jurisdictions. Not by replacing existing authentication systems, but by providing them with a stronger foundation for identity verification and recovery and remote identity verification – one that works even as AI makes traditional verification methods increasingly unreliable.

We’re moving from a world of island of identity systems to one with standardized and federated identity infrastructure, built on the same trust frameworks that back our most important physical credentials. And ironically, at least in the US it started with making driver’s licenses digital.

AI Agent Security: A Framework for Accountability and Control

1 Reply

This weekend, I came across a LinkedIn article by Priscilla Russo about OpenAI agents and digital wallets that touched on something I’ve been thinking about – liability and AI agents and how they change system designs. As autonomous AI systems become more prevalent, we face a critical challenge: how do we secure systems that actively optimize for success in ways that can break traditional security models? The article’s discussion of Knight Capital’s $440M trading glitch perfectly illustrates what’s at stake. When automated systems make catastrophic decisions, there’s no undo button – and with AI agents, the potential for unintended consequences scales dramatically with their capability to find novel paths to their objectives.

What we’re seeing isn’t just new—it’s a fundamental shift in how organizations approach security. Traditional software might accidentally misuse resources or escalate privileges, but AI agents actively seek out new ways to achieve their goals, often in ways developers never anticipated. This isn’t just about preventing external attacks; it’s about containing AI itself—ensuring it can’t accumulate unintended capabilities, bypass safeguards, or operate beyond its intended scope. Without containment, AI-driven optimization doesn’t just break security models—it reshapes them in ways that make traditional defenses obsolete.

“First, in 2024, O1 broke out of its container by exploiting a vuln. Then, in 2025, it hacked a chess game to win. Relying on AI alignment for security is like abstinence-only sex ed—you think it’s working, right up until it isn’t,” said the former 19-year-old father.

The Accountability Gap

Most security discussions around AI focus on protecting models from adversarial attacks or preventing prompt injection. These are important challenges, but they don’t get to the core problem of accountability. As Russo suggests, AI developers are inevitably going to be held responsible for the actions of their agents, just as financial firms, car manufacturers, and payment processors have been held accountable for unintended consequences in their respective industries.

The parallel to Knight Capital is particularly telling. When their software malfunction led to catastrophic trades, there was no ambiguity about liability. That same principle will apply to AI-driven decision-making – whether in finance, healthcare, or legal automation. If an AI agent executes an action, who bears responsibility? The user? The AI developer? The organization that allowed the AI to interact with its systems? These aren’t hypothetical questions anymore – regulators, courts, and companies need clear answers sooner rather than later.

Building Secure AI Architecture

Fail to plan, and you plan to fail. When legal liability is assigned, the difference between a company that anticipated risks, built mitigations, implemented controls, and ensured auditability and one that did not will likely be significant. Organizations that ignore these challenges will find themselves scrambling after a crisis, while those that proactively integrate identity controls, permissioning models, and AI-specific security frameworks will be in a far better position to defend their decisions.

While security vulnerabilities are a major concern, they are just one part of a broader set of AI risks. AI systems can introduce alignment challenges, emergent behaviors, and deployment risks that reshape system design. But at the core of these challenges is the need for robust identity models, dynamic security controls, and real-time monitoring to prevent AI from optimizing in ways that bypass traditional safeguards.

Containment and isolation are just as critical as resilience. It’s one thing to make an AI model more robust – it’s another to ensure that if it misbehaves, it doesn’t take down everything around it. A properly designed system should ensure that an AI agent can’t escalate its access, operate outside of predefined scopes, or create secondary effects that developers never intended. AI isn’t just another software component – it’s an active participant in decision-making processes, and that means limiting what it can influence, what it can modify, and how far its reach extends.

I’m seeing organizations take radically different approaches to this challenge. As Russo points out in her analysis, some organizations like Uber and Instacart are partnering directly with AI providers, integrating AI-driven interactions into their platforms. Others are taking a defensive stance, implementing stricter authentication and liveness tests to block AI agents outright. The most forward-thinking organizations are charting a middle path: treating AI agents as distinct entities with their own credentials and explicitly managed access. They recognize that pretending AI agents don’t exist or trying to force them into traditional security models is a recipe for disaster.

Identity and Authentication for AI Agents

One of the most immediate problems I’m grappling with is how AI agents authenticate and operate in online environments. Most AI agents today rely on borrowed user credentials, screen scraping, and brittle authentication models that were never meant to support autonomous systems. Worse, when organizations try to solve this through traditional secret sharing or credential delegation, they end up spraying secrets across their infrastructure – creating exactly the kind of standing permissions and expanded attack surface we need to avoid. This might work in the short term, but it’s completely unsustainable.

The future needs to look more like SPIFFE for AI agents – where each agent has its own verifiable identity, scoped permissions, and limited access that can be revoked or monitored. But identity alone isn’t enough. Having spent years building secure systems, I’ve learned that identity must be coupled with attenuated permissions, just-in-time authorization, and zero-standing privileges. The challenge is enabling delegation without compromising containment – we need AI agents to be able to delegate specific, limited capabilities to other agents without sharing their full credentials or creating long-lived access tokens that could be compromised.

Systems like Biscuits and Macaroons show us how this could work: they allow for fine-grained scoping and automatic expiration of permissions in a way that aligns perfectly with how AI agents operate. Instead of sharing secrets, agents can create capability tokens that are cryptographically bound to specific actions, contexts, and time windows. This would mean an agent can delegate exactly what’s needed for a specific task without expanding the blast radius if something goes wrong.

Agent Interactions and Chain of Responsibility

What keeps me up at night isn’t just individual AI agents – it’s the interaction between them. When a single AI agent calls another to complete a task, and that agent calls yet another, you end up with a chain of decision-making where no one knows who (or what) actually made the call. Without full pipeline auditing and attenuated permissions, this becomes a black-box decision-making system with no clear accountability or verifiablity. That’s a major liability problem – one that organizations will have to solve before AI-driven processes become deeply embedded in financial services, healthcare, and other regulated industries.

This is particularly critical as AI systems begin to interact with each other autonomously. Each step in an AI agent’s decision-making chain must be traced and logged, with clear accountability at each transition point. We’re not just building technical systems—we’re building forensic evidence chains that will need to stand up in court.

Runtime Security and Adaptive Controls

Traditional role-based access control models fundamentally break down with AI systems because they assume permissions can be neatly assigned based on predefined roles. But AI doesn’t work that way. Through reinforcement learning, AI agents optimize for success rather than security, finding novel ways to achieve their goals – sometimes exploiting system flaws in ways developers never anticipated. We have already seen cases where AI models learned to game reward systems in completely unexpected ways.

This requires a fundamental shift in our security architecture. We need adaptive access controls that respond to behavior patterns, runtime security monitoring for unexpected decisions, and real-time intervention capabilities. Most importantly, we need continuous behavioral analysis and anomaly detection that can identify when an AI system is making decisions that fall outside its intended patterns. The monitoring systems themselves must evolve as AI agents find new ways to achieve their objectives.

Compliance by Design

Drawing from my years building CAs, I’ve learned that continual compliance can’t just be a procedural afterthought – it has to be designed into the system itself. The most effective compliance models don’t just meet regulatory requirements at deployment; they generate the artifacts needed to prove compliance as natural byproducts of how they function.

The ephemeral nature of AI agents actually presents an opportunity here. Their transient access patterns align perfectly with modern encryption strategies – access should be temporary, data should always be encrypted, and only authorized agents should be able to decrypt specific information for specific tasks. AI’s ephemeral nature actually lends itself well to modern encryption strategies – access should be transient, data should be encrypted at rest and in motion, and only the AI agent authorized for a specific action should be able to decrypt it.

The Path Forward

If we don’t rethink these systems now, we’ll end up in a situation where AI-driven decision-making operates in a gray area where no one is quite sure who’s responsible for what. And if history tells us anything, regulators, courts, and companies will eventually demand a clear chain of responsibility – likely after a catastrophic incident forces the issue.

The solution isn’t just about securing AI – it’s about building an ecosystem where AI roles are well-defined and constrained, where actions are traceable and attributable, and where liability is clear and manageable. Security controls must be adaptive and dynamic, while compliance remains continuous and verifiable.

Organizations that ignore these challenges will find themselves scrambling after a crisis. Those that proactively integrate identity controls, permissioning models, and AI-specific security frameworks will be far better positioned to defend their decisions and maintain control over their AI systems. The future of AI security lies not in building impenetrable walls, but in creating transparent, accountable systems that can adapt to the unique challenges posed by autonomous agents.

This post lays out the challenges, but securing AI systems requires a structured, scalable approach. In Containing the Optimizer: A Practical Framework for Securing AI Agent Systems I outline a five-pillar framework that integrates containment, identity, adaptive monitoring, and real-time compliance to mitigate these risks.

Why It’s Time to Rethink Machine and Workload Identity: Lessons from User Security

Leave a reply

MFA slashed credential-based attacks. Passwordless authentication made phishing harder than ever. These breakthroughs transformed user security—so why are machines and workloads still stuck with static secrets and long-lived credentials?

While we’ve made remarkable progress in securing user identity, the same cannot always be said for machine and workload identity—servers, workloads, APIs, and applications. Machines often rely on static secrets stored in configuration files, environment variables, or files that are copied across systems. Over time, these secrets become fragmented, overly shared, and difficult to track, creating significant vulnerabilities. The good news? Machines and workloads are arguably easier to secure than humans, and applying the same principles that worked for users—like short-lived credentials, multi-factor verification, and dynamic access—can yield even greater results.

Let’s take the lessons learned from securing users and reimagine how we secure machines and workloads.

From Static Secrets to Dynamic Credentials

Machine and workload identity have long been built on the shaky foundation of static secrets—API keys, passwords, or certificates stored in configuration files, environment variables, or local files. These secrets are often copied across systems, passed between teams, and reused in multiple environments, making them not only overly shared but also hard to track. This lack of visibility means that a single forgotten or mismanaged secret can become a point of entry for attackers.

The lesson from user security is clear: static secrets must be replaced with dynamic, ephemeral credentials that are:

Short-lived: Credentials should expire quickly to minimize exposure.
Context-aware: Access should be tied to specific tasks or environments.
Automatically rotated: Machines and workloads should issue, validate, and retire credentials in real-time without human intervention.

This shift is about evolving from secret management to credential management, emphasizing real-time issuance and validation over static storage. Just as password managers gave way to passwordless authentication, dynamic credentialing represents the next step in securing machines and workloads.

Attestation: The MFA for Machines and Workloads

For users, MFA became critical in verifying identity by requiring multiple factors: something you know, have, or are. Machines and workloads need an equivalent, and attestation fills that role.

Attestation acts as the MFA for machines and workloads by providing:

Proof of identity: Verifying that a machine or workload is legitimate.
Proof of context: Ensuring the workload’s environment and posture align with security policies.
Proof of trustworthiness: Validating the workload operates within secure boundaries, such as hardware-backed enclaves or trusted runtimes.

Just as MFA reduced compromised passwords, attestation prevents compromised machines or workloads from gaining unauthorized access. It’s a dynamic, context-aware layer of security that aligns perfectly with Zero Trust principles.

Zero Trust: Reclaiming the Original Vision

When Zero Trust was introduced, it was a design principle: “Never trust, always verify.” It challenged the idea of implicit trust and called for dynamic, contextual verification for every access request.

But somewhere along the way, marketers reduced Zero Trust to a buzzword, often pushing solutions like VPN replacements or network segmentation tools.

To reclaim Zero Trust, we need to:

Treat all access as privileged access: Every request—whether from a user, machine, or workload—should be verified and granted the least privilege necessary.
Apply dynamic credentialing: Replace static secrets with short-lived credentials tied to real-time context.
Extend MFA principles to machines and workloads: Use attestation to continuously verify identity, context, and trustworthiness.

Preparing for the Future: Agentic AI and the Need for Robust Machine and Workload Identity

As organizations increasingly adopt agentic AI systems—autonomous systems that execute tasks and make decisions on behalf of users—the need for robust machine and workload identity management becomes even more pressing. These systems often require delegated access to resources, APIs, and other identities. Without proper safeguards, they introduce new attack surfaces, including:

Over-permissioned access: Delegated tasks may unintentionally expose sensitive resources.
Static secrets misuse: Secrets stored in configuration files or environment variables can become high-value targets for attackers, especially when copied across systems.
Fragmented visibility: Secrets that are spread across teams or environments are nearly impossible to track, making it hard to detect misuse.

To securely deploy agentic AI, organizations must:

Implement dynamic credentials: Ensure AI systems use short-lived, context-aware credentials that expire after each task, reducing the risk of abuse.
Require attestation: Validate the AI’s environment, behavior, and identity before granting access, just as you would verify a trusted workload.
Continuously monitor and revoke access: Apply zero standing privileges, ensuring access is granted only for specific tasks and revoked immediately afterward.

Building strong foundations in machine and workload identity management today ensures you’re prepared for the growing complexity of AI-driven systems tomorrow.

A Call to Action for Security Practitioners

For years, we’ve made meaningful progress in securing users, from deploying MFA to replacing passwords with strong authenticators. These changes worked because they addressed fundamental flaws in how identity and access were managed.

Now, it’s time to ask: Where else can we apply these lessons?

Look for parallels:

If replacing passwords reduced breaches for users, then replacing static secrets with dynamic credentials for machines and workloads can deliver similar results.
If MFA improved user authentication, then attestation for machines and workloads can add the same level of assurance to machine identity.
E2E encryption for personal communications vs. process-to-process security: End-to-end encryption has drastically improved the privacy of our personal communications, ensuring messages are secure from sender to recipient. Similarly, robust authentication and encryption between processes—ensuring that only trusted workloads communicate—can bring the same level of assurance to machine-to-machine communications, protecting sensitive data and operations.

By identifying these parallels, we can break down silos, extend the impact of past successes, and create a truly secure-by-default environment.

Final Thought

Security practitioners should always ask: Where have we already made meaningful progress, and where can we replicate that success?

If replacing passwords and adding MFA helped reduce user-related breaches, then replacing static secrets and adopting attestation for machines and workloads is a natural next step—one that is arguably quicker and easier to implement, given that machines and workloads don’t resist change.

Zero Trust was never meant to be a buzzword. It’s a call to rethink security from the ground up, applying proven principles to every layer of identity, human or machine. By embracing this approach, we can build systems that are not only resilient but truly secure by design.

Safeguarding Internet Trust: From Reactive to Continuous

Leave a reply

The internet rests on a foundation of core infrastructure components that make global communication possible. Among these load-bearing elements are DNS, DNSSEC, BGP, BGPsec, WebPKI, RPKI, transparency logs, IXPs, Autonomous Systems, and various registries. This includes critical governance bodies like ICANN and IANA, standards bodies like the CA/Browser Forum. These systems don’t just enable the internet – they are the internet, forming the critical backbone that allows us to establish secure connections, route traffic reliably, and maintain operational trust across the global network.

The PKI and transparency mechanisms that support these systems, particularly WebPKI, RPKI, and Certificate Transparency, are especially critical load-bearing elements essential to delivering the internet’s net value. When these foundational elements fail, they don’t just impact individual services – they can undermine the security and reliability of the entire internet infrastructure and erode the fundamental trust that billions of users and organizations place in the internet. This trust, once damaged, is difficult to rebuild and can have lasting consequences for how people interact with and rely upon digital services.

This foundational role makes the governance of root programs, which oversee these trust systems, absolutely critical. Yet recent incidents and historical patterns suggest we need to improve how we approach their governance. While no root program is perfect, and some have made significant strides in adopting product-focused practices and proactive risk management, there remains substantial room for improvement across the ecosystem. This framework isn’t meant to dismiss current efforts, but rather to synthesize best practices and push the conversation forward about how we can collectively strengthen these critical trust anchors.

To transform root programs from reactive administrative functions into proactive product-driven systems, we need a clear framework for change. This starts with defining our core mission, establishing a vision for the future state we want to achieve, and outlining the strategic pillars that will get us there.

Mission

To safeguard global trust in internet infrastructure by managing systemic risks, driving technical innovation, fostering transparent governance, and building durable systems that serve generations to come.

Vision

A resilient and adaptive trust ecosystem where:

Governance is proactive and risk-aware, balancing technical rigor with user-centric principles
Infrastructure and processes are continuously validated, transparent, and simplified
Collaboration fosters innovation to address emerging challenges and deliver long-term security

Strategy

1. Transparency and Accountability

Establish robust public verifiability for all CA operations, leveraging tools like transparency logs and continuous compliance monitoring
Communicate decisions on inclusion, removal, reentry, and policy changes openly, ensuring stakeholder trust
Build mechanisms for regular stakeholder feedback and confidence measurement, ensuring the ecosystem remains responsive to both technical and user needs

2. Integrated Risk Management

Apply blast radius management to minimize the impact of failures, for example by segmenting trust dependencies and ensuring risks remain contained
Use real-time monitoring and automated enforcement to detect and mitigate systemic risks
Implement standardized processes for risk assessment and mitigation

3. Proactive Governance

Shift from reactive to anticipatory governance by identifying potential risks and implementing early countermeasures
Leverage automated monitoring and enforcement to prevent and catch issues before they become incidents
Maintain clear lifecycle management processes for all ecosystem participants

4. Modernization and Simplification

Establish WebPKI governance primacy for included roots and minimize cross-ecosystem trust relationships
Limit what each CA is trusted for, reducing complexity and narrowing the scope of potential failures
Employ these measures as part of broader blast radius management strategies

5. Collaborative Ecosystem Building

Support and fund foundational open-source projects and critical infrastructure that the ecosystem depends on
Implement shared accountability mechanisms, ensuring all ecosystem participants bear responsibility for maintaining trust and integrity
Encourage CAs to align their policies not only with their own standards but also with aggregated internet governance policies, and best practices, especially for global use cases like TLS
Partner with browsers, CAs, and researchers to co-develop solutions for current and emerging threats
Foster an environment of mutual respect and constructive partnership

6. Commitment to Continuous Improvement

Drive decisions through data collection, measurement, and empirical analysis
Evolve policies based on quantitative feedback, incident analyses, and advancements in technology
Regularly reassess and refine program criteria to remain relevant and effective
Maintain clear processes for managing organizational transitions

The Stakes Are Higher Than Ever

The history of CA failures tells a sobering story – major CA distrust events occur on average every 1.23 years, each one threatening the foundation of trust that enables secure internet communication. These aren’t isolated incidents but rather represent recurring patterns of systemic failures in CA operations and governance.

Consider the range of critical failures we’ve seen: From DigiNotar’s complete infrastructure compromise in 2011 leading to rogue Google certificates to TURKTRUST’s “accidental” intermediate certificates in 2013, to government-affiliated CAs repeatedly undermining trust through deliberate actions or “accidents.” Take for example the ICP-Brasil case, where a root that had announced the end of SSL issuance continued to issue certificates months later – demonstrating how root programs’ decisions (or lack thereof) to maintain trust in roots that should no longer be part of the WebPKI can create unnecessary risks.

These incidents follow disturbingly consistent patterns:

Security breaches and infrastructure compromises that enable unauthorized certificate issuance
Systematic misissuance of certificates that undermine the entire trust model
Poor incident response handling that compounds initial failures
Non-compliance with industry standards despite clear requirements
Operational vulnerabilities that go unaddressed until it’s too late
Deceptive actions that breach the fundamental trust of the ecosystem

The Economic Reality

The current ecosystem suffers from fundamentally misaligned incentives. Root programs are typically run by browser vendors as a necessary cost of doing business, often competing with commercial priorities for resources and attention. Meanwhile, CAs face strong pressure to maintain their trusted status but weak incentives to uphold rigorous security practices. When security failures occur, users bear the cost while CAs often face minimal consequences. This economic reality is compounded by an ineffective auditing system where CAs select and pay their own auditors – reminiscent of the dynamics that enabled financial scandals like Wirecard and Enron.

The Long Tail Problem

A particularly concerning aspect of the current system is the “long tail” of rarely-used CAs. Many root certificates in browser trust stores belong to CAs that issue only dozens to hundreds of certificates annually, yet they maintain the same broad trust as major CAs issuing millions. These low-volume CAs pose risks that far outweigh their utility, creating unnecessary attack surfaces in our trust infrastructure. Regular assessment of each CA’s ongoing value to the ecosystem, balanced against their inherent risks, should inform continued inclusion in trust stores. This approach ensures the ecosystem maintains an appropriate balance between accessibility and security.

The Product-Centric Approach

To address these challenges, root programs must evolve from administrative oversight roles to become proactive, risk-managed entities. Here’s how a product-centric framework can transform root program governance:

1. Transparency and Accountability

Implement robust public verifiability for all CA operations
Leverage transparency logs and continuous compliance monitoring
Ensure open communication about inclusion, removal, and policy changes
Require automated reporting of security incidents and operational compliance

2. Blast Radius Management

Segment trust dependencies to contain potential failures
Implement dedicated hierarchies for specific use cases
Limit CA trust scope to reduce complexity and narrow failure impacts
Deploy real-time monitoring and automated enforcement

3. Risk-Based Governance

Move from reactive to anticipatory governance
Apply different levels of scrutiny based on CA context and risk profile
Regularly assess whether each CA’s utility justifies its risks
Implement meaningful technical restrictions on certificate issuance

4. Modernization and Simplification

Establish and maintain WebPKI governance primacy
Implement dedicated hierarchies for specific use cases
Limit CA trust scope to reduce complexity and narrow failure impacts
Deploy real-time monitoring and automated enforcement

5. Shared Accountability

Support and fund critical infrastructure and monitoring
Foster collaboration between browsers, CAs, and researchers
Establish clear responsibilities across all ecosystem participants
Create incentives that align with security goals
Balance rigorous oversight with constructive partnership
Develop clear processes for managing CA transitions and lifecycle events

Measuring Success

Like any product, root programs need clear metrics for success:

1. Risk Reduction

Track mis-issuance rates and time-to-remediate
Measure decrease in systemic vulnerabilities
Monitor adoption of proactive security measures
Track stakeholder confidence through regular surveys

2. Ecosystem Resistance

Assess recovery capabilities from disruptions
Track implementation and effectiveness of blast radius containment measures
Monitor CA inclusion, removal, and reentry success rates

3. Operational Excellence

Monitor CA inclusion and removal process efficiency
Track adoption of modern security and governance practices
Measure response times to security incidents and evaluate the thoroughness of incident handling
Evaluate lifecycle management process consistency and post-incident improvements

Lifecycle Excellence

The sustainability of root programs depends on having clear, repeatable processes for managing the complete lifecycle of CAs – from inclusion to potential removal. This includes:

Standardized onboarding and transition procedures
Regular assessment checkpoints
Clear criteria for maintaining trusted status
Efficient processes for handling CA turnover
Proactive planning for ecosystem evolution

The Trust Paradox

One of the most challenging aspects of root program governance is the inherent contradiction between trust and security. As we’ve seen with government-affiliated CAs and others, institutional incentives often directly conflict with security goals. A product-centric approach helps address this by:

Implementing consistent risk evaluation frameworks that account for different institutional incentives and constraints
Requiring proactive enforcement rather than post-incident reactions
Creating clear, measurable criteria for ongoing trust
Establishing automated compliance checks and monitoring
Establishing feedback loops between governance bodies, CAs, and end-users to maintain alignment

The Path Forward

Root programs must continue evolving beyond reactive governance and inconsistent enforcement. By adopting a product mindset that emphasizes continuous improvement, measurable outcomes, and proactive risk management, we can build an even more resilient trust ecosystem.

Immediate actions should include:

Implementing automated compliance monitoring
Establishing clear criteria for CA risk assessment
Developing robust blast radius management strategies
Creating transparent processes for trust decisions
Supporting proper funding for monitoring infrastructure
Implementing standardized CA lifecycle management processes
Building collaborative frameworks that balance accountability with mutual trust

Conclusion

The security of the internet depends on root programs functioning effectively. By treating them as products rather than administrative functions, we can build a more secure, transparent, and reliable trust ecosystem. This transformation won’t be easy, but the cost of maintaining the status quo – as evidenced by the long history of failures – is simply too high.

The question isn’t whether we need root programs – we absolutely do. The question is how we can continue evolving them to meet the security challenges of today’s internet. A product-centric approach, focused on proactive risk management and measurable outcomes, offers our best path forward.

Government CAs and the WebPKI: Trust is Often the Opposite of Security

Leave a reply

Following my recent post about another CA failing the “Turing test” with a likely MITM certificate issuance, let’s examine a troubling pattern: the role of government-run and government-affiliated CAs in the WebPKI ecosystem. This incident brings attention to Microsoft’s root program, what is clear is a fundamental contradiction persists: we’re trusting entities whose institutional incentives often directly conflict with the security goals of the WebPKI.

The Value Proposition

Let me be clear—CAs and root programs serve critical functions in the WebPKI. As I discussed in my article about Trust On First Use, attempting to build trust without them leads to even worse security outcomes. The issue isn’t whether we need CAs—we absolutely do. The question is whether our current trust model, which treats all CAs as equally trustworthy regardless of their incentives and constraints, actually serves our security goals.

The Core Contradiction

History has repeatedly shown that the temptation to abuse these capabilities is simply too great. Whether it’s decision-makers acting in their perceived national interest or CAs that fail to understand—or choose to ignore—the consequences of their actions, we keep seeing the same patterns play out.

Consider that a CA under government oversight faces fundamentally different pressures than one operating purely as a business. While both might fail, the failure modes and their implications for users differ dramatically. Yet our root programs largely pretend these differences don’t exist.

The DarkMatter Paradox

The removal of DarkMatter as a CA due to its affiliation with the UAE government, despite its clean record in this context, starkly contrasts with the continued trust granted to other government-affiliated CAs with documented failures. This inconsistency highlights a deeper flaw in root programs: Rules are often applied reactively, addressing incidents after they occur, rather than through proactive, continuous, and consistent enforcement.

A History of Predictable Failures

If you read yesterday’s post, you may recall my 2011 post on the number of government-run or affiliated CAs. The intervening years have given us a clear pattern of failures. Whether through compromise, willful action, or “accidents” (take that as you will), here are just the incidents I can recall off the top of my head—I’m sure there are more:

DigiNotar (2011): Complete compromise leading to rogue certificates for Google and demonstrated MITM attacks against Iranian users
TURKTRUST (2013): “Accidental” intermediate certificates
China’s Great Firewall: Systematic exploitation of trust through encouraging domestic browsers not to validate certificates to enable MiTM
CNNIC (2015): Unauthorized intermediate issuance
ANSSI (2015): Deliberate MITM certificates
Kazakhstan Government CA (2019): Attempted forced root installation
Telecom Egypt (2019): Systematic traffic interception
ICP-Brasil (2024): Announced end of SSL issuance in August, issued likely MiTM google.com certificate in November

The Economics of (In)Security

The fundamental problem isn’t just technical—it’s economic. While some root programs genuinely prioritize security, inconsistencies across the ecosystem remain a critical challenge. The broader issue is not simply about convenience but about conflicting incentives—balancing compatibility, regulatory pressures, and market demands often at the expense of doing what is best for end users.

CAs face strong incentives to maintain their trusted status but relatively weak incentives to uphold the rigorous security practices users expect. The cost of their security failure is largely borne by users, while the benefits of looser practices accrue directly to the CA. Audits, much like those in financial scandals such as Wirecard or Enron, often serve as window dressing. With CAs selecting and paying their auditors, incentives rarely align with rigorous enforcement.

The long tail of rarely-discussed CAs is particularly concerning. Many root certificates in browser trust stores belong to CAs that issue only dozens to hundreds of certificates annually, not the thousands or millions that major CAs produce. Some haven’t issued a certificate in ages but retain the capability to do so—and with it, the ability to compromise security for months or longer. It wouldn’t be unreasonable to say these low-volume CAs pose risks far outweighing their utility.

Certificate Transparency: Necessary but Not Sufficient

While Certificate Transparency has been invaluable in helping identify incidents (including the latest ICP-Brasil case), it’s not a complete solution. Its limitations include:

Reactive nature: Violations are identified only after they occur.
Monitoring challenges: Effective oversight is resource-intensive and depends on a small community of volunteers.
Incomplete coverage: Not all certificates are logged, leaving gaps in visibility.
Poorly funded: We have too few logs and monitors to have confidence about the long-term survivability of the ecosystem.

The Limits of Technical Controls

Some browsers have implemented technical guardrails for some CA mistakes in their validation logic, such as basic certificate linting and rules, to reject certificates that don’t pass basic checks but nothing more granular. There have been discussions about imposing additional restrictions on CAs based on their relationship to government oversight or regulatory jurisdictions. However, these proposals face significant pushback, partly due to the political consequences for browser vendors and partly due to concerns about basing trust decisions on “future crime” scenarios. As a result, the WebPKI remains stuck with a one-size-fits-all approach to CA trust.

The Monitoring Gap

The challenges extend beyond malicious behavior to include operational oversight. For instance, in August 2024, ICP-Brasil formally announced they would cease issuing publicly trusted SSL/TLS certificates. Yet by November, they issued a rogue certificate for google.com. This outcome was predictable—public CT logs in 2020 revealed their consistent inability to handle basic operational and issuance requirements, including issuing certificates with invalid DNS names and malformed URLs. Despite these red flags, they remained trusted.

How many other CAs operate outside their stated parameters without detection? Patterns of technical incompetence frequently precede security incidents, but warnings are often ignored.

Required Reforms

To address these systemic issues, root programs must adopt the following measures:

Consistent Standards: Apply appropriate scrutiny to CAs based on their operational and institutional context.
Swift Response Times: Minimize delays between discovery and action.
Proactive Enforcement: Treat red flags as early warnings, not just post-incident justifications.
Technical Controls: Implement meaningful restrictions to limit the scope of certificate issuance.
Automated Compliance: Require CAs to report security incidents, and operational, and ongoing compliance while continuingly to monitor them via automated checks for compliance.
Value Assessment: Regularly evaluate whether each CA’s utility justifies its risks and remove those that do not.

Protecting Yourself

Until the ecosystem adopts consistent and enforceable security measures:

Windows users should monitor Microsoft’s root program decisions.
Enterprises should use the Microsoft distrust store and group policies.
Everyone should stay informed about CA incidents and their handling.

When Will We Learn?

The “Turing Test” reference in my previous post was somewhat tongue-in-cheek, but it points to serious questions: How many more failures will it take before we fundamentally reform the WebPKI? Even if we know what’s needed, can we realistically create a system that treats government-affiliated CAs differently – or even reliably identify such affiliations – given the complex web of international relations, corporate structures and potential diplomatic fallout?

With regulatory frameworks like eIDAS 2.0 potentially constraining security measures browsers can take, vigilance from the security community is more critical than ever. Stay vigilant, and keep watching those CT logs. Someone has to.

From Years to Seconds: Rethinking Public Key Infrastructure

Leave a reply

Public Key Infrastructure was designed for a world where identities persisted for years—employees joining a company, servers running in data centers, devices connecting to networks. In this world, the deliberate pace of certificate issuance and revocation aligned perfectly with the natural lifecycle of these long-lived identities. But today’s cloud-native workloads—containers, serverless functions, and microservices—live and die in seconds, challenging these fundamental assumptions.

Though these ephemeral workloads still rely on public key cryptography for authentication, their deployment and management patterns break the traditional model. A container that exists for mere seconds to process a single request can’t wait minutes for certificate issuance. A serverless function that scales from zero to thousands of instances in moments can’t depend on manual certificate management. The fundamental mismatch isn’t about the cryptography—it’s about the infrastructure and processes built around it.

This isn’t a problem of public key infrastructure being inadequate but rather of applying it in a way that doesn’t align with modern workload realities. These new patterns challenge us to rethink how authentication and identity management systems should work—not just to ensure security, but to support the flexibility, performance, and speed that cloud-native infrastructure demands.

Why Workloads Are Different

Unlike human or machine identities, workloads are ephemeral by design. While a human identity might persist for years with occasional role changes, and a machine identity might remain tied to a server or device for months, workloads are created and destroyed on-demand. In many cases, they live just long enough to process a task before disappearing.

Unlike human and machine identities where identifiers are pre-assigned, workload identifiers must be dynamically assigned at runtime based on what is running and where. This transient nature makes revocation—a cornerstone of traditional PKI—irrelevant. There’s no need to revoke a workload’s credentials because they’ve already expired. In fact, much like Kerberos tickets, workload credentials are short-lived by design, issued for just long enough to meet deployment SLAs.

The Identity Lifecycle Dynamics graphic below illustrates these differences clearly:

Human identities are persistent, often spanning years, with sequential changes governed by compliance and auditing processes.
Machine identities are semi-persistent, lasting weeks or months, with planned updates and automated renewals sometimes tied to devices or hardware lifetimes.
Workload identities, by contrast, are ephemeral. They join and leave almost instantly, with lifespans measured in minutes and operations occurring at massive scale.

Compounding this difference is the scale and speed at which workloads operate. It’s not unusual for thousands of workloads to be created simultaneously, each requiring immediate authentication. Traditional PKI processes, designed for slower-moving environments, simply can’t keep up. And workloads don’t just operate in isolation—they’re often distributed across multiple regions to minimize latency and avoid unnecessary points of failure. This means the supporting credentialing infrastructure must also be distributed, capable of issuing and verifying credentials locally without introducing bottlenecks or dependency risks.

Governance adds another layer of complexity. While human and machine identities are often subject to compliance-driven processes focused on auditability and security, workloads are governed by operational priorities:

Zero downtime: Workloads must scale rapidly and without disruption.
Regional performance: Authentication systems must match the workloads’ regional deployments to avoid latency.
Developer flexibility: Identity systems must integrate with whatever technology stacks developers are already using.

The lifecycle of a workload identity reflects the immediacy of software deployment cycles, rather than the structured schedules of hardware or personnel management.

Rethinking Identity Infrastructure for Workloads

The traditional PKI model isn’t going away—it remains essential for the stable, predictable environments it was designed to support. But workloads require a shift in approach. They demand systems capable of:

Dynamic credential issuance: Credentials must be created on-demand to support rapid scaling, with automated identifier assignment based on runtime context and workload characteristics.
Ephemeral lifecycles: Workload credentials should expire automatically, eliminating the need for revocation, with lifecycle durations matched to actual workload runtime requirements.
Multi-factor workload authentication: Something the workload has (hardware roots of trust, cryptographic keys), something the workload knows (runtime configuration), something the workload is (attestation data, container hashes, process metadata).
Distributed infrastructure: Regional authentication systems ensure low latency and high availability, with local credential issuance capabilities.
Massive scalability: Systems must support thousands of identity events per minute, operating across clouds or hybrid environments, with automated identifier management at scale.
Runtime identifier assignment based on: What is running (container hashes, process information), Where it’s running (environment context, runtime attestation), how it’s running (execution environment verification).

As highlighted in the lifecycle comparison, workload identities aren’t simply a smaller, faster version of machine identities. Their governance models reflect their role in delivering business-critical objectives like seamless scaling and developer empowerment.

Perhaps the most significant difference is the role of developers. Workload identity systems can’t impose rigid, one-size-fits-all requirements. Developers need the freedom to:

Work with existing technology stacks.
Seamlessly integrate identity management into their workflows.
Build and deploy at the speed demanded by modern infrastructure.

In this way, workload identity management becomes not just a security task but a foundational enabler of innovation and efficiency.

Taking the First Step with SPIFFE

SPIFFE (Secure Production Identity Framework For Everyone) is an open standard designed to enable workloads to automatically acquire identities, certificates, and OIDC tokens for secure zero-trust communication between services. Rather than retrofitting existing systems, look for upcoming greenfield deployments where you can engage early in the design phase. This allows you to build SPIFFE’s workload identity patterns in from the start—solving different problems than traditional PKI, not competing with it. Use this greenfield project to demonstrate how PKI as a technology via SPIFFE can help solve additional problems for production environments.

Final Thoughts

Workloads have redefined how we think about identity. They operate at a speed and scale that traditional PKI never anticipated, governed by different priorities and lifecycles that reflect the realities of modern software. While PKI will continue to serve as a critical component of authentication, it must evolve to meet the unique demands of ephemeral, distributed workloads.

This isn’t about abandoning the tools we rely on but about adapting them for a world where zero downtime, developer flexibility, and seamless scalability are non-negotiable. The future of identity isn’t static or centralized—it’s dynamic, distributed, and built to support workloads that define modern infrastructure.

For organizations looking to implement these patterns quickly and efficiently, SPIRL (a company I advise) provides tools to make workload identity management straightforward and accessible.

CA Misissuance: A Tale of Two Incident Responses

Leave a reply

Certificate Authorities on the web are tasked with validating that the websites we visit are truly associated with the domains that serve the content that makes up that website. As a result, we gain confidence that we know which domains we are communicating with and can trust that no one else can view our interactions with them.

CAs are trusted because user agents—typically web browsers—chose to delegate the “Trust On First Use” (TOFU) problem to these authorities. A user agent acts on behalf of the user, aiming to serve and protect the user’s interests while navigating the internet. In this context, ensuring secure communication with the correct site is managed through this delegation of domain control verification to CAs.

Many often forget that a CA’s role does not stop at the time of certificate issuance; they also manage revocations when keys are compromised or replaced, the certificates are issued in error, or when they are not compliant with the industry standards the CA has agreed to meet as a publicly trusted CA—a responsibility pivotal to maintaining the trust billions place in them.

To illustrate this further, not only is this their responsibility, but they also have obligations under the audit regimes they operate under. Depending on the scenario, they may need to revoke the certificates within 24 hours to 5 days. When you are potentially talking about tens of thousands or millions of certificates, this requires both good engineering and preparation.

“By failing to prepare, you are preparing to fail.” — Benjamin Franklin

In this post, I want to delve into how two different CAs approached issues in handling two misissuances involving failures to meet the requirements for domain control verification. The contrast between these two incidents is striking and highlights the importance of preparation, good engineering, and strategic decision-making in crisis management.

Case Studies

Let’s Encrypt: A Proactive Approach

Let’s Encrypt demonstrated exemplary crisis management during their 2020 domain control verification-related incident. Through compensating controls, they internally discovered a bug in their domain control verification logic. With the right data, tooling, and processes in place, they were able to act decisively—filing the initial incident quickly, analyzing the scope and impact of the incident, and revoking over 1.7 million out of 3 million impacted certificates within hours. They were able to do this because the service and organization was built from day one to make it possible. One of the most telling elements of their response was their ability to quickly determine which certificates had already been replaced and which were still in use. This information guided their communication strategy and helped identify which certificates were safe to revoke immediately. Furthermore, within the 24-hour revocation requirement, they filed a subsequent delayed revocation incident.

This captured why the remaining certificates were not immediately revoked and when they would all expire, helping browsers and the community understand not only that they had done their research but also how their handling of the issue could impact the web and what the exposure of the delayed revocation represented.

DigiCert: A Reactive Approach

In contrast, DigiCert’s response to their domain control-related misissuance highlights several areas for improvement. The issue had persisted for an extended period and was identified externally, rather than through internal engineering and compliance processes designed to catch such errors.

DigiCert also struggled with the logistics of mass revocations. Faced with tens of thousands of certificates, they had difficulty determining which could be revoked quickly, which were actually impacted, and as a result initially opted to start a plan to revoke all certificates within a 24-hours. Though they started this clock when they figured out the scope of the incident, it took them a while to do so, suggesting a lack of preparation for incidents. They also appear to have done this all without a clear strategy to mitigate the customer impact and through a sequence of events ultimately rolled out a 5-day delay on top of the initial plan. This all led to the unusual step of a subscriber filing a Temporary Restraining Order (TRO) against the revocation, which set a troubling precedent and complicated the situation further.

A significant challenge was that many DigiCert customers do not utilize automated certificate lifecycle management extensively, which is essential during such incidents. This issue reflects back on DigiCert’s own processes, as manual lifecycle management is the default rather than the exception.

This incident raises questions about DigiCert’s preparedness, including whether they maintain a comprehensive list of all valid certificates, track replacements, and monitor customer use of lifecycle management. It also prompts a reflection on their internal review processes and test automation for compliance with CA/Browser Forum Requirements.

“Learn all you can from the mistakes of others. You won’t have time to make them all yourself.” – Alfred Sheinwold

It also makes me wonder about the type of design review process that was in place, the description in the incident sounds like a fix was applied at the user interface level which no serious engineer would do. Beyond that, it’s also unclear what kind of test automation is incorporated into their engineering processes to ensure compliance with associated requirements. For example, it would be quite reasonable for the process to mandate unit tests for all normative level requirements from the CA/Browser Forum Requirements and associated IETF documents. This would ensure that initial and subsequent releases meet requirements, which is particularly crucial for an organization with many engineers who may not all be experts in every requirement.

Engineering Readiness and Strategic Decision-Making

These case studies highlight critical lessons in engineering readiness and crisis management. Systems designed for scale and availability must be built with proactive planning, automation, and testing in mind. The reality is that systems designed for both scale and availability are also operated with engineering readiness as a core tenant. If you read the O’Reilly Site Reliability Engineering book, you will notice the same undertow of planning, automating, testing, and preparing, that is because the only way to have a reliable system or process is to engineer it to be that way.

Beyond that, if you’ve ever been on call, you understand how crucial it is that both mundane and unexpected issues be prevented and automated so you are able to respond effectively or even better — never experience entire classes of problems as you learn how to operate a CA. This is why, following Let’s Encrypt’s 2020 incident and their second mass revocation in 2021—for certificates that were one second longer than their policy allowed —they began exploring ways to ensure this never happens again. This is what led us to Automated Renewal and Installation (ARI) which aims to make it possible for these unplanned certificate renewals boring. This proactive approach aligns with the principle:

“An error doesn’t become a mistake until you refuse to correct it.” – Orlando A. Battista

The browsers have long stressed the importance of blameless postmortems and a culture of self-improvement they do this because without that people within an organization do not feel empowered to do the right thing. At the same time how an organization responds to an incident is very telling about its culture, transparency, operational excellence, focus on compliance, and engineering practices and as such serves as a useful indicator of an organization’s competence and suitability to be part of the threat model of the internet.

At the last CA/Browser Forum Face to Face meeting the Chrome team talked a bit about this.

Lessons Learned

These incidents highlight several key lessons for CAs, including the importance of automated systems, transparent communication, and continuous learning.

1. Preparation is Key: Automated systems, and tooling to support and predefined response strategies are crucial. For effective incident management, CAs should develop and test incident response plans regularly to ensure rapid action during a crisis.

2. Transparency Builds Trust: During crises, maintaining open and regular communication is essential. CAs should ensure that their communication strategies are clear and consistent to build and maintain trust with stakeholders and the community.

3. Learn from Others: Assign teams and individuals to conduct regular reviews of both historical and current incidents. Have them present these findings to the organization and rotate this responsibility across different disciplines to ensure knowledge is shared.

Conclusion

Acknowledging that DigiCert displayed a level of transparency during their incident, it’s clear that the event still exposed significant gaps in their compliance and engineering practices. Transparency, while crucial, is only one component of effective crisis management. The DigiCert case illustrates that even when stakeholders are kept informed, underlying deficiencies in operational procedures and compliance frameworks can compromise both the efficacy of the response and the overall security posture.

“The decisions you make reflect the priorities you have.”

Some recommendations for CAs looking to learn from this incident include:

DigiCert’s incident highlighted the need for more robust compliance mechanisms. Regularly updating compliance protocols and ensuring they are in strict alignment with industry standards is essential. This includes periodic reviews and updates to incorporate new regulatory requirements and best practices.
The incident revealed that while DigiCert was transparent, there was a lack of sufficient engineering controls to prevent or mitigate the impact of such issues. Implementing rigorous engineering best practices such as continuous integration, automated testing, and detailed, proactive monitoring of systems can help prevent similar incidents.
While DigiCert was forthcoming about the issue, their response strategy could benefit from a more structured approach that includes predefined escalation paths, detailed role assignments, and rapid response teams specialized in different aspects of the crisis management process.
Beyond immediate crisis management, fostering a culture prioritizing continuous improvement and learning from past mistakes is vital. This involves regular training, simulations of potential issues, and a blame-free environment that encourages reporting and swift resolution of potential threats.

In conclusion, effective crisis management involves not only responding to incidents but also implementing robust preventive measures. Regular updates to designs, implementations, tooling, improved engineering controls, and a culture of continuous improvement are essential for minimizing the likelihood and impact of future incidents.

Exploring Requirements for Timelines Certificate Problem Reports and Revocations

1 Reply

Today, DigiCert’s mass revocation is in the news, so I thought it would be worthwhile to explore the rules for a CA when it comes to revoking a subscriber certificate and the choices and constraints a CA faces in circumstances like this.

First, let’s examine the types of issues that might lead a CA to revoke a certificate before its natural expiration. Broadly speaking, there are two categories of these issues:

Externally identified Issues, these start from what is called a Certificate Problem Report (CPR), they may originate from: Subscribers, or Relying parties, application software suppliers, and other third parties

Internally identified issues, these may be identified by: Pre-issuance checks, Post-issuance checks, or Internal periodic reviews

Problem reports from external entities almost always require some minimal level of investigation. An exception might be cases where the reporter has access to the private key and requests a certificate be revoked, this could come from the subscriber, or someone who has gained access to the private key. Even in these cases you often need to do multiple communication round trips because requests often do not contain sufficient information to authenticate the request from the outset.

There are other cases too; for example, sometimes a problem report claims unauthorized issuance of a certificate, but it turns out the reporter had authorized it, perhaps by signing up for a CDN or SaaS offering that uses that CA. In these cases you might need to contact the person who requested the certificate, or the service provider for more information or as the reporter to prove control of the domain to verify your authorization to request the revocation — again introducing a few round trips.

However, most problem reports are spam or requests to take down a website. In this second case, if a CA has a policy about revoking domains that serve such content then it also needs some level of investigation. On that check out this post on the CA’s role in phishing mitigation on why CAs should stay out of this class of issue altogether.

These are just a few examples of certificate problem reports a CA might see but there are more, one of the most actionable is a claim that there was a violation of the requirements that a CA must meet, such as including incorrect information in a certificate or using a certificate profile with a wrong extension, or using a domain control verification process that does not meet requirements.

The workflow for these requests is straightforward, but it still requires some investigation. The questions that need to be answered before taking action include things like: When was the certificate issued? What rules were in force at that time? What data was used when validating the entitlement of the requester? What version of the software implemented the validation at the time of issuance? Is the reporter’s interpretation of the requirement technically correct?

This is why the Baseline Requirements (BRs) allow CAs some time for an investigation:

CA SHOULD revoke a certificate within 24 hours and MUST revoke a Certificate within 5 days

NOTE: Here is a conversation on the last changes to this text, searching the forums is always a good way to validate what the intent was at the time of authorship.

Let’s look at an example here, in 2020 Let’s Encrypt had a misissuance due to a bug in CAA handling. This issue was found internally through what we would call mitigating controls, a process that exists to double-check you are doing the right thing.

That’s not a lot of time, especially if such a report comes in on New Year’s Eve. CAs maintain on-call rotations to investigate issues, but it’s easy to imagine that from time to time, an investigation might take longer than 24 hours to fully complete. Despite this, the CAs shouldn’t have an infinite amount of time for such investigations, which is why the text limits the time to revocation to a maximum of 5 days.

Let’s explore what this might look like in the case of a real incident. On February 29, 2020, Let’s Encrypt through an internal review discovered an issue in their domain control verification logic and kicked off an investigation, and set off to do a revocation of over 3 million certificates on that same day they discovered the issue they announced their plan to revoke the affected certificates. Revocations began on March 4, 2020, and by the March 5, 2020 deadline, they had revoked 1,706,505 certificates. Network scans revealed that a large portion of the remaining 1,003,596 certificates were still in active use.

Due to concerns over the potential impact revocation would have on these active sites, they filed an incident for delayed revocation just before the deadline. These remaining certificates either expired or were revoked by May 29, 2020, after which the ticket remained open for community discussion and further analysis.

Let’s Encrypt did many things right here, including implementing compensating controls to detect issues internally, preparing to analyze the scope of impact immediately, and revoking the certificates they could without causing disruption as soon as possible. When it became clear that they would not meet the deadline requirements, they filed a bug to document their non-compliance with the obligations before the deadline lapsed. Moreover, they communicated clearly and regularly throughout the entire process until the issue was resolved.

But what about the missed revocation deadline!?! The reality is that delayed revocations happen, It’s not ideal, but they do happen. What is important is how a CA deals with them, and how credibly. Let’s Encrypt executed like a well-oiled machine here; no one complained. (They did this all without getting distrusted, for more on how CAs usually get distrusted see.)

There are other cases where delayed revocations were done as well, for example in 2019 Apple, Google, and GoDaddy were all hit by a bug in EJBCA, all of which had some delayed revocations, and while there may have been some challenges in each of those cases, what you could tell is that like Let’s Encrypt they were being proactive, transparent, and working towards addressing the issue responsibly and when you looked at their broader history there was not a sign of systemic failures.

OK so if you are in the CA industry right now you should be saying, but Ryan, What about the Certificate Problem Reports response requirements?! This Digicert issue was reported externally, those requirements are also relevant if so you are right, these requirements say:

The CA SHALL maintain a continuous 24×7 ability to accept and respond to revocation requests and Certificate Problem Reports.

And …

The period from receipt of the Certificate Problem Report or revocation-related notice to published revocation MUST NOT exceed the time frame set forth in Section 4.9.1.1. The date selected by the CA SHOULD consider the following criteria:

– The nature of the alleged problem (scope, context, severity, magnitude, risk of harm);

– The consequences of revocation (direct and collateral impacts to Subscribers and Relying Parties);

– The number of Certificate Problem Reports received about a particular Certificate or Subscriber;

– The entity making the complaint (for example, a complaint from a law enforcement official that a Web site is engaged in illegal activities should carry more weight than a complaint from a consumer alleging that they didn’t receive the goods they ordered); and

– Relevant legislation.

While In Section 4.9.1.1 it says

With the exception of Short-lived Subscriber Certificates, the CA SHALL revoke a Certificate within 24 hours and use the corresponding CRLReason (see Section 7.2.2) if one or more of the following occurs:

If you’re paying attention, you might be thinking to yourself, “Wait a minute, doesn’t that mean, since the request came in from an external reporter, that the certificate should be revoked within 24 hours?”

Well, yes, it does! This totally makes sense in some cases, for example, if a notice comes in of a key compromise that has all the required detail to confirm it, but as we discussed above, requests received externally often require some level of verification. This begs the question is 24 hours enough time to handle that or not? But one thing is for sure arguing that the rules are wrong or that the rules should be changed during an incident response is a recipe for disaster.

So what would I have done if I were DigiCert? I would have stopped issuance as soon as the issue was identified. I would have then set engineers to work on resolving the issue in the right way. In parallel, I would have prepared a preliminary incident report, noting that I had stopped the issuance. I would have also noted any confusion I may have had about the language in the requirements in that incident report, I would have then committed to start revoking following the requirements, and before the 24-hour clock expired, I would have started replacing and revoking the certificates that could be revoked safely. If it was clear that despite active, good faith, best efforts, the revocation was not going to be completed in time, then, I would have filed a delayed revocation incident before that clock expired—essentially following the same playbook Let’s Encrypt used back in 2020.

After successfully navigating the incident, I would have worked within the CA/Browser Forum to discuss this language and make sure it was it is clearer why the difference exists or change it to align with the other text.

We did not really tackle the topic of ‘Internally Identified Issues’? Though I do mentioned them above but didn’t really explain them in detail, so let’s see if I can capture them without rambling too much.

As I mentioned earlier, there are several classes. In the Let’s Encrypt case, they had some compensating controls around design reviews that identified an issue. When it comes to cases like this, it is pretty straightforward; by the time you have completed the design review, you are sure there is an issue at play, and the clock starts ticking.

Regarding pre-certificate linting, it’s a failure in your linting management if an issue that could or should have been caught by your linting gets issued; this is more akin to a design defect than a CPR-like incident and as such it would certainly trigger the 24-hour clock.

Then there is post-issuance linting, which is used either as a spot check during data gathering for an audit, as a compensating control, for example, or when you are testing new rules for the linter or refining your linter policy configuration. These are by their very nature a lot more like a CPR in that, because you don’t have a fine-tuned policy for the linter, they too need some level of investigation.

To be clear, none of this appears related to the DigiCert incident, but they have not published a detailed incident report yet, which will include an end-to-end timeline and that may change some of this analysis but I wanted to try to cover the topic more holistically in this post before declaring victory tonight.

Regardless of the above, the key takeaway is that if you are a CA or a customer of a CA, you need to be prepared to respond to an incident. One of the most useful actions you can take to achieve this is to adopt ACME for certificate lifecycle management. In particular, look for ACME implementations and CAs that support ARI, an extension of ACME. This extension enables a CA to signal to its enrollment clients that their certificate needs to be replaced sooner than expected, allowing them to roll certificates without requiring on-call incident response or causing downtime. The key takeaway is if you are a CA or a customer of a CA you need to be ready to respond to an incident.

If you are interested in this topic, I recommend checking out the Sectigo “Root Causes” podcast which has a great conversation on Chrome’s recent presentation at the CA/Browser Forum on Incident Reporting.

This presentation can be summarized as :

– Incident reporting is an opportunity to demonstrate continuous improvement and to make the ecosystem more resilient and secure.

– We rely on the public incident reporting process to demonstrate that continued trust is justified.

– We think incident reporting should be boring (i.e. routine).

I hope this was helpful. Let me know if you think I missed something or if my analysis is wrong in some way. When processing prose as code, there are almost always slightly different takes, and I wouldn’t be surprised to find out others feel differently than me on this issue.

Thanks to all of you who read this and provided feedback — it takes a village.

Understanding Patterns in WebPKI CA Issues

Leave a reply

There’s a saying, “where there’s smoke, there’s fire.” This adage holds especially true in the context of WebPKI Certificate Authorities (CAs). Patterns of issues are one of the key tools that root programs use to understand what’s happening inside organizations. While audits are essential, they are often insufficient. Historical cases like Wirecard and Enron illustrate how audits can provide a partial and sometimes incorrect picture. Just as in most interactions in life, understanding who you are dealing with is crucial for successful navigation, especially when a power dynamic differential exists.

The Limitations of Audits

Currently, there are 86 organizations in the Microsoft root program. Most root programs have at most two people involved in monitoring and policing these 86 CAs. Technologies like Certificate Transparency make this possible, and open-source tools like Zlint and others use this data to find technically observable issues. However, these tools, combined with audits, only provide a small slice of the picture. Audits are backward-looking, not forward-looking. To understand where an organization is going, you need to understand how they operate and how focused they are on meeting their obligations.

This is where the nuanced side of root program management, the standards, and norms of the ecosystem, come into play. If we look at signals in isolation, they often appear trivial. However, when we examine them over a long enough period in the context of their neighboring signals, a more complete picture becomes apparent.

For example, consider a CA with minor compliance issues that seem trivial in isolation. A single misissued certificate might not seem alarming. But when you see a pattern of such incidents over time, combined with other issues like poor incident response or associations with controversial entities, the picture becomes clearer. These patterns reveal deeper issues within the organization, indicating potential systemic problems.

Root Program Challenges

Root programs face significant challenges in managing and monitoring numerous CAs. With limited personnel and resources, they rely heavily on technology and community vigilance. Certificate Transparency logs and tools like Zlint help identify and flag issues, but they are only part of the solution. Understanding the intentions and operational integrity of CAs requires a deeper dive into their practices and behaviors.

In the WebPKI ecosystem, context is everything. Root programs must consider the broader picture, evaluating CAs not just on isolated incidents but on their overall track record. This involves looking at how CAs handle their responsibilities, their commitment to security standards, and their transparency with the community. A CA that consistently falls short in these areas, even in seemingly minor ways, can pose a significant risk to the ecosystem.

Conclusion

Understanding the nuances of CA operations and focusing on their adherence to obligations is critical. By examining patterns over time and considering the broader context, root programs can better identify and address potential risks. The combination of audits, technological tools, and a keen understanding of organizational behavior forms a more comprehensive approach to maintaining trust in the WebPKI system.

It’s always important to remember that CAs need to be careful to keep this in mind. After all, it’s not just what you do, but what you think you do. Having your house in order is essential. By learning from past mistakes and focusing on continuous improvement, organizations can navigate public reporting obligations more effectively, ensuring they emerge stronger and more resilient.