From Years to Seconds: Rethinking Public Key Infrastructure

Public Key Infrastructure was designed for a world where identities persisted for years—employees joining a company, servers running in data centers, devices connecting to networks. In this world, the deliberate pace of certificate issuance and revocation aligned perfectly with the natural lifecycle of these long-lived identities. But today’s cloud-native workloads—containers, serverless functions, and microservices—live and die in seconds, challenging these fundamental assumptions.

Though these ephemeral workloads still rely on public key cryptography for authentication, their deployment and management patterns break the traditional model. A container that exists for mere seconds to process a single request can’t wait minutes for certificate issuance. A serverless function that scales from zero to thousands of instances in moments can’t depend on manual certificate management. The fundamental mismatch isn’t about the cryptography—it’s about the infrastructure and processes built around it.

This isn’t a problem of public key infrastructure being inadequate but rather of applying it in a way that doesn’t align with modern workload realities. These new patterns challenge us to rethink how authentication and identity management systems should work—not just to ensure security, but to support the flexibility, performance, and speed that cloud-native infrastructure demands.

Why Workloads Are Different

Unlike human or machine identities, workloads are ephemeral by design. While a human identity might persist for years with occasional role changes, and a machine identity might remain tied to a server or device for months, workloads are created and destroyed on-demand. In many cases, they live just long enough to process a task before disappearing.

Unlike human and machine identities where identifiers are pre-assigned, workload identifiers must be dynamically assigned at runtime based on what is running and where. This transient nature makes revocation—a cornerstone of traditional PKI—irrelevant. There’s no need to revoke a workload’s credentials because they’ve already expired. In fact, much like Kerberos tickets, workload credentials are short-lived by design, issued for just long enough to meet deployment SLAs.

The Identity Lifecycle Dynamics graphic below illustrates these differences clearly:

  • Human identities are persistent, often spanning years, with sequential changes governed by compliance and auditing processes.
  • Machine identities are semi-persistent, lasting weeks or months, with planned updates and automated renewals sometimes tied to devices or hardware lifetimes.
  • Workload identities, by contrast, are ephemeral. They join and leave almost instantly, with lifespans measured in minutes and operations occurring at massive scale.

Compounding this difference is the scale and speed at which workloads operate. It’s not unusual for thousands of workloads to be created simultaneously, each requiring immediate authentication. Traditional PKI processes, designed for slower-moving environments, simply can’t keep up. And workloads don’t just operate in isolation—they’re often distributed across multiple regions to minimize latency and avoid unnecessary points of failure. This means the supporting credentialing infrastructure must also be distributed, capable of issuing and verifying credentials locally without introducing bottlenecks or dependency risks.

Governance adds another layer of complexity. While human and machine identities are often subject to compliance-driven processes focused on auditability and security, workloads are governed by operational priorities:

  • Zero downtime: Workloads must scale rapidly and without disruption.
  • Regional performance: Authentication systems must match the workloads’ regional deployments to avoid latency.
  • Developer flexibility: Identity systems must integrate with whatever technology stacks developers are already using.

The lifecycle of a workload identity reflects the immediacy of software deployment cycles, rather than the structured schedules of hardware or personnel management.

Rethinking Identity Infrastructure for Workloads

The traditional PKI model isn’t going away—it remains essential for the stable, predictable environments it was designed to support. But workloads require a shift in approach. They demand systems capable of:

  • Dynamic credential issuance: Credentials must be created on-demand to support rapid scaling, with automated identifier assignment based on runtime context and workload characteristics.
  • Ephemeral lifecycles: Workload credentials should expire automatically, eliminating the need for revocation, with lifecycle durations matched to actual workload runtime requirements.
  • Multi-factor workload authentication:  Something the workload has (hardware roots of trust, cryptographic keys), something the workload knows (runtime configuration), something the workload is (attestation data, container hashes, process metadata).
  • Distributed infrastructure: Regional authentication systems ensure low latency and high availability, with local credential issuance capabilities.
  • Massive scalability: Systems must support thousands of identity events per minute, operating across clouds or hybrid environments, with automated identifier management at scale.
  • Runtime identifier assignment based on: What is running (container hashes, process information), Where it’s running (environment context, runtime attestation), how it’s running (execution environment verification).

As highlighted in the lifecycle comparison, workload identities aren’t simply a smaller, faster version of machine identities. Their governance models reflect their role in delivering business-critical objectives like seamless scaling and developer empowerment.

Perhaps the most significant difference is the role of developers. Workload identity systems can’t impose rigid, one-size-fits-all requirements. Developers need the freedom to:

  • Work with existing technology stacks.
  • Seamlessly integrate identity management into their workflows.
  • Build and deploy at the speed demanded by modern infrastructure.

In this way, workload identity management becomes not just a security task but a foundational enabler of innovation and efficiency.

Taking the First Step with SPIFFE

SPIFFE (Secure Production Identity Framework For Everyone) is an open standard designed to enable workloads to automatically acquire identities, certificates, and OIDC tokens for secure zero-trust communication between services. Rather than retrofitting existing systems, look for upcoming greenfield deployments where you can engage early in the design phase. This allows you to build SPIFFE’s workload identity patterns in from the start—solving different problems than traditional PKI, not competing with it. Use this greenfield project to demonstrate how PKI as a technology via SPIFFE can help solve additional problems for production environments.

Final Thoughts

Workloads have redefined how we think about identity. They operate at a speed and scale that traditional PKI never anticipated, governed by different priorities and lifecycles that reflect the realities of modern software. While PKI will continue to serve as a critical component of authentication, it must evolve to meet the unique demands of ephemeral, distributed workloads.

This isn’t about abandoning the tools we rely on but about adapting them for a world where zero downtime, developer flexibility, and seamless scalability are non-negotiable. The future of identity isn’t static or centralized—it’s dynamic, distributed, and built to support workloads that define modern infrastructure.

For organizations looking to implement these patterns quickly and efficiently, SPIRL (a company I advise) provides tools to make workload identity management straightforward and accessible.

   

Bundling and Unbundling in the NHI Market: Opportunities in Identity, Governance, and Cryptography

Jim Barksdale famously said “All money is made through bundling and unbundling,” and this dynamic is evident in the Non-Human Identity (NHI) market. Cryptography management, privileged access management, and certificate lifecycle solutions are being redefined under a higher-level taxonomy. These functions, once viewed as isolated, are increasingly integrated into broader frameworks addressing identity, governance, and security holistically, reflecting the market’s shift toward unified and specialized solutions.

Cloud providers dominate in offering integrated solutions across categories, but these are often limited and focus on cost-recovery pricing to encourage adoption of their real money-makers like compute, storage, network, databases, and these days AI. They frequently provide just enough to facilitate a single project’s adoption, leaving opportunities for other vendors. For instance, Microsoft’s push to migrate enterprises from on-premises Active Directory to its cloud offering presents an opportunity to unbundle within the NHI IAM space. By focusing narrowly on migrating existing infrastructures rather than reimagining solutions from first principles to meet modern usage patterns, Microsoft has created gaps that smaller, more agile providers can exploit. Similarly, regulatory pressures and the rise of AI-driven, agentic workloads are driving demand for advanced workload authentication, creating further opportunities for specialized providers to deliver tailored solutions. Meanwhile, established players like CyberArk and Keyfactor have pursued acquisitions, such as Keyfactor’s merger with PrimeKey, to bundle new capabilities and remain competitive. However, the integration complexity of these acquisitions often leaves room for focused providers to address modern, cloud-native demands more effectively.

At the same time, traditional cryptography management companies have been so focused on their existing Key Management System (KMS) and Hardware Security Module (HSM) offerings that they have largely ignored broader unmet needs in the market, prioritizing feature expansion and acquisitions aimed at chasing smaller competitors. This narrow focus has left significant gaps in visibility, particularly around cryptographic assets and risks, creating fertile ground for new solutions focused on cryptography discovery, automated inventory management, and preparation for post-quantum cryptography.

Capital allocation, on the other hand, highlights category focus and growth potential. Seed and Series A investments underscore the dynamic opportunities created by unbundling, as well as the constraints faced by larger vendors burdened with legacy products that make it harder to truly innovate due to existing commercial obligations in the same space. In contrast, private equity activity targets larger bundling opportunities, enabling less agile and more mature market leaders to remain relevant by scaling established solutions or consolidating fragmented players. These stages illustrate the market’s balance between early-stage innovation and late-stage consolidation, driven by the growing demand for unified, cloud-native identity and governance solutions.

These patterns of bundling and unbundling are organic and continual, offering just one lens on the evolving dynamics of this market. While the NHI market appears new, it is, in fact, a natural evolution of existing identity governance patterns, driven by the growing demand for unified, cloud-native identity and governance solutions. This evolution underscores the balance between early-stage innovation and late-stage consolidation, as new entrants and established players alike navigate the opportunities created by shifting market dynamics.

Rethinking Authentication: “Something You Have,” “Something You Know,” and “Something You Are” for Workloads and Machines

Passwords have existed for millennia, and their weaknesses have persisted just as long. Their simplicity led to widespread adoption, but as their use expanded, so did the frequency of their abuse. To address this, we implemented stricter password policies—longer lengths, special characters, regular changes—much like hiding vulnerable software behind firewalls. When these efforts fell short, we evolved to multi-factor authentication (MFA), introducing the principles of “Something You Have,” “Something You Know,” and “Something You Are.”

MFA brought its own challenges—deployment complexity and user friction. Password managers helped bridge some gaps by generating and storing random passwords for us. While each of these steps enhanced security, none addressed the core problem: passwords and shared secrets are fundamentally flawed authenticators. This realization spurred efforts like WebAuthn, FIDO, and Passkeys, which replaced passwords with cryptographic keys and secure protocols, eliminating shared secrets entirely.

However, while user authentication evolved, workload and machine authentication lagged behind. Instead of passwords, workloads relied on API keys—essentially shared passwords—managed through “password managers” rebranded as secret vaults. These shared secrets are just as fragile and inadequate for today’s complex, scaled environments as passwords were for users.

The path forward is clear: workloads and machines need their own authentication revolution. We must replace shared secrets with cryptographic keys and implement MFA for workloads. But what does machine-focused MFA look like? Let’s explore how the three fundamental authentication factors apply to workloads and machines.


Applying Authentication Factors to Workloads and Machines

1. Something the Workload Has

This encompasses physical or cryptographic elements unique to the workload:

  • Hardware Roots of Trust: Security processors like TPM, Microsoft Pluton, Google Titan, and Apple’s Secure Enclave provide tamper-resistant foundations for device identity and posture assessment.
  • Cryptographic Keys: Private keys secured within hardware security processors serve as a robust “something you have,” enabling strong authentication.
  • Credentials: Like OIDC tokens and X.509 certificates, uniquely identify machines and workloads within trusted environments.

These mechanisms form the backbone of secure workload authentication, much like physical security tokens do for human users.


2. Something the Workload Knows

This parallels knowledge-based authentication but focuses on workload-specific secrets:

  • Shared Secrets: Shared API keys, symmetric keys, and asymmetric credentials that are used for authentication.
  • Configuration Data: Runtime-specific information like environment configuration.

Although often necessary for a service’s functionality, these weak attributes are highly susceptible to exposure, reuse, and theft. Implementing credentialing systems like SPIFFE can significantly mitigate these risks by replacing shared secrets with cryptographically secure, short-lived credentials uniquely tailored to each workload.


3. Something the Workload Is

This represents inherent characteristics of the workload, similar to human biometrics:

  • Trusted Execution Environments (TEEs): Secure enclaves like Intel SGX or AWS Nitro verify and attest to the integrity of the workload’s execution environment.
  • Immutable Code or Container Hashes: Binary or container image hashes verify workload integrity.
  • Runtime Attestation: Environmental and configuration validation ensures compliance with security policy.
  • POSIX Process Names and Metadata: Process information and runtime metadata provide operational context.

By combining these attributes, workloads can demonstrate their role and environment to enable more contextual identification and authorization.


The Future of Workload Identity

Authentication factors vary in strength—cryptographic keys and runtime attestations can provide strong confidence in what you are talking to while process names and secrets offer weaker assurance. Combining these elements creates a more comprehensive picture of workload authenticity. Standards like SPIFFE leverage this combination, creating strong workload identities by incorporating hardware roots of trust, runtime attestations, and other security properties. Over time, these attestations can be enriched with supply chain provenance, vulnerability assessments, and other compliance data.

As we look further ahead to agentic workloads and AI systems, we will need to develop protocols and policies that enable us to consider both workload identity and the entities they represent. For example, an AI model handling financial transactions needs both verified workload identity and specific policies for each user it serves so that the agent does not become a way to escalate privileges and access data that would otherwise have been unreachable.

This evolving, layered approach ensures that workload authentication scales with increasingly complex machine ecosystems while maintaining security and accountability. By unifying identity, actions, and policies, we prepare for a future of autonomous yet accountable workloads.

The Myth of Non-Technical Product Management

A common theme in conversations about product managers is the notion that they don’t need to be technical; they just need to bridge the gap between technical and non-technical teams. In my experience, particularly with enterprise and security products, this is a complete fallacy. Part of why this argument persists is the misconception that all product management is the same.

If you’re working on a 10-year-old product based on 20-year-old deployment patterns—and this might be hard to hear—chances are you’re not innovating. Instead, you’re managing customer requests and operating within the constraints of the bureaucracy you’re part of. Your roadmap likely consists of a mix of customer demands and features cloned from smaller competitors.

Another reason this perspective persists is that many organizations divide product managers into two categories: inbound and outbound. Outbound product managers are this decade’s version of product MBAs. They often have a limited understanding of their customers and their needs, instead focusing on systematizing a go-to-market strategy based on abstractions.

In the problem domain of enterprise and security—especially in small to medium-sized companies, where innovation tends to happen—there is no substitute for being an expert in what you’re building and selling. One of the most important things to understand is your customer: their pains, their constraints, and the schedules they operate within. The thing is, your customer isn’t just one person in an enterprise sale. As I’ve written before, at a minimum, you’re dealing with an economic buyer and a champion in any sale. If you’re lucky, you have many champions. And if you think strategically, you can even identify your champions’ champions within the sale.

This requires you to understand everyone’s job and perspective. If you don’t understand the technology or problem domain natively, you will always struggle—and likely fail—especially in smaller, early-stage companies.

Don’t get me wrong: once a company finds product-market fit and has a reproducible recipe for selling into organizations—or as the market evolves and expectations for a product in a given segment become standardized—it becomes less necessary. But even then, bringing that expertise to the table remains a powerful force multiplier that enables organizations lucky enough to have these resources to vastly outperform much larger and better-funded competitors.

Since I spend most of my time these days with smaller companies or very large companies looking to become more competitive again, all I can say is this: without the right product leaders, the best you can hope for is growing at the pace of your overall market and maintaining the status quo.

Winning Over the Organization: The Subtle Art of Getting Your Product Deployed

As someone who has spent over 30 years building security and infrastructure products both in large companies and small, I’ve seen one pattern repeat itself far too often: a great product gets sold to an enterprise, only to end up sitting on a shelf, untouched and unloved. For every company that successfully deploys their product and becomes a cornerstone of their customer’s operations, there are countless others that fall victim to this fate.

The difference? It’s rarely about the technology itself. Instead, it’s about understanding how to navigate the human dynamics of enterprise sales and deployment—helping your champions and economic buyers not just buy your product, but deploy it, show value, and win over the organization. The startups that succeed here often share a surprising trait: they get people promoted.

Here’s how I think about this challenge and the advice I give to the companies I advise.

Know Your Allies: The Champion and the Economic Buyer

In any enterprise sale, there are two critical players: the champion and the economic buyer. Your champion is the person who feels the pain your product solves most acutely. They’re your advocate inside the organization, the one who wants you to succeed because it solves their problem.

The economic buyer, on the other hand, is the one with the budget and the organizational perspective. They’re not as close to the day-to-day pain, but they’re thinking about alignment with company priorities, ROI, and risk.

If you want your product to avoid becoming shelfware, you need to understand what it takes for both of these people to succeed—not just in deploying your product, but in navigating the bureaucracy of their organization.

Empowering Your Champion: The Keys to Advocacy

Your champion is on your side, but they’re likely not equipped to sell your product internally without help. They need:

  • Clear, tangible wins: How can they deploy your product incrementally, showing immediate value without asking the organization to take on too much risk or disruption upfront?
  • Compelling talking points: You know your product better than anyone. Equip your champion with narratives that resonate with their stakeholders. For example:
    • “This solution aligns with our zero-trust initiative and reduces risks highlighted in the Verizon DBIR report.”
  • Materials for buy-in: Provide them with decks, case studies, and ROI calculators tailored to their audience—whether it’s IT, security, or the C-suite.

Startups that succeed make it easy for champions to tell a compelling story, removing the burden of figuring it out themselves.

Winning Over the Economic Buyer: Speak Their Language

The economic buyer is focused on the bigger picture: strategic alignment, ROI, and risk management. They’ll ask questions like:

  • How does this product support our organizational goals?
  • What’s the ROI? How does this reduce costs or avoid risks?
  • Will this disrupt our existing systems or processes?

To win them over:

  • Frame the product as part of their strategy: Don’t sell a feature—sell a solution to their larger problems.
  • Provide financial justification: Show them how your product saves money, reduces risks, or increases efficiency.
  • Mitigate risk: Give them confidence that deploying your product won’t be a gamble.

This isn’t just about convincing them to buy—it’s about giving them the confidence to champion your product as it rolls out.

Navigating Bureaucracy: Guiding the Path Forward

Here’s the uncomfortable truth: in large organizations, success often depends more on navigating bureaucracy than on the quality of the technology. Startups that win deployment understand this and partner with their buyers to:

  • Break down deployments into milestones: Start small, deliver quick wins, and build momentum over time.
  • Anticipate bottlenecks: Security reviews, procurement delays, and committee approvals are inevitable. Help your buyer prepare for and address these hurdles.
  • Guide advocacy efforts: Provide step-by-step playbooks to help champions and economic buyers build internal support and overcome resistance.

Think of it as being not just a vendor, but a partner in internal change management.

Selling More Than Software: The Roadmap as a Vision

One of the most overlooked strategies in enterprise sales is this: sell your roadmap.

A roadmap isn’t just a future wish list; it’s a way to help champions and buyers plot their own narratives for how your product will grow with their organization. By aligning your roadmap with their goals, you’re not just selling what your product does today—you’re selling the promise of what it can enable tomorrow.

Successful startups make buyers feel like they’re investing in something bigger than a single tool. They’re investing in a vision.

Helping Customers Win—and Get Promoted

Here’s the heart of it: successful startups help their customers succeed personally.

  • For champions, this might mean solving a thorny problem and becoming the hero of their team.
  • For economic buyers, it might mean delivering measurable results that align with company priorities, demonstrating strategic leadership.

Startups that win understand that their product is a means to an end. The real goal is to make the people buying and deploying your product look good—and in some cases, get promoted. This is a mindset shift, but it’s critical. If your customers succeed, your product succeeds.

Building Partnerships, Not Just Products

The startups I see succeed don’t try to bulldoze their way into organizations. They’re humble, practical, and focused on helping their customers navigate the messy, human reality of enterprise deployment.

They make it easy for champions to win arguments. They help economic buyers frame deployments as strategic wins. And they sell not just their product, but a roadmap that makes their customers look like visionaries.

In the end, that’s the secret: make your customer’s success the core of your strategy. If you do that, you’re not just selling a product—you’re building a partnership that drives real results. And that’s how you avoid the shelfware graveyard.

From Fairways to the Cloud: Estimating Golf Balls in Flight to Tackling Cloud Workload Scale

Early in my career, I worked in quality assurance at Microsoft, analytical skills were a core trait we tried to hire for, at the time  “brain teasers” were often used in interviews to assess these skills. One memorable interview question went like this:

“How would you figure out how many golf balls are in flight at any given moment?”

This question wasn’t about pinpointing the exact number; it was a window into the candidate’s analytical thinking, problem-solving approach, and ability to break down complex problems into manageable parts. It was always interesting to see how different minds approached this seemingly simple yet deceptively complex problem. If a candidate wasn’t sure how to begin, we would encourage them to ask questions or to simply document their assumptions, stressing that it was the deconstruction of the problem—not the answer—that we were looking for.

In engineering, we often need to take big, abstract problems and break them down. For those who aren’t engineers, this golf ball question makes that process a bit more approachable. Let me walk you through how we might tackle the golf ball question:

  1. Number of Golf Courses Worldwide
    • There are approximately 38,000 golf courses globally.
  2. Players and Tee Times
    • On average, each course hosts about 50 groups per day.
    • With an average group size of 4 players, that’s 200 players per course daily.
  3. Shots Per Player
    • An average golfer takes around 90 shots in a full round.
  4. Total Golf Balls in Play
    • 200 players × 90 shots = 18,000 shots per course per day.
  5. Time a Golf Ball Is in the Air
    • Let’s estimate each shot keeps the ball airborne for about 5 seconds.
  6. Calculating Balls in Flight
    • Over 12 hours of playtime, there are 43,200 seconds in a golfing day.
    • Total airborne time per course: 18,000 shots × 5 seconds = 90,000 seconds.
    • Average balls in flight per course: 90,000 seconds ÷ 43,200 seconds2 golf balls.
  7. Global Estimate
    • 2 balls per course × 38,000 courses = 76,000 golf balls in flight at any given moment worldwide.

This exercise isn’t about precision; it’s about methodically breaking down a complex question into digestible parts to arrive at a reasonable estimate. As the saying goes, all models are wrong, but some are useful. Our goal here is to find a stable place to stand as we tackle the problem, and this question does a decent job at doing that, if nothing else, letting us see how a candidate might approach unknown topics.

Transitioning from the Green to the Cloud

Today, the biggest challenges in cloud workload identity management remind me of these kinds of problems—except far more complex. Unlike in a round of golf, most workloads aren’t individually authenticated today; instead, they rely on shared credentials, essentially passwords, stored and distributed by secret managers, and anything needing access to a resource must have access to that secret. 

But with the push for zero trust, rising cloud adoption, infrastructure as code, and the reality that credential breaches represent one of the largest attack vectors, it’s clear we need a shift. The future should focus on a model where every workload is independently authenticated and authorized.

So, let’s put the “golf balls soaring through the air” approach to work here, using the same framework to break down the cloud workload scale:

  1. Global Cloud Infrastructure
    • Major cloud providers operate data centers with an estimated 10 million servers worldwide.
  2. Workloads Per Server
    • Each server might run an average of 100 workloads (virtual machines or containers).
    • 10 million servers × 100 workloads = 1,000 million  (1 billion) workloads running at any given time.
  3. Ephemeral Nature of Workloads
    • Let’s assume 50% of these are ephemeral, spinning up and down as needed.
    • 1 billion workloads × 50% = 500 million ephemeral workloads.
  4. Workload Lifespan and Credential Lifecycle
    • If each ephemeral workload runs for about 1 hour there are 24 cycles in a 24-hour day.
    • 500 million workloads × 24 cycles = 12 billion ephemeral workloads initiated daily.
  5. Credentials Issued
    • Each new workload requires secure credentials or identities to access resources.
    • This results in 12 billion credentials needing issuance and management every day.
  6. Credentials Issued Per Second
    • With 86,400 seconds in a day:
    • 12 billion credentials ÷ 86,400 seconds138,889 credentials issued per second globally.

In this updated example, just as with the golf balls in flight question, we deconstruct a complex system to better understand its core challenges:

  • Scale: The number of workloads and credentials needed to achieve this zero-trust ideal is much higher than we would need to simply pass around shared secrets.
  • Dynamics: These credentialing systems must have much higher availability than static systems to support the dynamism involved.
  • Complexity: Managing identities and credentials at this scale is a monumental task, emphasizing the need for scalable and automated solutions.

Note: These calculations are estimates meant to illustrate the concept. In real-world cloud environments, actual numbers can vary widely depending on factors like workload type distribution, number of replicas, ephemerality of workloads, and, of course, individual workload needs.

Conclusion

This exercise demonstrates a fundamental point: analytical thinking and problem-solving are timeless skills, applicable across domains.

You don’t need to be an expert in any given system to get started; you simply need to understand how to break down a problem and apply some basic algebra.

It also serves as a way to understand the scope and scale of authenticating every workload to enable zero trust end-to-end. Simply put, this is a vastly different problem than user and machine authentication, presenting a unique challenge in managing identities at scale.

Beyond the Facade: Evaluating Long-Term Impacts of Enterprise Software Architectures

Many enterprise products these days have a core architecture that consists of placing a proxy in front of an existing service. While the facade architecture makes sense in some cases, it’s usually a temporary measure because it increases the costs of administration, scaling, security, and debugging. It also adds complexity to general management.

The use cases for these offerings often involve one team in an organization providing incremental value to a service operated by another team. This introduces various organizational politics, which make anything more than a proof of concept not really viable, at least on an ongoing basis.

Essentially, anyone who has ever carried a pager or tried to deploy even the most basic system in a large enterprise should avoid this pattern except as a stopgap solution for a critical business system. It is far better, in the medium and long term, to look at replacing the fronted system with something that provides the needed integration or capability natively.

For example, some solutions aim to enable existing systems to use quantum-secure algorithms. In practice, these solutions often look like a single TLS server sitting in front of another TLS server, or a TLS-based VPN where a TLS client (for example, integrating via SOCKS) interfaces with your existing TLS client, which then communicates with that TLS server sitting in front of another TLS server. You can, of course, implement this, and there are places where it makes sense. However, on a long-term basis, you would be better off if there were native support for quantum-secure algorithms or switching out the legacy system altogether.

Similarly, it’s very common now for these enterprise-focused solutions to split the architecture between on-premise/private cloud and a SaaS component. This approach has several benefits: the on-premise part enables integration, core policy enforcement, and transaction handling, and, if done right, ensures availability. The SaaS component serves as the control plane. This combination gives you the best benefits of both on-premise and SaaS offerings and can be implemented while minimizing the security impact of the service provider.

Another pattern that might be confused with this model is one where transactional policy enforcement or transaction handling happens in the SaaS part of the solution, while the integration element remains on-premise. This is probably the easiest to deploy, so it goes smoothly in a proof of concept. However, it makes the SaaS component mission-critical, a performance bottleneck, and a single point of failure, while also pulling it entirely into the threat model of the component. There are cases where this model can work, but for any system that needs to scale and be highly reliable, it’s often not the best choice.

These architectural decisions in third-party solutions have other large impacts that need to be considered, such as data residency and compliance. These are especially important topics outside the US but are also issues within the US if you handle sensitive data and work in regulated markets. Beyond that, architecture and design choices of third-party products can have vendor lock-in consequences. For example, if the solution is not based on open standards, you may find yourself in an inescapable situation down the road without a forklift upgrade, which is often not viable organizationally if the solution fails to deliver.

So why does all of this matter? When we evaluate the purchase of enterprise security software, we need to be sure to look beyond the surface, beyond the ingredient list, and understand how the system is built and how those design decisions will impact our security, availability, performance, maintainability and total cost of ownership both in the near term and long term. Enterprise architects and decision-makers should carefully consider these factors when evaluating different architectural patterns.

When Words Mislead: Cybersecurity’s Terminology Problem

At Black Hat this year, I did my usual walk around the vendor floor. I talked to lots of companies about their products. One thing that stood out to me is vendors either by accident or on purpose are redefining terms in a way that does harm. One vendor in particular was calling “bearer tokens” “attestations” in both their marketing and product documentation. Let’s use this as an example and break down why this matters.

What’s an attestation?

An attestation is when someone says something is true and puts their reputation behind that statement. It’s like when your friend vouches for you at a new job. A good technology example is a TPM attestation. The TPM in your computer can prove certain things about how your machine started up. When we trust the chip’s design and the company that made it, we can believe what it tells us.

What’s a claim?

A claim is just something someone says. It might be true, but there’s no proof. If I tell you my name is Ryan Hurst, that’s a claim. I haven’t shown you my ID or anything. Claims can also be about other people or things. If I say Cloudflare is safe to use, that’s just my opinion unless I back it up with something.

What’s a bearer token?

A bearer token is like a temporary password. It’s a secret that proves who you are to a service. Anyone who has the token can pretend to be you. We use them because they’re necessary, but we try to limit their use in modern systems.

You usually get a bearer token by trading in something more permanent, like an API key, which is essentially a long-lived password. It’s like swapping a house key for a hotel room key. The hotel key only works for a short time, but anyone who finds it can get into your room.

Why does any of this matter?

When companies use the wrong terms to explain what they do it can lead people to make bad security choices. For example, If you hear a vendor say their system relies on bearer tokens and then you do a search on the term, you’ll find experts talking about their risks and how to manage them. But if you search for attestations, you’ll find different info about how they help prove things are reliable, trustworthy or factual.

If a company selling security products tells you it does one thing, but it does another, it’s a bad sign. They either have some technical debt buried in the design that may have a negative impact, don’t know what they’re talking about, or they’re trying to confuse you. Either way, you might want to look at other options.

So, when you’re buying security products for your company, pay attention to how vendors use words. If they’re changing the meaning of important terms, be careful. It could mean you’re not getting what you think you are.

The True Essence of Secure by Design

When we discuss “secure by design,” we often focus on capabilities, features, and defaults—such as logging and monitoring, default-deny, regular updates, authentication, and minimizing by default privileges—rather than focus on the word “design.” Don’t get me wrong; those things are important. But if we have learned anything from the last several decades of evolving security practices, it is that security features and settings alone do not make for secure products.

For instance, the recent CrowdStrike outages and the Microsoft Storm 0558 incident illustrate that design choices have a significant impact on the security and reliability of a system. So what does it mean to have a secure “design” then?

The whole concept of interconnected systems was bolted onto the way computers work. We didn’t even see a TCP/IP stack included by default in an operating system until BSD did this in 1983. Microsoft Windows didn’t get a TCP/IP stack until 1994, around the same time TISFWTK was released, which was a framework to build a firewall, not a firewall itself. For some, these developments may seem like ancient history; however, the methods we use to build and secure software and systems have evolved from this bolt-on approach to security.

“The beginning is the most important part of the work.” — Plato

In my view, a secure design incorporates an understanding of the entire lifecycle of a system. It recognizes that the question is not if it will be hacked, but when. It then employs questions such as “How do you reduce the chances of that happening?” and “How can you make it easy to recover when it does?” to inform the system’s shape.

The Storm case is a great example of insecure design. Loading extremely valuable keys into memory on a front-end machine, in the user context of code parsing user-supplied data, enabled an unauthenticated user on the internet to trigger a memory dump that exposed the key to theft via another attack vector. At a minimum, this key should have been kept in a separate user process, one that operated at a different permission level.

Similarly, the update mechanism for CrowdStrike’s ‘signatures’ functioned like a command and control network for a large array of bots installed on mission-critical systems. The fact that they could, in essence, push code at runtime, at will into the kernel privilege space of their customers means that a malicious actor with access to these systems could push malicious code onto all of those endpoints simultaneously. This insecure design choice put customers in a precarious position, making CrowdStrike’s update system a direct and uncontrollable method for compromising their systems.

How Did We Get Here?

Today, much of the software we depend on still carries the legacy of this bolt-on approach to security. This led the industry to adopt compliance and audit standards similar to the programs used in financial services, such as the Sarbanes-Oxley Act, which paved the way for standards like ISO 27001 and PCI DSS which aspired to manage this risk. However, these measures, like the financial audits that preceded them and missed major frauds such as those at Enron and Wirecard, unsurprisingly also often fail to prevent significant security risks.

This in turn led to a shift towards posture management, where products aimed at risk discovery have proliferated. However, these solutions predominantly focus on identifying potential problems and leave the filtering and remediating of them to IT security, thereby turning into a flurry of risk mitigation tasks rather than identifying the root causes and manifesting into design changes that would fundamentally reduce the risk. These approaches are symptomatic of a broader issue in security—treating the symptoms of poor foundational security rather than addressing the root causes.

Alongside these developments, the last few decades have seen massive growth in the adoption of distributed computing technologies, which first led to MSPs, then SaaS, and now cloud computing. This brought about the concept of shared responsibility, where security is managed as an incremental part of “solving the problem” for the buyer, allowing them to focus on their core business. For many, if not most business problems, this is the right way to go but it also often complicates accountability and obscures the visibility of security controls, sometimes exacerbating the very issues it seeks to mitigate. 

Is It Really That Bleak?

At first glance, this all might seem pretty dire. After all, we do not get to replace all the technology we depend on at once, and even if we could, these new solutions will always come with their own problems. However, we must accept that you can’t effectively retrofit “secure by design” onto an existing system—or if you do, at best it will take decades to replace poorly designed bits and pieces of those existing systems. Even when improvements do occur, as we saw with Microsoft’s Trustworthy Computing initiative, eventually the organization may lose focus and start to regress, leaving us on a rollercoaster of empty promises.

We have seen some organizations take this to heart; a good example is Google. They built their infrastructure and processes on platforms that address foundational issues, allowing the vast majority of their developers to remain largely unaware of the broad spectrum of security challenges that operating services like theirs are subject to. For instance, most Google developers couldn’t tell you much about the infrastructure that runs their code. They build on top of mandated frameworks that “make the right thing the easy thing,” enabling the central security organization to perform core security functions.

A telling example of the design of these systems is how the entire concepts of authentication, authorization, and encryption are abstracted away from the developer, so they largely just happen automatically. To be clear, this approach is not without its faults, and Google’s systems are neither perfect nor suitable for all types of workloads. However, the base principles applied here are fairly universal: Make investments in infrastructure design upfront to ensure that the systems built on them are secure, well, by design.

We are starting to see the industry come around to this reality with the advent of platforms like Kubernetes and infrastructure such as SPIFFE (the Secure Production Identity Framework For Everyone), which are designed to foster environments where security is integrated at the core rather than being an add-on but this is just the beginning of this focus on incorporating security infrastructure into the way we build, not the end.

Conclusion

What is necessary is a cultural shift in how we approach design. For this to happen, customers need to start demanding transparency from vendors about how their systems are designed, assessed, operated, and serviced. Part of this responsibility involves conducting their own assessments and not relying solely on blind faith. The recent recognition on supply chain risks and global scale outages is a direct result of such blind trust.

By demanding vendors to provide a comprehensive narrative on how they’ve built their systems to be secure, rather than merely listing security features, and by engaging in initial and continuous technological due diligence to hold them accountable, we can hasten the transition to a world where security is woven into the fabric of every system, not merely added as an afterthought.

In conclusion, while we are just at the beginning of this journey and will have to continue managing systems that were not designed with transparent “secure by design” principles for decades, the direction is clear. We are moving towards a world where code, as well as users, are authenticated, and automated intelligent systems that utilize expert systems and AI evaluate whether systems are regressing or living up to their promises. If we continue on this path, our children and grandchildren may have the opportunity to live in a world that is inherently more secure, or at the very least, they will be better equipped to manage the security risks associated with the massive expansion of technology that we rely on daily.

CA Misissuance: A Tale of Two Incident Responses

Certificate Authorities on the web are tasked with validating that the websites we visit are truly associated with the domains that serve the content that makes up that website. As a result, we gain confidence that we know which domains we are communicating with and can trust that no one else can view our interactions with them.

CAs are trusted because user agents—typically web browsers—chose to delegate the “Trust On First Use” (TOFU) problem to these authorities. A user agent acts on behalf of the user, aiming to serve and protect the user’s interests while navigating the internet. In this context, ensuring secure communication with the correct site is managed through this delegation of domain control verification to CAs.

Many often forget that a CA’s role does not stop at the time of certificate issuance; they also manage revocations when keys are compromised or replaced, the certificates are issued in error, or when they are not compliant with the industry standards the CA has agreed to meet as a publicly trusted CA—a responsibility pivotal to maintaining the trust billions place in them.

To illustrate this further, not only is this their responsibility, but they also have obligations under the audit regimes they operate under. Depending on the scenario, they may need to revoke the certificates within 24 hours to 5 days.  When you are potentially talking about tens of thousands or millions of certificates, this requires both good engineering and preparation.

“By failing to prepare, you are preparing to fail.” — Benjamin Franklin

In this post, I want to delve into how two different CAs approached issues in handling two misissuances involving failures to meet the requirements for domain control verification. The contrast between these two incidents is striking and highlights the importance of preparation, good engineering, and strategic decision-making in crisis management.

Case Studies

Let’s Encrypt: A Proactive Approach

Let’s Encrypt demonstrated exemplary crisis management during their 2020 domain control verification-related incident. Through compensating controls, they internally discovered a bug in their domain control verification logic. With the right data, tooling, and processes in place, they were able to act decisively—filing the initial incident quickly, analyzing the scope and impact of the incident, and revoking over 1.7 million out of 3 million impacted certificates within hours. They were able to do this because the service and organization was built from day one to make it possible. One of the most telling elements of their response was their ability to quickly determine which certificates had already been replaced and which were still in use. This information guided their communication strategy and helped identify which certificates were safe to revoke immediately. Furthermore, within the 24-hour revocation requirement, they filed a subsequent delayed revocation incident

This captured why the remaining certificates were not immediately revoked and when they would all expire, helping browsers and the community understand not only that they had done their research but also how their handling of the issue could impact the web and what the exposure of the delayed revocation represented.

DigiCert: A Reactive Approach

In contrast, DigiCert’s response to their domain control-related misissuance highlights several areas for improvement. The issue had persisted for an extended period and was identified externally, rather than through internal engineering and compliance processes designed to catch such errors.

DigiCert also struggled with the logistics of mass revocations. Faced with tens of thousands of certificates, they had difficulty determining which could be revoked quickly, which were actually impacted, and as a result initially opted to start a plan to revoke all certificates within a 24-hours. Though they started this clock when they figured out the scope of the incident, it took them a while to do so, suggesting a lack of preparation for incidents. They also appear to have done this all without a clear strategy to mitigate the customer impact and through a sequence of events ultimately rolled out a 5-day delay on top of the initial plan. This all led to the unusual step of a subscriber filing a Temporary Restraining Order (TRO) against the revocation, which set a troubling precedent and complicated the situation further.

A significant challenge was that many DigiCert customers do not utilize automated certificate lifecycle management extensively, which is essential during such incidents. This issue reflects back on DigiCert’s own processes, as manual lifecycle management is the default rather than the exception.

This incident raises questions about DigiCert’s preparedness, including whether they maintain a comprehensive list of all valid certificates, track replacements, and monitor customer use of lifecycle management. It also prompts a reflection on their internal review processes and test automation for compliance with CA/Browser Forum Requirements.

“Learn all you can from the mistakes of others. You won’t have time to make them all yourself.” – Alfred Sheinwold

It also makes me wonder about the type of design review process that was in place, the description in the incident sounds like a fix was applied at the user interface level which no serious engineer would do. Beyond that, it’s also unclear what kind of test automation is incorporated into their engineering processes to ensure compliance with associated requirements. For example, it would be quite reasonable for the process to mandate unit tests for all normative level requirements from the CA/Browser Forum Requirements and associated IETF documents. This would ensure that initial and subsequent releases meet requirements, which is particularly crucial for an organization with many engineers who may not all be experts in every requirement.

Engineering Readiness and Strategic Decision-Making

These case studies highlight critical lessons in engineering readiness and crisis management. Systems designed for scale and availability must be built with proactive planning, automation, and testing in mind. The reality is that systems designed for both scale and availability are also operated with engineering readiness as a core tenant. If you read the O’Reilly Site Reliability Engineering book, you will notice the same undertow of planning, automating, testing, and preparing, that is because the only way to have a reliable system or process is to engineer it to be that way.

Beyond that, if you’ve ever been on call, you understand how crucial it is that both mundane and unexpected issues be prevented and automated so you are able to respond effectively or even better — never experience entire classes of problems as you learn how to operate a CA. This is why, following Let’s Encrypt’s 2020 incident and their second mass revocation in 2021—for certificates that were one second longer than their policy allowed —they began exploring ways to ensure this never happens again. This is what led us to  Automated Renewal and Installation (ARI) which aims to make it possible for these unplanned certificate renewals boring. This proactive approach aligns with the principle:

“An error doesn’t become a mistake until you refuse to correct it.” – Orlando A. Battista

The browsers have long stressed the importance of blameless postmortems and a culture of self-improvement they do this because without that people within an organization do not feel empowered to do the right thing. At the same time how an organization responds to an incident is very telling about its culture, transparency, operational excellence, focus on compliance, and engineering practices and as such serves as a useful indicator of an organization’s competence and suitability to be part of the threat model of the internet. 

At the last CA/Browser Forum Face to Face meeting the Chrome team talked a bit about this.

Lessons Learned

These incidents highlight several key lessons for CAs, including the importance of automated systems, transparent communication, and continuous learning.

1. Preparation is Key: Automated systems, and tooling to support and predefined response strategies are crucial. For effective incident management, CAs should develop and test incident response plans regularly to ensure rapid action during a crisis.

2. Transparency Builds Trust: During crises, maintaining open and regular communication is essential. CAs should ensure that their communication strategies are clear and consistent to build and maintain trust with stakeholders and the community.

3. Learn from Others: Assign teams and individuals to conduct regular reviews of both historical and current incidents. Have them present these findings to the organization and rotate this responsibility across different disciplines to ensure knowledge is shared.

Conclusion

Acknowledging that DigiCert displayed a level of transparency during their incident, it’s clear that the event still exposed significant gaps in their compliance and engineering practices. Transparency, while crucial, is only one component of effective crisis management. The DigiCert case illustrates that even when stakeholders are kept informed, underlying deficiencies in operational procedures and compliance frameworks can compromise both the efficacy of the response and the overall security posture.

“The decisions you make reflect the priorities you have.”

Some recommendations for CAs looking to learn from this incident include:

  • DigiCert’s incident highlighted the need for more robust compliance mechanisms. Regularly updating compliance protocols and ensuring they are in strict alignment with industry standards is essential. This includes periodic reviews and updates to incorporate new regulatory requirements and best practices.
  • The incident revealed that while DigiCert was transparent, there was a lack of sufficient engineering controls to prevent or mitigate the impact of such issues. Implementing rigorous engineering best practices such as continuous integration, automated testing, and detailed, proactive monitoring of systems can help prevent similar incidents.
  • While DigiCert was forthcoming about the issue, their response strategy could benefit from a more structured approach that includes predefined escalation paths, detailed role assignments, and rapid response teams specialized in different aspects of the crisis management process.
  • Beyond immediate crisis management, fostering a culture prioritizing continuous improvement and learning from past mistakes is vital. This involves regular training, simulations of potential issues, and a blame-free environment that encourages reporting and swift resolution of potential threats.

In conclusion, effective crisis management involves not only responding to incidents but also implementing robust preventive measures. Regular updates to designs, implementations, tooling, improved engineering controls, and a culture of continuous improvement are essential for minimizing the likelihood and impact of future incidents.