Tag Archives: Authentication

From Fairways to the Cloud: Estimating Golf Balls in Flight to Tackling Cloud Workload Scale

Early in my career, I worked in quality assurance at Microsoft, analytical skills were a core trait we tried to hire for, at the time  “brain teasers” were often used in interviews to assess these skills. One memorable interview question went like this:

“How would you figure out how many golf balls are in flight at any given moment?”

This question wasn’t about pinpointing the exact number; it was a window into the candidate’s analytical thinking, problem-solving approach, and ability to break down complex problems into manageable parts. It was always interesting to see how different minds approached this seemingly simple yet deceptively complex problem. If a candidate wasn’t sure how to begin, we would encourage them to ask questions or to simply document their assumptions, stressing that it was the deconstruction of the problem—not the answer—that we were looking for.

In engineering, we often need to take big, abstract problems and break them down. For those who aren’t engineers, this golf ball question makes that process a bit more approachable. Let me walk you through how we might tackle the golf ball question:

  1. Number of Golf Courses Worldwide
    • There are approximately 38,000 golf courses globally.
  2. Players and Tee Times
    • On average, each course hosts about 50 groups per day.
    • With an average group size of 4 players, that’s 200 players per course daily.
  3. Shots Per Player
    • An average golfer takes around 90 shots in a full round.
  4. Total Golf Balls in Play
    • 200 players × 90 shots = 18,000 shots per course per day.
  5. Time a Golf Ball Is in the Air
    • Let’s estimate each shot keeps the ball airborne for about 5 seconds.
  6. Calculating Balls in Flight
    • Over 12 hours of playtime, there are 43,200 seconds in a golfing day.
    • Total airborne time per course: 18,000 shots × 5 seconds = 90,000 seconds.
    • Average balls in flight per course: 90,000 seconds ÷ 43,200 seconds2 golf balls.
  7. Global Estimate
    • 2 balls per course × 38,000 courses = 76,000 golf balls in flight at any given moment worldwide.

This exercise isn’t about precision; it’s about methodically breaking down a complex question into digestible parts to arrive at a reasonable estimate. As the saying goes, all models are wrong, but some are useful. Our goal here is to find a stable place to stand as we tackle the problem, and this question does a decent job at doing that, if nothing else, letting us see how a candidate might approach unknown topics.

Transitioning from the Green to the Cloud

Today, the biggest challenges in cloud workload identity management remind me of these kinds of problems—except far more complex. Unlike in a round of golf, most workloads aren’t individually authenticated today; instead, they rely on shared credentials, essentially passwords, stored and distributed by secret managers, and anything needing access to a resource must have access to that secret. 

But with the push for zero trust, rising cloud adoption, infrastructure as code, and the reality that credential breaches represent one of the largest attack vectors, it’s clear we need a shift. The future should focus on a model where every workload is independently authenticated and authorized.

So, let’s put the “golf balls soaring through the air” approach to work here, using the same framework to break down the cloud workload scale:

  1. Global Cloud Infrastructure
    • Major cloud providers operate data centers with an estimated 10 million servers worldwide.
  2. Workloads Per Server
    • Each server might run an average of 100 workloads (virtual machines or containers).
    • 10 million servers × 100 workloads = 1,000 million  (1 billion) workloads running at any given time.
  3. Ephemeral Nature of Workloads
    • Let’s assume 50% of these are ephemeral, spinning up and down as needed.
    • 1 billion workloads × 50% = 500 million ephemeral workloads.
  4. Workload Lifespan and Credential Lifecycle
    • If each ephemeral workload runs for about 1 hour there are 24 cycles in a 24-hour day.
    • 500 million workloads × 24 cycles = 12 billion ephemeral workloads initiated daily.
  5. Credentials Issued
    • Each new workload requires secure credentials or identities to access resources.
    • This results in 12 billion credentials needing issuance and management every day.
  6. Credentials Issued Per Second
    • With 86,400 seconds in a day:
    • 12 billion credentials ÷ 86,400 seconds138,889 credentials issued per second globally.

In this updated example, just as with the golf balls in flight question, we deconstruct a complex system to better understand its core challenges:

  • Scale: The number of workloads and credentials needed to achieve this zero-trust ideal is much higher than we would need to simply pass around shared secrets.
  • Dynamics: These credentialing systems must have much higher availability than static systems to support the dynamism involved.
  • Complexity: Managing identities and credentials at this scale is a monumental task, emphasizing the need for scalable and automated solutions.

Note: These calculations are estimates meant to illustrate the concept. In real-world cloud environments, actual numbers can vary widely depending on factors like workload type distribution, number of replicas, ephemerality of workloads, and, of course, individual workload needs.

Conclusion

This exercise demonstrates a fundamental point: analytical thinking and problem-solving are timeless skills, applicable across domains.

You don’t need to be an expert in any given system to get started; you simply need to understand how to break down a problem and apply some basic algebra.

It also serves as a way to understand the scope and scale of authenticating every workload to enable zero trust end-to-end. Simply put, this is a vastly different problem than user and machine authentication, presenting a unique challenge in managing identities at scale.

Automating Non-Human Identities: The Future of Production Key Management 

Historically, key management was seen as activities involving hardware security modules (HSMs), manual tasks, and audits. This approach was part of what we termed ‘responsible key management.’ However, HSMs were impractical for many use cases, and these manual tasks, typical of IT processes, were often poorly executed or never completed, frequently causing security incidents, outages, and unexpected work.

Simultaneously, as an industry, we began applying cryptography to nearly all communications and as a way to protect data at rest. This led to the adoption of cryptography as the method for authenticating hardware, machines, and workloads to apply access control to their activities. As a result, today, cryptography has become a fundamental component of every enterprise solution we depend on. This shift led us to attempt to apply legacy key management approaches at the enterprise scale. The increased use of cryptography within enterprises made it clear these legacy approaches ignored the majority of keys we relied on, so we took a tactical approach and created repositories to manage the sprawl of secrets. While a step forward, this approach also papered over the real problems with how we use, track, and manage keys.

It is time for us as an industry to start viewing cryptography and key management not just as a tax we must pay but as an investment. We need to manage these keys in an automated and scalable way that helps us manage risk in our businesses.

To do this, we need to start with a question: What are these keys, anyway? Broadly, I think of three categories of keys: long-lived asymmetric secrets like those associated with certificate authorities, long-lived shared secrets used for encryption and authentication, and modern-day asymmetric key credentials for users, devices, and workloads. The greatest growth in keys has been in the last category, so let’s focus on that for the purpose of this discussion.

Modern Credentials and Their Management

Modern-day asymmetric key-based credentials are not always “certificates,” but they generally bind some claim(s) to an asymmetric key pair. These certificates can be formatted as JSON, ASN.1, CBOR, TLVs, X.509, JWT, or some other encoding. They serve various purposes:

  1. User Certificates:  Issued to individual users to authenticate their identity within an organization, these certificates provide secure access to corporate resources, such as an SSH certificate used by developers to access production. They bind a user’s identity to a cryptographic key pair, ensuring only authorized individuals access sensitive information and systems.
  2. Hardware Certificates: Assigned by manufacturers during production, these certificates uniquely identify hardware devices. They are often used to bootstrap the identity of machines or workloads, ensuring only authorized devices can access resources on your network.
  3. Machine Certificates: Common in operational IT environments, these certificates authenticate servers associated with domains, IP addresses, or device identifiers. They are typically used with TLS and for network access use cases like 802.1x, IKE, and various VPNs.
  4. Workload Certificates: In cloud and serverless environments, workload certificates perform access control close to the business logic to minimize security exposure and deliver on zero trust goals. These dynamic certificates often reflect both the underlying hardware and the workload running on it, acting like multi-factor authentication for devices. The frequent need to re-credential workloads makes issuing credentials mission-critical, as failure to do so can cause outages. This necessitates issuers in each failure domain (think of this as a cluster of clusters) hosting these workloads to ensure continuous operation.

What we can take from this is that we have been approaching credentials incorrectly by treating them as a key management problem. This approach is akin to using password managers for hardware, machines, and workloads, whereas, for users, we have moved toward multi-factor authentication and non-password-based authenticators.

Towards Automated and Scalable Key Management

If password managers or key vaults are not the right solution for machine authentication, what is? The answer is simpler than it might seem. Just as with users, these cases require built-for-purpose Identity Providers (IDPs). This is especially true for workloads, which dynamically spin up and down, making durable identifiers impractical. An IDP becomes a security domain for a given deployment, ensuring that workloads are accessible only by appropriate resources. This setup limits attackers’ lateral movement, allows for nearly instant granting and removal of access, minimizes the impact of compromises, and enables easy federation between deployments—all while providing a central point for identity governance and ensuring the cryptographic keys associated with credentials are well-managed and protected.

Getting Started

Modernizing key management starts with measurement. Identify the most common types of keys in your secret vaults, typically workload-related credentials. Deploy a workload-specific IDP, such as those enabled via SPIFFE, to transition these credentials out of the secret manager. Over time, the secret manager will store static secrets like API keys for legacy systems, while dynamic credentials are managed appropriately.

Prevent using your secret manager as an IDP from the start, especially for new systems. Teams responsible for the operational burden of these systems usually support this change, as automated end-to-end credentialing of workloads is more agile, scalable, and secure. This results in fewer outages and security incidents related to secret managers and non-production quality dependencies.

From this point, the process becomes a cycle of identifying where static secrets or long-lived credentials are funneled through your secret manager and moving them to built-for-purpose credential lifecycle management solutions.

Multi-factor authentication for workloads

Adopting a purpose-built IDP workload solution is a good start, but keys can still be stolen or leaked. For machines and workloads, use hardware attestations. Built-in hardware authenticators, such as Trusted Platform Modules (TPMs), create and secure keys within the device, ensuring they never leave. TPMs also verify device integrity during boot-up, adding an extra layer of security. This combination provides stronger multi-factor authentication without the usability issues associated with similar patterns for user authentication.

Avoiding Common Mistakes

The most common mistake organizations make is applying existing systems to workload credential management problems without fully analyzing operational, scale, uptime, and security needs. For example, enterprise PKI teams might mandate using their existing CA infrastructure for managing workload credentials, assuming a certificate suffices. However, this often violates the principle of least privilege and struggles with dynamic identification required for workloads.

Existing credential issuance systems are designed for static, long-lived subjects, such as directory names, DNS names, or IP addresses, which don’t change frequently. In contrast, workload credentials may change every few minutes. Provisioning devices like network appliances before assigning durable identifiers adds to this challenge. New workload-based systems, like SPIFFE, assign identifiers based on runtime elements, preventing the same bad practices that led to secret sprawl and mismanaged key problems.

Reducing Reliance on Shared Secrets 

Moving away from shared secrets won’t eliminate the need for secret vaults but will significantly reduce the problem’s scope. As systems modernize, password-based authenticators will be updated or deprecated. Over time, we will see fewer shared, long-lived secrets used for workload identity, driven by zero trust and least privilege principles.

At the same time, we can do much to improve overall key management practices in production systems. However, that’s a topic for another post.

Closing Note

The challenges and opportunities in modern key management are significant, but by leveraging innovative solutions and focusing on automation and scalability, we can make substantial progress. As we adopt built-for-purpose Identity Providers (IDPs) and hardware attestations, it’s important to have the right tools and frameworks in place to succeed.

I  have been working with SPIRL, a company focused on making the right thing the easy thing for developers, operations, and compliance. I see firsthand how the right platform investments can simplify the creation and operation of production systems. SPIRL achieves this by authoring and adopting open standards for identity and building reliable, scalable infrastructure that provides greater visibility and control.

Even if you don’t use SPIRL, by focusing on these principles, organizations can better manage the complexities of modern workload-related key and credential management, ensuring greater productivity and security.