Category Archives: Thoughts

From Static to Dynamic: Adapting PKI for Cloud-Native Architectures

When it comes to workload and service credential management, a common misconception is that you can simply reuse your existing Certificate Authority (CA) and Certificate Lifecycle Management (CLM) infrastructure to achieve your desired end-state more quickly. Given that nearly every organization has client and server TLS certificates for devices and web servers and many have made significant investments in managing and issuing these certificates, this notion seems logical. However, the design constraints of these use cases differ significantly. The transition from static to dynamic environments, along with the associated credential governance problems, makes it clear that a purpose-built approach is necessary for managing workload and service credentials.

The Static Nature of Traditional CA and CLM Infrastructure

Traditional CA and CLM infrastructure primarily deals with TLS server certificates, where the identities involved are usually domain names or IP addresses. These identifiers are largely static, pre-defined, and managed out-of-band. As a result, the lifecycle of these certificates, which includes issuance, renewal, and revocation, follows a relatively predictable pattern. These certificates are usually issued for validity periods ranging from 90 days to a year, and the processes and infrastructure are designed to handle these constraints. However, use cases surrounding workload and service credentials have a totally different set of constraints, given the highly dynamic environment and various regulatory frameworks they must adhere to.

The Dynamic Nature of Workloads and Services

Workloads and services operate in a significantly different environment. These identities often come and go as services scale up or down or undergo updates. The sheer scale at which workload and service credentials are issued and managed can be hundreds of times greater than that of traditional client and server TLS certificate use cases.

Unlike the static domain names or IP addresses used in these traditional TLS certificates, the identifiers for workloads and services are dynamically assigned as workloads spin up and down or are updated. Workload and service credentials often have a much shorter lifespan compared to server certificates. They might need to be reissued every few minutes or hours, depending on the nature of the workload and the policies of the environment. This changes the expectations of the availability and scalability of the issuing infrastructure, leading to a need for credentials to be issued as close to the workload as possible to ensure issuance doesn’t become a source of downtime.

These workloads and services are also often deployed in clusters across various geographic locations to ensure fault tolerance and scalability. Unlike server certificates that can rely on a centralized CA, workload and service credentials often need a more distributed approach to minimize latency and meet performance requirements.

Another critical difference lies in the scope of trust. For TLS client and server certificates, the security domain is essentially defined by a root certificate or the corresponding root store. However, in the context of workloads and services, the goal is to achieve least privilege. This often necessitates a much more complicated PKI to deliver on this least-privileged design.

Additionally, federation is often needed in these use cases, and it is usually not possible to standardize on just X.509. For example, you might need to interoperate with a service that works based on OAuth. Even in such cases, you want to manage all of this within one common framework to have unified control and visibility regardless of the technology used for the deployment.

Purpose-Built Solutions for Workload and Service Credential Management

Given the unique challenges of managing workload and service credentials in dynamic, cloud-native environments, existing CA and CLM investments often fall short. Purpose-built solutions, such as those based on SPIFFE (Secure Production Identity Framework for Everyone), tend to offer a more effective approach by providing dynamic, attested identities tailored for these environments.

Dynamic Identifier Assignment

Unlike static identifiers managed by traditional CLM solutions, SPIFFE dynamically assigns identifiers as workloads spin up and down, aligning with the nature of modern cloud environments.

Decentralized Issuance

By issuing credentials as close to the workload as possible, SPIFFE-based solutions reduce latency and align issuance with the availability goals of these deployments, ensuring that credentials are issued and managed efficiently without becoming a source of downtime.

Granular Policy Enforcement

SPIFFE-based solutions enable enforcing fine-grained policy through its namespacing and security domain concepts. This enables organizations to define and enforce policies at a more granular level, ensuring that workloads only access the necessary resources.

Identity Governance and Federation

SPIFFE-based solutions tend to support extending existing identity governance programs to workload and service credentials while also facilitating seamless and secure access to external services across different trust domains.

Multifactor Authentication

These SPIFFE-based solutions also provide support for attestation, which can be thought of as multi-factor authentication for workloads. This attestation verifies the workload and services’ running state and environment, tying the credentials to those environments and helping minimize the risks of credential theft.

Integration with Other Systems and Protocols

These environments can seldom rely exclusively on just X.509, which is why SPIFFE supports both X.509 and JWT formats of credentials. This flexibility allows seamless integration with various systems and protocols within cloud deployments.

Conclusion

While it might be tempting to reuse existing CA and CLM infrastructure for workload and service credential management, this approach fails to address the unique challenges posed by dynamic, cloud-native environments. The ephemeral nature of workload identities, the high frequency of credential issuance, and the need for granular trust domains necessitate purpose-built solutions like those based on SPIFFE.

These purpose-built systems are designed to handle the dynamic nature of modern environments, providing strong, attested identities and ensuring that credential management aligns with the evolving needs of cloud-native workloads. Understanding these distinctions is crucial for developing an effective identity management strategy for modern workloads and services without having to go through a reset event down the line.

To learn more about effective machine identity management, check out SPIRL and see how they are leading the way.

Credential Management vs. Secret Management: Choosing the Right Approach

If we examine the contents of most secret management solutions, like HashiCorp Vault, we will find that we primarily store the logical equivalent of user IDs and passwords for services, workloads, and machines. Much like the old-school practice of writing passwords on post-it notes and sharing them, these secrets are then distributed to whatever entity needs access to the protected resource. As a result, it is no surprise that we encounter the same problems as password management: weak secrets, shared secrets, and stolen secrets. This is why we see machine credentials and keys being stolen in events like STORM 0558.

Secret Management: A Machine-Scale Password Management

Secret management can be seen as password management at a machine scale. These systems centralize secrets  (e.g., API keys, tokens, and passwords) to manage their sprawl and then make it easy to pass these shared secrets around various services and applications. However, just as simple passwords alone are seen as insufficient for user authentication—necessitating the adoption of multi-factor authentication (MFA) and migration to non-password authenticators like Passkeys—the same shortcomings apply to this legacy approach to managing these credentials.

The reality is that these secret managers were designed to address the symptoms rather than the root cause of secret sprawl. By storing and distributing these static secrets without treating them as credentials to be managed, you centralize the chaos but do not solve the fundamental problem: the need for robust, scalable machine identity management. Machines, services, and workloads, much like users credentials, require purpose-built approaches. Just as we have applied multi-factor authentication, identity governance and administration (IGA) and solutions like OKTA to user management, we need analogous systems for machines.

Credential Management for Machines

At machine and workload scale, credential management solutions need to be architected differently to handle the dynamic nature of modern cloud environments. This is where purpose-built solutions come into play. These solutions must address the unique challenges of credential management in dynamic, cloud-native environments. Unlike secret management, which often relies on static secrets, credential management provides dynamic, attested identities for workloads. For example, SPIFFE (Secure Production Identity Framework for Everyone) offers a strategy for managing dynamic identities, providing the foundation for standards-based, scalable, and robust workload identity management.

Higher-Level Concepts in Workload Credential Management:

  1. Provisioning and Deprovisioning:
    • Efficiently creating, updating, and removing machine identities as services start, stop, and scale.
  2. Role-Based Access Control (RBAC):
    • Assigning access rights based on predefined roles, ensuring that machines and services have appropriate permissions.
  3. Periodic Reviews:
    • Conducting regular reviews of machine access rights to ensure compliance and appropriateness.
  4. Policy Enforcement:
    • Defining and enforcing access control policies, ensuring that machines and services adhere to security guidelines.
  5. Audit and Reporting:
    • Generating comprehensive reports to demonstrate compliance with regulatory requirements and internal policies.
  6. Risk Analysis:
    • Identifying and mitigating risks associated with machine identities and their entitlements.
  7. Behavioral Analysis:
    • Monitoring machine behavior to detect anomalies and potential security threats.

Just as legacy approaches to secret management fail for workloads, machines, and services, these legacy IGA approaches will not work either. We need built-for-purpose solutions to adapt and not repeat the mistakes of the past. Those solutions will look like credential management systems that enable these functions as a byproduct of how they work, rather than bolting them onto an existing secret management or IGA process. 

This is why I have been an advisor to SPIRL since its founding. The founders, Evan and Eli, are why I knew I had to be involved. As the authors of the standard and having had real production-at-scale experience with SPIFFE, I knew they had what it took, which is why I have been working with them in a more formal capacity as of this last month. They are building the infrastructure that makes, as I like to say, the right thing the easy thing, ensuring their customers don’t repeat the mistakes of the past when it comes to workload and service identity.

Conclusion

Secret management tools were a necessary step in addressing the sprawl of machine secrets, and they will continue to be needed. However, as the complexity and scale of cloud environments grow, so does the need for more sophisticated solutions for workload and service credentials. Credential management systems provide the dynamic, robust framework required for modern machine identity management.

By adopting credential management as the primary approach and using secret management as the exception, organizations can achieve greater security, scalability, and operational efficiency in their cloud-native architectures. To learn more about effective machine identity management, check out SPIRL and see how they are leading the way.

ACME vs. SPIFFE: Choosing the Right One

In the world of certificate lifecycle management for workloads, two approaches often come into focus: ACME (Automated Certificate Management Environment) and SPIFFE (Secure Production Identity Framework for Everyone). While both can deliver certificates to a device or workload, they cater to different use cases. Let’s explore their core differences and why these matter, especially for dynamic workload deployments.

ACME: Proving Control Over Identifiers

ACME is primarily designed to prove control over an identifier, typically a domain name, and to deliver a certificate that demonstrates successful control verification.

  • Control Verification: ACME verifies control through various methods such as HTTP-01, DNS-01, TLS-ALPN, and External Account Binding.
  • Attestation (Optional): Attestation in ACME is secondary and optional, primarily enabling verifying if the key is well protected.
  • Pre-assigned Identifiers: ACME assumes that the identifier (like a domain name) is pre-assigned and focuses on validating control over this identifier.

This approach is particularly useful for scenarios where identifiers are static and pre-assigned, making it ideal for server authenticated TLS and applications that rely on domain names.

SPIFFE: Dynamic Assignment of Identifier Based on Attestation

SPIFFE, conversely, is designed for dynamic workloads, which frequently change as services scale or update. SPIFFE assigns identifiers to workloads dynamically, based on attestation.

  • Identifier Assignment: SPIFFE assigns an identifier to the subject (such as a workload) using attestation about the subject to construct this identifier.
  • Attestation: Unlike ACME, attestation is a core component in SPIFFE, enabling robust multi-factor authentication (MFA) for workloads based on what is running and where it is running.
  • Namespacing and Least Privilege: SPIFFE facilitates the namespacing of identifiers, building a foundation that enables authorization frameworks that promote least privilege, ensuring workloads only access necessary resources.
  • Minimal Security Domains: At the core of SPIFFE is the concept of security domains, which serve as trust boundaries between services, helping to minimize the attack surface.
  • JWT and X.509: SPIFFE supports SVIDs in both X.509 and JWT formats, enabling seamless integration with various systems and protocols within cloud deployments.

Importance in Dynamic Workload Deployments

The differences between ACME and SPIFFE are particularly significant in dynamic workload deployments:

  • Flexibility and Scalability: SPIFFE’s dynamic identifier assignment is highly suitable for environments where workloads are frequently spun up and down, such as in microservices architectures and Kubernetes clusters.
  • Role-Based Authentication: By making attestation a core component and promoting least privilege, SPIFFE ensures that each workload is authenticated and authorized precisely for its intended role.
  • Granular Policy Enforcement: SPIFFE’s namespacing and minimal security domain features enable fine-grained policy enforcement, enabling organizations to define and enforce policies at a more granular level, such as per-workload or per-service.
  • Identity Federation: SPIFFE supports identity federation enabling different trust domains to interoperate securely. This is particularly beneficial in multi-cloud and hybrid environments where different parts of an organization or third-parties.

Conclusion

While both ACME and SPIFFE are used for certificate management, they cater to different scenarios and can complement each other effectively. ACME is ideal for static environments where identifiers are pre-allocated, focusing on certificate lifecycle management for issuing, renewing, and revoking certificates. It excels in managing the lifecycle of certificates for web servers and other relatively static resources. 

On the other hand, SPIFFE excels in dynamic, high-scale environments, emphasizing credential lifecycle management with flexible and robust authentication and authorization through dynamic identifiers and attestation. SPIFFE is designed for modern, cloud-native architectures where workloads are ephemeral, and identifiers need to be dynamically issued and managed. 

By understanding these differences, you can leverage both ACME and SPIFFE to enhance your identity management strategy. Use ACME for managing certificates in static environments to ensure efficient lifecycle management. Deploy SPIFFE for dynamic environments to provide strong, attested identities for your workloads.

Building Effective Roadmaps through Mission, Vision, and Strategy Alignment

Creating a cohesive and effective roadmap requires more than just a list of tasks and deadlines; it requires a clear mission, vision, and strategy that aligns the entire organization. This alignment ensures that every team member understands the overarching goals and the path to achieving them, helping facilitate a unified and focused approach to succeeding.

Mission: The Foundation of Purpose

The mission statement serves as the foundation, articulating the core purpose of the organization. It answers why the organization exists and what it aims to achieve. For example:

Google Cloud’s mission to enable global digital transformation through superior infrastructure and industry-specific solutions guides all efforts.

A strong mission focuses on the problem, providing clarity on the organization’s purpose.

Vision: The Aspirational Future

The vision statement paints an aspirational picture of the future the organization seeks to create within the next 2-5 years. It describes desired outcomes and the impact of the organization’s efforts. For example, envisioning customers leveraging advanced features to transform operations provides a clear goal. For example:

Google Cloud’s vision is to empower businesses of all sizes to innovate faster and achieve operational excellence through cutting-edge technology and seamless integration with existing technology investments.

While the mission addresses the ‘why,’ the vision focuses on the ‘what’—what success looks like in practical terms.

Strategy: The Path to Achievement

Strategy bridges the gap between the current state and the envisioned future. It outlines ‘how’ the organization will achieve its mission and vision, including:

Assessing the current environment, including market conditions and internal capabilities.

Google Cloud operates in a highly competitive market with rapid technological advancements. Understanding customer needs for scalable and secure cloud solutions is crucial.

Using data and trends to inform strategic decisions.

Analyzing trends in cloud adoption, security requirements, and industry-specific needs helps Google Cloud identify opportunities for growth and areas where they can provide unique value.

Setting clear, measurable business outcomes.

Google Cloud aims to achieve a 25% increase in enterprise adoption of its cloud services within the next two years by enhancing its product offerings and expanding its market reach.

Deciding where to focus resources and explaining why certain paths are chosen.

Prioritizing the development of AI and machine learning capabilities tailored to industry-specific solutions to address top enterprise blockers. This decision is based on market demand and internal strengths in these technologies.

Roadmap: The Plan

The roadmap translates the strategic vision into a plan, providing an 18-24 month overview of key milestones. It breaks down big problems into manageable tasks, ensuring alignment with overall goals. A good roadmap is not just a schedule; it’s a strategic tool that shows how initiatives drive business success, allowing for flexibility and adaptation while staying true to the mission and vision.

For example:

  • Q1 2024: Launch Enhanced AI and Machine Learning Tools
    • Initiative: Develop and release new AI and machine learning tools designed to address specific industry needs, such as healthcare analytics and financial risk modeling.
    • Milestone: Achieve a 10% increase in customer adoption of AI services within three months of launch.

  • Q2 2024: Expand Data Security Solutions
    • Initiative: Introduce advanced data security features, including encryption and threat detection, to meet growing cybersecurity demands.
    • Milestone: Secure certifications for new security standards and onboard 50 new enterprise customers by the end of the quarter.

  • Q3 2024: Integrate Cloud Services with Major ERP Systems
    • Initiative: Develop seamless integrations with leading ERP systems to streamline operations for enterprise customers.
    • Milestone: Complete integration with three major ERP platforms and pilot with 10 key clients.

  • Q4 2024: Launch Global Data Centers Expansion
    • Initiative: Open new data centers in strategic locations worldwide to enhance global reach and reliability.
    • Milestone: Increase data center footprint by 20% and reduce latency by 15% for international clients.

  • Q1 2025: Roll Out Customer Success Programs
    • Initiative: Implement customer success programs aimed at improving user experience and satisfaction.
    • Milestone: Achieve a 15% increase in customer satisfaction scores and reduce churn rate by 10%.

Achieving Alignment and Focus

Aligning the mission, vision, and strategy, and clearly communicating them, helps maintain organizational focus. This alignment prevents reactive firefighting and directional resets that consume resources. By fostering proactive planning and execution, organizations can drive meaningful progress toward business goals.

Conclusion

Effective roadmaps and work plans stem from a well-aligned mission, vision, and strategy. This alignment provides a stable foundation, clear goals, and a practical path to achieve them. By maintaining this alignment, organizations can move from reactive to proactive operations, regain stakeholder trust, and achieve sustainable success.

Thanks to Michale Windsor for his advice on this and other topics over the years

Transitioning from Reactive to Proactive

Building Effective, Evidence-Based Roadmaps for Business Success

In defining roadmaps, the primary goal is to deliver on product objectives that will generate revenue. These roadmaps are prioritized to achieve business goals as swiftly as possible. Typically, the prioritized items listed on roadmaps are broadly visible to customers but often depend on behind-the-scenes work that is not customer-visible, determined when building out work plans.

These work plans allocate time for roadmap items, customer obligations, and technical debt. For instance, a typical allocation might be 70% for roadmap items and their dependencies, 20% for customer obligations, and 10% for technical debt. In healthy organizations, these target allocations will ebb and flow but are largely adhered to, with the roadmap representing the largest commitment in the allocation.

When this is not the case, work plans are treated as roadmaps. In these situations, organizations are in a reactive position, where the only work that gets done is what has been committed to customers, and addressing technical debt has gotten so bad it is already causing harm to the business. These organizations are reactive and are always in a fight-or-flight mode.

Such reactive modes become a byproduct of the lack of structure in building the roadmap, the disconnect in the work supporting that roadmap, and a lack of recognition of the current situation regarding customer commitments. This, in turn, creates a situation where leadership loses faith in the organization’s ability to produce a compelling product roadmap.

This can be the beginning of a death spiral, where the expectation is that every item in the work plan is mapped back to customer revenue. This results in the items necessary to achieve forward growth business objectives being de-prioritized, with the committed sales funnel driving all engineering spend. The work plan and the management of it replace the accountability to build a business and associated engineering investment with a list of customer asks. As a consequence, the organization can never address technical debt or implement the kind of product design pivots needed to tackle issues like improving the size of deals through expanding its footprint within customers, reducing the cost of goods sold, time to close deals, or creating new parallel go-to-market strategies.

While there is no one-size-fits-all approach to product planning, the principles outlined here offer valuable lessons that can be adapted to various organizational contexts.To transition to a more proactive and accountable roadmap-driven model, organizations must commit to a disciplined approach where a significant portion of the effort is dedicated to developing a product that can deliver business outcomes, supported by a robust body of customer evidence. 

This shift also involves ensuring alignment with internal stakeholders, such as product, engineering, and executive leadership. Achieving this alignment is key to prevent the cycle of reactive firefighting and directional resets that consume resources and hinder progress on business objectives.

By returning to first principles and building a compelling, evidence-based roadmap, organizations can over time move the organizations culture from a reactive to a proactive operational mode, regain trust of all participants, and as a result drive meaningful progress toward their business goals.

Automating Non-Human Identities: The Future of Production Key Management 

Historically, key management was seen as activities involving hardware security modules (HSMs), manual tasks, and audits. This approach was part of what we termed ‘responsible key management.’ However, HSMs were impractical for many use cases, and these manual tasks, typical of IT processes, were often poorly executed or never completed, frequently causing security incidents, outages, and unexpected work.

Simultaneously, as an industry, we began applying cryptography to nearly all communications and as a way to protect data at rest. This led to the adoption of cryptography as the method for authenticating hardware, machines, and workloads to apply access control to their activities. As a result, today, cryptography has become a fundamental component of every enterprise solution we depend on. This shift led us to attempt to apply legacy key management approaches at the enterprise scale. The increased use of cryptography within enterprises made it clear these legacy approaches ignored the majority of keys we relied on, so we took a tactical approach and created repositories to manage the sprawl of secrets. While a step forward, this approach also papered over the real problems with how we use, track, and manage keys.

It is time for us as an industry to start viewing cryptography and key management not just as a tax we must pay but as an investment. We need to manage these keys in an automated and scalable way that helps us manage risk in our businesses.

To do this, we need to start with a question: What are these keys, anyway? Broadly, I think of three categories of keys: long-lived asymmetric secrets like those associated with certificate authorities, long-lived shared secrets used for encryption and authentication, and modern-day asymmetric key credentials for users, devices, and workloads. The greatest growth in keys has been in the last category, so let’s focus on that for the purpose of this discussion.

Modern Credentials and Their Management

Modern-day asymmetric key-based credentials are not always “certificates,” but they generally bind some claim(s) to an asymmetric key pair. These certificates can be formatted as JSON, ASN.1, CBOR, TLVs, X.509, JWT, or some other encoding. They serve various purposes:

  1. User Certificates:  Issued to individual users to authenticate their identity within an organization, these certificates provide secure access to corporate resources, such as an SSH certificate used by developers to access production. They bind a user’s identity to a cryptographic key pair, ensuring only authorized individuals access sensitive information and systems.
  2. Hardware Certificates: Assigned by manufacturers during production, these certificates uniquely identify hardware devices. They are often used to bootstrap the identity of machines or workloads, ensuring only authorized devices can access resources on your network.
  3. Machine Certificates: Common in operational IT environments, these certificates authenticate servers associated with domains, IP addresses, or device identifiers. They are typically used with TLS and for network access use cases like 802.1x, IKE, and various VPNs.
  4. Workload Certificates: In cloud and serverless environments, workload certificates perform access control close to the business logic to minimize security exposure and deliver on zero trust goals. These dynamic certificates often reflect both the underlying hardware and the workload running on it, acting like multi-factor authentication for devices. The frequent need to re-credential workloads makes issuing credentials mission-critical, as failure to do so can cause outages. This necessitates issuers in each failure domain (think of this as a cluster of clusters) hosting these workloads to ensure continuous operation.

What we can take from this is that we have been approaching credentials incorrectly by treating them as a key management problem. This approach is akin to using password managers for hardware, machines, and workloads, whereas, for users, we have moved toward multi-factor authentication and non-password-based authenticators.

Towards Automated and Scalable Key Management

If password managers or key vaults are not the right solution for machine authentication, what is? The answer is simpler than it might seem. Just as with users, these cases require built-for-purpose Identity Providers (IDPs). This is especially true for workloads, which dynamically spin up and down, making durable identifiers impractical. An IDP becomes a security domain for a given deployment, ensuring that workloads are accessible only by appropriate resources. This setup limits attackers’ lateral movement, allows for nearly instant granting and removal of access, minimizes the impact of compromises, and enables easy federation between deployments—all while providing a central point for identity governance and ensuring the cryptographic keys associated with credentials are well-managed and protected.

Getting Started

Modernizing key management starts with measurement. Identify the most common types of keys in your secret vaults, typically workload-related credentials. Deploy a workload-specific IDP, such as those enabled via SPIFFE, to transition these credentials out of the secret manager. Over time, the secret manager will store static secrets like API keys for legacy systems, while dynamic credentials are managed appropriately.

Prevent using your secret manager as an IDP from the start, especially for new systems. Teams responsible for the operational burden of these systems usually support this change, as automated end-to-end credentialing of workloads is more agile, scalable, and secure. This results in fewer outages and security incidents related to secret managers and non-production quality dependencies.

From this point, the process becomes a cycle of identifying where static secrets or long-lived credentials are funneled through your secret manager and moving them to built-for-purpose credential lifecycle management solutions.

Multi-factor authentication for workloads

Adopting a purpose-built IDP workload solution is a good start, but keys can still be stolen or leaked. For machines and workloads, use hardware attestations. Built-in hardware authenticators, such as Trusted Platform Modules (TPMs), create and secure keys within the device, ensuring they never leave. TPMs also verify device integrity during boot-up, adding an extra layer of security. This combination provides stronger multi-factor authentication without the usability issues associated with similar patterns for user authentication.

Avoiding Common Mistakes

The most common mistake organizations make is applying existing systems to workload credential management problems without fully analyzing operational, scale, uptime, and security needs. For example, enterprise PKI teams might mandate using their existing CA infrastructure for managing workload credentials, assuming a certificate suffices. However, this often violates the principle of least privilege and struggles with dynamic identification required for workloads.

Existing credential issuance systems are designed for static, long-lived subjects, such as directory names, DNS names, or IP addresses, which don’t change frequently. In contrast, workload credentials may change every few minutes. Provisioning devices like network appliances before assigning durable identifiers adds to this challenge. New workload-based systems, like SPIFFE, assign identifiers based on runtime elements, preventing the same bad practices that led to secret sprawl and mismanaged key problems.

Reducing Reliance on Shared Secrets 

Moving away from shared secrets won’t eliminate the need for secret vaults but will significantly reduce the problem’s scope. As systems modernize, password-based authenticators will be updated or deprecated. Over time, we will see fewer shared, long-lived secrets used for workload identity, driven by zero trust and least privilege principles.

At the same time, we can do much to improve overall key management practices in production systems. However, that’s a topic for another post.

Closing Note

The challenges and opportunities in modern key management are significant, but by leveraging innovative solutions and focusing on automation and scalability, we can make substantial progress. As we adopt built-for-purpose Identity Providers (IDPs) and hardware attestations, it’s important to have the right tools and frameworks in place to succeed.

I  have been working with SPIRL, a company focused on making the right thing the easy thing for developers, operations, and compliance. I see firsthand how the right platform investments can simplify the creation and operation of production systems. SPIRL achieves this by authoring and adopting open standards for identity and building reliable, scalable infrastructure that provides greater visibility and control.

Even if you don’t use SPIRL, by focusing on these principles, organizations can better manage the complexities of modern workload-related key and credential management, ensuring greater productivity and security.

Rethinking Security in Complex Systems

Over the last few decades, we seem to have gotten better at the micro aspects of security, such as formally verifying protocols and designing cryptographic algorithms, but have worsened, or at least failed to keep up at the macro aspects, such as building and managing effective, reproducible, risk-based security programs.

This can probably be attributed to both the scale of the systems we now depend on and perhaps even more to human factors. The quote, “Bureaucracy defends the status quo long past the time when the quo has lost its status,” is relevant here. Security organizations grow organically, not strategically, usually in response to a past failure or a newly recognized risk, which ultimately results in new teams that exist forever increasing the overall load of the security organization on the business. The role of these organizations typically expands over time to justify this load transforming them into data-gathering organizations. This explains why enterprises have so many dashboards of data that fail to drive action. Organizations are overwhelmed and struggle to understand where their risks lie so they can effectively allocate their limited resources toward achieving security-positive outcomes.

Building on this foundation of understanding risks, then there is the question of how we manage the success of security organizations in managing risk, as the saying goes, “If you can’t measure it, you can’t manage it.” The issue is that organizations often turn these measurements into metrics of success, which seems rational on the surface. However, in practice, we encounter another adage: “When a measure becomes a target, it ceases to be a good measure,” something that is especially true in security. For example, we often measure a security program for responding to an incident well, shipping new capabilities or passing audits, but this is an incomplete picture, using audits as an example, they are designed to fit every organization, which takes us to the saying, “If you try to please all, you please none.” In the case of security, you are guaranteed to be missing major issues or, at an absolute minimum, prioritizing activity over effectiveness with this approach. To make this more concrete this incentivizes successful audits (despite the incomplete picture this represents), and seeing the number of CVEs identified go down (despite these figures almost always being false positives) which in turn mislead organizations into believing they are more secure than they actually are.

Transitioning from metrics a combination of metrics, technology, and continual improvement, can help with some of the scale problems above, for example, AI shows promise for triaging issues and accelerating the review of low-level issues that fit neatly into a small context window but at the same time give a false sense of security. The human problems, on the other hand, are something we cannot simply automate; the best we can do is to rethink the way we build organizations so that empowerment and accountability are woven into how the organization operates. This will require ensuring those who take on that accountability truly understand how systems work and build their teams with a culture of critical, continual improvement. The second-order effects of these scale tools is de-skilling of the creators and operators of the tools — for example, I am always blown away how little modern computer science graduates understand of the way the systems they write code on operate — even from the most prestigious schools. We must also make continual education a key part of how we build our organizations.

Finally, focusing on system design, to support this from a design standpoint, we also need to consider how we design systems. The simplest approach is to design these systems in the most straightforward way possible, but even then, we need to consider the total operational nature of the system as part of the design. In user experience, we design for key users; early on, we talk about the concept of user stories. In system design, we often design and then figure out how that system will be operated and managed. We need to incorporate operational security into our designs. Do they emit the information (metrics and contextualized logs) required to monitor, do we provide the tools to use them to do this detection? If not how are their users to know they are operating securely? For example, do they enable monitoring for threats like token forgery over time? We must make the systems we ship less dependent on the human being involved in their operation, recognizing that they are generalists, not specialists, and give them simple answers to these operational issues if we want to help them achieve secure outcomes.

In conclusion, organizations need to look at the technological, and human aspects of their security program as well as their technology choices continuously and critically. This will almost certainly involve rethinking the metrics they use to drive security efforts, while also building a workplace culture centered on empowerment, accountability, continuous improvement and fundamentally integrating the security lifecycle of systems into the design process from day one. By adopting these strategic pillars, organizations can build more resilient, effective, and adaptable security programs that are equipped to meet the challenges of today’s dynamic environment.

Navigating Security and Innovation

I started my career at Microsoft in the 90s, initially working on obscure third-party networking issues, printing, and later Internet Explorer. Back then, even though I had gotten into computers through what today would probably be categorized as security research, it would have been nearly impossible in those days to find someone who wanted to hire me for those skills. I left the company a few years later and found my first job in computer security and never looked back.

I came back to Microsoft in 2000, but this time I was working on authentication and cryptography. This was just a few years before the infamous security standdown that was kicked off by the Bill Gates’ Trustworthy Computing Memo. This gave me a firsthand view into what led to that pivotal moment and how it evolved afterward. The work that was done during the subsequent years changed the way the industry looked at building secure software.

The thing is, at the same time, the concepts of third-party operated applications (SaaS) and shared computing platforms (Cloud Computing) were gaining traction. The adoption of these concepts required us to rethink how we build secure software for these new use cases and environments. For example, this shift introduced the concepts of massive multi-tenancy and operational shared fate between customers and their providers and made shipping updates much easier on a large scale. This accelerated rate of change also drove the need to rethink how we manage a security program, as the approaches used by the traditional software business often did not apply in this fast-paced world. My initial exposure to this problem came from my last role at Microsoft, where I was responsible for security engineering for the Advertising business.

The company had not defined mature approaches to how to secure online services yet which created the opportunity for us to find ways to use similar but different models that could fit into the realities of these new environments, which had both positive impacts (agility to remediate) and negative impacts (scale and speed) and through that, try to build a security program that could work in this new reality.

I share that context to give a bit of color to the bias and background I bring to the current situation Microsoft finds itself in. Having lived through what was surely the world’s single largest investment in making software and services secure to that point, and having spent decades working in security, I have had the chance to see several cycles in the way we look at building systems.

A New Chapter Unfolds

All things in life have natural cycles, and the same is true for how the industry views security. Organizations ebb and flow as a result of market changes, leadership changes, and as customer demands evolve. In the case of security, there is also the false idea that it is a destination or a barrier to delivering on business objectives that factors into these cycles.

As a result, it’s no surprise that over the following decade and a half, we saw Microsoft lessen its commitment to security – especially in the fast-moving and growing opportunity for cloud services. As an outsider looking in, it felt like they lost their commitment around the time they began viewing security as a business rather than the way you keep your promises to customers. At some point it felt like every month, you would see outages related to mishandling basics, with increased frequency of the same type of issues, for example, multi-tenancy violations, one right after another.

This increase in basic security issues was paired with poor handling of incidents which is why it was no surprise to see the incident known as STORM-0558 come about. As soon as this incident became public it was clear what had happened, the organization adopted the most convenient practices to ship and operate and under-invested in the most basic lessons of the preceding two decades in a trade-off that externalized the consequences of those decisions to their customers.

Microsoft had no choice but to respond in some way, so three months after the issue became public they announced the Secure Future Initiative which can be summarized as:

  1. Apply AI to Scale Security
  2. Using More Secure Defaults
  3. Rolling Out Zero Trust Principles 
  4. Adopting Better Key Management
  5. Consistency in Incident Response
  6. Advocating for Broader Security Investments

This was lauded by some as the next Trustworthy Computing initiative, but on the surface, that’s a far cry from the kind of investment made during those days. To me it sounds more like a mix of how Microsoft intends to meet the CISA Secure By Design initiative and how they think they need to respond to the STORM-0558 incident. There is always a question of messaging versus reality, so I personally held, and still do, hope that this was the first organizational sign of an awakening that could lead to a similar level of investment.

Shortly after the Storm-0558 incident, I appeared on “Security Conversations” with Ryan Naraine. We discussed how the situation might have unfolded and identified the root causes—my answer was lack of security leadership. Therefore, it was no surprise to me to see that when the CSRB came out, the reviewers reached the same conclusion.

“Microsoft’s security culture was inadequate and requires an overhaul, particularly in light of the company’s centrality in the technology ecosystem and the level of trust customers place in the company to protect their data and operations.” 

Despite these challenges, it’s important to recognize that not all teams within Microsoft have been equally impacted by these systemic issues. As William Gibson famously stated, ‘The future is already here — it’s just not very evenly distributed.’ This is evident within Microsoft, where the Windows team, for example, appears to have continued to do well relative to its peers.

Beyond Metrics to Meaningful Reform

The other day Satya, someone who did what many thought was impossible by turning Microsoft around from the company it became during the Balmer years, wrote an internal memo, that amongst other things stated: 

“If you’re faced with the tradeoff between security and another priority, your answer is clear: Do security.”

In this memo, he also stated:

“In addition, we will instill accountability by basing part of the compensation of the senior leadership team on our progress towards meeting our security plans and milestones.”

At the same time, Charlie Bell released more information on the intended implementation of the Microsoft Secure Future Initiative which puts some more meat on what the initial announcement promised, and how they are expanding it as a result of the CSRB findings, the most impactful organizational change probably being the decision to add deputy CISOs in product teams.

So the good news is that this does signal that Microsoft has heard the message from the CSRB. These are good first steps in addressing the cultural issues that have contributed to Microsoft’s broad decline as a leader in security over the last several decades.

The question then becomes: does the executive leadership under Satya understand how their personal choices, organizational structure, approach to culture, approach to staffing, and overall business management decisions have contributed to the current situation? In my experience, the types of changes needed to achieve the transformational shifts required to address security neglect often necessitate leadership changes. Merely issuing a strong directive from the CEO and allocating additional budget is seldom enough to materially create the needed changes to a company’s approach to security.

What concerns me is the wording of the statement about tying compensation to security. What it actually does is link compensation to progress in meeting their “security plans and milestones”.

So the question becomes, do those security plans and milestones manifest into the technical changes and cultural changes needed to address the problem and the hole they have dug for themselves?

If we look at the value they will realize by executing on work items called out in the Security First Initiative as the definition of what they believe their problems are I have some doubts.

Bridging Visions and Realities

If we look at the CSRB report and classify the issues identified into 5 categories we see that the majority of the identified issues were related to design decisions. 

CategoryExampleProportion
Security Design IssuesInadequate cryptographic key management, failure to detect forged tokens40%
Incident ResponseDelays in updating the public on the true nature of incidents, slow response to key compromise20%
Operational IssuesFailure to rotate keys automatically, using aging keys, allowing consumer keys to access enterprise data20%
Vulnerability ManagementLack of controls to alert for aging keys, not detecting unauthorized token use10%
Risk ManagementInadequate security practices compared to other CSPs, not having a detection system for forged tokens10%

Key:

  • Security Design Issues: This includes fundamental flaws in how security measures were architected.
  • Incident Response: Refers to the overall handling and transparency of the incident, including the timeliness and accuracy of public communications.
  • Operational Issues: These are failures in the operational handling of security mechanisms.
  • Vulnerability Management: Concerns the lack of proactive measures to detect and mitigate vulnerabilities.
  • Risk Management: Describes the overall approach to assessing and managing risks, highlighting a lack of comparable security controls relative to industry standards.

If we compare those issues with the areas of investment outlined above in the Secure Futures Investment announcement it’s not clear to me that they would have made a meaningful dent to the root cause of the identified issues, and more generally they don’t seem to look at the larger systemic issues Microsoft is experiencing at all.

Let’s just take the plan to use AI to help scale out their security program as an example, certainly a worthwhile initiative but the root cause here was a design issue. Today’s AI systems are good at automating the tasks we understand how to do well and even then they struggle at that, and that’s not even touching on the more nuanced issue of “design”.

For example, there is a great paper from Dan Boneh and his students that shows that code from solutions like  OpenAI’s Codex, may contribute to the creation of less secure code. Another research effort focused on GitHub CoPilot reported similar findings.

This doesn’t mean that this technology isn’t promising or that it can’t help manage security issues in the massive software systems we rely on today. However, it’s unlikely to significantly impact the types of issues currently seen at Microsoft. That’s why the CSRB has emphasized the need for a cultural overhaul in how Microsoft approaches security organizationally. Satya Nadella’s message about prioritizing security is a step in the right direction, and the Charlie blog post does outline a systemization of how they will go about that but meaningful cultural changes and making Microsoft a leader in security again will require much more than a blog post and incentives to execute in a timely manner.

Conclusion

Microsoft was once known for its poor track record in building secure software and services, they made huge investments and became a leader, over time, they lost their edge. The Secure Future Initiative marks a step forward, as does the recent memo from Satya Nadella prioritizing security above all else. However, true progress will depend on Microsoft’s ability to roll out organizational changes and rebuild a culture that prioritizes security not just meeting milestones.

The good news is that they have the talent, the resources, and still some of the muscle memory on how to get this done at scale. If Satya can turn around the company from those ailing Ballmer years, I have faith he can address this issue too.

Restoring Memories

As the old saying goes, “You can take the boy out of the farm, but you can’t take the farm out of the boy.” Although I was raised in metro Seattle, my father grew up on a farm in Eastern Washington, in the city of Walla Walla. We made regular trips there during my childhood, especially when my great-grandmother lived there by herself. These visits were more than just familial obligations; they were my introduction to values like hard work, family, and the joy of being close to the earth—values that have profoundly influenced who I am today.

I also fondly recall visits and weekend trips to my uncle’s, where my cousins and I would ride in the bed of his Chevy 3100, sliding around as we drove down the road, laughing and jostling around — back when the world was less concerned about safety regulations. Those moments of freedom are treasures I still carry.

This sense of nostalgia may explain why, after a career in information security, I felt compelled to restore several late 19th and early 20th-century safes. A few years ago, I embarked on a project to restore a Dodge Power Wagon, which encapsulates the strength, reliability, and spirit of those farmstead adventures.

Power Wagons Origin Story

The Dodge Power Wagon earned its legendary status on American farmlands shortly after World War II. Returning servicemen recognized the potential of the Dodge WCs they had used in the war. These vehicles could navigate the rugged farm terrain much like the battlefields they’d left behind. Equipped with a Power Take Off (PTO) and winch, the Dodge WC was not just a means of transport; it transformed into a tool that could till the fields or haul away a fallen tree. Recognizing its demand, Dodge released a civilian version—the Dodge Power Wagon.

The Power Wagon was the first mass-produced civilian 4×4 vehicle, ultimately symbolizing an era when durability and utility were paramount in vehicle design. Its introduction led to the widespread adoption of 4×4 capabilities by nearly every truck manufacturer.

Power Wagons also played a vital role in developing early infrastructure, aiding transportation and communication networks for rail and telephone companies. Coachbuilders would modify these trucks by combining two Power Wagons, to create multi-door vehicles that could transport crews to remote or difficult-to-access sites. It wasn’t until International Harvester introduced the Travelette in 1957 that a production truck with three or more doors became available.

Anyone who has ever done a high-end restoration of a vehicle will tell you it takes way longer than you expect, my project is no different. While we are getting close to the end of the project, after all it runs, drives, stops, has been put back together, and has been painted, and is now getting its interior done I would surely be wrong with whatever guess I gave.

​​
My Power Wagon Restoration

Restoring this piece of history isn’t just about reviving a classic vehicle; in a way, it’s a tribute to my father, my family’s legacy. It’s a pilgrimage back to my roots, a way to share my family’s story with my children and, eventually, my grandchildren.

Navigating Content Authentication In the Age of Generative AI

In 1995, SSL was introduced, and it took 21 years for 40% of web traffic to become encrypted. This rate changed dramatically in 2016 with Let’s Encrypt and the adoption of ACME, leading to an exponential increase in TLS usage. In the next 8 years, adoption nearly reached 100% of web traffic. Two main factors contributed to this shift: first, a heightened awareness of security risks due to high-profile data breaches and government surveillance, creating a demand for better security. Second, ACME made obtaining and maintaining TLS certificates much easier.

Similarly, around 2020, the SolarWinds incident highlighted the issue of software supply chain security. This, among other factors, led to an increase in the adoption of code signing technologies, an approach that has been in use at least since 1995 when Microsoft used this approach to help deal with the problem of authenticity as we shifted away from CDs and floppy disks to network-based distributions of software. However, the complexity and cost of using code signing severely limited its widespread use, and where it was used, thanks to poor tooling, key compromises often led to a failure for most deployments to achieve the promised security properties. Decades later, projects like Binary Transparency started popping up and, thanks to the SolarWinds incident, projects that spun out of that like Go ChecksumDB, SigStore, and SigSum projects led to more usage of code signing.

Though the EU’s digital signature laws in 1999 specified a strong preference for cryptographic-based document signing technologies, their adoption was very limited, in part due to the difficulty of using the associated solutions. In the US, the lack of a mandate for cryptographic signatures also resulted in an even more limited adoption of this more secure approach to signing documents and instead relied on font-based signatures. However, during the COVID-19 pandemic, things started changing; in particular, most states adopted remote online notary laws, mandating the use of cryptographic signatures which quickly accelerated the adoption of this capability.

The next shift in this story started around 2022 when generative AI began to take off like no other technology in my lifetime. This resulted in a rush to create tools to detect this generated content but, as I mentioned in previous posts [1,2], this is at best an arms race and more practically intractable on a moderate to long-term timeline.

So, where does this take us? If we take a step back, what we see is that societally we are now seeing an increased awareness of the need to authenticate digital artifacts’ integrity and origin, just like we saw with the need for encryption a decade ago. In part, this is why we already see content authentication initiatives and discussions, geared for different artifact types like documents, pictures, videos, code, web applications, and others. What is not talked about much is that each of these use cases often involves solving the same core problems, such as:

  • Verifying entitlement to acquire the keys and credentials to be used to prove integrity and origin.
  • Managing the logical and physical security of the keys and associated credentials.
  • Managing the lifecycle of the keys and credentials.
  • Enabling the sharing of credentials and keys across the teams that are responsible for the objects in question.
  • Making the usage of these keys and credentials usable by machines and integrating naturally into existing workflows.

This problem domain is particularly timely in that the rapid growth of generative AI has raised the question for the common technology user — How can I tell if this is real or not? The answer, unfortunately, will not be in detecting the fakes, because of generative AIs ability to create content that is indistinguishable from human-generated work, rather, it will become evident that organizations will need to adopt practices, across all modalities of content, to not only sign these objects but also make verifying them easy so these questions can be answered by everyday users.

This is likely to be accelerated once the ongoing shifts take place in the context of software and service liability for meeting security basics. All of this seems to suggest we will see broader adoption of these content authentication techniques over the next decade if the right tools and services are developed to make adoption, usage, and management easy.

While no crystal ball can tell us for sure what the progression will look like, it seems not only plausible but necessary in this increasingly digital world where the lines between real and synthetic content continue to blur that this will be the case.

Update: Just saw this while checking out my feed on X and it seems quite timely 🙂