Speeding Up Development and Navigating Security Risks

In software development, time is often of the essence. Developers are constantly pushed to deliver faster and more efficiently. Tools like GitHub Copilot have emerged, promising to accelerate coding tasks significantly. A study by GitHub found that developers using Copilot completed tasks up to 55% faster compared to those who didn’t use it [1]. However, as with speeding on the road, increased velocity in coding can lead to more significant risks and potential accidents.

The Need for Speed and Its Consequences

When driving at high speeds, the time available to react to unexpected events diminishes. Similarly, when coding tasks are completed rapidly with the help of AI copilots, developers have less time to thoroughly review and understand the code. This accelerated pace can lead to over-reliance on AI suggestions and decreased familiarity with the codebase. This lack of deep understanding might obscure potential issues that could be caught with a slower, more deliberate approach.

Narrowing the Focus

At higher speeds, drivers experience “tunnel vision,” where their field of view narrows, making it harder to perceive hazards. In coding, this translates to a reduced ability to catch subtle issues or security vulnerabilities introduced by AI-generated suggestions. A study by New York University highlighted this risk, finding that nearly 40% of code suggestions by AI tools like Copilot contained security vulnerabilities [2].

For a very basic example of how this could play out consider, an implementation of an authentication module using Copilot’s suggestions might skip input sanitization, leading to vulnerabilities like SQL injection or cross-site scripting (XSS) attacks. This omission might go unnoticed until the feature is in production, where it could lead to a security breach.

This reduction in situational awareness can be attributed to the lack of contextual knowledge AI copilots have. While they can provide syntactically correct code, they lack a nuanced understanding of the specific application or environment, leading to contextually insecure or inappropriate suggestions.

The Impact of High-Speed Errors

Higher driving speeds result in more severe accidents due to the increased force of impact. In coding, the severity of errors introduced by AI suggestions can be significant, especially if they compromise security. The GitHub study noted improvements in general code quality, as Copilot often follows best practices and common coding patterns [1]. However, the rapid pace and reliance on AI can mean fewer opportunities for developers to learn from their work and catch potential issues, leading to severe consequences in the long run.

Balancing Speed and Safety

To harness the benefits of tools like Copilot while mitigating the risks, developers, and teams can adopt several strategies:

  1. Enhanced Code Reviews: As the volume of code increases with the use of AI copilots, rigorous code reviews become even more crucial. Teams should ensure that every piece of code is thoroughly reviewed by experienced developers who can catch issues that the AI might miss.
  2. Integrating Security Tools: Using AI-based code reviewers and linters can help identify common issues and security vulnerabilities. These tools can act as a first line of defense, catching problems early in the development process.
  3. Continuous Learning: Establishing a culture of continuous learning ensures that developers stay updated with the latest security practices and coding standards. This ongoing education helps them adapt to new challenges and integrate best practices into their work.
  4. Practical Update Mechanisms: Implementing reliable and practical update mechanisms ensures that when issues are found, they can be addressed quickly and effectively. This proactive approach minimizes the impact of deployed vulnerabilities.
  5. Balanced Use of AI: Developers should use AI copilots as assistants rather than crutches. By balancing the use of AI with hands-on coding and problem-solving, developers can maintain a high level of familiarity with their codebases.

Conclusion

The advent of AI coding tools like GitHub Copilot offers a significant boost in productivity, enabling developers to code faster than ever before. However, this speed comes with inherent risks, particularly concerning security vulnerabilities and reduced familiarity with the code.

At the same time, we must recognize that attackers can also leverage these sorts of tools to accelerate their malicious activities, making it crucial to integrate these tools ourselves. By implementing robust code review processes, integrating security analysis tools into the build pipeline, and fostering a culture of continuous learning, teams can effectively manage these risks and use AI tools responsibly.

The game is getting real, and now is the time to put the scaffolding in place to manage both the benefits and the risks of this new reality.

Groundhog Day: Learning from Past Key and Credential Compromises

As they say, Those who cannot remember the past are condemned to repeat it, as we look back at the last decade, it seems we are caught in our own little Groundhog Day, reexperiencing the consequences of weak authentication and poor key management over and over.

It is not that we don’t know how to mitigate these issues; it’s more that, as organizations, we tend to avoid making uncomfortable changes. For example, the recent spate of incidents involving Snowflake customers being compromised appears to be related to Snowflake not mandating multi-factor authentication or a strong hardware-backed authenticator like Passkeys for its customers.

At the same time, not all of these key and credential thefts are related to users; many involve API keys, signing keys used by services, or code signing keys. Keys we have been failing to manage appropriately to the risks they represent to our systems and customers.

Timeline of Notable Incidents

Table of Incidents

IncidentType of CompromiseDescription
Stuxnet (2010)Code Signing KeysUtilized stolen digital certificates from Realtek and JMicron to authenticate its code, making it appear legitimate.
Adobe (2012)Code Signing KeysAttackers compromised a build server and signed malware with valid Adobe certificates.
Target (2013)Developer CredentialsNetwork credentials stolen from a third-party HVAC contractor via malware on an employee’s home computer.
Neiman Marcus (2013)Developer CredentialsMalware installed on systems, possibly through compromised credentials or devices from third-party vendors.
Home Depot (2014)Developer CredentialsMalware infected point-of-sale systems, initiated by stolen vendor credentials, potentially compromised via phishing.
Equifax (2017)API KeysExploitation of a vulnerability in Equifax’s website application exposed API keys used for authentication.
CCleaner (2017)Code Signing KeysAttackers inserted malware into CCleaner’s build process, distributing malicious versions of the software.
Ticketmaster (2018)Developer CredentialsMalware from a third-party customer support product led to the compromise of payment information.
ASUS Live Update (2018)Code Signing KeysAttackers gained access to ASUS’s code signing keys and distributed malware through the update mechanism.
Google Cloud Key Leak (2019)API KeysInternal tool inadvertently exposed customer API keys on the internet, leading to potential data exposure.
Facebook Instagram API Breach (2019)API KeysPlaintext passwords were accessible to employees due to misuse of an internal API.
DoorDash Data Breach (2019)API KeysUnauthorized access to systems included the compromise of API keys, exposing sensitive customer data.
Mission Lane (2020)Developer CredentialsMalware introduced into the development environment from a developer’s compromised personal device.
BigBasket (2020)Developer CredentialsData breach with over 20 million customer records exposed, suspected initial access through compromised developer credentials.
Twitter API Key Compromise (2020)API KeysAttackers gained access to internal systems, possibly compromising API keys or administrative credentials.
Amazon AWS Key Exposure (2021)API KeysMisconfigured AWS S3 bucket led to the exposure of API keys and sensitive customer data.
Nvidia (2022)Code Signing KeysStolen code signing certificates were used to sign malware, making it appear legitimate.
Microsoft Storm 0558 (2023)Signing KeysAttackers gained access to email accounts of government agencies and other organizations by forging authentication tokens using a stolen Microsoft signing key.

When we look at these incidents we see a few common themes including:

  1. Repetitive Failures in Security Practices:
    • Despite awareness of the issues, organizations continue to face the same security breaches repeatedly. This suggests a reluctance or failure to implement effective long-term solutions.
  2. Resistance to Necessary Changes:
    • Organizations often avoid making uncomfortable but necessary changes, such as enforcing multi-factor authentication (MFA) or adopting stronger authentication methods like Passkeys, and adopting work This resistance contributes to ongoing vulnerabilities and compromises.
  3. Diverse Sources of Compromise:
    • Key and credential thefts are not solely due to user actions but also involve API keys, service signing keys, and code signing keys. This highlights the need for comprehensive key management and protection strategies that cover all types of keys.
  4. Broader Key Management Practices:
    • Moving from shared secrets to asymmetric credentials for workloads and machines can significantly enhance security. Asymmetric credentials, especially those backed by hardware, are harder to steal and misuse compared to shared secrets. Techniques like attestation, which is the device equivalent of MFA, can provide additional security by verifying the authenticity of devices and systems.
  5. Third-Party Risks:
    • Even if we deploy the right technologies in our own environment, we often ignore the security practices of our upstream providers. Several incidents involved third-party vendors or services (e.g., Target, Home Depot), highlighting the critical need for comprehensive third-party risk management. Organizations must ensure that their vendors adhere to strong security practices and protocols to mitigate potential risks.
  6. Mismanagement of API and Signing Keys:
    • Incidents such as the Google Cloud Key Leak and Amazon AWS Key Exposure point to misconfigurations and accidental exposures of API keys. Proper configuration management, continuous monitoring, and strict access controls are essential to prevent these exposures.
  7. Importance of Multi-Factor Authentication (MFA):
    • The absence of MFA has been a contributing factor in several breaches. Organizations need to mandate MFA to enhance security and reduce the risk of credential theft.
  8. Need for Secure Code Signing Practices:
    • The use of stolen code-signing certificates in attacks like Stuxnet and Nvidia underscores the importance of securing code-signing keys. Implementing hardware security modules (HSMs) for key storage and signing can help mitigate these risks.

Conclusion

Looking back at the past decade of key and credential compromises, it’s clear that we’re stuck in a loop, facing the same issues repeatedly. It’s not about knowing what to do—we know the solutions—but about taking action and making sometimes uncomfortable changes.

Organizations need to step up and embrace multi-factor authentication, adopt strong hardware-backed authenticators like Passkeys, and move workloads from shared secrets to hardware-backed asymmetric credentials for better security. We also can’t overlook the importance of managing third-party risks and ensuring proper configuration and monitoring to protect our API and signing keys.

Breaking free from this cycle means committing to these changes. By learning from past mistakes and taking a proactive approach to key and credential management, we can better protect our systems and customers – It’s time to move forward.

Integrating Security: Making Safe Software Development Seamless and Productive

As software progresses from the developer’s machine to staging and finally to production, it undergoes significant changes. Each environment presents unique challenges, and transitions between these stages often introduce security weaknesses. By integrating security practices early in the development process, we bridge these gaps and help ensure we deliver a secure product. If done right, we can also improve developer productivity.

The Journey of Software Through Different Stages

In the development stage, the focus is primarily on writing code and quickly iterating on features. Security is often an afterthought, and practices like deployments, hardening, key management, auditing, and authentication between components are usually minimal or nonexistent. As the software moves to staging, it encounters a more production-like environment, revealing a new set of problems. Dependencies, network configurations, and integrations that worked on the developer’s machine may fail or expose vulnerabilities in staging.

When the software finally reaches production, the stakes are higher. It must handle real-world traffic, maintain uptime, and protect sensitive data. Here, the absence of strong security measures overlooked in earlier stages can lead to significant issues. Weak authentication, lack of auditing, and insecure configurations become glaring problems that can result in breaches and compromises.

The Role of Security in the Software Lifecycle

Security compromises often occur in the gaps between development, staging, and production. These gaps exist because each stage is treated as an isolated entity, leading to inconsistencies in security practices. To secure the software supply chain, we must address these gaps from the very beginning.

Integrating security practices early in the development process ensures that security is not an afterthought but a core component of software development. Here’s how we can achieve this, in no particular order:

  1. Start from the beginning with a Threat Model
    The right place to start addressing this problem is with a threat model, involving everyone in its development so the lessons from its creation are factored into the product’s manifestation. This model also needs to continuously evolve and adapt as the project progresses. A well-defined threat model helps identify potential security risks and guides the implementation of appropriate security controls and design elements throughout software development.
  2. Early Implementation of Authentication and Authorization
    From the first line of code, understand how you will implement authentication and authorization mechanisms for both users and workloads. A good example of how to do this for workloads is SPIFFE, which provides a robust framework for securely identifying and authenticating workloads. SPIRL makes deploying and adopting SPIFFE easy, laying the groundwork for unifying authentication and authorization across various stages of development. Ensure that every component, service, and user interaction is authenticated and authorized appropriately. This practice prevents unauthorized access and reduces the risk of security breaches.
  3. Continuous Auditing and Monitoring
    Integrate auditing and monitoring tools into the development pipeline and runtime environments. By continuously monitoring code changes, dependencies, configurations, and runtime behavior, we can detect and address vulnerabilities early. Additionally, automated analysis tools that integrate into CI/CD pipelines, like those from Binarly, can perform security checks to ensure that vulnerabilities and back doors didn’t sneak their way in during the build. Synthetic load testing modeled after real use cases can help detect outages and abnormal behavior early, ensuring that issues are identified before reaching production, and can serve as useful production monitoring automation. 
  4. Automate Security Infrastructure
    Incorporate automated security infrastructure to offload security tasks from developers. Middleware frameworks, such as service meshes and security orchestration platforms, provide standard interfaces for security operations. For example, using a service mesh like Istio can automatically handle authentication, authorization, and encryption between microservices without requiring developers to implement these security features manually. This allows developers to focus on shipping code while ensuring consistent security practices are maintained. These frameworks enable faster deployment into production, ensuring that security checks are embedded and consistent across all stages.
  5. Implement Infrastructure as Code (IaC)
    Implement Infrastructure as Code (IaC) to ensure consistent configuration and reduce human error. By automating infrastructure management, IaC not only enhances the reliability and security of your systems but also significantly improves your ability to recover from incidents and outages when they occur. With IaC, you can quickly replicate environments, roll back changes, and restore services, minimizing downtime and reducing the impact of any disruptions.
  6. Build for Continuous Compliance
    Pair IaC with continuous compliance automation to create artifacts that demonstrate adherence to security policies, reducing the compliance burden over time. Do this by using verifiable data structures, similar to how Google and others use Trillian, to log all server actions verifiably. These immutable logs create a transparent and auditable trail of activities, improving traceability and compliance.
  7. Minimize Human Access to Production
    Minimize human access to production environments to reduce the risk of errors and unauthorized actions. For example, use a privileged access management system to implement access on demand, where temporary access is granted to production systems or data only when needed and with proper approvals. This ensures that developers can debug and maintain high availability without permanent access, enhancing security while allowing necessary interventions.
  8. Implement Code Signing and Supply Chain Monitoring
    Implement code signing to enable the verification of the authenticity and integrity of the software at every stage. Regularly audit your dependencies to identify and mitigate the risks that they come with. As Ken Thompson famously said, “You can’t trust code that you did not totally create yourself.” This highlights the importance of not relying on blind faith when it comes to the security of external code and components. 
  9. Proper Secret Management from Day One
    It’s common for organizations to overlook proper secret management practices from the onset and have to graft them on later. Ensure developers have the tools to adopt proper secret management from day one, emphasizing the use of fully managed credentials rather than treating credentials as secrets. Solutions like SandboxAQ’s Aqtive Guard can help you understand where proper secret management isn’t happening and solve the problem of last-mile key management. Tools like this turn secret management into a tool for risk management, not just sprawl management.
  10. Effective Vulnerability Management
    Vulnerability management is a critical aspect of maintaining a secure software environment. A key component of this is monitoring third-party dependencies, as vulnerabilities in external libraries and frameworks can pose significant risks. Implement tools that continuously scan for vulnerabilities in these dependencies and provide actionable insights without overwhelming your team with noise. By leveraging automated vulnerability management solutions, you can prioritize critical issues and reduce alert fatigue. These tools should integrate seamlessly into your development pipeline, allowing for real-time vulnerability assessments and ensuring that only relevant, high-priority alerts are raised. This approach not only enhances your security posture but also allows your team to focus on meaningful security tasks, rather than being bogged down by an excessive number of alerts.
  11. Shift Left with Developer-Friendly Security Tools
    Equip developers with tools and platforms that seamlessly integrate into their workflows, offering security features without adding friction. User-friendly and non-intrusive security tools increase the likelihood of early and correct adoption by developers. For instance, tools like GitHub’s Dependabot and Snyk help identify and fix vulnerabilities in dependencies, while CodeQL allows for deep code analysis to uncover security issues before they reach production. These tools make security a natural part of the development process, enabling developers to focus on writing code while ensuring robust security practices are maintained.
  12. Consider the Entire Lifecycle of The Offering
    Security is an ongoing process that extends beyond development and deployment. Regularly assess and enhance the security posture of your production environment by implementing frequent security reviews, patch management, and incident response plans. Ensure that security practices evolve alongside the software by utilizing comprehensive security playbooks. These playbooks should offer clear, repeatable steps for handling common security tasks and incidents and be updated regularly to address new threats and best practices. Crucially, feedback loops from incident response and post-mortems are essential for continuous improvement. These loops provide valuable insights into past issues, helping to prevent similar problems in the future and fostering a culture of ongoing enhancement.
  13. Profile Your System and Understand Bottlenecks
    Profile your system and understand where the bottlenecks are. Have plans to both prevent them and to respond when they are encountered. This proactive approach ensures that performance issues are addressed before they impact the user experience and that you have a strategy in place for quick remediation when problems do arise.

Conclusion

Early on, normalizing security practices across all stages of software development may be challenging, but it’s crucial. By consistently applying security measures from development to production, we can bridge the gaps that lead to vulnerabilities. This not only improves security but also enhances developer productivity.

Platforms and middleware that automate security checks, enforce policies, and provide clear visibility into security issues help developers stay productive while maintaining a robust security posture, reducing outages and security incidents. By removing these security concerns from the developer’s workflow, we allow them to concentrate on what they do best: writing code.

Making security the easy and natural choice for developers enhances both security and productivity, leading to more secure and reliable software. While it may be difficult to fully implement these practices initially, developing a model that includes continuous threat analysis and plans to address any gaps will ensure long-term success. In doing so, we make the right thing the easy thing.

Thanks to Amir Omidi for his feedback on this post

From Static to Dynamic: Adapting PKI for Cloud-Native Architectures

When it comes to workload and service credential management, a common misconception is that you can simply reuse your existing Certificate Authority (CA) and Certificate Lifecycle Management (CLM) infrastructure to achieve your desired end-state more quickly. Given that nearly every organization has client and server TLS certificates for devices and web servers and many have made significant investments in managing and issuing these certificates, this notion seems logical. However, the design constraints of these use cases differ significantly. The transition from static to dynamic environments, along with the associated credential governance problems, makes it clear that a purpose-built approach is necessary for managing workload and service credentials.

The Static Nature of Traditional CA and CLM Infrastructure

Traditional CA and CLM infrastructure primarily deals with TLS server certificates, where the identities involved are usually domain names or IP addresses. These identifiers are largely static, pre-defined, and managed out-of-band. As a result, the lifecycle of these certificates, which includes issuance, renewal, and revocation, follows a relatively predictable pattern. These certificates are usually issued for validity periods ranging from 90 days to a year, and the processes and infrastructure are designed to handle these constraints. However, use cases surrounding workload and service credentials have a totally different set of constraints, given the highly dynamic environment and various regulatory frameworks they must adhere to.

The Dynamic Nature of Workloads and Services

Workloads and services operate in a significantly different environment. These identities often come and go as services scale up or down or undergo updates. The sheer scale at which workload and service credentials are issued and managed can be hundreds of times greater than that of traditional client and server TLS certificate use cases.

Unlike the static domain names or IP addresses used in these traditional TLS certificates, the identifiers for workloads and services are dynamically assigned as workloads spin up and down or are updated. Workload and service credentials often have a much shorter lifespan compared to server certificates. They might need to be reissued every few minutes or hours, depending on the nature of the workload and the policies of the environment. This changes the expectations of the availability and scalability of the issuing infrastructure, leading to a need for credentials to be issued as close to the workload as possible to ensure issuance doesn’t become a source of downtime.

These workloads and services are also often deployed in clusters across various geographic locations to ensure fault tolerance and scalability. Unlike server certificates that can rely on a centralized CA, workload and service credentials often need a more distributed approach to minimize latency and meet performance requirements.

Another critical difference lies in the scope of trust. For TLS client and server certificates, the security domain is essentially defined by a root certificate or the corresponding root store. However, in the context of workloads and services, the goal is to achieve least privilege. This often necessitates a much more complicated PKI to deliver on this least-privileged design.

Additionally, federation is often needed in these use cases, and it is usually not possible to standardize on just X.509. For example, you might need to interoperate with a service that works based on OAuth. Even in such cases, you want to manage all of this within one common framework to have unified control and visibility regardless of the technology used for the deployment.

Purpose-Built Solutions for Workload and Service Credential Management

Given the unique challenges of managing workload and service credentials in dynamic, cloud-native environments, existing CA and CLM investments often fall short. Purpose-built solutions, such as those based on SPIFFE (Secure Production Identity Framework for Everyone), tend to offer a more effective approach by providing dynamic, attested identities tailored for these environments.

Dynamic Identifier Assignment

Unlike static identifiers managed by traditional CLM solutions, SPIFFE dynamically assigns identifiers as workloads spin up and down, aligning with the nature of modern cloud environments.

Decentralized Issuance

By issuing credentials as close to the workload as possible, SPIFFE-based solutions reduce latency and align issuance with the availability goals of these deployments, ensuring that credentials are issued and managed efficiently without becoming a source of downtime.

Granular Policy Enforcement

SPIFFE-based solutions enable enforcing fine-grained policy through its namespacing and security domain concepts. This enables organizations to define and enforce policies at a more granular level, ensuring that workloads only access the necessary resources.

Identity Governance and Federation

SPIFFE-based solutions tend to support extending existing identity governance programs to workload and service credentials while also facilitating seamless and secure access to external services across different trust domains.

Multifactor Authentication

These SPIFFE-based solutions also provide support for attestation, which can be thought of as multi-factor authentication for workloads. This attestation verifies the workload and services’ running state and environment, tying the credentials to those environments and helping minimize the risks of credential theft.

Integration with Other Systems and Protocols

These environments can seldom rely exclusively on just X.509, which is why SPIFFE supports both X.509 and JWT formats of credentials. This flexibility allows seamless integration with various systems and protocols within cloud deployments.

Conclusion

While it might be tempting to reuse existing CA and CLM infrastructure for workload and service credential management, this approach fails to address the unique challenges posed by dynamic, cloud-native environments. The ephemeral nature of workload identities, the high frequency of credential issuance, and the need for granular trust domains necessitate purpose-built solutions like those based on SPIFFE.

These purpose-built systems are designed to handle the dynamic nature of modern environments, providing strong, attested identities and ensuring that credential management aligns with the evolving needs of cloud-native workloads. Understanding these distinctions is crucial for developing an effective identity management strategy for modern workloads and services without having to go through a reset event down the line.

To learn more about effective machine identity management, check out SPIRL and see how they are leading the way.

Credential Management vs. Secret Management: Choosing the Right Approach

If we examine the contents of most secret management solutions, like HashiCorp Vault, we will find that we primarily store the logical equivalent of user IDs and passwords for services, workloads, and machines. Much like the old-school practice of writing passwords on post-it notes and sharing them, these secrets are then distributed to whatever entity needs access to the protected resource. As a result, it is no surprise that we encounter the same problems as password management: weak secrets, shared secrets, and stolen secrets. This is why we see machine credentials and keys being stolen in events like STORM 0558.

Secret Management: A Machine-Scale Password Management

Secret management can be seen as password management at a machine scale. These systems centralize secrets  (e.g., API keys, tokens, and passwords) to manage their sprawl and then make it easy to pass these shared secrets around various services and applications. However, just as simple passwords alone are seen as insufficient for user authentication—necessitating the adoption of multi-factor authentication (MFA) and migration to non-password authenticators like Passkeys—the same shortcomings apply to this legacy approach to managing these credentials.

The reality is that these secret managers were designed to address the symptoms rather than the root cause of secret sprawl. By storing and distributing these static secrets without treating them as credentials to be managed, you centralize the chaos but do not solve the fundamental problem: the need for robust, scalable machine identity management. Machines, services, and workloads, much like users credentials, require purpose-built approaches. Just as we have applied multi-factor authentication, identity governance and administration (IGA) and solutions like OKTA to user management, we need analogous systems for machines.

Credential Management for Machines

At machine and workload scale, credential management solutions need to be architected differently to handle the dynamic nature of modern cloud environments. This is where purpose-built solutions come into play. These solutions must address the unique challenges of credential management in dynamic, cloud-native environments. Unlike secret management, which often relies on static secrets, credential management provides dynamic, attested identities for workloads. For example, SPIFFE (Secure Production Identity Framework for Everyone) offers a strategy for managing dynamic identities, providing the foundation for standards-based, scalable, and robust workload identity management.

Higher-Level Concepts in Workload Credential Management:

  1. Provisioning and Deprovisioning:
    • Efficiently creating, updating, and removing machine identities as services start, stop, and scale.
  2. Role-Based Access Control (RBAC):
    • Assigning access rights based on predefined roles, ensuring that machines and services have appropriate permissions.
  3. Periodic Reviews:
    • Conducting regular reviews of machine access rights to ensure compliance and appropriateness.
  4. Policy Enforcement:
    • Defining and enforcing access control policies, ensuring that machines and services adhere to security guidelines.
  5. Audit and Reporting:
    • Generating comprehensive reports to demonstrate compliance with regulatory requirements and internal policies.
  6. Risk Analysis:
    • Identifying and mitigating risks associated with machine identities and their entitlements.
  7. Behavioral Analysis:
    • Monitoring machine behavior to detect anomalies and potential security threats.

Just as legacy approaches to secret management fail for workloads, machines, and services, these legacy IGA approaches will not work either. We need built-for-purpose solutions to adapt and not repeat the mistakes of the past. Those solutions will look like credential management systems that enable these functions as a byproduct of how they work, rather than bolting them onto an existing secret management or IGA process. 

This is why I have been an advisor to SPIRL since its founding. The founders, Evan and Eli, are why I knew I had to be involved. As the authors of the standard and having had real production-at-scale experience with SPIFFE, I knew they had what it took, which is why I have been working with them in a more formal capacity as of this last month. They are building the infrastructure that makes, as I like to say, the right thing the easy thing, ensuring their customers don’t repeat the mistakes of the past when it comes to workload and service identity.

Conclusion

Secret management tools were a necessary step in addressing the sprawl of machine secrets, and they will continue to be needed. However, as the complexity and scale of cloud environments grow, so does the need for more sophisticated solutions for workload and service credentials. Credential management systems provide the dynamic, robust framework required for modern machine identity management.

By adopting credential management as the primary approach and using secret management as the exception, organizations can achieve greater security, scalability, and operational efficiency in their cloud-native architectures. To learn more about effective machine identity management, check out SPIRL and see how they are leading the way.

ACME vs. SPIFFE: Choosing the Right One

In the world of certificate lifecycle management for workloads, two approaches often come into focus: ACME (Automated Certificate Management Environment) and SPIFFE (Secure Production Identity Framework for Everyone). While both can deliver certificates to a device or workload, they cater to different use cases. Let’s explore their core differences and why these matter, especially for dynamic workload deployments.

ACME: Proving Control Over Identifiers

ACME is primarily designed to prove control over an identifier, typically a domain name, and to deliver a certificate that demonstrates successful control verification.

  • Control Verification: ACME verifies control through various methods such as HTTP-01, DNS-01, TLS-ALPN, and External Account Binding.
  • Attestation (Optional): Attestation in ACME is secondary and optional, primarily enabling verifying if the key is well protected.
  • Pre-assigned Identifiers: ACME assumes that the identifier (like a domain name) is pre-assigned and focuses on validating control over this identifier.

This approach is particularly useful for scenarios where identifiers are static and pre-assigned, making it ideal for server authenticated TLS and applications that rely on domain names.

SPIFFE: Dynamic Assignment of Identifier Based on Attestation

SPIFFE, conversely, is designed for dynamic workloads, which frequently change as services scale or update. SPIFFE assigns identifiers to workloads dynamically, based on attestation.

  • Identifier Assignment: SPIFFE assigns an identifier to the subject (such as a workload) using attestation about the subject to construct this identifier.
  • Attestation: Unlike ACME, attestation is a core component in SPIFFE, enabling robust multi-factor authentication (MFA) for workloads based on what is running and where it is running.
  • Namespacing and Least Privilege: SPIFFE facilitates the namespacing of identifiers, building a foundation that enables authorization frameworks that promote least privilege, ensuring workloads only access necessary resources.
  • Minimal Security Domains: At the core of SPIFFE is the concept of security domains, which serve as trust boundaries between services, helping to minimize the attack surface.
  • JWT and X.509: SPIFFE supports SVIDs in both X.509 and JWT formats, enabling seamless integration with various systems and protocols within cloud deployments.

Importance in Dynamic Workload Deployments

The differences between ACME and SPIFFE are particularly significant in dynamic workload deployments:

  • Flexibility and Scalability: SPIFFE’s dynamic identifier assignment is highly suitable for environments where workloads are frequently spun up and down, such as in microservices architectures and Kubernetes clusters.
  • Role-Based Authentication: By making attestation a core component and promoting least privilege, SPIFFE ensures that each workload is authenticated and authorized precisely for its intended role.
  • Granular Policy Enforcement: SPIFFE’s namespacing and minimal security domain features enable fine-grained policy enforcement, enabling organizations to define and enforce policies at a more granular level, such as per-workload or per-service.
  • Identity Federation: SPIFFE supports identity federation enabling different trust domains to interoperate securely. This is particularly beneficial in multi-cloud and hybrid environments where different parts of an organization or third-parties.

Conclusion

While both ACME and SPIFFE are used for certificate management, they cater to different scenarios and can complement each other effectively. ACME is ideal for static environments where identifiers are pre-allocated, focusing on certificate lifecycle management for issuing, renewing, and revoking certificates. It excels in managing the lifecycle of certificates for web servers and other relatively static resources. 

On the other hand, SPIFFE excels in dynamic, high-scale environments, emphasizing credential lifecycle management with flexible and robust authentication and authorization through dynamic identifiers and attestation. SPIFFE is designed for modern, cloud-native architectures where workloads are ephemeral, and identifiers need to be dynamically issued and managed. 

By understanding these differences, you can leverage both ACME and SPIFFE to enhance your identity management strategy. Use ACME for managing certificates in static environments to ensure efficient lifecycle management. Deploy SPIFFE for dynamic environments to provide strong, attested identities for your workloads.

Building Effective Roadmaps through Mission, Vision, and Strategy Alignment

Creating a cohesive and effective roadmap requires more than just a list of tasks and deadlines; it requires a clear mission, vision, and strategy that aligns the entire organization. This alignment ensures that every team member understands the overarching goals and the path to achieving them, helping facilitate a unified and focused approach to succeeding.

Mission: The Foundation of Purpose

The mission statement serves as the foundation, articulating the core purpose of the organization. It answers why the organization exists and what it aims to achieve. For example:

Google Cloud’s mission to enable global digital transformation through superior infrastructure and industry-specific solutions guides all efforts.

A strong mission focuses on the problem, providing clarity on the organization’s purpose.

Vision: The Aspirational Future

The vision statement paints an aspirational picture of the future the organization seeks to create within the next 2-5 years. It describes desired outcomes and the impact of the organization’s efforts. For example, envisioning customers leveraging advanced features to transform operations provides a clear goal. For example:

Google Cloud’s vision is to empower businesses of all sizes to innovate faster and achieve operational excellence through cutting-edge technology and seamless integration with existing technology investments.

While the mission addresses the ‘why,’ the vision focuses on the ‘what’—what success looks like in practical terms.

Strategy: The Path to Achievement

Strategy bridges the gap between the current state and the envisioned future. It outlines ‘how’ the organization will achieve its mission and vision, including:

Assessing the current environment, including market conditions and internal capabilities.

Google Cloud operates in a highly competitive market with rapid technological advancements. Understanding customer needs for scalable and secure cloud solutions is crucial.

Using data and trends to inform strategic decisions.

Analyzing trends in cloud adoption, security requirements, and industry-specific needs helps Google Cloud identify opportunities for growth and areas where they can provide unique value.

Setting clear, measurable business outcomes.

Google Cloud aims to achieve a 25% increase in enterprise adoption of its cloud services within the next two years by enhancing its product offerings and expanding its market reach.

Deciding where to focus resources and explaining why certain paths are chosen.

Prioritizing the development of AI and machine learning capabilities tailored to industry-specific solutions to address top enterprise blockers. This decision is based on market demand and internal strengths in these technologies.

Roadmap: The Plan

The roadmap translates the strategic vision into a plan, providing an 18-24 month overview of key milestones. It breaks down big problems into manageable tasks, ensuring alignment with overall goals. A good roadmap is not just a schedule; it’s a strategic tool that shows how initiatives drive business success, allowing for flexibility and adaptation while staying true to the mission and vision.

For example:

  • Q1 2024: Launch Enhanced AI and Machine Learning Tools
    • Initiative: Develop and release new AI and machine learning tools designed to address specific industry needs, such as healthcare analytics and financial risk modeling.
    • Milestone: Achieve a 10% increase in customer adoption of AI services within three months of launch.

  • Q2 2024: Expand Data Security Solutions
    • Initiative: Introduce advanced data security features, including encryption and threat detection, to meet growing cybersecurity demands.
    • Milestone: Secure certifications for new security standards and onboard 50 new enterprise customers by the end of the quarter.

  • Q3 2024: Integrate Cloud Services with Major ERP Systems
    • Initiative: Develop seamless integrations with leading ERP systems to streamline operations for enterprise customers.
    • Milestone: Complete integration with three major ERP platforms and pilot with 10 key clients.

  • Q4 2024: Launch Global Data Centers Expansion
    • Initiative: Open new data centers in strategic locations worldwide to enhance global reach and reliability.
    • Milestone: Increase data center footprint by 20% and reduce latency by 15% for international clients.

  • Q1 2025: Roll Out Customer Success Programs
    • Initiative: Implement customer success programs aimed at improving user experience and satisfaction.
    • Milestone: Achieve a 15% increase in customer satisfaction scores and reduce churn rate by 10%.

Achieving Alignment and Focus

Aligning the mission, vision, and strategy, and clearly communicating them, helps maintain organizational focus. This alignment prevents reactive firefighting and directional resets that consume resources. By fostering proactive planning and execution, organizations can drive meaningful progress toward business goals.

Conclusion

Effective roadmaps and work plans stem from a well-aligned mission, vision, and strategy. This alignment provides a stable foundation, clear goals, and a practical path to achieve them. By maintaining this alignment, organizations can move from reactive to proactive operations, regain stakeholder trust, and achieve sustainable success.

Thanks to Michale Windsor for his advice on this and other topics over the years

Transitioning from Reactive to Proactive

Building Effective, Evidence-Based Roadmaps for Business Success

In defining roadmaps, the primary goal is to deliver on product objectives that will generate revenue. These roadmaps are prioritized to achieve business goals as swiftly as possible. Typically, the prioritized items listed on roadmaps are broadly visible to customers but often depend on behind-the-scenes work that is not customer-visible, determined when building out work plans.

These work plans allocate time for roadmap items, customer obligations, and technical debt. For instance, a typical allocation might be 70% for roadmap items and their dependencies, 20% for customer obligations, and 10% for technical debt. In healthy organizations, these target allocations will ebb and flow but are largely adhered to, with the roadmap representing the largest commitment in the allocation.

When this is not the case, work plans are treated as roadmaps. In these situations, organizations are in a reactive position, where the only work that gets done is what has been committed to customers, and addressing technical debt has gotten so bad it is already causing harm to the business. These organizations are reactive and are always in a fight-or-flight mode.

Such reactive modes become a byproduct of the lack of structure in building the roadmap, the disconnect in the work supporting that roadmap, and a lack of recognition of the current situation regarding customer commitments. This, in turn, creates a situation where leadership loses faith in the organization’s ability to produce a compelling product roadmap.

This can be the beginning of a death spiral, where the expectation is that every item in the work plan is mapped back to customer revenue. This results in the items necessary to achieve forward growth business objectives being de-prioritized, with the committed sales funnel driving all engineering spend. The work plan and the management of it replace the accountability to build a business and associated engineering investment with a list of customer asks. As a consequence, the organization can never address technical debt or implement the kind of product design pivots needed to tackle issues like improving the size of deals through expanding its footprint within customers, reducing the cost of goods sold, time to close deals, or creating new parallel go-to-market strategies.

While there is no one-size-fits-all approach to product planning, the principles outlined here offer valuable lessons that can be adapted to various organizational contexts.To transition to a more proactive and accountable roadmap-driven model, organizations must commit to a disciplined approach where a significant portion of the effort is dedicated to developing a product that can deliver business outcomes, supported by a robust body of customer evidence. 

This shift also involves ensuring alignment with internal stakeholders, such as product, engineering, and executive leadership. Achieving this alignment is key to prevent the cycle of reactive firefighting and directional resets that consume resources and hinder progress on business objectives.

By returning to first principles and building a compelling, evidence-based roadmap, organizations can over time move the organizations culture from a reactive to a proactive operational mode, regain trust of all participants, and as a result drive meaningful progress toward their business goals.

Automating Non-Human Identities: The Future of Production Key Management 

Historically, key management was seen as activities involving hardware security modules (HSMs), manual tasks, and audits. This approach was part of what we termed ‘responsible key management.’ However, HSMs were impractical for many use cases, and these manual tasks, typical of IT processes, were often poorly executed or never completed, frequently causing security incidents, outages, and unexpected work.

Simultaneously, as an industry, we began applying cryptography to nearly all communications and as a way to protect data at rest. This led to the adoption of cryptography as the method for authenticating hardware, machines, and workloads to apply access control to their activities. As a result, today, cryptography has become a fundamental component of every enterprise solution we depend on. This shift led us to attempt to apply legacy key management approaches at the enterprise scale. The increased use of cryptography within enterprises made it clear these legacy approaches ignored the majority of keys we relied on, so we took a tactical approach and created repositories to manage the sprawl of secrets. While a step forward, this approach also papered over the real problems with how we use, track, and manage keys.

It is time for us as an industry to start viewing cryptography and key management not just as a tax we must pay but as an investment. We need to manage these keys in an automated and scalable way that helps us manage risk in our businesses.

To do this, we need to start with a question: What are these keys, anyway? Broadly, I think of three categories of keys: long-lived asymmetric secrets like those associated with certificate authorities, long-lived shared secrets used for encryption and authentication, and modern-day asymmetric key credentials for users, devices, and workloads. The greatest growth in keys has been in the last category, so let’s focus on that for the purpose of this discussion.

Modern Credentials and Their Management

Modern-day asymmetric key-based credentials are not always “certificates,” but they generally bind some claim(s) to an asymmetric key pair. These certificates can be formatted as JSON, ASN.1, CBOR, TLVs, X.509, JWT, or some other encoding. They serve various purposes:

  1. User Certificates:  Issued to individual users to authenticate their identity within an organization, these certificates provide secure access to corporate resources, such as an SSH certificate used by developers to access production. They bind a user’s identity to a cryptographic key pair, ensuring only authorized individuals access sensitive information and systems.
  2. Hardware Certificates: Assigned by manufacturers during production, these certificates uniquely identify hardware devices. They are often used to bootstrap the identity of machines or workloads, ensuring only authorized devices can access resources on your network.
  3. Machine Certificates: Common in operational IT environments, these certificates authenticate servers associated with domains, IP addresses, or device identifiers. They are typically used with TLS and for network access use cases like 802.1x, IKE, and various VPNs.
  4. Workload Certificates: In cloud and serverless environments, workload certificates perform access control close to the business logic to minimize security exposure and deliver on zero trust goals. These dynamic certificates often reflect both the underlying hardware and the workload running on it, acting like multi-factor authentication for devices. The frequent need to re-credential workloads makes issuing credentials mission-critical, as failure to do so can cause outages. This necessitates issuers in each failure domain (think of this as a cluster of clusters) hosting these workloads to ensure continuous operation.

What we can take from this is that we have been approaching credentials incorrectly by treating them as a key management problem. This approach is akin to using password managers for hardware, machines, and workloads, whereas, for users, we have moved toward multi-factor authentication and non-password-based authenticators.

Towards Automated and Scalable Key Management

If password managers or key vaults are not the right solution for machine authentication, what is? The answer is simpler than it might seem. Just as with users, these cases require built-for-purpose Identity Providers (IDPs). This is especially true for workloads, which dynamically spin up and down, making durable identifiers impractical. An IDP becomes a security domain for a given deployment, ensuring that workloads are accessible only by appropriate resources. This setup limits attackers’ lateral movement, allows for nearly instant granting and removal of access, minimizes the impact of compromises, and enables easy federation between deployments—all while providing a central point for identity governance and ensuring the cryptographic keys associated with credentials are well-managed and protected.

Getting Started

Modernizing key management starts with measurement. Identify the most common types of keys in your secret vaults, typically workload-related credentials. Deploy a workload-specific IDP, such as those enabled via SPIFFE, to transition these credentials out of the secret manager. Over time, the secret manager will store static secrets like API keys for legacy systems, while dynamic credentials are managed appropriately.

Prevent using your secret manager as an IDP from the start, especially for new systems. Teams responsible for the operational burden of these systems usually support this change, as automated end-to-end credentialing of workloads is more agile, scalable, and secure. This results in fewer outages and security incidents related to secret managers and non-production quality dependencies.

From this point, the process becomes a cycle of identifying where static secrets or long-lived credentials are funneled through your secret manager and moving them to built-for-purpose credential lifecycle management solutions.

Multi-factor authentication for workloads

Adopting a purpose-built IDP workload solution is a good start, but keys can still be stolen or leaked. For machines and workloads, use hardware attestations. Built-in hardware authenticators, such as Trusted Platform Modules (TPMs), create and secure keys within the device, ensuring they never leave. TPMs also verify device integrity during boot-up, adding an extra layer of security. This combination provides stronger multi-factor authentication without the usability issues associated with similar patterns for user authentication.

Avoiding Common Mistakes

The most common mistake organizations make is applying existing systems to workload credential management problems without fully analyzing operational, scale, uptime, and security needs. For example, enterprise PKI teams might mandate using their existing CA infrastructure for managing workload credentials, assuming a certificate suffices. However, this often violates the principle of least privilege and struggles with dynamic identification required for workloads.

Existing credential issuance systems are designed for static, long-lived subjects, such as directory names, DNS names, or IP addresses, which don’t change frequently. In contrast, workload credentials may change every few minutes. Provisioning devices like network appliances before assigning durable identifiers adds to this challenge. New workload-based systems, like SPIFFE, assign identifiers based on runtime elements, preventing the same bad practices that led to secret sprawl and mismanaged key problems.

Reducing Reliance on Shared Secrets 

Moving away from shared secrets won’t eliminate the need for secret vaults but will significantly reduce the problem’s scope. As systems modernize, password-based authenticators will be updated or deprecated. Over time, we will see fewer shared, long-lived secrets used for workload identity, driven by zero trust and least privilege principles.

At the same time, we can do much to improve overall key management practices in production systems. However, that’s a topic for another post.

Closing Note

The challenges and opportunities in modern key management are significant, but by leveraging innovative solutions and focusing on automation and scalability, we can make substantial progress. As we adopt built-for-purpose Identity Providers (IDPs) and hardware attestations, it’s important to have the right tools and frameworks in place to succeed.

I  have been working with SPIRL, a company focused on making the right thing the easy thing for developers, operations, and compliance. I see firsthand how the right platform investments can simplify the creation and operation of production systems. SPIRL achieves this by authoring and adopting open standards for identity and building reliable, scalable infrastructure that provides greater visibility and control.

Even if you don’t use SPIRL, by focusing on these principles, organizations can better manage the complexities of modern workload-related key and credential management, ensuring greater productivity and security.

Rethinking Security in Complex Systems

Over the last few decades, we seem to have gotten better at the micro aspects of security, such as formally verifying protocols and designing cryptographic algorithms, but have worsened, or at least failed to keep up at the macro aspects, such as building and managing effective, reproducible, risk-based security programs.

This can probably be attributed to both the scale of the systems we now depend on and perhaps even more to human factors. The quote, “Bureaucracy defends the status quo long past the time when the quo has lost its status,” is relevant here. Security organizations grow organically, not strategically, usually in response to a past failure or a newly recognized risk, which ultimately results in new teams that exist forever increasing the overall load of the security organization on the business. The role of these organizations typically expands over time to justify this load transforming them into data-gathering organizations. This explains why enterprises have so many dashboards of data that fail to drive action. Organizations are overwhelmed and struggle to understand where their risks lie so they can effectively allocate their limited resources toward achieving security-positive outcomes.

Building on this foundation of understanding risks, then there is the question of how we manage the success of security organizations in managing risk, as the saying goes, “If you can’t measure it, you can’t manage it.” The issue is that organizations often turn these measurements into metrics of success, which seems rational on the surface. However, in practice, we encounter another adage: “When a measure becomes a target, it ceases to be a good measure,” something that is especially true in security. For example, we often measure a security program for responding to an incident well, shipping new capabilities or passing audits, but this is an incomplete picture, using audits as an example, they are designed to fit every organization, which takes us to the saying, “If you try to please all, you please none.” In the case of security, you are guaranteed to be missing major issues or, at an absolute minimum, prioritizing activity over effectiveness with this approach. To make this more concrete this incentivizes successful audits (despite the incomplete picture this represents), and seeing the number of CVEs identified go down (despite these figures almost always being false positives) which in turn mislead organizations into believing they are more secure than they actually are.

Transitioning from metrics a combination of metrics, technology, and continual improvement, can help with some of the scale problems above, for example, AI shows promise for triaging issues and accelerating the review of low-level issues that fit neatly into a small context window but at the same time give a false sense of security. The human problems, on the other hand, are something we cannot simply automate; the best we can do is to rethink the way we build organizations so that empowerment and accountability are woven into how the organization operates. This will require ensuring those who take on that accountability truly understand how systems work and build their teams with a culture of critical, continual improvement. The second-order effects of these scale tools is de-skilling of the creators and operators of the tools — for example, I am always blown away how little modern computer science graduates understand of the way the systems they write code on operate — even from the most prestigious schools. We must also make continual education a key part of how we build our organizations.

Finally, focusing on system design, to support this from a design standpoint, we also need to consider how we design systems. The simplest approach is to design these systems in the most straightforward way possible, but even then, we need to consider the total operational nature of the system as part of the design. In user experience, we design for key users; early on, we talk about the concept of user stories. In system design, we often design and then figure out how that system will be operated and managed. We need to incorporate operational security into our designs. Do they emit the information (metrics and contextualized logs) required to monitor, do we provide the tools to use them to do this detection? If not how are their users to know they are operating securely? For example, do they enable monitoring for threats like token forgery over time? We must make the systems we ship less dependent on the human being involved in their operation, recognizing that they are generalists, not specialists, and give them simple answers to these operational issues if we want to help them achieve secure outcomes.

In conclusion, organizations need to look at the technological, and human aspects of their security program as well as their technology choices continuously and critically. This will almost certainly involve rethinking the metrics they use to drive security efforts, while also building a workplace culture centered on empowerment, accountability, continuous improvement and fundamentally integrating the security lifecycle of systems into the design process from day one. By adopting these strategic pillars, organizations can build more resilient, effective, and adaptable security programs that are equipped to meet the challenges of today’s dynamic environment.