Balancing Innovation and Privacy: The Risk of Government Surveillance in the Age of AI

Imagine a world where every conversation, every movement, and every interaction is tracked in real-time by unseen eyes. This isn’t the plot of a dystopian novel—it’s a very real possibility enabled by today’s rapid technological advancements. As we develop and deploy powerful new tools, and maintain the ones that we built the internet on, the line between technology that makes our lives easier and invasive surveillance becomes increasingly blurred.

“In the future everyone will want to be anonymous for 15 minutes”, Banksy, 2011

This growing concern isn’t unfounded. In recent years, the push for censorship and surveillance has become more pronounced, driven by various geopolitical, social, and technological factors. While these measures often are framed around national sovereignty, combating terrorism, and protecting children, the unintended consequences could be far-reaching. Whether this occurs accidentally or purposely, the outcome can be equally invasive and harmful. 

We stand at a point in history where the rapid development and deployment of new technologies, such as Microsoft’s Recall and Apple’s Private Cloud Compute, might inadvertently build the infrastructure that enables governments’ ultimate surveillance goals.

The advent of generative AI, which can extract insights at scale from video, audio, and text, opens the door to mass surveillance more than any other technology in history. These AI advancements enable detailed monitoring and analysis of vast amounts of data, making it easier for governments to track individuals and gain unprecedented insights into personal lives. To be clear, these technologies have amazing potential and are already making many people’s lives easier and making businesses more productive. In some cases, they also represent large and positive trends in privacy protection and should also give us faith in our ability to enable these innovations in a privacy-respecting way.

However, their use raises crucial questions about how we, as system designers, should evolve our philosophies on security and privacy.

A Timeline of Surveillance and Encryption Battles

Let’s take a step back and examine the historical context of government attempts to monitor citizens and the corresponding technological responses.

1990s

  • Introduction of technologies to enable government access, such as the Clipper Chip.
  • Backlash against government surveillance efforts and exposure of security flaws in Clipper Chip.
  • Encryption classified as a munition, leading to strict export control, an effort that held back security advancement for years..

2000s

  • Relaxation of U.S. export controls on encryption products in a tacit acknowledgment of its failure.
  • Expansion of law enforcement access to personal data through legislation like the USA PATRIOT Act.

2010s

  • Major leaks reveal extensive government surveillance programs.
  • Public debates over government backdoors in encryption.
  • Expansion of government surveillance capabilities through new legislation.
  • Proposals for key escrow systems and their criticisms from security experts.

2020s

  • Controversies over commercial companies their data collection and sale practices, and the use of the data by the government against citizens.
  • Legislative proposals and debates around encryption and privacy, such as the EARN IT Act.
  • EU pushes to limit CAs ability to remove EU CAs from their root programs and the mandated trust of EU CAs.
  • Tech companies’ attempts to implement on-device monitoring of your data and the ensuing backlash.
  • Suppression of legitimate discourse about COVID-19, leading to long-term impacts on public trust.
  • Ongoing legislative efforts to expand surveillance capabilities in the EU.

New Technologies and Surveillance Risks

Fast forward to today, and we are witnessing the rise of new technologies designed to make our data more useful. However, in our rush to build out these new capabilities, we might be inadvertently paving the way for government surveillance on an unprecedented scale.

To give this some color, let’s look back at another technological innovation: payphones. Payphones became widespread in the 1930s, and by the 1950s, the FBI under J. Edgar Hoover’s COINTELPRO program was using phone tapping to target civil rights organizations, feminist groups, and others advocating for political change. Regardless of your political leanings, we can all hopefully agree that this was an infringement on civil liberties.

Pay Phone in the Haight, San Francisco, 1967

Similarly, the advent of generative AI, which can extract insights at scale from video, audio, and text, opens the door to mass surveillance more than any other technology in history. These AI advancements enable detailed monitoring and analysis of vast amounts of data, making it easier for governments to track individuals and gain unprecedented insights into personal lives. While these technologies are useful they present a double-edged sword: the potential for more usable technology and improved productivity on one hand, and the risk of pervasive surveillance on the other.

Fast forward to the early 2000s, the USA PATRIOT Act expanded the surveillance powers of U.S. law enforcement agencies, laying the groundwork for extensive data collection practices. These powers eventually led to the Snowden revelations in 2013, which exposed the vast scope of the NSA’s surveillance programs, including the bulk collection of phone metadata and internet communications.

“How would you like this wrapped?” John Jonik, early 2000sClay Bennett, Christian Science Monitor, 2006

As George Santayana famously said, ‘Those who cannot remember the past are condemned to repeat it.’ Technological advancements intended to be used for good can enable detailed, real-time surveillance on a national or even global scale. This dual-use nature underscores the critical need for robust, transparent frameworks to prevent misuse and ensure these technologies align with our fundamental rights and freedoms.

Striking a Balance: Innovation and Vigilance

We do not want, nor can we afford, to be Luddites; this is a world economy, and time and technology move on. Nor do we want to let fear convince us to give up essential liberty to purchase a little temporary safety. Instead, we must strive to strike a balance between innovation and vigilance. This balance involves advocating for robust, transparent, verifiable, and accountable frameworks that ensure the use of technology aligns with our fundamental rights and freedoms.

For example, historically, many security practitioners have pushed back against the use of client-side encryption claiming it is a distraction from real security work because the software or service provider can bypass these protections. However, as we see now, applying these technologies serves as a useful tool for protecting against accidental leakage and for defense in depth more broadly. This change exemplifies the need to continually reassess and adapt our security strategies so they match the actual threat model and not some ideal.

An essential aspect of this balance is the governance of the foundational elements that ensure our privacy on the web. Root programs that underpin the WebPKI play a critical role in maintaining trust and security online. The WebPKI provides the infrastructure for secure Internet communication, ensuring that users can trust the websites they visit and the transactions they perform. Efforts to weaken this system, such as the eIDAS regulation in the EU, pose significant risks. Certificate Transparency (CT) logs, which provide a transparent and publicly auditable record of all certificates issued, are crucial for detecting and preventing fraudulent certificates, thus ensuring the infrastructure remains robust and reliable. Trusted root certificate authorities (CAs) must demonstrate principled, competent, professional, secure, and transparent practices, with continual review to maintain the WebPKI’s integrity. By protecting and strengthening these foundational elements, we ensure that the infrastructure supporting our digital world remains secure and trustworthy. This vigilance is crucial as we continue to innovate and develop new technologies that rely on these systems.

By protecting and strengthening these foundational elements, we ensure that the infrastructure supporting our digital world remains secure and trustworthy. This vigilance is crucial as we continue to innovate and develop new technologies that rely on these systems.

Here are several guiding principles to consider:

  1. Transparency: Companies developing these technologies must be transparent about their data practices and the potential risks involved. Clear communication about how data is collected, stored, and used can help build trust.
  2. Accountability: Ensuring accountability involves creating systems and processes that allow for oversight and verification of how data is handled. This can include audits, reports, and other mechanisms that demonstrate a company’s commitment to ethical data management.
  3. Privacy by Design: Incorporating privacy features from the ground up is essential. Technologies should embed privacy controls, transparency encryption, as Private Cloud Compute does,  as fundamental components, not as afterthoughts.
  4. Legal and Ethical Standards: There must be robust legal frameworks and ethical standards governing the use of surveillance technologies. These frameworks should ensure that surveillance is conducted legally, with proper oversight, and that it respects individuals’ rights and is discoverable when it has been done so it is harder to abuse.
  5. Public Dialogue and Advocacy: Engaging in public dialogue about the implications of these technologies is crucial. Advocacy groups, policymakers, and the tech industry need to work together to educate the public and advocate for policies that protect privacy and civil liberties.
  6. Technological Safeguards: Implementing technological safeguards to prevent misuse is critical. This includes encryption, verifiable designs, and transparent audit-ability so reliance on blind faith isn’t a necessary precondition to ensure that surveillance capabilities are not abused.

The Role of the Tech Community

The tech community plays a pivotal role in shaping the future of these technologies. By fostering a culture of ethical development and responsible innovation, developers, engineers, and tech companies can influence how new tools are designed and deployed. This involves prioritizing user privacy and security at every stage of development and ensuring that these principles are not compromised for convenience or profit.

Historical examples demonstrate the effectiveness of this approach. The Clipper Chip breakage in the 1990s is a notable case where researchers exposed significant security flaws, proving that government-mandated backdoors can undermine overall security. More recently, the advocacy for end-to-end encryption (E2EE) and the supporting technologies showcase the tech community’s positive influence. Companies like WhatsApp, Signal and Apple have implemented E2EE, setting new standards for privacy and security in communications.

Continued innovation in this regard can be seen in developments like key transparency and contact discovery. Key transparency mechanisms allow users to verify that the encryption keys they are using have not been tampered with, minimizing the amount of trust required in service providers and showing leadership for the industry as a whole.

Developers should also advocate for and adopt open standards that enhance verifiability, auditability, security, and privacy. These standards can serve as benchmarks for best practices, helping to establish a baseline of trust and reliability in new technologies. Collaboration within the tech community can lead to the creation of robust, interoperable systems that are harder for malicious actors to exploit.

Additionally, academia plays a crucial role in this endeavor by researching and developing provably secure models for various technologies. Academic work often focuses on creating rigorous, mathematically sound methods for ensuring security and privacy. By integrating these academic findings into practical applications, we can build more secure and trustworthy systems. This partnership between industry and academia fosters innovation and helps establish a foundation for secure technology development.

Moreover, the tech community must actively engage in policy discussions. By providing expert insights and technical perspectives, they can help shape legislation that balances innovation with the protection of civil liberties. This engagement can also help policymakers understand the technical complexities and potential risks associated with surveillance technologies, leading to more informed and balanced decision-making.

And finally, important we help regulators understand that technology alone cannot solve deeply rooted social problems and over-reliance on technology will lead to unintended consequences. A balanced approach that integrates technology with thoughtful policy, ethical considerations, and social engagement is essential if we want to see effective changes.

Education and Awareness

Educating the public about the implications of these technologies is crucial. Many people may not fully understand how their data is collected, used, and potentially monitored. By raising awareness about these issues, we can empower individuals to make informed decisions about their digital privacy.

Educational initiatives can take many forms, from public awareness campaigns to integrating privacy and security topics into school curriculums. At the same time, we must be careful to recognize the concerns and motivations of those advocating for these cases. This is not a situation where preaching to the choir or pontificating about abstract possibilities will be effective. We must anchor this education in recent history and concrete examples to avoid sounding like zealots.

Conclusion

The increasing push for censorship and the risk of enabling government surveillance underscore the delicate balance between technological advancement and civil liberties. Emerging technologies bring both promise and peril, potentially enabling unprecedented surveillance.

To address these risks, we must advocate for transparency, accountability, and privacy by design. Establishing robust legal and ethical standards, engaging in public dialogue, and raising awareness about digital privacy are essential steps. The tech community must prioritize ethical development, adopt open standards, and participate in policy discussions to ensure that innovation does not come at the cost of individual rights.

In conclusion, while new technologies offer significant benefits, they also pose risks that demand careful management. By balancing innovation with vigilance, we can create a future where technology enhances usability without compromising our fundamental freedoms.

Speeding Up Development and Navigating Security Risks

In software development, time is often of the essence. Developers are constantly pushed to deliver faster and more efficiently. Tools like GitHub Copilot have emerged, promising to accelerate coding tasks significantly. A study by GitHub found that developers using Copilot completed tasks up to 55% faster compared to those who didn’t use it [1]. However, as with speeding on the road, increased velocity in coding can lead to more significant risks and potential accidents.

The Need for Speed and Its Consequences

When driving at high speeds, the time available to react to unexpected events diminishes. Similarly, when coding tasks are completed rapidly with the help of AI copilots, developers have less time to thoroughly review and understand the code. This accelerated pace can lead to over-reliance on AI suggestions and decreased familiarity with the codebase. This lack of deep understanding might obscure potential issues that could be caught with a slower, more deliberate approach.

Narrowing the Focus

At higher speeds, drivers experience “tunnel vision,” where their field of view narrows, making it harder to perceive hazards. In coding, this translates to a reduced ability to catch subtle issues or security vulnerabilities introduced by AI-generated suggestions. A study by New York University highlighted this risk, finding that nearly 40% of code suggestions by AI tools like Copilot contained security vulnerabilities [2].

For a very basic example of how this could play out consider, an implementation of an authentication module using Copilot’s suggestions might skip input sanitization, leading to vulnerabilities like SQL injection or cross-site scripting (XSS) attacks. This omission might go unnoticed until the feature is in production, where it could lead to a security breach.

This reduction in situational awareness can be attributed to the lack of contextual knowledge AI copilots have. While they can provide syntactically correct code, they lack a nuanced understanding of the specific application or environment, leading to contextually insecure or inappropriate suggestions.

The Impact of High-Speed Errors

Higher driving speeds result in more severe accidents due to the increased force of impact. In coding, the severity of errors introduced by AI suggestions can be significant, especially if they compromise security. The GitHub study noted improvements in general code quality, as Copilot often follows best practices and common coding patterns [1]. However, the rapid pace and reliance on AI can mean fewer opportunities for developers to learn from their work and catch potential issues, leading to severe consequences in the long run.

Balancing Speed and Safety

To harness the benefits of tools like Copilot while mitigating the risks, developers, and teams can adopt several strategies:

  1. Enhanced Code Reviews: As the volume of code increases with the use of AI copilots, rigorous code reviews become even more crucial. Teams should ensure that every piece of code is thoroughly reviewed by experienced developers who can catch issues that the AI might miss.
  2. Integrating Security Tools: Using AI-based code reviewers and linters can help identify common issues and security vulnerabilities. These tools can act as a first line of defense, catching problems early in the development process.
  3. Continuous Learning: Establishing a culture of continuous learning ensures that developers stay updated with the latest security practices and coding standards. This ongoing education helps them adapt to new challenges and integrate best practices into their work.
  4. Practical Update Mechanisms: Implementing reliable and practical update mechanisms ensures that when issues are found, they can be addressed quickly and effectively. This proactive approach minimizes the impact of deployed vulnerabilities.
  5. Balanced Use of AI: Developers should use AI copilots as assistants rather than crutches. By balancing the use of AI with hands-on coding and problem-solving, developers can maintain a high level of familiarity with their codebases.

Conclusion

The advent of AI coding tools like GitHub Copilot offers a significant boost in productivity, enabling developers to code faster than ever before. However, this speed comes with inherent risks, particularly concerning security vulnerabilities and reduced familiarity with the code.

At the same time, we must recognize that attackers can also leverage these sorts of tools to accelerate their malicious activities, making it crucial to integrate these tools ourselves. By implementing robust code review processes, integrating security analysis tools into the build pipeline, and fostering a culture of continuous learning, teams can effectively manage these risks and use AI tools responsibly.

The game is getting real, and now is the time to put the scaffolding in place to manage both the benefits and the risks of this new reality.

Groundhog Day: Learning from Past Key and Credential Compromises

As they say, Those who cannot remember the past are condemned to repeat it, as we look back at the last decade, it seems we are caught in our own little Groundhog Day, reexperiencing the consequences of weak authentication and poor key management over and over.

It is not that we don’t know how to mitigate these issues; it’s more that, as organizations, we tend to avoid making uncomfortable changes. For example, the recent spate of incidents involving Snowflake customers being compromised appears to be related to Snowflake not mandating multi-factor authentication or a strong hardware-backed authenticator like Passkeys for its customers.

At the same time, not all of these key and credential thefts are related to users; many involve API keys, signing keys used by services, or code signing keys. Keys we have been failing to manage appropriately to the risks they represent to our systems and customers.

Timeline of Notable Incidents

Table of Incidents

IncidentType of CompromiseDescription
Stuxnet (2010)Code Signing KeysUtilized stolen digital certificates from Realtek and JMicron to authenticate its code, making it appear legitimate.
Adobe (2012)Code Signing KeysAttackers compromised a build server and signed malware with valid Adobe certificates.
Target (2013)Developer CredentialsNetwork credentials stolen from a third-party HVAC contractor via malware on an employee’s home computer.
Neiman Marcus (2013)Developer CredentialsMalware installed on systems, possibly through compromised credentials or devices from third-party vendors.
Home Depot (2014)Developer CredentialsMalware infected point-of-sale systems, initiated by stolen vendor credentials, potentially compromised via phishing.
Equifax (2017)API KeysExploitation of a vulnerability in Equifax’s website application exposed API keys used for authentication.
CCleaner (2017)Code Signing KeysAttackers inserted malware into CCleaner’s build process, distributing malicious versions of the software.
Ticketmaster (2018)Developer CredentialsMalware from a third-party customer support product led to the compromise of payment information.
ASUS Live Update (2018)Code Signing KeysAttackers gained access to ASUS’s code signing keys and distributed malware through the update mechanism.
Google Cloud Key Leak (2019)API KeysInternal tool inadvertently exposed customer API keys on the internet, leading to potential data exposure.
Facebook Instagram API Breach (2019)API KeysPlaintext passwords were accessible to employees due to misuse of an internal API.
DoorDash Data Breach (2019)API KeysUnauthorized access to systems included the compromise of API keys, exposing sensitive customer data.
Mission Lane (2020)Developer CredentialsMalware introduced into the development environment from a developer’s compromised personal device.
BigBasket (2020)Developer CredentialsData breach with over 20 million customer records exposed, suspected initial access through compromised developer credentials.
Twitter API Key Compromise (2020)API KeysAttackers gained access to internal systems, possibly compromising API keys or administrative credentials.
Amazon AWS Key Exposure (2021)API KeysMisconfigured AWS S3 bucket led to the exposure of API keys and sensitive customer data.
Nvidia (2022)Code Signing KeysStolen code signing certificates were used to sign malware, making it appear legitimate.
Microsoft Storm 0558 (2023)Signing KeysAttackers gained access to email accounts of government agencies and other organizations by forging authentication tokens using a stolen Microsoft signing key.

When we look at these incidents we see a few common themes including:

  1. Repetitive Failures in Security Practices:
    • Despite awareness of the issues, organizations continue to face the same security breaches repeatedly. This suggests a reluctance or failure to implement effective long-term solutions.
  2. Resistance to Necessary Changes:
    • Organizations often avoid making uncomfortable but necessary changes, such as enforcing multi-factor authentication (MFA) or adopting stronger authentication methods like Passkeys, and adopting work This resistance contributes to ongoing vulnerabilities and compromises.
  3. Diverse Sources of Compromise:
    • Key and credential thefts are not solely due to user actions but also involve API keys, service signing keys, and code signing keys. This highlights the need for comprehensive key management and protection strategies that cover all types of keys.
  4. Broader Key Management Practices:
    • Moving from shared secrets to asymmetric credentials for workloads and machines can significantly enhance security. Asymmetric credentials, especially those backed by hardware, are harder to steal and misuse compared to shared secrets. Techniques like attestation, which is the device equivalent of MFA, can provide additional security by verifying the authenticity of devices and systems.
  5. Third-Party Risks:
    • Even if we deploy the right technologies in our own environment, we often ignore the security practices of our upstream providers. Several incidents involved third-party vendors or services (e.g., Target, Home Depot), highlighting the critical need for comprehensive third-party risk management. Organizations must ensure that their vendors adhere to strong security practices and protocols to mitigate potential risks.
  6. Mismanagement of API and Signing Keys:
    • Incidents such as the Google Cloud Key Leak and Amazon AWS Key Exposure point to misconfigurations and accidental exposures of API keys. Proper configuration management, continuous monitoring, and strict access controls are essential to prevent these exposures.
  7. Importance of Multi-Factor Authentication (MFA):
    • The absence of MFA has been a contributing factor in several breaches. Organizations need to mandate MFA to enhance security and reduce the risk of credential theft.
  8. Need for Secure Code Signing Practices:
    • The use of stolen code-signing certificates in attacks like Stuxnet and Nvidia underscores the importance of securing code-signing keys. Implementing hardware security modules (HSMs) for key storage and signing can help mitigate these risks.

Conclusion

Looking back at the past decade of key and credential compromises, it’s clear that we’re stuck in a loop, facing the same issues repeatedly. It’s not about knowing what to do—we know the solutions—but about taking action and making sometimes uncomfortable changes.

Organizations need to step up and embrace multi-factor authentication, adopt strong hardware-backed authenticators like Passkeys, and move workloads from shared secrets to hardware-backed asymmetric credentials for better security. We also can’t overlook the importance of managing third-party risks and ensuring proper configuration and monitoring to protect our API and signing keys.

Breaking free from this cycle means committing to these changes. By learning from past mistakes and taking a proactive approach to key and credential management, we can better protect our systems and customers – It’s time to move forward.

Integrating Security: Making Safe Software Development Seamless and Productive

As software progresses from the developer’s machine to staging and finally to production, it undergoes significant changes. Each environment presents unique challenges, and transitions between these stages often introduce security weaknesses. By integrating security practices early in the development process, we bridge these gaps and help ensure we deliver a secure product. If done right, we can also improve developer productivity.

The Journey of Software Through Different Stages

In the development stage, the focus is primarily on writing code and quickly iterating on features. Security is often an afterthought, and practices like deployments, hardening, key management, auditing, and authentication between components are usually minimal or nonexistent. As the software moves to staging, it encounters a more production-like environment, revealing a new set of problems. Dependencies, network configurations, and integrations that worked on the developer’s machine may fail or expose vulnerabilities in staging.

When the software finally reaches production, the stakes are higher. It must handle real-world traffic, maintain uptime, and protect sensitive data. Here, the absence of strong security measures overlooked in earlier stages can lead to significant issues. Weak authentication, lack of auditing, and insecure configurations become glaring problems that can result in breaches and compromises.

The Role of Security in the Software Lifecycle

Security compromises often occur in the gaps between development, staging, and production. These gaps exist because each stage is treated as an isolated entity, leading to inconsistencies in security practices. To secure the software supply chain, we must address these gaps from the very beginning.

Integrating security practices early in the development process ensures that security is not an afterthought but a core component of software development. Here’s how we can achieve this, in no particular order:

  1. Start from the beginning with a Threat Model
    The right place to start addressing this problem is with a threat model, involving everyone in its development so the lessons from its creation are factored into the product’s manifestation. This model also needs to continuously evolve and adapt as the project progresses. A well-defined threat model helps identify potential security risks and guides the implementation of appropriate security controls and design elements throughout software development.
  2. Early Implementation of Authentication and Authorization
    From the first line of code, understand how you will implement authentication and authorization mechanisms for both users and workloads. A good example of how to do this for workloads is SPIFFE, which provides a robust framework for securely identifying and authenticating workloads. SPIRL makes deploying and adopting SPIFFE easy, laying the groundwork for unifying authentication and authorization across various stages of development. Ensure that every component, service, and user interaction is authenticated and authorized appropriately. This practice prevents unauthorized access and reduces the risk of security breaches.
  3. Continuous Auditing and Monitoring
    Integrate auditing and monitoring tools into the development pipeline and runtime environments. By continuously monitoring code changes, dependencies, configurations, and runtime behavior, we can detect and address vulnerabilities early. Additionally, automated analysis tools that integrate into CI/CD pipelines, like those from Binarly, can perform security checks to ensure that vulnerabilities and back doors didn’t sneak their way in during the build. Synthetic load testing modeled after real use cases can help detect outages and abnormal behavior early, ensuring that issues are identified before reaching production, and can serve as useful production monitoring automation. 
  4. Automate Security Infrastructure
    Incorporate automated security infrastructure to offload security tasks from developers. Middleware frameworks, such as service meshes and security orchestration platforms, provide standard interfaces for security operations. For example, using a service mesh like Istio can automatically handle authentication, authorization, and encryption between microservices without requiring developers to implement these security features manually. This allows developers to focus on shipping code while ensuring consistent security practices are maintained. These frameworks enable faster deployment into production, ensuring that security checks are embedded and consistent across all stages.
  5. Implement Infrastructure as Code (IaC)
    Implement Infrastructure as Code (IaC) to ensure consistent configuration and reduce human error. By automating infrastructure management, IaC not only enhances the reliability and security of your systems but also significantly improves your ability to recover from incidents and outages when they occur. With IaC, you can quickly replicate environments, roll back changes, and restore services, minimizing downtime and reducing the impact of any disruptions.
  6. Build for Continuous Compliance
    Pair IaC with continuous compliance automation to create artifacts that demonstrate adherence to security policies, reducing the compliance burden over time. Do this by using verifiable data structures, similar to how Google and others use Trillian, to log all server actions verifiably. These immutable logs create a transparent and auditable trail of activities, improving traceability and compliance.
  7. Minimize Human Access to Production
    Minimize human access to production environments to reduce the risk of errors and unauthorized actions. For example, use a privileged access management system to implement access on demand, where temporary access is granted to production systems or data only when needed and with proper approvals. This ensures that developers can debug and maintain high availability without permanent access, enhancing security while allowing necessary interventions.
  8. Implement Code Signing and Supply Chain Monitoring
    Implement code signing to enable the verification of the authenticity and integrity of the software at every stage. Regularly audit your dependencies to identify and mitigate the risks that they come with. As Ken Thompson famously said, “You can’t trust code that you did not totally create yourself.” This highlights the importance of not relying on blind faith when it comes to the security of external code and components. 
  9. Proper Secret Management from Day One
    It’s common for organizations to overlook proper secret management practices from the onset and have to graft them on later. Ensure developers have the tools to adopt proper secret management from day one, emphasizing the use of fully managed credentials rather than treating credentials as secrets. Solutions like SandboxAQ’s Aqtive Guard can help you understand where proper secret management isn’t happening and solve the problem of last-mile key management. Tools like this turn secret management into a tool for risk management, not just sprawl management.
  10. Effective Vulnerability Management
    Vulnerability management is a critical aspect of maintaining a secure software environment. A key component of this is monitoring third-party dependencies, as vulnerabilities in external libraries and frameworks can pose significant risks. Implement tools that continuously scan for vulnerabilities in these dependencies and provide actionable insights without overwhelming your team with noise. By leveraging automated vulnerability management solutions, you can prioritize critical issues and reduce alert fatigue. These tools should integrate seamlessly into your development pipeline, allowing for real-time vulnerability assessments and ensuring that only relevant, high-priority alerts are raised. This approach not only enhances your security posture but also allows your team to focus on meaningful security tasks, rather than being bogged down by an excessive number of alerts.
  11. Shift Left with Developer-Friendly Security Tools
    Equip developers with tools and platforms that seamlessly integrate into their workflows, offering security features without adding friction. User-friendly and non-intrusive security tools increase the likelihood of early and correct adoption by developers. For instance, tools like GitHub’s Dependabot and Snyk help identify and fix vulnerabilities in dependencies, while CodeQL allows for deep code analysis to uncover security issues before they reach production. These tools make security a natural part of the development process, enabling developers to focus on writing code while ensuring robust security practices are maintained.
  12. Consider the Entire Lifecycle of The Offering
    Security is an ongoing process that extends beyond development and deployment. Regularly assess and enhance the security posture of your production environment by implementing frequent security reviews, patch management, and incident response plans. Ensure that security practices evolve alongside the software by utilizing comprehensive security playbooks. These playbooks should offer clear, repeatable steps for handling common security tasks and incidents and be updated regularly to address new threats and best practices. Crucially, feedback loops from incident response and post-mortems are essential for continuous improvement. These loops provide valuable insights into past issues, helping to prevent similar problems in the future and fostering a culture of ongoing enhancement.
  13. Profile Your System and Understand Bottlenecks
    Profile your system and understand where the bottlenecks are. Have plans to both prevent them and to respond when they are encountered. This proactive approach ensures that performance issues are addressed before they impact the user experience and that you have a strategy in place for quick remediation when problems do arise.

Conclusion

Early on, normalizing security practices across all stages of software development may be challenging, but it’s crucial. By consistently applying security measures from development to production, we can bridge the gaps that lead to vulnerabilities. This not only improves security but also enhances developer productivity.

Platforms and middleware that automate security checks, enforce policies, and provide clear visibility into security issues help developers stay productive while maintaining a robust security posture, reducing outages and security incidents. By removing these security concerns from the developer’s workflow, we allow them to concentrate on what they do best: writing code.

Making security the easy and natural choice for developers enhances both security and productivity, leading to more secure and reliable software. While it may be difficult to fully implement these practices initially, developing a model that includes continuous threat analysis and plans to address any gaps will ensure long-term success. In doing so, we make the right thing the easy thing.

Thanks to Amir Omidi for his feedback on this post

From Static to Dynamic: Adapting PKI for Cloud-Native Architectures

When it comes to workload and service credential management, a common misconception is that you can simply reuse your existing Certificate Authority (CA) and Certificate Lifecycle Management (CLM) infrastructure to achieve your desired end-state more quickly. Given that nearly every organization has client and server TLS certificates for devices and web servers and many have made significant investments in managing and issuing these certificates, this notion seems logical. However, the design constraints of these use cases differ significantly. The transition from static to dynamic environments, along with the associated credential governance problems, makes it clear that a purpose-built approach is necessary for managing workload and service credentials.

The Static Nature of Traditional CA and CLM Infrastructure

Traditional CA and CLM infrastructure primarily deals with TLS server certificates, where the identities involved are usually domain names or IP addresses. These identifiers are largely static, pre-defined, and managed out-of-band. As a result, the lifecycle of these certificates, which includes issuance, renewal, and revocation, follows a relatively predictable pattern. These certificates are usually issued for validity periods ranging from 90 days to a year, and the processes and infrastructure are designed to handle these constraints. However, use cases surrounding workload and service credentials have a totally different set of constraints, given the highly dynamic environment and various regulatory frameworks they must adhere to.

The Dynamic Nature of Workloads and Services

Workloads and services operate in a significantly different environment. These identities often come and go as services scale up or down or undergo updates. The sheer scale at which workload and service credentials are issued and managed can be hundreds of times greater than that of traditional client and server TLS certificate use cases.

Unlike the static domain names or IP addresses used in these traditional TLS certificates, the identifiers for workloads and services are dynamically assigned as workloads spin up and down or are updated. Workload and service credentials often have a much shorter lifespan compared to server certificates. They might need to be reissued every few minutes or hours, depending on the nature of the workload and the policies of the environment. This changes the expectations of the availability and scalability of the issuing infrastructure, leading to a need for credentials to be issued as close to the workload as possible to ensure issuance doesn’t become a source of downtime.

These workloads and services are also often deployed in clusters across various geographic locations to ensure fault tolerance and scalability. Unlike server certificates that can rely on a centralized CA, workload and service credentials often need a more distributed approach to minimize latency and meet performance requirements.

Another critical difference lies in the scope of trust. For TLS client and server certificates, the security domain is essentially defined by a root certificate or the corresponding root store. However, in the context of workloads and services, the goal is to achieve least privilege. This often necessitates a much more complicated PKI to deliver on this least-privileged design.

Additionally, federation is often needed in these use cases, and it is usually not possible to standardize on just X.509. For example, you might need to interoperate with a service that works based on OAuth. Even in such cases, you want to manage all of this within one common framework to have unified control and visibility regardless of the technology used for the deployment.

Purpose-Built Solutions for Workload and Service Credential Management

Given the unique challenges of managing workload and service credentials in dynamic, cloud-native environments, existing CA and CLM investments often fall short. Purpose-built solutions, such as those based on SPIFFE (Secure Production Identity Framework for Everyone), tend to offer a more effective approach by providing dynamic, attested identities tailored for these environments.

Dynamic Identifier Assignment

Unlike static identifiers managed by traditional CLM solutions, SPIFFE dynamically assigns identifiers as workloads spin up and down, aligning with the nature of modern cloud environments.

Decentralized Issuance

By issuing credentials as close to the workload as possible, SPIFFE-based solutions reduce latency and align issuance with the availability goals of these deployments, ensuring that credentials are issued and managed efficiently without becoming a source of downtime.

Granular Policy Enforcement

SPIFFE-based solutions enable enforcing fine-grained policy through its namespacing and security domain concepts. This enables organizations to define and enforce policies at a more granular level, ensuring that workloads only access the necessary resources.

Identity Governance and Federation

SPIFFE-based solutions tend to support extending existing identity governance programs to workload and service credentials while also facilitating seamless and secure access to external services across different trust domains.

Multifactor Authentication

These SPIFFE-based solutions also provide support for attestation, which can be thought of as multi-factor authentication for workloads. This attestation verifies the workload and services’ running state and environment, tying the credentials to those environments and helping minimize the risks of credential theft.

Integration with Other Systems and Protocols

These environments can seldom rely exclusively on just X.509, which is why SPIFFE supports both X.509 and JWT formats of credentials. This flexibility allows seamless integration with various systems and protocols within cloud deployments.

Conclusion

While it might be tempting to reuse existing CA and CLM infrastructure for workload and service credential management, this approach fails to address the unique challenges posed by dynamic, cloud-native environments. The ephemeral nature of workload identities, the high frequency of credential issuance, and the need for granular trust domains necessitate purpose-built solutions like those based on SPIFFE.

These purpose-built systems are designed to handle the dynamic nature of modern environments, providing strong, attested identities and ensuring that credential management aligns with the evolving needs of cloud-native workloads. Understanding these distinctions is crucial for developing an effective identity management strategy for modern workloads and services without having to go through a reset event down the line.

To learn more about effective machine identity management, check out SPIRL and see how they are leading the way.

Credential Management vs. Secret Management: Choosing the Right Approach

If we examine the contents of most secret management solutions, like HashiCorp Vault, we will find that we primarily store the logical equivalent of user IDs and passwords for services, workloads, and machines. Much like the old-school practice of writing passwords on post-it notes and sharing them, these secrets are then distributed to whatever entity needs access to the protected resource. As a result, it is no surprise that we encounter the same problems as password management: weak secrets, shared secrets, and stolen secrets. This is why we see machine credentials and keys being stolen in events like STORM 0558.

Secret Management: A Machine-Scale Password Management

Secret management can be seen as password management at a machine scale. These systems centralize secrets  (e.g., API keys, tokens, and passwords) to manage their sprawl and then make it easy to pass these shared secrets around various services and applications. However, just as simple passwords alone are seen as insufficient for user authentication—necessitating the adoption of multi-factor authentication (MFA) and migration to non-password authenticators like Passkeys—the same shortcomings apply to this legacy approach to managing these credentials.

The reality is that these secret managers were designed to address the symptoms rather than the root cause of secret sprawl. By storing and distributing these static secrets without treating them as credentials to be managed, you centralize the chaos but do not solve the fundamental problem: the need for robust, scalable machine identity management. Machines, services, and workloads, much like users credentials, require purpose-built approaches. Just as we have applied multi-factor authentication, identity governance and administration (IGA) and solutions like OKTA to user management, we need analogous systems for machines.

Credential Management for Machines

At machine and workload scale, credential management solutions need to be architected differently to handle the dynamic nature of modern cloud environments. This is where purpose-built solutions come into play. These solutions must address the unique challenges of credential management in dynamic, cloud-native environments. Unlike secret management, which often relies on static secrets, credential management provides dynamic, attested identities for workloads. For example, SPIFFE (Secure Production Identity Framework for Everyone) offers a strategy for managing dynamic identities, providing the foundation for standards-based, scalable, and robust workload identity management.

Higher-Level Concepts in Workload Credential Management:

  1. Provisioning and Deprovisioning:
    • Efficiently creating, updating, and removing machine identities as services start, stop, and scale.
  2. Role-Based Access Control (RBAC):
    • Assigning access rights based on predefined roles, ensuring that machines and services have appropriate permissions.
  3. Periodic Reviews:
    • Conducting regular reviews of machine access rights to ensure compliance and appropriateness.
  4. Policy Enforcement:
    • Defining and enforcing access control policies, ensuring that machines and services adhere to security guidelines.
  5. Audit and Reporting:
    • Generating comprehensive reports to demonstrate compliance with regulatory requirements and internal policies.
  6. Risk Analysis:
    • Identifying and mitigating risks associated with machine identities and their entitlements.
  7. Behavioral Analysis:
    • Monitoring machine behavior to detect anomalies and potential security threats.

Just as legacy approaches to secret management fail for workloads, machines, and services, these legacy IGA approaches will not work either. We need built-for-purpose solutions to adapt and not repeat the mistakes of the past. Those solutions will look like credential management systems that enable these functions as a byproduct of how they work, rather than bolting them onto an existing secret management or IGA process. 

This is why I have been an advisor to SPIRL since its founding. The founders, Evan and Eli, are why I knew I had to be involved. As the authors of the standard and having had real production-at-scale experience with SPIFFE, I knew they had what it took, which is why I have been working with them in a more formal capacity as of this last month. They are building the infrastructure that makes, as I like to say, the right thing the easy thing, ensuring their customers don’t repeat the mistakes of the past when it comes to workload and service identity.

Conclusion

Secret management tools were a necessary step in addressing the sprawl of machine secrets, and they will continue to be needed. However, as the complexity and scale of cloud environments grow, so does the need for more sophisticated solutions for workload and service credentials. Credential management systems provide the dynamic, robust framework required for modern machine identity management.

By adopting credential management as the primary approach and using secret management as the exception, organizations can achieve greater security, scalability, and operational efficiency in their cloud-native architectures. To learn more about effective machine identity management, check out SPIRL and see how they are leading the way.

ACME vs. SPIFFE: Choosing the Right One

In the world of certificate lifecycle management for workloads, two approaches often come into focus: ACME (Automated Certificate Management Environment) and SPIFFE (Secure Production Identity Framework for Everyone). While both can deliver certificates to a device or workload, they cater to different use cases. Let’s explore their core differences and why these matter, especially for dynamic workload deployments.

ACME: Proving Control Over Identifiers

ACME is primarily designed to prove control over an identifier, typically a domain name, and to deliver a certificate that demonstrates successful control verification.

  • Control Verification: ACME verifies control through various methods such as HTTP-01, DNS-01, TLS-ALPN, and External Account Binding.
  • Attestation (Optional): Attestation in ACME is secondary and optional, primarily enabling verifying if the key is well protected.
  • Pre-assigned Identifiers: ACME assumes that the identifier (like a domain name) is pre-assigned and focuses on validating control over this identifier.

This approach is particularly useful for scenarios where identifiers are static and pre-assigned, making it ideal for server authenticated TLS and applications that rely on domain names.

SPIFFE: Dynamic Assignment of Identifier Based on Attestation

SPIFFE, conversely, is designed for dynamic workloads, which frequently change as services scale or update. SPIFFE assigns identifiers to workloads dynamically, based on attestation.

  • Identifier Assignment: SPIFFE assigns an identifier to the subject (such as a workload) using attestation about the subject to construct this identifier.
  • Attestation: Unlike ACME, attestation is a core component in SPIFFE, enabling robust multi-factor authentication (MFA) for workloads based on what is running and where it is running.
  • Namespacing and Least Privilege: SPIFFE facilitates the namespacing of identifiers, building a foundation that enables authorization frameworks that promote least privilege, ensuring workloads only access necessary resources.
  • Minimal Security Domains: At the core of SPIFFE is the concept of security domains, which serve as trust boundaries between services, helping to minimize the attack surface.
  • JWT and X.509: SPIFFE supports SVIDs in both X.509 and JWT formats, enabling seamless integration with various systems and protocols within cloud deployments.

Importance in Dynamic Workload Deployments

The differences between ACME and SPIFFE are particularly significant in dynamic workload deployments:

  • Flexibility and Scalability: SPIFFE’s dynamic identifier assignment is highly suitable for environments where workloads are frequently spun up and down, such as in microservices architectures and Kubernetes clusters.
  • Role-Based Authentication: By making attestation a core component and promoting least privilege, SPIFFE ensures that each workload is authenticated and authorized precisely for its intended role.
  • Granular Policy Enforcement: SPIFFE’s namespacing and minimal security domain features enable fine-grained policy enforcement, enabling organizations to define and enforce policies at a more granular level, such as per-workload or per-service.
  • Identity Federation: SPIFFE supports identity federation enabling different trust domains to interoperate securely. This is particularly beneficial in multi-cloud and hybrid environments where different parts of an organization or third-parties.

Conclusion

While both ACME and SPIFFE are used for certificate management, they cater to different scenarios and can complement each other effectively. ACME is ideal for static environments where identifiers are pre-allocated, focusing on certificate lifecycle management for issuing, renewing, and revoking certificates. It excels in managing the lifecycle of certificates for web servers and other relatively static resources. 

On the other hand, SPIFFE excels in dynamic, high-scale environments, emphasizing credential lifecycle management with flexible and robust authentication and authorization through dynamic identifiers and attestation. SPIFFE is designed for modern, cloud-native architectures where workloads are ephemeral, and identifiers need to be dynamically issued and managed. 

By understanding these differences, you can leverage both ACME and SPIFFE to enhance your identity management strategy. Use ACME for managing certificates in static environments to ensure efficient lifecycle management. Deploy SPIFFE for dynamic environments to provide strong, attested identities for your workloads.

Building Effective Roadmaps through Mission, Vision, and Strategy Alignment

Creating a cohesive and effective roadmap requires more than just a list of tasks and deadlines; it requires a clear mission, vision, and strategy that aligns the entire organization. This alignment ensures that every team member understands the overarching goals and the path to achieving them, helping facilitate a unified and focused approach to succeeding.

Mission: The Foundation of Purpose

The mission statement serves as the foundation, articulating the core purpose of the organization. It answers why the organization exists and what it aims to achieve. For example:

Google Cloud’s mission to enable global digital transformation through superior infrastructure and industry-specific solutions guides all efforts.

A strong mission focuses on the problem, providing clarity on the organization’s purpose.

Vision: The Aspirational Future

The vision statement paints an aspirational picture of the future the organization seeks to create within the next 2-5 years. It describes desired outcomes and the impact of the organization’s efforts. For example, envisioning customers leveraging advanced features to transform operations provides a clear goal. For example:

Google Cloud’s vision is to empower businesses of all sizes to innovate faster and achieve operational excellence through cutting-edge technology and seamless integration with existing technology investments.

While the mission addresses the ‘why,’ the vision focuses on the ‘what’—what success looks like in practical terms.

Strategy: The Path to Achievement

Strategy bridges the gap between the current state and the envisioned future. It outlines ‘how’ the organization will achieve its mission and vision, including:

Assessing the current environment, including market conditions and internal capabilities.

Google Cloud operates in a highly competitive market with rapid technological advancements. Understanding customer needs for scalable and secure cloud solutions is crucial.

Using data and trends to inform strategic decisions.

Analyzing trends in cloud adoption, security requirements, and industry-specific needs helps Google Cloud identify opportunities for growth and areas where they can provide unique value.

Setting clear, measurable business outcomes.

Google Cloud aims to achieve a 25% increase in enterprise adoption of its cloud services within the next two years by enhancing its product offerings and expanding its market reach.

Deciding where to focus resources and explaining why certain paths are chosen.

Prioritizing the development of AI and machine learning capabilities tailored to industry-specific solutions to address top enterprise blockers. This decision is based on market demand and internal strengths in these technologies.

Roadmap: The Plan

The roadmap translates the strategic vision into a plan, providing an 18-24 month overview of key milestones. It breaks down big problems into manageable tasks, ensuring alignment with overall goals. A good roadmap is not just a schedule; it’s a strategic tool that shows how initiatives drive business success, allowing for flexibility and adaptation while staying true to the mission and vision.

For example:

  • Q1 2024: Launch Enhanced AI and Machine Learning Tools
    • Initiative: Develop and release new AI and machine learning tools designed to address specific industry needs, such as healthcare analytics and financial risk modeling.
    • Milestone: Achieve a 10% increase in customer adoption of AI services within three months of launch.

  • Q2 2024: Expand Data Security Solutions
    • Initiative: Introduce advanced data security features, including encryption and threat detection, to meet growing cybersecurity demands.
    • Milestone: Secure certifications for new security standards and onboard 50 new enterprise customers by the end of the quarter.

  • Q3 2024: Integrate Cloud Services with Major ERP Systems
    • Initiative: Develop seamless integrations with leading ERP systems to streamline operations for enterprise customers.
    • Milestone: Complete integration with three major ERP platforms and pilot with 10 key clients.

  • Q4 2024: Launch Global Data Centers Expansion
    • Initiative: Open new data centers in strategic locations worldwide to enhance global reach and reliability.
    • Milestone: Increase data center footprint by 20% and reduce latency by 15% for international clients.

  • Q1 2025: Roll Out Customer Success Programs
    • Initiative: Implement customer success programs aimed at improving user experience and satisfaction.
    • Milestone: Achieve a 15% increase in customer satisfaction scores and reduce churn rate by 10%.

Achieving Alignment and Focus

Aligning the mission, vision, and strategy, and clearly communicating them, helps maintain organizational focus. This alignment prevents reactive firefighting and directional resets that consume resources. By fostering proactive planning and execution, organizations can drive meaningful progress toward business goals.

Conclusion

Effective roadmaps and work plans stem from a well-aligned mission, vision, and strategy. This alignment provides a stable foundation, clear goals, and a practical path to achieve them. By maintaining this alignment, organizations can move from reactive to proactive operations, regain stakeholder trust, and achieve sustainable success.

Thanks to Michale Windsor for his advice on this and other topics over the years

Transitioning from Reactive to Proactive

Building Effective, Evidence-Based Roadmaps for Business Success

In defining roadmaps, the primary goal is to deliver on product objectives that will generate revenue. These roadmaps are prioritized to achieve business goals as swiftly as possible. Typically, the prioritized items listed on roadmaps are broadly visible to customers but often depend on behind-the-scenes work that is not customer-visible, determined when building out work plans.

These work plans allocate time for roadmap items, customer obligations, and technical debt. For instance, a typical allocation might be 70% for roadmap items and their dependencies, 20% for customer obligations, and 10% for technical debt. In healthy organizations, these target allocations will ebb and flow but are largely adhered to, with the roadmap representing the largest commitment in the allocation.

When this is not the case, work plans are treated as roadmaps. In these situations, organizations are in a reactive position, where the only work that gets done is what has been committed to customers, and addressing technical debt has gotten so bad it is already causing harm to the business. These organizations are reactive and are always in a fight-or-flight mode.

Such reactive modes become a byproduct of the lack of structure in building the roadmap, the disconnect in the work supporting that roadmap, and a lack of recognition of the current situation regarding customer commitments. This, in turn, creates a situation where leadership loses faith in the organization’s ability to produce a compelling product roadmap.

This can be the beginning of a death spiral, where the expectation is that every item in the work plan is mapped back to customer revenue. This results in the items necessary to achieve forward growth business objectives being de-prioritized, with the committed sales funnel driving all engineering spend. The work plan and the management of it replace the accountability to build a business and associated engineering investment with a list of customer asks. As a consequence, the organization can never address technical debt or implement the kind of product design pivots needed to tackle issues like improving the size of deals through expanding its footprint within customers, reducing the cost of goods sold, time to close deals, or creating new parallel go-to-market strategies.

While there is no one-size-fits-all approach to product planning, the principles outlined here offer valuable lessons that can be adapted to various organizational contexts.To transition to a more proactive and accountable roadmap-driven model, organizations must commit to a disciplined approach where a significant portion of the effort is dedicated to developing a product that can deliver business outcomes, supported by a robust body of customer evidence. 

This shift also involves ensuring alignment with internal stakeholders, such as product, engineering, and executive leadership. Achieving this alignment is key to prevent the cycle of reactive firefighting and directional resets that consume resources and hinder progress on business objectives.

By returning to first principles and building a compelling, evidence-based roadmap, organizations can over time move the organizations culture from a reactive to a proactive operational mode, regain trust of all participants, and as a result drive meaningful progress toward their business goals.

Automating Non-Human Identities: The Future of Production Key Management 

Historically, key management was seen as activities involving hardware security modules (HSMs), manual tasks, and audits. This approach was part of what we termed ‘responsible key management.’ However, HSMs were impractical for many use cases, and these manual tasks, typical of IT processes, were often poorly executed or never completed, frequently causing security incidents, outages, and unexpected work.

Simultaneously, as an industry, we began applying cryptography to nearly all communications and as a way to protect data at rest. This led to the adoption of cryptography as the method for authenticating hardware, machines, and workloads to apply access control to their activities. As a result, today, cryptography has become a fundamental component of every enterprise solution we depend on. This shift led us to attempt to apply legacy key management approaches at the enterprise scale. The increased use of cryptography within enterprises made it clear these legacy approaches ignored the majority of keys we relied on, so we took a tactical approach and created repositories to manage the sprawl of secrets. While a step forward, this approach also papered over the real problems with how we use, track, and manage keys.

It is time for us as an industry to start viewing cryptography and key management not just as a tax we must pay but as an investment. We need to manage these keys in an automated and scalable way that helps us manage risk in our businesses.

To do this, we need to start with a question: What are these keys, anyway? Broadly, I think of three categories of keys: long-lived asymmetric secrets like those associated with certificate authorities, long-lived shared secrets used for encryption and authentication, and modern-day asymmetric key credentials for users, devices, and workloads. The greatest growth in keys has been in the last category, so let’s focus on that for the purpose of this discussion.

Modern Credentials and Their Management

Modern-day asymmetric key-based credentials are not always “certificates,” but they generally bind some claim(s) to an asymmetric key pair. These certificates can be formatted as JSON, ASN.1, CBOR, TLVs, X.509, JWT, or some other encoding. They serve various purposes:

  1. User Certificates:  Issued to individual users to authenticate their identity within an organization, these certificates provide secure access to corporate resources, such as an SSH certificate used by developers to access production. They bind a user’s identity to a cryptographic key pair, ensuring only authorized individuals access sensitive information and systems.
  2. Hardware Certificates: Assigned by manufacturers during production, these certificates uniquely identify hardware devices. They are often used to bootstrap the identity of machines or workloads, ensuring only authorized devices can access resources on your network.
  3. Machine Certificates: Common in operational IT environments, these certificates authenticate servers associated with domains, IP addresses, or device identifiers. They are typically used with TLS and for network access use cases like 802.1x, IKE, and various VPNs.
  4. Workload Certificates: In cloud and serverless environments, workload certificates perform access control close to the business logic to minimize security exposure and deliver on zero trust goals. These dynamic certificates often reflect both the underlying hardware and the workload running on it, acting like multi-factor authentication for devices. The frequent need to re-credential workloads makes issuing credentials mission-critical, as failure to do so can cause outages. This necessitates issuers in each failure domain (think of this as a cluster of clusters) hosting these workloads to ensure continuous operation.

What we can take from this is that we have been approaching credentials incorrectly by treating them as a key management problem. This approach is akin to using password managers for hardware, machines, and workloads, whereas, for users, we have moved toward multi-factor authentication and non-password-based authenticators.

Towards Automated and Scalable Key Management

If password managers or key vaults are not the right solution for machine authentication, what is? The answer is simpler than it might seem. Just as with users, these cases require built-for-purpose Identity Providers (IDPs). This is especially true for workloads, which dynamically spin up and down, making durable identifiers impractical. An IDP becomes a security domain for a given deployment, ensuring that workloads are accessible only by appropriate resources. This setup limits attackers’ lateral movement, allows for nearly instant granting and removal of access, minimizes the impact of compromises, and enables easy federation between deployments—all while providing a central point for identity governance and ensuring the cryptographic keys associated with credentials are well-managed and protected.

Getting Started

Modernizing key management starts with measurement. Identify the most common types of keys in your secret vaults, typically workload-related credentials. Deploy a workload-specific IDP, such as those enabled via SPIFFE, to transition these credentials out of the secret manager. Over time, the secret manager will store static secrets like API keys for legacy systems, while dynamic credentials are managed appropriately.

Prevent using your secret manager as an IDP from the start, especially for new systems. Teams responsible for the operational burden of these systems usually support this change, as automated end-to-end credentialing of workloads is more agile, scalable, and secure. This results in fewer outages and security incidents related to secret managers and non-production quality dependencies.

From this point, the process becomes a cycle of identifying where static secrets or long-lived credentials are funneled through your secret manager and moving them to built-for-purpose credential lifecycle management solutions.

Multi-factor authentication for workloads

Adopting a purpose-built IDP workload solution is a good start, but keys can still be stolen or leaked. For machines and workloads, use hardware attestations. Built-in hardware authenticators, such as Trusted Platform Modules (TPMs), create and secure keys within the device, ensuring they never leave. TPMs also verify device integrity during boot-up, adding an extra layer of security. This combination provides stronger multi-factor authentication without the usability issues associated with similar patterns for user authentication.

Avoiding Common Mistakes

The most common mistake organizations make is applying existing systems to workload credential management problems without fully analyzing operational, scale, uptime, and security needs. For example, enterprise PKI teams might mandate using their existing CA infrastructure for managing workload credentials, assuming a certificate suffices. However, this often violates the principle of least privilege and struggles with dynamic identification required for workloads.

Existing credential issuance systems are designed for static, long-lived subjects, such as directory names, DNS names, or IP addresses, which don’t change frequently. In contrast, workload credentials may change every few minutes. Provisioning devices like network appliances before assigning durable identifiers adds to this challenge. New workload-based systems, like SPIFFE, assign identifiers based on runtime elements, preventing the same bad practices that led to secret sprawl and mismanaged key problems.

Reducing Reliance on Shared Secrets 

Moving away from shared secrets won’t eliminate the need for secret vaults but will significantly reduce the problem’s scope. As systems modernize, password-based authenticators will be updated or deprecated. Over time, we will see fewer shared, long-lived secrets used for workload identity, driven by zero trust and least privilege principles.

At the same time, we can do much to improve overall key management practices in production systems. However, that’s a topic for another post.

Closing Note

The challenges and opportunities in modern key management are significant, but by leveraging innovative solutions and focusing on automation and scalability, we can make substantial progress. As we adopt built-for-purpose Identity Providers (IDPs) and hardware attestations, it’s important to have the right tools and frameworks in place to succeed.

I  have been working with SPIRL, a company focused on making the right thing the easy thing for developers, operations, and compliance. I see firsthand how the right platform investments can simplify the creation and operation of production systems. SPIRL achieves this by authoring and adopting open standards for identity and building reliable, scalable infrastructure that provides greater visibility and control.

Even if you don’t use SPIRL, by focusing on these principles, organizations can better manage the complexities of modern workload-related key and credential management, ensuring greater productivity and security.