Tag Archives: key management

Automating Non-Human Identities: The Future of Production Key Management 

Historically, key management was seen as activities involving hardware security modules (HSMs), manual tasks, and audits. This approach was part of what we termed ‘responsible key management.’ However, HSMs were impractical for many use cases, and these manual tasks, typical of IT processes, were often poorly executed or never completed, frequently causing security incidents, outages, and unexpected work.

Simultaneously, as an industry, we began applying cryptography to nearly all communications and as a way to protect data at rest. This led to the adoption of cryptography as the method for authenticating hardware, machines, and workloads to apply access control to their activities. As a result, today, cryptography has become a fundamental component of every enterprise solution we depend on. This shift led us to attempt to apply legacy key management approaches at the enterprise scale. The increased use of cryptography within enterprises made it clear these legacy approaches ignored the majority of keys we relied on, so we took a tactical approach and created repositories to manage the sprawl of secrets. While a step forward, this approach also papered over the real problems with how we use, track, and manage keys.

It is time for us as an industry to start viewing cryptography and key management not just as a tax we must pay but as an investment. We need to manage these keys in an automated and scalable way that helps us manage risk in our businesses.

To do this, we need to start with a question: What are these keys, anyway? Broadly, I think of three categories of keys: long-lived asymmetric secrets like those associated with certificate authorities, long-lived shared secrets used for encryption and authentication, and modern-day asymmetric key credentials for users, devices, and workloads. The greatest growth in keys has been in the last category, so let’s focus on that for the purpose of this discussion.

Modern Credentials and Their Management

Modern-day asymmetric key-based credentials are not always “certificates,” but they generally bind some claim(s) to an asymmetric key pair. These certificates can be formatted as JSON, ASN.1, CBOR, TLVs, X.509, JWT, or some other encoding. They serve various purposes:

  1. User Certificates:  Issued to individual users to authenticate their identity within an organization, these certificates provide secure access to corporate resources, such as an SSH certificate used by developers to access production. They bind a user’s identity to a cryptographic key pair, ensuring only authorized individuals access sensitive information and systems.
  2. Hardware Certificates: Assigned by manufacturers during production, these certificates uniquely identify hardware devices. They are often used to bootstrap the identity of machines or workloads, ensuring only authorized devices can access resources on your network.
  3. Machine Certificates: Common in operational IT environments, these certificates authenticate servers associated with domains, IP addresses, or device identifiers. They are typically used with TLS and for network access use cases like 802.1x, IKE, and various VPNs.
  4. Workload Certificates: In cloud and serverless environments, workload certificates perform access control close to the business logic to minimize security exposure and deliver on zero trust goals. These dynamic certificates often reflect both the underlying hardware and the workload running on it, acting like multi-factor authentication for devices. The frequent need to re-credential workloads makes issuing credentials mission-critical, as failure to do so can cause outages. This necessitates issuers in each failure domain (think of this as a cluster of clusters) hosting these workloads to ensure continuous operation.

What we can take from this is that we have been approaching credentials incorrectly by treating them as a key management problem. This approach is akin to using password managers for hardware, machines, and workloads, whereas, for users, we have moved toward multi-factor authentication and non-password-based authenticators.

Towards Automated and Scalable Key Management

If password managers or key vaults are not the right solution for machine authentication, what is? The answer is simpler than it might seem. Just as with users, these cases require built-for-purpose Identity Providers (IDPs). This is especially true for workloads, which dynamically spin up and down, making durable identifiers impractical. An IDP becomes a security domain for a given deployment, ensuring that workloads are accessible only by appropriate resources. This setup limits attackers’ lateral movement, allows for nearly instant granting and removal of access, minimizes the impact of compromises, and enables easy federation between deployments—all while providing a central point for identity governance and ensuring the cryptographic keys associated with credentials are well-managed and protected.

Getting Started

Modernizing key management starts with measurement. Identify the most common types of keys in your secret vaults, typically workload-related credentials. Deploy a workload-specific IDP, such as those enabled via SPIFFE, to transition these credentials out of the secret manager. Over time, the secret manager will store static secrets like API keys for legacy systems, while dynamic credentials are managed appropriately.

Prevent using your secret manager as an IDP from the start, especially for new systems. Teams responsible for the operational burden of these systems usually support this change, as automated end-to-end credentialing of workloads is more agile, scalable, and secure. This results in fewer outages and security incidents related to secret managers and non-production quality dependencies.

From this point, the process becomes a cycle of identifying where static secrets or long-lived credentials are funneled through your secret manager and moving them to built-for-purpose credential lifecycle management solutions.

Multi-factor authentication for workloads

Adopting a purpose-built IDP workload solution is a good start, but keys can still be stolen or leaked. For machines and workloads, use hardware attestations. Built-in hardware authenticators, such as Trusted Platform Modules (TPMs), create and secure keys within the device, ensuring they never leave. TPMs also verify device integrity during boot-up, adding an extra layer of security. This combination provides stronger multi-factor authentication without the usability issues associated with similar patterns for user authentication.

Avoiding Common Mistakes

The most common mistake organizations make is applying existing systems to workload credential management problems without fully analyzing operational, scale, uptime, and security needs. For example, enterprise PKI teams might mandate using their existing CA infrastructure for managing workload credentials, assuming a certificate suffices. However, this often violates the principle of least privilege and struggles with dynamic identification required for workloads.

Existing credential issuance systems are designed for static, long-lived subjects, such as directory names, DNS names, or IP addresses, which don’t change frequently. In contrast, workload credentials may change every few minutes. Provisioning devices like network appliances before assigning durable identifiers adds to this challenge. New workload-based systems, like SPIFFE, assign identifiers based on runtime elements, preventing the same bad practices that led to secret sprawl and mismanaged key problems.

Reducing Reliance on Shared Secrets 

Moving away from shared secrets won’t eliminate the need for secret vaults but will significantly reduce the problem’s scope. As systems modernize, password-based authenticators will be updated or deprecated. Over time, we will see fewer shared, long-lived secrets used for workload identity, driven by zero trust and least privilege principles.

At the same time, we can do much to improve overall key management practices in production systems. However, that’s a topic for another post.

Closing Note

The challenges and opportunities in modern key management are significant, but by leveraging innovative solutions and focusing on automation and scalability, we can make substantial progress. As we adopt built-for-purpose Identity Providers (IDPs) and hardware attestations, it’s important to have the right tools and frameworks in place to succeed.

I  have been working with SPIRL, a company focused on making the right thing the easy thing for developers, operations, and compliance. I see firsthand how the right platform investments can simplify the creation and operation of production systems. SPIRL achieves this by authoring and adopting open standards for identity and building reliable, scalable infrastructure that provides greater visibility and control.

Even if you don’t use SPIRL, by focusing on these principles, organizations can better manage the complexities of modern workload-related key and credential management, ensuring greater productivity and security.

A look at short lived certificates, keys and the relevance of FIPS 140-2

Today the defacto-standard for purchasing criteria for a cryptographic component is a US Federal Standard called FIPS 140-2. This is set of assurance levels the US Federal Government uses to ensure that government agencies purchase cryptographic products that are interoperable and address threat-specific risks; Europe has similar set of guidelines called Common Criteria.

These standards were adopted by the security industry because in the beginning the only purchasers of their products were government agencies and if you did not design your products to meet these requirements your product wouldn’t even be considered by your only customer segment.

As the security industry began selling outside of government agencies they started with the Fortune 50 because they were the only ones who understood the risks their businesses were exposed to. This was a time when information security was in-essence a new discipline and the only tried and true examples these organizations had to learn from were from the government and military. For this reason the solutions that were sold and deployed were watered down versions of what they sold to governments.

As the awareness of security risks spread to the rest of the corporate world these same foundational standards continued to be used — in many respects without question. In fact I am always surprised how many customers I encounter who have mandated a specific FIPS assurance level be supported by a product that have no understanding of what protection each level provides.

With the Snowden revelations people are now starting to question these standing assumptions. Should we be using cryptography that is specified governments at all? Is our adoption of government approved cryptography making us more secure or is it exposing us to new risks?

The real questions we must be asking ourselves are:

  1. What is the actual (vs perceived) threat model?
  2. Where are the assets that are valuable to the attacker in my system?
  3. Are we applying security technology and approaches in a balanced way relative to the risks?
  4. What are the consequences of each of the design decisions we are making?

Our reliance on blanket adoption of standards like FIPS 140-2 are in many respects a way to make ourselves feel better about not spending the time to answer the first two questions and the last two questions represent areas where most organizations fall down.

First let me temper what I am about to say with I still believe FIPS 140-2 and Common Criteria have value and they are good solutions for what they were designed for but in many cases they are a round peg in a square hole.

Let’s start this by first understanding the claims and the values of each:

Third-party evaluated – An organization deemed knowledgeable and capable by the government has reviewed the design relative to the stated requirements and found no unresolved issues.

Approved Algorithms – Supports a set of algorithms that the government has decided are necessary for interoperability. The selection of these algorithms by the government is plausibly a result of a rigorous process that determined they are sufficiently secure for their needs. Ex: RSA, ECC /w secp256r1, SHA2, etc.

Uses Approved Algorithms and Methods to Protect Keys – Uses a set of algorithms and approaches the government has decided are sufficient to keep keys of the types specified in approved algorithms secure. Ex: Use crypto and methods at least as strong as the keys being protected.

Production-Grade Components – An attempt to specify a qualitative set of requirements that are intended to ensure there is adequate quality in the solution to be used in production.

Tamper Evidence – Implements mechanisms such as seals and manufacturing techniques that make it visibly obvious that the device has been physically compromised.

Protects Once Compromised – Implements mechanisms that make it difficult to extract the keys from the device once it is physically compromised.

Tamper resistant – Implements mechanisms to destroy the protected keys when a compromise is attempted.

The following table shows you how these traits map across the various FIPS 140-2 assurance levels:

Third-party evaluated Approved Algorithms Uses Approved Algorithms and Methods to Protect Keys Production-Grade Components Tamper Evidence Protects Once Compromised Tamper resistant
Level 1 x x x x
Level 2 x x x x x
Level 3 x x x x x x
Level 4 x x x x x x x

Now each of these traits are desirable but they may also have consequences, for example:

Third-party evaluated – These audits take up to a year to prepare for and complete. Due to the specialized nature and near-monopoly the approved testers have the tests are incredibly expensive. Additionally these testing agencies perform their tasks based on guidelines based published by governments who are very slow to adapt and change and focused on their own immediate needs which restricts innovation.

This all becomes very complicated when you need to respond to security issues in short periods of time and many have come to the conclusion the bureaucracy associated with completing these audits reduces security.

Approved Algorithms – While I am pleased with the fact that NIST runs crypto competitions in some cases they are not used and in others their choices may not be right for you. Additionally there are questions about some of their decisions and what they mean to the security of the algorithms they implement.

In other cases  the requirements may actually hamper adoption of your solution and while the product may be “more secure” it will not be usable by in many cases. A great example being it is only possible to have a software only solution that is evaluated to FIPS 140-2 Level 1 so if you specify anything higher you may significantly reduce the usability and applicability of your solution.

The important thing to remember is there are many ways to mitigate a risk and if we are not careful to take a step back and take a look at the problem and goals as a whole we might as they say miss the forest through the trees.

For example if we come to the conclusion that we require the use of a FIPS 140-2 Level 4 device we preclude the un-augmented use of every Windows or ChromeOS computer that has a TPM when arguably that would expose the product to hundreds of millions of more users. Is the increased security of that that choice worth the it?

Also if we look at the Tamper EvidenceProtects Once Compromised and Tamper resistant goals we can mitigate these risks significantly if we simply generate new keys every 15 minutes. By doing this we reduce the risk of compromise to a very small window and we reduce the value of the key to the attacker.

It’s this last approach I think we should as an industry apply more now; we no longer live in a world of disconnected systems. We are dynamically deploying services using technologies like Docker, Chef, and Puppet and there is no reason we can not deploy our keys to systems and users dynamically as well.

Key management and key lifetime

One of my favorite quotes about cryptography is this one from Bruce Schneier where he says:

“If you think cryptography can solve your problem, then you don’t understand your problem and you don’t understand cryptography.”

The point he is getting at is often times the introduction of cryptography carries its own baggage that can itself be a problem to manage. One of the larger issues one is exposed to is that of Key Management.

Many of the key management practices we use today were actually designed around the concepts of offline keys. You see exchanging keys securely is hard and it’s human nature to avoid hard things so we (either explicitly or implicitly) choose to do them infrequently. For example take a look at TLS private keys — The single most prominent “upgrade” on most CA websites is a longer lived certificate (as much as 3 years per certificate).

People just don’t want to hassle with the idea of getting a new certificate and renewing it. The lifetimes of these certificates are well within the current guidance for crypto effectiveness but there are other factors to be considered when looking at cryptoperiods beyond how strong the cryptography is.

The reality is that crypto itself is seldom the direct attack vector it is application logic, coding defects and operational practices that prove to be the source of most vulnerabilities.

For this reason surely how that key is protected is the most important factor. If “anyone” can access a key encrypting or signing data with that key is nothing more than security theater. When you consider that remember today for keys to be used they must be accessible to application logic. The key is exposed to the risks of the full software and hardware stack that runs supports that service. As a result if that system is exposed to the internet it should be changed more frequently than one that is in a locked box at a bank.

The key itself doesn’t actually have to be exposed in its raw form either, for example if malware can turn the software that has access to the key into a signing oracle it doesn’t need raw access to the key — this is actually what happened to DigiNotar, the Dutch CA who was compromised the bad guy got into the system that had access to the HSM containing the CA keys and was able to sign virtually anything they wanted.

So what do we do about this? Of course one needs to build systems using a process that incorporates security into all aspects of product development and operations but above and beyond that you really should change your keys as often as possible.

Fundamentally the longer a key is trusted the more valuable it is to an attacker and the more opportunity an attacker has had to compromise that key.

It is this paradigm that necessitates the existence of revocation protocols like OCSP in X.509. The CABFORUM allows these revocation messages to be good for up-to a week. This is important to understand because a CA’s ability to revoke a certificate effectively in the event a compromise is identified is limited by this. If the CA instead issued certificates that were good for no longer than a week then there would in-essence be no need for revocation checking at all.

If you can issue certificates that are good for a week and change them reliably each week why not do shorter? What about certificates and keys that are trusted for only a few hours or minutes? Surely this would be better. This significantly reduces the value to the attacker and increases the amount of trust one can place in that certificate.

The same holds true for certificates that are stored on Smartcards and Hardware Security Modules; the more recently the key was created and the crypto operator authenticated the more trustworthy they key is.

If that’s the case why is it we still manage keys like they are on hardened offline systems? The answer is simple — Key Management is hard. What’s important to understand that while it is hard it is doable we just need the will to do something about it as an industry.

NOTE: Though in my examples above I use certificates as the canonical example they are just that examples; the exact same issues exist with all uses of cryptographic keys (encryption keys, bitcoin wallets, authentication keys, etc.).

Keeping long-term passwords secured

We all know that passwords should be changed regularly to reduce the value to an attacker and that they should be stored in ways that they can not be easily compromised which is why generally people are encouraged not to write passwords down.

The reality is that the human brain can only retain so much information and the less often you use something the more likely it is that you will forget it.

This is true regardless of how memorable your password happens to be.

This is especially true for passwords used in key management ceremonies. Imagine being there when the first keys were generated for the first root CA on the Internet, this is a key that will exist for decades and the implications for loosing access to this key are huge. More over the passwords involved in these ceremonies do not bellong to an individual, they belong to an organization.

For these reasons key management ceremonies use password record forms; I have attached an example form to this post for your reference.

These forms once filled out are stored securely, how securely being dependent on the security needs of the scenario. For example if the password was associated with a share in a Shamir Secret Sharing scheme (M of N set of keys) one would transport and store them securely in facilities geographically distributed under lock and an dual lock control scheme.

Periodically these stored values are retrieved and changed, as part of a process to ensure continued access to systems and keys is possible.

While not something the average person needs to deal with it is relevant to those doing paper key management for large amounts of Bitcoin, important DNSSEC keys or maybe keys embedded into some device that has been mass produced.