Tag Archives: short-lived certificates

My response, to his response, to my response? or short-lived certificates part 3

The conversation on short-lived certificates and their value continuesIn the most recent conversation, we have started to shift from an either or position to one where we explore what is needed to make revocation checking a viable technology, this is a topic I am passionate about.

That said, there still seems to be some confusion on short-lived certificates, specifically the author states:

Of course, the cost of short-lived certs is very high as change the whole computing infrastructure so that certificates are renewed on a daily basis (daily for it to be secure enough vs 90 day certificates in my view) and introduce this new moving part that might cause vulnerability and operational issues.

Let’s explore this “cost” argument for a moment. First, when I issue an end-entity certificate I minimally have to perform two cryptographic signatures, one on the certificate and one on the OCSP response (it’s actually more than this in some cases but to keep the conversation simple I have omitted the others).

If we look at the performance optimization that Firefox has implemented where they do not require revocation checking when the certificate is within a subset the resolution of the revocation period short-lived certificates can, in fact, reduce the cost for a CA. This is because you no longer need to sign two things during the first few days/hours of the life of the certificate and you do not need to distribute that response.

The only way I think short lived certificates are more expensive for the CA is if you compare a model of certificates issued for a period better measured in hours to a model with certificate validity measured in years providing weekly revocation updates. This, however, is a bad model, and something in-between is needed.

The author also believes the use of automation represents a security vulnerability, so this deserves a response. It is true that complexity is the enemy of security, it is even true that automation can if poorly implemented can hurt security. The inverse is also true, however.  It is also generally accepted that availability is a component of security and one of the more common problems in the WebPKI is poor manual management practices resulting in the lack of understanding what is deployed and those certificates that are deployed expiring [see 1,2,3, and 4] and taking down services.

It is also important to look at the big picture when evaluating the net-security benefits of automation, for example, does anyone honestly believe we would ever get to a world where the majority of the web is encrypted if organizations have to staff people to manually generate certificate requests and hand carry them to CAs? Is the net-benefit of automation of reliability and scope of deployment worth its secondary effects?

The author also suggests that a certificate must have a validity period of only a day to be “secure enough”. This seems both arbitrary and wrong, as stated previously the User Agents and WebTrust allows an OCSP response or CRL can be a week old and still be trusted.

One of the largest reasons for this is that clock skew is a big problem in the real world and as a result, you need to keep validity periods of certificates and revocation messages outside this skew period to prevent skew related failures.

The decision to define “secure enough” at a day, both defines the problem in an intractable way and furthermore ignores the fact that it establishes a double standard that does nothing to address the issue if stale revocation information.

If we were to bring this conversation back to how we improve certificate revocation I would say there should be one standard for how recent the client’s understanding of the certificate’s validity would need to be.

On that topic, the author goes on to discuss how 32 bytes is better than 470 (the size of the smallest OCSP response). I could not agree more about this, in fact in the 90s’ when I was at Valicert we implemented a proposal from Paul Kocher called Certificate Revocation Trees. This approach uses of Merkle Trees (the heart of the Bitcoin ledger) to provide a very space efficient solution to this problem. Unfortunately, we were unable to popularize this at the time.

Ben Laurie began work on a variation on this approach that leveraged sparse Merkle Trees that he called Revocation Transparency. I personally like the idea of this approach because it leverages the work done to make Certificate Transparency scalable. For example, Trillian, the foundational server for Google’s next generation log server is designed to scale to Trillions of certificates.

That said, there are a number of similar approaches that could be equally scalable.

While I do think that an approach similar to the above could be made to work today, I also think it is more of a long-term solution in that even with the significantly increased rate of technological adoption it would take close to ten years given the state of things for such a solution to be fully deployed if we started right now.

As such I would start with the problem definition, which would need to involve a more formal analysis of the role of revocation checking today so that the right solution was built.

In parallel I would want to see the industry adopt a more strategic plan to address the more practical and immediately solvable problems, including:

  • Measuring and improving the revocation infrastructure operated by CAs,
  • Establishing global performance and reliability metrics and reporting that all CAs must meet,
  • Funding improvements to Nginx and Apache’s OCSP Stapling implementations,
  • Working with browsers to adopt the performance optimization firefox has implemented for revocation checking,
  • Working with TLS stacks, User Agents, Servers and Service Providers to adopt OCSP Must-Staple,
  • Defining an OCSP transport based on DNS that would reduce dependency on CA infrastructure reliability,
  • Evangalizing the adoption of OCSP stapling with administrators.

Ryan

P.S.

The author also has also added in someone else who has asked some questions or more correctly seems to question my version of the historical narrative. To provide some context, my narrative comes from my practical experience working with Microsoft, eBay, Amazon, and other large companies in the mid to late 90s and through the mid-2000s.

I too have worked with the BBN Safekeeper, I have a fun story how we hired some people to extract the keys from one of these boxes I would be happy to share over a beer sometime.

However, a cool device, the first one I remember working with was the KOV-8 in the 1993/4 timeframe.

Anyway, it is true that SSL started its life in 1994/5 at which point only software implementations of crypto were used (they were all BSAFE) but it is also true that mass deployment of SSL (relatively speaking) did not start until the late 90s and early 00s and that is the time my narrative was based on.

He also has also questioned the narrative of what Windows supported in the context of key protection. Since the author knows me personally he must have simply forgotten that I was the PM for these technologies and was at Microsoft working in this area for about 15 years.

Again I think there is some confusion here, the author states:

The software based Cryptographic Service Provider for RSA allowed keys to be marked ‘not for export’ from a very early release if not the first.

and:

The CAPI features used to protect private keys were expanded and exposed as a separate API in Windows 2000 as the Data Protection API.

As someone who worked at Microsoft on these technologies for a long time I can say with absolute confidence they were not built to provide key isolation, do not provide key isolation properties and were actually not used by the SSL implementation (SCHANNEL) for the server keys. If you’re interested in learning more about the capabilities of Windows in this area check out this post I did recently.

He has also questioned the role of ValiCert in the definition of the RFC, thankfully the IETF PKIX archives are still there and if you care to look you can see Mike was basically checked out, Warwick was not publicly paticipating and the work to finalize the protocol was largely done by Ambarish Malpani the founder of Valicert.

Why bother with short-lived certificates and keys in TLS?

There seems to be a lot of confusion and misinformation about the idea of short-lived certificates and keys so I thought I would pen some thoughts about the topic in the hope of providing some clarification.

I have seen some argue the rationale behind short-lived certificates is to address the shortcoming in the CA and browser revocation infrastructure, I would argue this is not the case at all.

In my mind, the main reason for them is to address the issue of key compromise. Long-lived keys have a long period in which they are exposed to theft (see Heartbleed) and therefor are a higher value to an attacker since a stolen key enables the attacker to impersonate the associated website for that period.

The most important thing to keep in mind is that the nature of key compromise is such that you almost never know it has happened until it is too late (consider Diginotar as an extreme example).

The importance of protecting the SSL private key is why in the 90s when SSL deployment started in earnest, large companies tried to deploy SSL in their environment using SSL “accelerators” and “security modules” that protected the keys. This was the “right technical” thing to do not not the right “practical thing” because it significantly reduced the scalability of SSL protected services and at the same time massively increased the cost to deploy SSL.

By the mid 00’s we recognized this was not workable and the use of accelerators basically stopped outside a few edge cases and software keys were used instead. Some implementations, like the Microsoft SCHANNEL implementation, tried to protect the keys by moving them to a separate process mitigating the risk of theft to some degree. Others simply loaded the keys into the web servers process and as a result exposed them to Heartbleed like attacks.

What’s important here is that when this shift happened no one pushed meaningful changes to the maximum validity period of certificates and by association their private keys. This meant that we had software keys exposed to theft for five years (max validity period at the time) with no reliable way to detect they were stolen and in the unlikely event we did find out of the theft we would rely on the unreliable CA revocation infrastructure to communicate that issue to User Agents — clearly this was not an ideal plan.

In short, by significantly reducing the validity of the certificate and keys, we also significantly reduce the value to the attacker.

Another issue that short-lived certificates help mitigate is the evolution of the WebPKI, with a long-lived certificate you get virtually no security benefits of policy and technology improvements until the old certificates and keys are expunged from the ecosystem. Today this is rollout of new policy largely accomplished via “natural expiration” which means you have to wait until the last certificate that was issued under the old policies expires before absolute enforcement is possible.

So what is the ideal certificate validity period then? I don’t think there is a one-size-fits-all answer to that question. The best I can offer is:

As short as possible but no shorter.

With some systems it is not possible to deploy automation and until it is one needs to pick a validity period that is short enough to mitigate the key compromise and policy risks I discussed while long enough to make management practical.

It is probably easier to answer the question what is the shortest validity period we can reliably use. The answer to that question is buried in the clock skew of the relying parties. Chrome recently released a new clock synchronization feature that significantly reduces errors related certificate validity periods. But until that is fully deployed and other UAs adopt similar solutions you are probably best to keep certificate validity periods at 30 days to accommodate skew and potential renewal failures.

In-short, long-lived keys for SSL exist because we never re-visited the threat model of key compromise when we stopped using hardware-protected keys in our SSL deployments and short-lived keys help deal with the modern reality of SSL deployments.

Ryan

– I have added a few pictures for fun and made some minor text changes for clarity.

A look at short lived certificates, keys and the relevance of FIPS 140-2

Today the defacto-standard for purchasing criteria for a cryptographic component is a US Federal Standard called FIPS 140-2. This is set of assurance levels the US Federal Government uses to ensure that government agencies purchase cryptographic products that are interoperable and address threat-specific risks; Europe has similar set of guidelines called Common Criteria.

These standards were adopted by the security industry because in the beginning the only purchasers of their products were government agencies and if you did not design your products to meet these requirements your product wouldn’t even be considered by your only customer segment.

As the security industry began selling outside of government agencies they started with the Fortune 50 because they were the only ones who understood the risks their businesses were exposed to. This was a time when information security was in-essence a new discipline and the only tried and true examples these organizations had to learn from were from the government and military. For this reason the solutions that were sold and deployed were watered down versions of what they sold to governments.

As the awareness of security risks spread to the rest of the corporate world these same foundational standards continued to be used — in many respects without question. In fact I am always surprised how many customers I encounter who have mandated a specific FIPS assurance level be supported by a product that have no understanding of what protection each level provides.

With the Snowden revelations people are now starting to question these standing assumptions. Should we be using cryptography that is specified governments at all? Is our adoption of government approved cryptography making us more secure or is it exposing us to new risks?

The real questions we must be asking ourselves are:

  1. What is the actual (vs perceived) threat model?
  2. Where are the assets that are valuable to the attacker in my system?
  3. Are we applying security technology and approaches in a balanced way relative to the risks?
  4. What are the consequences of each of the design decisions we are making?

Our reliance on blanket adoption of standards like FIPS 140-2 are in many respects a way to make ourselves feel better about not spending the time to answer the first two questions and the last two questions represent areas where most organizations fall down.

First let me temper what I am about to say with I still believe FIPS 140-2 and Common Criteria have value and they are good solutions for what they were designed for but in many cases they are a round peg in a square hole.

Let’s start this by first understanding the claims and the values of each:

Third-party evaluated – An organization deemed knowledgeable and capable by the government has reviewed the design relative to the stated requirements and found no unresolved issues.

Approved Algorithms – Supports a set of algorithms that the government has decided are necessary for interoperability. The selection of these algorithms by the government is plausibly a result of a rigorous process that determined they are sufficiently secure for their needs. Ex: RSA, ECC /w secp256r1, SHA2, etc.

Uses Approved Algorithms and Methods to Protect Keys – Uses a set of algorithms and approaches the government has decided are sufficient to keep keys of the types specified in approved algorithms secure. Ex: Use crypto and methods at least as strong as the keys being protected.

Production-Grade Components – An attempt to specify a qualitative set of requirements that are intended to ensure there is adequate quality in the solution to be used in production.

Tamper Evidence – Implements mechanisms such as seals and manufacturing techniques that make it visibly obvious that the device has been physically compromised.

Protects Once Compromised – Implements mechanisms that make it difficult to extract the keys from the device once it is physically compromised.

Tamper resistant – Implements mechanisms to destroy the protected keys when a compromise is attempted.

The following table shows you how these traits map across the various FIPS 140-2 assurance levels:

Third-party evaluated Approved Algorithms Uses Approved Algorithms and Methods to Protect Keys Production-Grade Components Tamper Evidence Protects Once Compromised Tamper resistant
Level 1 x x x x
Level 2 x x x x x
Level 3 x x x x x x
Level 4 x x x x x x x

Now each of these traits are desirable but they may also have consequences, for example:

Third-party evaluated – These audits take up to a year to prepare for and complete. Due to the specialized nature and near-monopoly the approved testers have the tests are incredibly expensive. Additionally these testing agencies perform their tasks based on guidelines based published by governments who are very slow to adapt and change and focused on their own immediate needs which restricts innovation.

This all becomes very complicated when you need to respond to security issues in short periods of time and many have come to the conclusion the bureaucracy associated with completing these audits reduces security.

Approved Algorithms – While I am pleased with the fact that NIST runs crypto competitions in some cases they are not used and in others their choices may not be right for you. Additionally there are questions about some of their decisions and what they mean to the security of the algorithms they implement.

In other cases  the requirements may actually hamper adoption of your solution and while the product may be “more secure” it will not be usable by in many cases. A great example being it is only possible to have a software only solution that is evaluated to FIPS 140-2 Level 1 so if you specify anything higher you may significantly reduce the usability and applicability of your solution.

The important thing to remember is there are many ways to mitigate a risk and if we are not careful to take a step back and take a look at the problem and goals as a whole we might as they say miss the forest through the trees.

For example if we come to the conclusion that we require the use of a FIPS 140-2 Level 4 device we preclude the un-augmented use of every Windows or ChromeOS computer that has a TPM when arguably that would expose the product to hundreds of millions of more users. Is the increased security of that that choice worth the it?

Also if we look at the Tamper EvidenceProtects Once Compromised and Tamper resistant goals we can mitigate these risks significantly if we simply generate new keys every 15 minutes. By doing this we reduce the risk of compromise to a very small window and we reduce the value of the key to the attacker.

It’s this last approach I think we should as an industry apply more now; we no longer live in a world of disconnected systems. We are dynamically deploying services using technologies like Docker, Chef, and Puppet and there is no reason we can not deploy our keys to systems and users dynamically as well.

Key management and key lifetime

One of my favorite quotes about cryptography is this one from Bruce Schneier where he says:

“If you think cryptography can solve your problem, then you don’t understand your problem and you don’t understand cryptography.”

The point he is getting at is often times the introduction of cryptography carries its own baggage that can itself be a problem to manage. One of the larger issues one is exposed to is that of Key Management.

Many of the key management practices we use today were actually designed around the concepts of offline keys. You see exchanging keys securely is hard and it’s human nature to avoid hard things so we (either explicitly or implicitly) choose to do them infrequently. For example take a look at TLS private keys — The single most prominent “upgrade” on most CA websites is a longer lived certificate (as much as 3 years per certificate).

People just don’t want to hassle with the idea of getting a new certificate and renewing it. The lifetimes of these certificates are well within the current guidance for crypto effectiveness but there are other factors to be considered when looking at cryptoperiods beyond how strong the cryptography is.

The reality is that crypto itself is seldom the direct attack vector it is application logic, coding defects and operational practices that prove to be the source of most vulnerabilities.

For this reason surely how that key is protected is the most important factor. If “anyone” can access a key encrypting or signing data with that key is nothing more than security theater. When you consider that remember today for keys to be used they must be accessible to application logic. The key is exposed to the risks of the full software and hardware stack that runs supports that service. As a result if that system is exposed to the internet it should be changed more frequently than one that is in a locked box at a bank.

The key itself doesn’t actually have to be exposed in its raw form either, for example if malware can turn the software that has access to the key into a signing oracle it doesn’t need raw access to the key — this is actually what happened to DigiNotar, the Dutch CA who was compromised the bad guy got into the system that had access to the HSM containing the CA keys and was able to sign virtually anything they wanted.

So what do we do about this? Of course one needs to build systems using a process that incorporates security into all aspects of product development and operations but above and beyond that you really should change your keys as often as possible.

Fundamentally the longer a key is trusted the more valuable it is to an attacker and the more opportunity an attacker has had to compromise that key.

It is this paradigm that necessitates the existence of revocation protocols like OCSP in X.509. The CABFORUM allows these revocation messages to be good for up-to a week. This is important to understand because a CA’s ability to revoke a certificate effectively in the event a compromise is identified is limited by this. If the CA instead issued certificates that were good for no longer than a week then there would in-essence be no need for revocation checking at all.

If you can issue certificates that are good for a week and change them reliably each week why not do shorter? What about certificates and keys that are trusted for only a few hours or minutes? Surely this would be better. This significantly reduces the value to the attacker and increases the amount of trust one can place in that certificate.

The same holds true for certificates that are stored on Smartcards and Hardware Security Modules; the more recently the key was created and the crypto operator authenticated the more trustworthy they key is.

If that’s the case why is it we still manage keys like they are on hardened offline systems? The answer is simple — Key Management is hard. What’s important to understand that while it is hard it is doable we just need the will to do something about it as an industry.

NOTE: Though in my examples above I use certificates as the canonical example they are just that examples; the exact same issues exist with all uses of cryptographic keys (encryption keys, bitcoin wallets, authentication keys, etc.).