Let’s talk about revocation checking, let’s talk about you and me.

I have been having a conversation with Melhi at Comodo, this is the most recent post in that series.

First of all, the author is unaware that my company has built the most sophisticated Certificate Management system that can automatically request, issue, renew and manage the whole lifecycle of the certificate. Many Fortune 500 companies rely on this amazing technology to manage their PKI infrastructure.

I am aware.

So it is with that expertise and insight I must insist that the author does not appreciate the nuance between “high frequency” renewal vs “low frequency” renewals. Short lived certs will require “high frequency” renewal system. To an IT admin this is a scary prospect! They tell us that!

Having been responsible for and or worked heavly on:

  • The Valicert OCSP responder and clients,
  • The Windows CA and the enrollment client at Microsoft, which is the most used CA software ever,
  • The Network Access Protection solution that used IPSEC and ~12-hour certificates to do segmentation of hosts at the IP layer and their health. based network isolation solution that was deployed into many of the Fortune 500,
  • All of GlobalSign’s technology offerings, in particular, the technology enabling their expansion into the Enterprise,
  • Helping the ISRG build out Let’s Encrypt,
  • Designing, building and operating numerous other high volume products and services in finance, healthcare, and government.

Above and beyond that I helped secure Bing Ads and Live ID and now work at Google on other very high scale systems.

Needless to say, I too have familiarity with “high frequency”,  “low frequency”, “disaster recovery” and “availability” problems.

As for fear, it is a natural part of life, I believe it was Nelson Mandela who famously said: Courage is not the absence of fear but the triumph over it.

The reality is that it is manual certificate lifecycle management that is the thing to be afraid of. That is why COMODO and other CAs have been building out cloud-based management and automation solutions over the last few years. For customers this reduces failure, reduces costs, gives more control, and nets higher customer satisfaction.

Importantly, one needs to remember it is manual processes that lead to the majority of outages [see 1,2,3, and 4]. The scale of this problem has even led to an entire market segment dedicated to managing the lifecycle of WebPKI certificates.

So to the question of fear, I would say the same thing my father told me, though change can be scary and uncomfortable in the short term, it usually turns out okay in the end, and often better.

I believe there can be a scalable revocation infrastructure that can serve status for all certs from all CAs that is backward compatible with existing issued certificates that can be called from a browser. As I said before I do believe in the ingenuity of our scientist and engineers to bring us this solution……soon….

You misunderstand me. I did not say it was impossible to have a scalable revocation infrastructure that is backward compatible. I said the creation of a new system that could deliver on the small message size promise we were discussing would take 10 years to design, build and deploy.

To understand my rational take a look at this post. It is a little old now, and update rates are a faster now but not massively. When you review it you will see that if you exclude IE/Edge it takes just under a year for a new version of the most common browsers to reach 90% market share.

To put that in a more concrete way, if today, Chrome, Firefox, and Safari decided to simultaneously release a new version with a new revocation scheme it would take a bit less than a year for us to see 90% deployment. That is, of course, an unrealistic goal. Apple, for example only releases new versions a few times a year and does not even support WebRTC yet. Additionally, browsers have a pretty deep-seated position on revocation checking at this point given all the problems of the past so convince them will take time.

A more realistic, but still optimistic period assuming pre-existing consensus and a will to solve the problem is 5 to 6 years. If you question this figure, just look at how long it took for TLS 1.2 to get deployed. TLS 1.2 was published in 2008 and was not enabled by default in Windows until Windows 8.1 in 2014.

Web servers are even worse, as an example consider Apache version 2 was released in 2002 and these later versions still have less than 50% deployment.

Lets call it DCSP 😉

DCSP isn’t a bad idea, but it has its own challenges. For example, some that come at the top of mind are:

  • Most browsers do not ship their own DNS clients, this is one of the problems DNSSEC had in deployment. If the operating system DNS APIs they use do not provide the information they need then they can not adopt any technology depend on it.
  • Middleboxes and captive networks make DNS-based distribution problematic. Again DNSSEC suffered from this.
  • If DCSP requires a custom record type, for example, DNS servers and tooling will need to be updated. DNS servers also do not get updated regularly, this is another thing that has held back DNSSEC as well as CAA. It is fair to say that it is fair for the WebPKI CAs to update their DNS but based on the glacial pace in which CAs adopt technology that is 3-4 years before you could see deployment even with that (see CT as an example).

To be clear, I think something like this is worth exploring but I don’t think it will see meaningful deployment in the near term.

If we were in Vegas, I would say the most expedient path is to build on what is there, while in parallel building the new thing. This can shave as much as 5 years off the time it takes to solve the problem. If nothing else this moves the CA ecosystem closer to an operational maturity capable of supporting the new system.

Ryan

My response, to his response, to my response? or short-lived certificates part 3

The conversation on short-lived certificates and their value continuesIn the most recent conversation, we have started to shift from an either or position to one where we explore what is needed to make revocation checking a viable technology, this is a topic I am passionate about.

That said, there still seems to be some confusion on short-lived certificates, specifically the author states:

Of course, the cost of short-lived certs is very high as change the whole computing infrastructure so that certificates are renewed on a daily basis (daily for it to be secure enough vs 90 day certificates in my view) and introduce this new moving part that might cause vulnerability and operational issues.

Let’s explore this “cost” argument for a moment. First, when I issue an end-entity certificate I minimally have to perform two cryptographic signatures, one on the certificate and one on the OCSP response (it’s actually more than this in some cases but to keep the conversation simple I have omitted the others).

If we look at the performance optimization that Firefox has implemented where they do not require revocation checking when the certificate is within a subset the resolution of the revocation period short-lived certificates can, in fact, reduce the cost for a CA. This is because you no longer need to sign two things during the first few days/hours of the life of the certificate and you do not need to distribute that response.

The only way I think short lived certificates are more expensive for the CA is if you compare a model of certificates issued for a period better measured in hours to a model with certificate validity measured in years providing weekly revocation updates. This, however, is a bad model, and something in-between is needed.

The author also believes the use of automation represents a security vulnerability, so this deserves a response. It is true that complexity is the enemy of security, it is even true that automation can if poorly implemented can hurt security. The inverse is also true, however.  It is also generally accepted that availability is a component of security and one of the more common problems in the WebPKI is poor manual management practices resulting in the lack of understanding what is deployed and those certificates that are deployed expiring [see 1,2,3, and 4] and taking down services.

It is also important to look at the big picture when evaluating the net-security benefits of automation, for example, does anyone honestly believe we would ever get to a world where the majority of the web is encrypted if organizations have to staff people to manually generate certificate requests and hand carry them to CAs? Is the net-benefit of automation of reliability and scope of deployment worth its secondary effects?

The author also suggests that a certificate must have a validity period of only a day to be “secure enough”. This seems both arbitrary and wrong, as stated previously the User Agents and WebTrust allows an OCSP response or CRL can be a week old and still be trusted.

One of the largest reasons for this is that clock skew is a big problem in the real world and as a result, you need to keep validity periods of certificates and revocation messages outside this skew period to prevent skew related failures.

The decision to define “secure enough” at a day, both defines the problem in an intractable way and furthermore ignores the fact that it establishes a double standard that does nothing to address the issue if stale revocation information.

If we were to bring this conversation back to how we improve certificate revocation I would say there should be one standard for how recent the client’s understanding of the certificate’s validity would need to be.

On that topic, the author goes on to discuss how 32 bytes is better than 470 (the size of the smallest OCSP response). I could not agree more about this, in fact in the 90s’ when I was at Valicert we implemented a proposal from Paul Kocher called Certificate Revocation Trees. This approach uses of Merkle Trees (the heart of the Bitcoin ledger) to provide a very space efficient solution to this problem. Unfortunately, we were unable to popularize this at the time.

Ben Laurie began work on a variation on this approach that leveraged sparse Merkle Trees that he called Revocation Transparency. I personally like the idea of this approach because it leverages the work done to make Certificate Transparency scalable. For example, Trillian, the foundational server for Google’s next generation log server is designed to scale to Trillions of certificates.

That said, there are a number of similar approaches that could be equally scalable.

While I do think that an approach similar to the above could be made to work today, I also think it is more of a long-term solution in that even with the significantly increased rate of technological adoption it would take close to ten years given the state of things for such a solution to be fully deployed if we started right now.

As such I would start with the problem definition, which would need to involve a more formal analysis of the role of revocation checking today so that the right solution was built.

In parallel I would want to see the industry adopt a more strategic plan to address the more practical and immediately solvable problems, including:

  • Measuring and improving the revocation infrastructure operated by CAs,
  • Establishing global performance and reliability metrics and reporting that all CAs must meet,
  • Funding improvements to Nginx and Apache’s OCSP Stapling implementations,
  • Working with browsers to adopt the performance optimization firefox has implemented for revocation checking,
  • Working with TLS stacks, User Agents, Servers and Service Providers to adopt OCSP Must-Staple,
  • Defining an OCSP transport based on DNS that would reduce dependency on CA infrastructure reliability,
  • Evangalizing the adoption of OCSP stapling with administrators.

Ryan

P.S.

The author also has also added in someone else who has asked some questions or more correctly seems to question my version of the historical narrative. To provide some context, my narrative comes from my practical experience working with Microsoft, eBay, Amazon, and other large companies in the mid to late 90s and through the mid-2000s.

I too have worked with the BBN Safekeeper, I have a fun story how we hired some people to extract the keys from one of these boxes I would be happy to share over a beer sometime.

However, a cool device, the first one I remember working with was the KOV-8 in the 1993/4 timeframe.

Anyway, it is true that SSL started its life in 1994/5 at which point only software implementations of crypto were used (they were all BSAFE) but it is also true that mass deployment of SSL (relatively speaking) did not start until the late 90s and early 00s and that is the time my narrative was based on.

He also has also questioned the narrative of what Windows supported in the context of key protection. Since the author knows me personally he must have simply forgotten that I was the PM for these technologies and was at Microsoft working in this area for about 15 years.

Again I think there is some confusion here, the author states:

The software based Cryptographic Service Provider for RSA allowed keys to be marked ‘not for export’ from a very early release if not the first.

and:

The CAPI features used to protect private keys were expanded and exposed as a separate API in Windows 2000 as the Data Protection API.

As someone who worked at Microsoft on these technologies for a long time I can say with absolute confidence they were not built to provide key isolation, do not provide key isolation properties and were actually not used by the SSL implementation (SCHANNEL) for the server keys. If you’re interested in learning more about the capabilities of Windows in this area check out this post I did recently.

He has also questioned the role of ValiCert in the definition of the RFC, thankfully the IETF PKIX archives are still there and if you care to look you can see Mike was basically checked out, Warwick was not publicly paticipating and the work to finalize the protocol was largely done by Ambarish Malpani the founder of Valicert.

It is my turn, or short-lived certificates part 2

My response to a recent post suggesting short-lived certificates were intended to remove the need to do revocation seems to have spurred a response from its author.

The thesis of the latest response is that:

Revocation can easily handle “key compromise” situation and do so by offering more security than short-lived certs.

The largest problem with this thesis is that it is based on an incorrect understanding. That being the author believes short-lived certificates do not get checked for revocation, or they believe it is a choice between short-lived certificates and revocation.

Both statements are not true. User Agents do not say “oh, this certification is short-lived so we won’t do revocation checking”. If User Agents do revocation checking of end-entity certificates, and not all do, the short-lived certificate will get checked just like the long-lived certificate.

Now there is a reasonable argument to be made that if CAs are allowed to produce revocation lists and OCSP responses every 7 days, and User agents will trust them for that time, that it wouldn’t be an unreasonable performance optimization. At least if the behavior was limited to certificates with validity periods shorter than 7 days. With that said, User Agents don’t do this today and I didn’t suggest they should in my post.

[Update 10:43PM May 4th 2017 It seems a changed happened while I was asleep at the wheel. Firefox has implemented the below optimization as of Firefox 41. A corrected statement follows in the next  two paragraphs]

Both statements are not true, well except for one performance optimization implemented by Firefox I will explain shortly. Basically, if User Agents do revocation checking of end-entity certificates, and not all do, the short-lived certificate will get checked just like the long-lived certificate as long as it is younger than the corresponding CRL or OCSP response would be (Firefox only).

This performance optimization Firefox implemented is based on the fact that  CAs are allowed to rely on 7-day old OCSP responses and CRLS. As a result, 7 days becomes the precision of revocation checking, It is not clear what value Firefox chose but it is a subset of that figure, not 90 days. But either way, no major CA that I am aware of issues such short-lived certs today due to time skew issues.

The second problem with this thesis is it presumes the user who does know of a compromise wants to announce it to the world and can. As I mentioned in an earlier post on revocation reasons subscribers are not keen to announce to the world that there was “a compromise”. I mention “can” because in some cases the one who knows about the comromise do not even have sufficient permissions to the associated CA consoles to request the revocation.

And finally, the last issue in this thesis is that it presumes an effectiveness of revocation checking. Today over 9% of OCSP responses fail due to issues with the CA’s revocation infrastructure (the connections time out).

That’s right 9 out of 100 revocation checks fail because CAs fail to operate capable enough infrastructure to meet the needs of the clients that rely on them. It is actually worse than that though, the largest websites use a technique called Domain Sharding to make their sites load faster, this means that the failure rate you would experience as a user if hard-fail was implemented could be 2-4x higher than that.

This is before we consider the fact that due to the poor performance and failures in CA revocation infrastructure revocation checking has been largely turned off in Chrome and other browsers.

I say this because for “revocation checking” to work for a key compromise case you need two things:

  • The CA to know the compromise occurred,
  • Revocation checking to actually work.

Now I want to be clear, I think revocation checking is a good thing and I would like to see the situation improved, as a proof of that statement here is an example of some of my work in this area:

  • While at ValiCert (the lead creators of OCSP) I worked on the standardization of RFC 2560 (OCSP) including running the interoperability testing that led to its ultimate standardization,
  • I also am the author of RFC 5019 which is the profile of OCSP in use by CAs and clients,
  • I have led the development of numerous PKI SDKs and servers which have implemented these standards (including leading the team that added support for OCSP to Windows),
  • I led the implementation of the first, and most reliable implementation of OCSP stapling (in SCHANNEL),
  • I led the CASC project to get OCSP stapling added Nginx,
  • I have helped create, and/or recreate numerous CAs and along with their revocation infrastructure,
  • And I led the first wide scale measurement and efforts of the CA revocation infrastructure via my X509LABS Revocation Report project which led to all CAs adopting the same design philosophies to begin to address abhorrent response times and uptime.

Basically, I see the value in revocation checking and think the investments need to be made to make it work and be relevant for the WebPKI of today. That said, this topic has zero to do with short-lived certificates.

Ryan

How to keep a secret in Windows

Protecting cryptographic keys is always a balancing act. For the keys to be useful they need to be readily accessible, recoverable and their use needs to be sufficiently performant so their use does not slow your application down. On the other hand, the more accessible and recoverable they are the less secure the keys are.

In Windows, we tried to build a number of systems to help with these problems, the most basic was the Window Data Protection API (DPAPI). It was our answer to the question: “What secret do I use to encrypt a secret”. It can be thought of as a policy or DRM system since as a practical matter it is largely a speed bump for privileged users who want access to the data it protects.

Over the years there have been many tools that leverage the user’s permissions to decrypt DPAPI protected data, one of the most recent was DPAPIPick. Even though I have framed this problem in the context of Windows, Here is a neat paper on this broad problem called “Playing hide and seek with stored keys”.

The next level of key protection offered by Windows is a policy mechanism called “non-exportable keys” this is primarily a consumer of DPAPI. Basically, when you generate the key you ask Windows to deny its export, as a result the key gets a flag set on it that can not, via the API, be changed. The key and this flag are then protected with DPAPI. Even though this is just a policy enforced with a DRM-like system it does serve its purpose, reducing the casual copying of keys.

Again over the years, there have been numerous tools that have leveraged the user’s permissions to access these keys, one of the more recent I remember was called Jailbreak (https://github.com/iSECPartners/jailbreak-Windows). There have also been a lot of wonderful walkthroughs of how these systems work, for example, this nice NCC Group presentation.

The problem with all of the above mechanisms is that they are largely designed to protect keys from their rightful user. In other words, even when these systems work they key usually ends up being loaded into memory in the clear where it is accessible to the user and their applications. This is important to understand since the large majority of applications that use cryptography in Windows do so in the context of the user.

A better solution to protecting keys from the user is putting them behind protocol specific APIs that “remote” the operation to a process in another user space. We would call this process isolation and the best example of this in Windows is SCHANNEL.

SCHANNEL is the TLS implementation in Windows, prior to Windows 2003 the keys used by SCHANNEL were loaded into the memory of the application calling it. In 2003 we moved the cryptographic operations into Local Security Authority Subsystem Service (LSAS) which is essentially RING 0 in Windows.

By moving the keys to this process we help protect, but don’t prevent, them from user mode processes but still enable applications to do TLS sessions. This comes at an expense, you now need to marshal data to and from user mode and LSAS which hurts performance.

[Nasko, a former SCHANNEL developer tells me he believes it was the syncronous nature of SSPI that hurt the perf the most, this is likely, but the net result is the same]

In fact, this change was cited as one of the major reasons IIS was so much slower than Apache for “real workloads” in Windows Server 2003. It is worth noting those of us involved in the decision to make this change surely felt vindicated when Heartbleed occurred.

This solution is not perfect either, again if you are in Ring 0 you can still access the cryptographic keys.

When you want to address this risk you would then remote the cryptographic operation to a dedicated system managed by a set of users that do not include the user. This could be TCP/IP remoted cryptographic service (like Microsoft KeyVault, or Google Cloud Key Manager) or maybe a Hardware Security Module (HSM) or smart card. This has all of the performance problems of basic process isolation but worse because the transition from the user mode to the protected service is even “further” or “bandwidth” constrained (smart cards often run at 115k BPS or slower).

In Windows, for TLS, this is accomplished through providers to CNG, CryptoAPI CSPs, and Smartcard minidrivers. These solutions are usually closed source and some of the security issues I have seen in them over the years are appalling but despite their failings, there is a lot of value in getting keys out of the user space and this is the most effective way of doing that.

These devices also provide, to varying degrees, protection from physical, timing and other more advanced attacks.

Well, that is my summary of the core key protection schemes available in Windows. Most operating systems have similar mechanisms, Windows just has superior documentation and has answers to each of these problems in “one logical place” vs from many different libraries from different authors.

Ryan

Why bother with short-lived certificates and keys in TLS?

There seems to be a lot of confusion and misinformation about the idea of short-lived certificates and keys so I thought I would pen some thoughts about the topic in the hope of providing some clarification.

I have seen some argue the rationale behind short-lived certificates is to address the shortcoming in the CA and browser revocation infrastructure, I would argue this is not the case at all.

In my mind, the main reason for them is to address the issue of key compromise. Long-lived keys have a long period in which they are exposed to theft (see Heartbleed) and therefor are a higher value to an attacker since a stolen key enables the attacker to impersonate the associated website for that period.

The most important thing to keep in mind is that the nature of key compromise is such that you almost never know it has happened until it is too late (consider Diginotar as an extreme example).

The importance of protecting the SSL private key is why in the 90s when SSL deployment started in earnest, large companies tried to deploy SSL in their environment using SSL “accelerators” and “security modules” that protected the keys. This was the “right technical” thing to do not not the right “practical thing” because it significantly reduced the scalability of SSL protected services and at the same time massively increased the cost to deploy SSL.

By the mid 00’s we recognized this was not workable and the use of accelerators basically stopped outside a few edge cases and software keys were used instead. Some implementations, like the Microsoft SCHANNEL implementation, tried to protect the keys by moving them to a separate process mitigating the risk of theft to some degree. Others simply loaded the keys into the web servers process and as a result exposed them to Heartbleed like attacks.

What’s important here is that when this shift happened no one pushed meaningful changes to the maximum validity period of certificates and by association their private keys. This meant that we had software keys exposed to theft for five years (max validity period at the time) with no reliable way to detect they were stolen and in the unlikely event we did find out of the theft we would rely on the unreliable CA revocation infrastructure to communicate that issue to User Agents — clearly this was not an ideal plan.

In short, by significantly reducing the validity of the certificate and keys, we also significantly reduce the value to the attacker.

Another issue that short-lived certificates help mitigate is the evolution of the WebPKI, with a long-lived certificate you get virtually no security benefits of policy and technology improvements until the old certificates and keys are expunged from the ecosystem. Today this is rollout of new policy largely accomplished via “natural expiration” which means you have to wait until the last certificate that was issued under the old policies expires before absolute enforcement is possible.

So what is the ideal certificate validity period then? I don’t think there is a one-size-fits-all answer to that question. The best I can offer is:

As short as possible but no shorter.

With some systems it is not possible to deploy automation and until it is one needs to pick a validity period that is short enough to mitigate the key compromise and policy risks I discussed while long enough to make management practical.

It is probably easier to answer the question what is the shortest validity period we can reliably use. The answer to that question is buried in the clock skew of the relying parties. Chrome recently released a new clock synchronization feature that significantly reduces errors related certificate validity periods. But until that is fully deployed and other UAs adopt similar solutions you are probably best to keep certificate validity periods at 30 days to accommodate skew and potential renewal failures.

In-short, long-lived keys for SSL exist because we never re-visited the threat model of key compromise when we stopped using hardware-protected keys in our SSL deployments and short-lived keys help deal with the modern reality of SSL deployments.

Ryan

– I have added a few pictures for fun and made some minor text changes for clarity.

Revocation reasons don’t make sense for the WebPKI of today

As the old saying goes, to err is human but to forgive is divine, in WebPKI you deal with errors through the concept of revocation.

In the event a CA determines it made a mistake in the issuance of a certificate or has been notified by the subscriber they would like to see a certificate marked invalid, they revoke the corresponding certificate.

The two mechanisms they have available to them to do this are called Certificate Revocation Lists and OCSP responses. The first is a like a phonebook of all known “revoked” certificates while the last is more like a lookup that it enables User Agents to ask the status of a particular set of certificates.

In both cases they have the option to indicate why the certificate was revoked, the available options include:

Reason Description
Key Compromise We believe the key to have been stolen or can no longer be sure it has not been.
CA Compromise The issuing CA itself has been compromised (think DigiNotar).
Affiliation Changed The certificate contained affiliation information, for example, it may have been an EV certificate and the associated business is no longer owned by the same entity.
Privilege Withdrawn The right to represent the given host was revoked for some reason.

There are a few other reasons but they are not materially relevant to WebPKI use cases so I left them out.

CA Compromise is clearly a special and exceptional case, so much so that browsers have developed their own mechanism for dealing with it (see OneCRL and CRLSets) so it’s not particularly relevant to the options available to CAs anymore, at least in the WebPKI.

Another thing to consider when looking at revocation reasons is that about 90% of all certificates in used in the WebPKI (for TLS) are considered to be Domain Validated (DV) certificates, this limits the applicability of the remaining reasons substantially.

DV OV EV
Key Compromise Yes Yes Yes
Affiliation Changed Maybe Yes Yes
Privilege Withdrawn Maybe Yes Yes


As you can see, the available “reasons” for DV certificates are largely reduced to just “Key Compromise”. There is an argument that both “Privilege Withdrawn” and “Affiliation Changed” could be applicable to DV but in my opinion, they are a bit of a stretch. If nothing else the difference between these two reasons in the case of DV is so subtle you will almost never see these two usages be used on a DV certificate.

So what about Key Compromise?  How useful is it? Clearly, it is an important use case, your trust in an SSL certificate is frankly only as good as your confidence in how well the key is protected!

The reality is that most key compromise cases can not be detected and worse yet is that it’s far too easy to compromise an SSL key. Take for example the Heartbleed (http://heartbleed.com/) bug from last year, SSL keys are often kept in the process and relatively minor coding mistakes, as a result, can potentially expose that key and users won’t necessarily know about it.

Some platforms go out of their way to prevent such attacks at the expense of performance like we did at Microsoft with SCHANNEL where we keept the key out of process but this is the exception and not the rule.

Also popping a web server due to some third-party code vulnerability and getting to the shell in the context of the server is unfortunately far too easy. As a result, when an attacker does get a shell on a web server one of the first things they should be doing is taking that key so they can use it later, but again, how is this detected?

That is not to say “Key Compromise” as a revocation reason is not important when there is a large-scale vulnerability like Hearbleed but this is not exactly a common case.

The other cases for the “Key Compromise” revocation reason are notably dependent on the subscriber:

  • Knowing the compromise happened,
  • Being mindful enough to notify the CA,
  • Being willing to let the world know there was a compromise.

This last point of public signaling of “compromise” combined with the fact user agents treat all revocations the same (if they check at all) means that we seldom see subscribers even specify a reason when they ask for a certificate to be revoked.

As I look at why a certificate gets revoked today I think there are a few cases, these include:

  • CA has decided the certificate should be revoked,
  • The subscriber has decided the certificate should be revoked,
  • A root program has decided a certificate should be revoked.

These are far more meaningful signals to relying parties that the ones listed above and are not so specific that these is any negative signaling in their use.

Today it is not possible to specify these things but maybe we should take a second look at revocation reasons and if we continue to use them make them more relevant to the way the WebPKI works today.

Ryan

Secondary effects when encrypting the web

Secondary effects

Effects are often classified as having primary and secondary impacts. As we evolve the Internet we can see numerous examples of secondary effects, this is especially true as we look at how we evolve the way we secure it.

To put this in perspective consider the amazing milestone we have just hit, as of 2017 over half of the web is encrypted. Browsers will now be able to gradually switch from showing the “lock” as an exceptional and positive indicator to showing an error when a session is not encrypted. In fact, Chrome has already started this transition and as soon as October 2017 we will start to see it rollout.

I can not overstate how huge a change this is. For the last 23 years, SSL has been the exception and not the norm, as a result the best we could do as software developers was throw site developers a few pixels (the lock) for going out of their way to encrypt their sites. Now, we can do what we always wanted, warn the user when there is something to be concerned about.

It took a lot to get us to this point, some of which included:

  • Competition driving the cost of a certificate from thousands of dollars to as little as $0,
  • Evolution and standardization of certificate formats and of the TLS protocol,
  • Maturation and improving performance of TLS implementations,
  • Increase in average computational power so the overhead was negligible,
  • The definition and evolution of standardized practices and audits for certificate authorities,
  • The definition and standardization of a certificate enrollment protocol suitable for TLS use cases,
  • Formalization of user agent root programs and criteria in which CAs are managed to,
  • Establishment of technologies to effectively monitor and manage the CA ecosystem,
  • Migration to SaaS and the Cloud services where management of deployments can be handled in a seamless way,
  • And of course, education tooling to help the operators deploy TLS correctly.

While we are not done with this journey we have made it a long way and the end is in sight.

Each of these changes had secondary effects, even if small, but let’s look at the meta change to get a better understanding of what the consequences of change on the TLS ecosystem were.

Nobody ever did, or ever will escape the consequences of his choices. – Alfred A. Montapert

One obvious one that immediately comes to mind is that we have a generation of non-technical people who believe “the lock means it is secure” when it never really meant that.

As they say, correlation is not the same as causation, I would argue this is the reason the “Lock” myth came to be. That is because in the beginning it literally did cost thousands of dollars to get an SSL certificate, and if you were committed enough to sign up for that there was a good chance you were a big company who was at least thinking about security. This was especially true because those companies were also typically existing brick and mortar companies with reputations to protect.

From this myth, evolved another belief, that being that phishers don’t use SSL. This was true in the beginning, there were a few reasons for this, of course, the cost was one but it was also cumbersome and phishing is a numbers game so if users don’t notice the lock then why bother?

A few things have changed, through education more users today are aware of encryption and absence of the “lock” means more today than it did then. Additionally, the process of getting a certificate has been reduced to a task that can literally take about a minute and then of course cost has reduced to the point it’s inconsequential.

These things basically invalidate the once true assumption that phishers don’t bother to use SSL. Now, thankfully after the creation of SSL, we got technologies like SafeBrowsing and SmartScreen which are actually designed for, and are effective at, addressing this issue. Also with the lock disappearing over the next few years the “credibility” boost of the SSL lock will go away diminishing that value to the phisher. But is this true for other secondary effects?

It is worth noting that another secondary effect we will surely see is confusion from users as this change roles out, even though it is unquestionably better to warn users when they are at risk.

Let’s look at a larger example of a secondary effect of the move to encryption, specifically I believe BGP is a good example. The other day Dan Goodin over at ARS did a great article on a potentially malicious BGP rerouting event. In this article, he talked about how the attacker could have used the Logjam or DROWN attacks to attempt an MITM and that is true, but it’s not the only way. It no longer takes hours to get a certificate, you can get Domain Validation (DV) SSL certificate from many different providers in a matter of minutes now, a decade ago, or even just five years ago that could have taken hours.

This is important because to get an SSL certificate all you have to do is prove you control the associated host and one of the ways you can do that is by dropping a file with a pre-shared value in it on the host. If you can get the network traffic redirected to you for seven minutes like happened in this BGP rerouting case you can also get an SSL certificate for a host in the associated network segments.

In the past, the cumbersome and slow process involved in getting an SSL certificate made that practically impossible, but today it is not. This means that the broad use of SSL has, as a secondary effect, pushed the attacker down a level to BGP as a means to achieve its goal.

As a result, we will surely see BGP get more attention from attackers and eventually shore up its security so it is not as susceptible to these rerouting attacks. I also wouldn’t be surprised if we see the WebPKI evolve to make it slightly harder for this sort of attack to take place (by prohibiting certain types of validations) but that’s the game of evolution.

My friend Cem has a saying, security is often an exercise of re-arranging deck chairs on a sinking ship.

As we move closer to an encrypted web sometimes I am reminded of this, no matter what I am sure we will find more of these secondary effects which will give us an opportunity to focus on shoring up these other systems, just like we have with TLS.

Ryan

P.S. Thanks to Josh at ISRG and Andrew Whalley in the Chrome team at Google for some related conversations that helped me solidify my thoughts on this topic. And also thanks to Vincent Lynch for his review of the post.

Digital Signatures and the fallacy of Good, Cheap and Fast.

It is common to hear you can choose two of the following, “Good”, “Cheap” or “Fast”. While there is clearly some truth to this, it is not an absolute truth.

This is especially true in the context of Digital Signatures in the United States where the law allows for parties to agree to rely on virtually any mechanism to capture an agreement.  The rationale behind this is that in the event of a dispute it is the claimant in the dispute who will need to prove consent, as such, it is the parties of an agreement who must assume the burden of choice.

This introduces economic incentives to the decision-making process, specifically, it encourages you to consider which digital signature solution to apply to a given problem. This is an inherently good goal because over-specification in regulations can create market forces that result in the wrong solution being applied to a given business problem.

This flexibility also has downsides, specifically when people look to build a product to enable digital signatures this flexibility diminishes the economic incentives for them to build the most secure solution. Instead, their goal becomes to build the solution they think they can sell with minimal investment.

This, in theory, should still encourage providers to build a strong solution in that, if documents signed with their product end up being thrown out in court their business will ultimately fail. The problem with this motivation is that it can take a decade or more for this to become a concern that can hurt their business. As such, they double down on the creation of a solution that can be taken to market quickly.

It is easy to see this play out in the US Digital Signature market, the large majority of solutions do little more than produce an image that looks like “a signature” and then embeds it into a document. In the European Union, they call this a “Simple Signature”. To produce a fraudulent signature with such a scheme you typically need to select the signature with your mouse, right-click, select copy and then paste the image into the fraudulent document. These solutions are typically very easy to use and cheap to build but is clearly not a technique you would want to use on an agreement that might need to be enforced in a court of law.

Some solutions try to make these Simple Signatures stronger by taking a hash of the document with the signature in it and placing it in a database on their servers. This is done so that in the event of a dispute you can ask the service provider to go to court with you and act as an expert, stand by your side, and say “the hash is the same one we saw when they signed the document”. This is better than the pure image approach but not by much. This is true if for no other reason than they are unlikely to assign legal support to every contract dispute that happens involving contracts signed with their service. Additionally, if the legitimacy of the contract is put into question and the site has had any security incidents at all since it was signed, the claims they make would be questionable at best.

Other Simple Signature solutions try to mitigate these issues by applying a cryptographic notarization to the document after the basic Simple Signature was applied. This adds the concept of a trustworthy notarization of when the document was signed and to some degree, makes the document capable of standing on its own. Now if the original electronic document was produced with standards-based software anyone can confirm it has not been tampered with since it was produced. This helps you reduce the legal exposure relating to a compromise of the signing service which certainly helps with the enforceability but it does not eliminate this risk. This is because a slightly more difficult compromise of the service could still result in a bad actor producing a false document.

The next step up in enforceability is generally referred to as an “Advanced Signature”. These Advanced Signatures almost always start as a “Simple Signature” but they will also include signing the document with a cryptographic key under the sole control of the signer. This provides assurances that it was the user, and not the digital signature service, that signed the document.

Practically speaking the final step in enforceability when it comes to digital signatures is referred to as a “Qualified Signature”. This builds on the previous two types and of signatures and adds in two requirements. The first of which is that the key to being managed on a specialized cryptographic appliance such as a smart card or hardware security module. The rationale for this is that signing keys are typically long-lived and keys stored in software are exposed to theft. This is important because if an attacker steals one of these keys then they can produce fraudulently signed documents with it. The second requirement is that a the legal identity of the signer needs to be verified and included in the digital signature. This is in contrast to the other types of signatures which in essence only require the signer to be uniquely identified, in other words, something as basic as an email ping could be considered sufficient.

What has happened in the US is that since the law does not require anything stronger than a Simple Signature outside of areas like healthcare and finance where there are either regulatory requirements or business risks that justify a better solution we see only the Simple Signatures in use.

The vendors of these products would probably tell you they don’t do these more enforceable solutions because of you have to choose two of the following Fast, Cheap or Secure.

In the European Union, you have had the opposite problem, the economic incentives of the vendors and the associated regulatory frameworks have sent their market down the path of trying to apply Qualified Signatures to nearly all transactions. This reality has hampered the adoption of electronic signatures limiting their use to the highest value transactions due to the associated onboarding and use time friction this approach typically entails.

I believe this is an example of market failure and that it is possible to build a solution that scales across a businesses enforceability needs. One of the key reasons the market has not yet delivered such a solution is that the economic incentives for the vendors are not well aligned with those of their customers.

Tides are changing in both the US and in the European Union and I believe we will see a convergence of the best of both solutions in the coming years but as of yet, such a solution has not found its way to the market.

CAs and SSL and Phishing Oh My!

 

 

 

 

 

 

NOTE: This post reflects my personal beliefs and is not necessarily those of my employer Google, or Let’s Encrypt where I am a member of their Technical Advisory Board.

Introduction

Recently Vincent from The SSL Store published a blog post calling out Let’s Encrypt for issuing certificates to domains that contain the world PayPal.

The TL;DR for his post is he believes that Let’s Encrypt is enabling phishers by issuing them SSL certificates that contain the word “PayPal” and then refusing to revoke them when arbitrary third-parties ask them to.

As a result of his post, several news sources have decided to write articles about how “Let’s Encrypt” is acting as an enabler of these Phishers [1] [2].

Unfortunately, Vincent’s post and the associated articles don’t cover this in the most complete and balanced way so over my morning coffee today I decided to write this post to discuss the other side of the argument.

If this is a topic that interests you please also check out the Let’s Encrypt blog post where they talk about why they have taken this position.

Exploration

Let’s explore the opportunities CAs have to check for phishing, the tools they have available to them, the effectiveness of those tools, the consequences of this approach, how complete a solution based on the tools available to them would be and what the resulting experience would be for users.

Opportunities

The WebPKI’s CAs role, historically, has been that of a Passport office, you present proof you control a domain, and possibly that you are an authorized member of an organization and you get a digital certificate that attests to that.

This certificate could be valid for up to 1095 days. Once the certificate is issued the CA, largely speaking, has no natural opportunity to verify this information again. It is worth noting that this month the CABForum voted to shorten this period to 825 days.

Tools

In the event a CA determines it made a mistake in the issuance of a certificate or has been notified by the subscriber they would like to see a certificate marked invalid, the tool they have available to them is called “revocation”.

The two types of revocation that are under the control of a Certificate Authority are called Certificate Revocation Lists and OCSP responses. The first is a like a phonebook of all known “revoked” certificates while the last is more like a lookup that it enables User Agents to ask the status of a particular set of certificates.

Effectiveness

Earlier we discussed the lifetime of certificates, this is important to understand because the large majority of phishing sites do not start out as Phishing sites, as such issuance time checks seldom net positive results.

After issuance, this leaves you with periodic checks of the site,  third-party reports of phishing and relying on revocation checking as an enforcement mechanism. This is a recipie for failure, there are a few reasons for this, but one of the more significant is the general ineffectiveness of revocation checking.

Revocation checking is the most taxing thing a CA does. This is because the revocation mechanisms available to them will result in every relying party contacting them to download a OCSP response or CRL covering that certificate.

As a result, OCSP has a tendency to be both slow and unreliable. This forced browsers to implement this check as a “soft fail”, in other words, if the connection times out or fails for some reason they assume the certificate as good.

To give that some context about 8% of all revocation checks done by Firefox fail and the median response time is over 200ms.

As a result of this in 2012 Chrome, which is used by about 50% of all users, more-or-less disabled revocation checking except for exceptional circumstances.

What this means is that revocation checking, even for its intended purpose, is far from an effective tool. Expanding its use to include protecting users from phishers would not improve its effectiveness and arguably it would (due to the infrastructure implications) make it even less reliable.

It is also important to note that every wildcard certificate can be used for a hostname containing “PayPal” without the CA ever being made aware, a good example is https://paypal.github.io/ which is protected by a wildcard certificate issued to Github.

Consequences

To understand the consequences of expanding the CAs role include protecting us from phishing we first need to understand what a certificate represents, or more importantly what it does not represent. It does not represent the content, it represents the host that is serving content and it is the content that “phishes”.

Today, in the age of cloud services, there is a good chance the host that is serving the content is a service operated by WordPress, or maybe Amazon’s S3. These services allow users to sign up and post arbitrary content for free or very little money.

If we decide that revocation checking is the right tool to get phishing content off the web we would be saying a CA should revoke WordPress’s certificate if one of it’s users posted something someone reported as phishing content. That would, for the situations where revocation checking takes place and happens to work, take WordPress off the Internet. Is that what we want to happen?

If so, who is it we are asking to perform this task? There are well over 400 CAs in the Microsoft Root Program do we believe these are the right organizations to be policing the internet for the appropriateness of content?

If so what criteria should they use to do so and what do we do if they abuse this censorship role?

Completeness

It is easy to say that a CA should not issue a certificate if it contains the word “PayPal”. I could even see an argument that those that would be hurt by such a rule, for example, http://www.PayPalSucks.com and (a theoretical) PayPalantir.com are an acceptable loss.

This would, however not catch homoglyphs like when a Cyrillic “a” is used instead of the latin “a” which would very likley require a manual review of the name and content to determine the intent of the domain owner which is near impossible to do with any level of accuracy or fairness.

Even with that, what about ING, as one of the world’s largest banks, they too are commonly phished, should a CA be able to issue a certificate to https://www.fishing.com. And if they do and the issuing CA receives a complaint that it is Phishing ING what should they do?

And what about global markets and languages? In Romania there is a company called Amazon that is a cleaning company, should anyone be able to request their website be revoked because it contains the word Amazon?

If we promote the CA to content police, how do we do so in a complete way?

User Experience

With CAs acting as the content police, what would a user see when they encounter a revoked site? While it varies browser to browser the experience is almost always a blocking “interstitial”, for example:

 chrome revoked firefox revoked


If you look closely you will see these are not screens that you can bypass, revoked sites are effectively removed from the internet.

This is in contrast to Safe Browsing and Smartscreen which were designed for this particular problem set and therefore provide the user a chance to visit the site after a contextually relevant warning:

 SafeBrowsing smartscreen

Conclusion

I hope you see from the above that relying on Certificate Authorities as content police as a means to protect users from phishers a bad idea, at a minimum, it would be:

  • Ineffective,
  • Incomplete,
  • Unmanageable,
  • and Duplicative.

But more importantly it would be establishing a large loosely managed group as the de-facto content censors on the internet and as Steven Spielberg said, there is a fine line between censorship, good taste, and moral responsibility.

So what should CAs do about phishing then? It is my position they should check the Google Safe Browsing API prior to issuance (which by the way, Let’s Encrypt does), and they should report Phishers to the Safe Browsing service if they encounter any.

It is also important to answer the question about what users should do to protect themselves from phishing. I understand the desire to say there is only one indicator they need to be worried about, it’s just not realistic.

When I talk to regular users I tell them to do three things, the first of which is to use an up-to-date and modern browser that uses Smart Screen or Safe Browsing. Second, you should only provide data to sites you know and only over SSL. And finally, try to only provide sites information when it was you initiated the exchange of information.

P.S.

Thanks To Vincent Lynch and the others who were kind enough to proof this post before publishing.

How do you show someone they are important to you?

Showing someone they are important to you is not always straightforward. Hackers, on the other hand, seem to have figured this out.

Over the years I have designed, built and operated a number of WebPKI Certificate Authorities. In these roles, one thing that has been common is that hackers are always willing to use their finest exploits for a chance at the keys that would let them impersonate any site on the Internet.

That is why it’s not uncommon for an attacker to spend a zero-day on a WebPKI CA, and when they do I would argue that is their way of showing you they think you are important.

The black market for high-quality remote exploits that have never seen the light of day is not exactly small. Exploits like this are essentially a rapidly depreciating asset, once they become known the clock starts ticking and their value decreases until it hits zero.

In a past role, over a period of about three years, we had at least three `zero-days` spent on us, and I was flattered.

Attackers are no different than everyone else, they prefer the easy path so they often go after oldest and cruftiest stuff you are dependent on, and Adobe Acrobat was one such component for us.

Acrobat has had nearly 800 CVEs in the last 18 years. Each CVE often covers many related vectors but if we are generous and say one CVE equals one exploit it averages over 40 a year. Now not all of those issues are of the same bar, but it’s one of the most attacked pieces of software out there.

What got me thinking about this was a great post on mitigation vs isolation and the different strategies used by Chrome and Edge in their security program.

While you clearly need to employ both strategies, perfection is impossible to achieve which means isolation and least privilege ends up being one of the most powerful tools you have in your mitigation toolbox.

This is one of the reasons I like projects like the PDF.js project, Electron and Progressive Web Applications, they inherently get the benefit of browser isolation and mitigation.

The trend to use web applications like these make it harder on the attackers and that’s a good thing.