Tag Archives: Revocation Checking

A look at revocation repository uptime

It is no secret that in the last two months GlobalSign was affected by outages at relating to our use of CloudFlare. I won’t go into the specifics behind those outages because the CloudFlare team does a great job of documenting their outages as well as working to make sure the mistakes of the past do not reoccur. With that said we have been working closely with CloudFlare to ensure that our services are better isolated from their other customers and to optimize their network for the traffic our services generate.

I should add that I have a ton of faith in the CloudFlare team, these guys are knowledgeable, incredibly hard working and very self critical — I consider them great partners.

When looking at these events it is important to look at them holistically; for example one of the outages was a result of mitigating what has been called the largest publically announced DDOS in the history of the Internet.

While no downtime is acceptable and I am embarrassed we have had any downtime it’s also important to look at the positives that come from these events, for one we have had an opportunity to test our mitigations for such events and improve them so that in the future we can withstand even larger such attacks.

Additionally it’s also useful to look the actual uptime these services have had and to give those numbers some context look at them next to one of our peers. Thankfully I have this data as a result of the revocation report which tracks performance and uptime from 21 different network worldwide perspectives every minute.

For 05/2012-12/2012 we see:

Service Uptime(%) Avg(ms)
GlobalSign/AlphaSSL OCSP 100.00 101.29
VeriSign/Symantec/Thawte/GeoTrust/Trustcenter OCSP 99.92 319.40
GlobalSign/AlphaSSL CRL 100 96.86
VeriSign/Symantec/Thawte/GeoTrust/Trustcenter CRL 99.97 311.42

 

For 01/2013 to 04/2013 we see:

Service Uptime(%) Avg(ms)
GlobalSign/AlphaSSL OCSP 99.98 76.44
VeriSign/Symantec/Thawte/GeoTrust/Trustcenter OCSP 99.85 302.88
GlobalSign/AlphaSSL CRL 99.98 76.44
VeriSign/Symantec/Thawte/GeoTrust/Trustcenter CRL 99.22 296.97

NOTE:  Symantec operates several different infrastructures – which one you hit is dependent on which brand you buy from and some cases which product you buy. We operate only two brands which share the same infrastructure. I averaged the results for each of their brands together to create these two tables. If you want to see the independent numbers see the Excel document linked to this post.

 

As you can see no one is perfect; I don’t share this to say our downtime is acceptable because it is not, but instead I want to make it clear this is data we track and use to improve our services and to make it clear what the impact really was.

By the way if you want to see the data I used in the above computation you can download these spreadsheets.

Why we built the Revocation Report

For over a year I have been monitoring the industry’s largest OCSP and CRL repositories for performance and uptime. I started this project for a few reasons but to understand them I think it’s appropriate to start with why I joined GlobalSign.

If you’re reading this post you are likely aware of the last few years of attacks against public Certificate Authorities (CA). Though I am no stranger to this space, like you I was watching it all unfold from the outside as I was working at Microsoft in the Advertising division where I was responsible for Security Engineering for their platform.

I recall looking at the industry and feeling frustrated about how little has changed in the last decade, feeling like the Internet was evolving around the CA ecosystem – at least technologically. The focus seemed almost exclusively on policies, procedures and auditing which are of course extremely important when you’re in this business but by themselves they are not a solution.

When I looked at the CA ecosystem there were a few players who I thought understood this; the one I felt got it the most was GlobalSign. Instead of joining the race to the bottom they were developing solutions to help with key management, certificate lifecycle management, and publishing guides to help customers deploy certificates cost effectively.

As a result when they approached me with the opportunity to join them as their CTO and set the technology direction for the company I was intrigued. Those of you who know me know I love data, I believe above all things successful businesses (if they recognize it or not) leverage the Define, Measure, Analyze, Improve and Control cycle to ensure they are solving the right problems and doing so effectively.

To that end when I joined GlobalSign as their CTO and I wanted market intelligence on what the status quo was for technology, operating practices and standards compliance so that I could use to adjust my own priorities as I planned where GlobalSign was going to focus its investments.

It was more than that though, as many of you know I am not new to PKI and especially not to revocation technologies having developed several products / features in this area as well as contributing to the associated standards over the years. I was always frustrated by many public certificate authorities’ inability or unwillingness to acknowledge the inadequacy of their revocation infrastructure and its contribution to slow TLS adoption and bad user agent behavior when it comes to revocation checking.

More directly the reliability and performance of major CA operational infrastructure was why browsers had to implement what is now called “soft-fail” revocation behaviors; the treating of failures to check the status of a certificate the same as a successful check. Yet it is these same people who point fingers at the browsers when the security implications of this behavior are discussed.

I wanted to see that change.

This is why from the very beginning of this project I shared all the data I had with other CAs, my hope was they would use it to improve their infrastructure but unfortunately short of one or two smaller players no one seemed concerned – I was shouting at the wind.

With the limited feedback I had received for the data I had been collecting I decided to put together what is now the revocation report. As part of this project I switched to a different monitoring provider (Monitis) because it gave me more control of what was being monitored and had a more complete API I could use to get at the data.

In parallel I began to work with CloudFlare to address what I felt was one barrier to optimally using a CDN to distribute OCSP responses (inability to cache POSTs). The whole time chronicling my experiences, thoughts and decisions on my blog so that others could learn from my experience and the industry as a whole could benefit.

When I set up the Monitis account I licensed the ability to monitor the top responders from 21 locations worldwide every minute. At first I just published the graphical reports that Monitis had but they had a few problems:

  1. They did not perform very well (at the time).
  2. It was not laid out in such a way you could see all the data at once (also now fixed).
  3. It did not exclude issues associated with their monitoring sensors.
  4. It gave no context to the data that was being presented.

This is what took me to working with Eli to build the revocation report we have today, the internet now has a public view into approximately eleven months (and growing) of performance data for revocation repositories. Eli and I are also working on mining and quantizing the data so we can do something similar for responder uptime but this has taken longer than expected due to other priorities — we will finish it though.

So the question at this point is “was the effort worth it?” — I think so, both of us put a lot of time into this project but I believe it’s been a success for a few reasons:

  1. It allowed me to figure out how to improve our own revocation infrastructure; we now perform at about the same speed as gstatic.google.com for a similarly sized object which is what the bar should be.
  2. Both StartSSL and Entrust have now followed suit and made similar changes to their infrastructure improving their performance by about 3x (besting our performance by a few ms!).
  3. Symantec has improved their primary revocation repository performance by almost 40% and I understand more improvements are on the way.
  4. We are a closer to having data based argument we can present to browsers about why they can and should re-enable hardfail revocation checking by default.
  5. It gives customers visibility into the invisible performance hit associated with the decision of who you choose as your certificate provider.

What do you think? Do you find this valuable? Are there any other elements you think we should be tracking?

Revocation checking, Chrome and CRLsets

One of the things I often hear is that Chrome no longer does revocation checking, this isn’t actually true.

All major browsers do some form of revocation checking, that includes Opera, Safari, Chrome, Firefox and Internet Explorer.

Google still does revocation checking it just does so through a proprietary mechanism called CRLsets.

As its name implies CRLsets are basically a combination of CRLs, Google crawls the web gathers CRLs and merges them together into a “mega-crl”. This mega-crl is formatted differently than other CRLs but it’s essentially the same thing but there are some important differences, the most important being that due to size concerns Google selectively chooses which CAs it includes in the CRL set and within those CRLs which revoked certificates to include.

With this understanding you have to wonder why would Google introduce this new mechanism if it not as comprehensive as the standard based ways to deal with revocation checking? The answer is simple performance and reliability.

With CRLsets Google is distributing the revocation list, and as such they can make sure that its delivered quickly they do this in-part by taking a bet that they can intelligently pick which revoked certificates are important (IMHO they cannot – revoked = revoked) and by being the one that distributes the list.

This has implications for users, Chrome trusts certificate authorities for which it has no revocation information for it also intentionally treats some revoked certificates as good which exposes you to some risk.

This is especially problematic for enterprises that use Chrome and leverage PKI, there is essentially no chance Google will decide to include your CRL. This is also problematic for those who encounter certificates from those CAs.

That’s not to say CRLsets do not have value they do, but those values have been discussed elsewhere in detail.

But what do you do if you want a more holistic solution to revocation checking? Its simple you can turn on the standards based revocation checking mechanisms and Chrome will use them in addition to the CRLset, to do that you go to Settings and expand choose Advanced Settings where you will see:

 

 

 

Here you can re-enable the standards based revocation checking mechanisms so chrome can do a more holistic job protecting you from the known bad actors on the internet.

Ryan