uptime | UNMITIGATED RISK

The other day I had a great conversation with Robert Duncan over at Netcraft, he showed me some reports they have made public about CRL and OCSP performance and uptime.

One thing that I have been meaning to do is to look at average CRL size across the various CAs in a more formal way I just never got around to doing it; conveniently one of the Netcraft reports though included a column for CRL size. So while I was waiting for a meeting to start I decided to figure out what the average sizes were; I focused my efforts on the same CAs I include in the revocation report, this is what I came up with:

CA	Average CRL Size(K)	CRL Download Time @ 56k (s)
Entrust	512.33	74.95
Verisign	200.04	29.26
GoDaddy	173.79	25.42
Comodo	120.75	17.66
Cybertrust/Verizon	75.00	10.97
DigiCert	21.66	3.17
GlobalSign	21.25	3.11
Certum	20.00	2.93
StartSSL	9.40	1.38
TrendMicro	1.00	0.15

From this we can derive two charts one for size and another for download time at 56k (about 6% of internet users as of 2010):

I overlaid the red line at 10s because that is the timeout that most clients use to indicate when they will give up trying to download, some clients will continue trying in the background so that the next request would have the CRL already cached for the next call.

This threshold is very generous, after all what user is going to hang around for 10 seconds while a CRL is downloaded? This gets worse though the average chain is greater than 3 certificates per chain, two that need to have their status checked :/.

This is one of the reasons we have soft-fail revocation checking, until the Baseline Requirements were published inclusion of OCSP references was not mandatory and not every CA was managing their CRLs to be downloadable within that 10 second threshold.

There are a few ways CAs can manage their CRL sizes, one of the most common is simply roll new intermediate CAs when the CRL size gets unmanageable.

There is something you should understand about the data in the above charts; just because a CRL is published doesn’t mean it represents active certificates – this is one of the reasons I had put of doing this exercise because I wanted to exclude that case by cross-referencing the signing CA with crawler data to see if active certificates were associated with each CRL.

This would exclude the cases where a CA was taken out of operation and all of the associated certificates were revoked as a precautionary exercise – this can happen.

So why did I bother posting this then? It’s just a nice illustration as to why we cannot generally rely on CRLs as a form of revocation checking. In-fact this is very likely why some browsers do not bother trying to download CRLs.

All posts like this should end with a call to action (I need to do better about doing that), in this case I would say it is for CAs to review their revocation practices and how they make certificate status available to ensure it’s available in a fast and reliable manner.

It is no secret that in the last two months GlobalSign was affected by outages at relating to our use of CloudFlare. I won’t go into the specifics behind those outages because the CloudFlare team does a great job of documenting their outages as well as working to make sure the mistakes of the past do not reoccur. With that said we have been working closely with CloudFlare to ensure that our services are better isolated from their other customers and to optimize their network for the traffic our services generate.

I should add that I have a ton of faith in the CloudFlare team, these guys are knowledgeable, incredibly hard working and very self critical — I consider them great partners.

When looking at these events it is important to look at them holistically; for example one of the outages was a result of mitigating what has been called the largest publically announced DDOS in the history of the Internet.

While no downtime is acceptable and I am embarrassed we have had any downtime it’s also important to look at the positives that come from these events, for one we have had an opportunity to test our mitigations for such events and improve them so that in the future we can withstand even larger such attacks.

Additionally it’s also useful to look the actual uptime these services have had and to give those numbers some context look at them next to one of our peers. Thankfully I have this data as a result of the revocation report which tracks performance and uptime from 21 different network worldwide perspectives every minute.

For 05/2012-12/2012 we see:

	Service	Uptime(%)	Avg(ms)
GlobalSign/AlphaSSL	OCSP	100.00	101.29
VeriSign/Symantec/Thawte/GeoTrust/Trustcenter	OCSP	99.92	319.40
GlobalSign/AlphaSSL	CRL	100	96.86
VeriSign/Symantec/Thawte/GeoTrust/Trustcenter	CRL	99.97	311.42

For 01/2013 to 04/2013 we see:

	Service	Uptime(%)	Avg(ms)
GlobalSign/AlphaSSL	OCSP	99.98	76.44
VeriSign/Symantec/Thawte/GeoTrust/Trustcenter	OCSP	99.85	302.88
GlobalSign/AlphaSSL	CRL	99.98	76.44
VeriSign/Symantec/Thawte/GeoTrust/Trustcenter	CRL	99.22	296.97

_NOTE:_{Symantec operates several different infrastructures – which one you hit is dependent on which brand you buy from and some cases which product you buy. We operate only two brands which share the same infrastructure. I averaged the results for each of their brands together to create these two tables. If you want to see the independent numbers see the Excel document linked to this post.}

As you can see no one is perfect; I don’t share this to say our downtime is acceptable because it is not, but instead I want to make it clear this is data we track and use to improve our services and to make it clear what the impact really was.

By the way if you want to see the data I used in the above computation you can download these spreadsheets.

UNMITIGATED RISK

un.mit.i.gat.ed: Adj. Not diminished or moderated in intensity or severity; unrelieved. risk: N. The possibiity of suffering harm or loss; danger.

Tag Archives: uptime

Average CRL size and download time

A look at revocation repository uptime