Category Archives: Thoughts

Average CRL size and download time

The other day I had a great conversation with Robert Duncan over at Netcraft, he showed me some reports they have made public about CRL and OCSP performance and uptime.

One thing that I have been meaning to do is to look at average CRL size across the various CAs in a more formal way I just never got around to doing it; conveniently one of the Netcraft reports though included a column for CRL size. So while I was waiting for a meeting to start I decided to figure out what the average sizes were; I focused my efforts on the same CAs I include in the revocation report, this is what I came up with:

 

CA Average CRL Size(K) CRL Download Time @ 56k (s)
Entrust 512.33 74.95
Verisign 200.04 29.26
GoDaddy 173.79 25.42
Comodo 120.75 17.66
Cybertrust/Verizon 75.00 10.97
DigiCert 21.66 3.17
GlobalSign 21.25 3.11
Certum 20.00 2.93
StartSSL 9.40 1.38
TrendMicro 1.00 0.15

 

From this we can derive two charts one for size and another for download time at 56k (about 6% of internet users as of 2010):

clip_image002 clip_image004

 

I overlaid the red line at 10s because that is the timeout that most clients use to indicate when they will give up trying to download, some clients will continue trying in the background so that the next request would have the CRL already cached for the next call.

This threshold is very generous, after all what user is going to hang around for 10 seconds while a CRL is downloaded? This gets worse though the average chain is greater than 3 certificates per chain, two that need to have their status checked :/.

This is one of the reasons we have soft-fail revocation checking, until the Baseline Requirements were published inclusion of OCSP references was not mandatory and not every CA was managing their CRLs to be downloadable within that 10 second threshold.

There are a few ways CAs can manage their CRL sizes, one of the most common is simply roll new intermediate CAs when the CRL size gets unmanageable.

There is something you should understand about the data in the above charts; just because a CRL is published doesn’t mean it represents active certificates – this is one of the reasons I had put of doing this exercise because I wanted to exclude that case by cross-referencing the signing CA with crawler data to see if active certificates were associated with each CRL.

This would exclude the cases where a CA was taken out of operation and all of the associated certificates were revoked as a precautionary exercise – this can happen.

So why did I bother posting this then? It’s just a nice illustration as to why we cannot generally rely on CRLs as a form of revocation checking. In-fact this is very likely why some browsers do not bother trying to download CRLs.

All posts like this should end with a call to action (I need to do better about doing that), in this case I would say it is for CAs to review their revocation practices and how they make certificate status available to ensure it’s available in a fast and reliable manner.

How not to collect sensitive information

So I was chatting with a friend today about the recent Register article about TeliaSonera’s application to add a new root into into the Mozilla root program.

I could not recall ever visiting a site that used a certificate from TeliaSonera so I started looking around for one in my web crawler data, a few moments later my friend pointed me at Telia’s own home page where he noticed they were collecting user id’s and passwords from a CSS layer not served over SSL:

image

More interestingly they also accept your Social Security Number as an alternative to your user id:

image

Now it does submit these over SSL but if your reading this you know that the information is still susceptible to a Man In The Middle attack.

A quick look at its SSL configuration also shows the server has a number of SSL configuration problems:

1. Susceptible to DDOS because it supports client initiated renegotiation

2. Vulnerable to MITM because it supports insecure renegotiation

3. Supports weak and extra weak (export strength) ciphers

 

The certificate itself also doesn’t include OCSP pointers (which is required now under the baseline requirements, though it was issued before they become mandated) but more importantly it only includes a ldap reference to a CRL which most clients wont chase and if they did most firewalls wont allow out – in other words it can’t be meaningfully revoked by anyone other than the browsers.

More concerning is it is issued off of a root certificate authority which means its likely an online CA, vs something in a offline vault which means all sort of badness if you consider not all patches of browsers get adopted and the only way to revoke certificates like this is to patch the browser.

A look at revocation repository uptime

It is no secret that in the last two months GlobalSign was affected by outages at relating to our use of CloudFlare. I won’t go into the specifics behind those outages because the CloudFlare team does a great job of documenting their outages as well as working to make sure the mistakes of the past do not reoccur. With that said we have been working closely with CloudFlare to ensure that our services are better isolated from their other customers and to optimize their network for the traffic our services generate.

I should add that I have a ton of faith in the CloudFlare team, these guys are knowledgeable, incredibly hard working and very self critical — I consider them great partners.

When looking at these events it is important to look at them holistically; for example one of the outages was a result of mitigating what has been called the largest publically announced DDOS in the history of the Internet.

While no downtime is acceptable and I am embarrassed we have had any downtime it’s also important to look at the positives that come from these events, for one we have had an opportunity to test our mitigations for such events and improve them so that in the future we can withstand even larger such attacks.

Additionally it’s also useful to look the actual uptime these services have had and to give those numbers some context look at them next to one of our peers. Thankfully I have this data as a result of the revocation report which tracks performance and uptime from 21 different network worldwide perspectives every minute.

For 05/2012-12/2012 we see:

Service Uptime(%) Avg(ms)
GlobalSign/AlphaSSL OCSP 100.00 101.29
VeriSign/Symantec/Thawte/GeoTrust/Trustcenter OCSP 99.92 319.40
GlobalSign/AlphaSSL CRL 100 96.86
VeriSign/Symantec/Thawte/GeoTrust/Trustcenter CRL 99.97 311.42

 

For 01/2013 to 04/2013 we see:

Service Uptime(%) Avg(ms)
GlobalSign/AlphaSSL OCSP 99.98 76.44
VeriSign/Symantec/Thawte/GeoTrust/Trustcenter OCSP 99.85 302.88
GlobalSign/AlphaSSL CRL 99.98 76.44
VeriSign/Symantec/Thawte/GeoTrust/Trustcenter CRL 99.22 296.97

NOTE:  Symantec operates several different infrastructures – which one you hit is dependent on which brand you buy from and some cases which product you buy. We operate only two brands which share the same infrastructure. I averaged the results for each of their brands together to create these two tables. If you want to see the independent numbers see the Excel document linked to this post.

 

As you can see no one is perfect; I don’t share this to say our downtime is acceptable because it is not, but instead I want to make it clear this is data we track and use to improve our services and to make it clear what the impact really was.

By the way if you want to see the data I used in the above computation you can download these spreadsheets.

Why we built the Revocation Report

For over a year I have been monitoring the industry’s largest OCSP and CRL repositories for performance and uptime. I started this project for a few reasons but to understand them I think it’s appropriate to start with why I joined GlobalSign.

If you’re reading this post you are likely aware of the last few years of attacks against public Certificate Authorities (CA). Though I am no stranger to this space, like you I was watching it all unfold from the outside as I was working at Microsoft in the Advertising division where I was responsible for Security Engineering for their platform.

I recall looking at the industry and feeling frustrated about how little has changed in the last decade, feeling like the Internet was evolving around the CA ecosystem – at least technologically. The focus seemed almost exclusively on policies, procedures and auditing which are of course extremely important when you’re in this business but by themselves they are not a solution.

When I looked at the CA ecosystem there were a few players who I thought understood this; the one I felt got it the most was GlobalSign. Instead of joining the race to the bottom they were developing solutions to help with key management, certificate lifecycle management, and publishing guides to help customers deploy certificates cost effectively.

As a result when they approached me with the opportunity to join them as their CTO and set the technology direction for the company I was intrigued. Those of you who know me know I love data, I believe above all things successful businesses (if they recognize it or not) leverage the Define, Measure, Analyze, Improve and Control cycle to ensure they are solving the right problems and doing so effectively.

To that end when I joined GlobalSign as their CTO and I wanted market intelligence on what the status quo was for technology, operating practices and standards compliance so that I could use to adjust my own priorities as I planned where GlobalSign was going to focus its investments.

It was more than that though, as many of you know I am not new to PKI and especially not to revocation technologies having developed several products / features in this area as well as contributing to the associated standards over the years. I was always frustrated by many public certificate authorities’ inability or unwillingness to acknowledge the inadequacy of their revocation infrastructure and its contribution to slow TLS adoption and bad user agent behavior when it comes to revocation checking.

More directly the reliability and performance of major CA operational infrastructure was why browsers had to implement what is now called “soft-fail” revocation behaviors; the treating of failures to check the status of a certificate the same as a successful check. Yet it is these same people who point fingers at the browsers when the security implications of this behavior are discussed.

I wanted to see that change.

This is why from the very beginning of this project I shared all the data I had with other CAs, my hope was they would use it to improve their infrastructure but unfortunately short of one or two smaller players no one seemed concerned – I was shouting at the wind.

With the limited feedback I had received for the data I had been collecting I decided to put together what is now the revocation report. As part of this project I switched to a different monitoring provider (Monitis) because it gave me more control of what was being monitored and had a more complete API I could use to get at the data.

In parallel I began to work with CloudFlare to address what I felt was one barrier to optimally using a CDN to distribute OCSP responses (inability to cache POSTs). The whole time chronicling my experiences, thoughts and decisions on my blog so that others could learn from my experience and the industry as a whole could benefit.

When I set up the Monitis account I licensed the ability to monitor the top responders from 21 locations worldwide every minute. At first I just published the graphical reports that Monitis had but they had a few problems:

  1. They did not perform very well (at the time).
  2. It was not laid out in such a way you could see all the data at once (also now fixed).
  3. It did not exclude issues associated with their monitoring sensors.
  4. It gave no context to the data that was being presented.

This is what took me to working with Eli to build the revocation report we have today, the internet now has a public view into approximately eleven months (and growing) of performance data for revocation repositories. Eli and I are also working on mining and quantizing the data so we can do something similar for responder uptime but this has taken longer than expected due to other priorities — we will finish it though.

So the question at this point is “was the effort worth it?” — I think so, both of us put a lot of time into this project but I believe it’s been a success for a few reasons:

  1. It allowed me to figure out how to improve our own revocation infrastructure; we now perform at about the same speed as gstatic.google.com for a similarly sized object which is what the bar should be.
  2. Both StartSSL and Entrust have now followed suit and made similar changes to their infrastructure improving their performance by about 3x (besting our performance by a few ms!).
  3. Symantec has improved their primary revocation repository performance by almost 40% and I understand more improvements are on the way.
  4. We are a closer to having data based argument we can present to browsers about why they can and should re-enable hardfail revocation checking by default.
  5. It gives customers visibility into the invisible performance hit associated with the decision of who you choose as your certificate provider.

What do you think? Do you find this valuable? Are there any other elements you think we should be tracking?

Microsoft Root Program and Excel

The other day I was was studying up on my Excel so I could help someone with a project, today I met with a friend where we were discussing the composition of root programs.

Since when you’re a plumber you fix everything with a wrench out came Excel and since sometimes I can’t let a problem sit still I spent far to much time slicing and dicing the Microsoft Root Program membership list.

There is a ton more that can be done, for example:

  1. Root CA adoption relative to UN membership.
  2. Root CA certificates based on validity dates.
  3. Comparing the Microsoft Root Program membership to the Mozilla Root Program membership.
  4. CAs per network (using AIA:OCSP urls as an indicator)
  5. CA adoption of CDNs for OCSP and CRLs.
  6. A look at how many operational facilities are associated with each organization (for example Symantec has 66 root CAs for there various brands are they all in the same facilities?).

 

Anyways it was a fun exercise and I thought I would share the results with you, here is the XLSX.

For those who do not want to look at the XSLX here are some statistics you may think are interesting:

  1. 31.30% of the CAs were owned explicitly by governments.
  2. 64.35% of the CAs were owned by commercial entities in the business of being third-party trust providers.
  3. Of the 352 certificates they are owned by 115 organizations, 36 government, 74 commercial and 5 enterprise.
  4. The USA has the most organizations who own CAs coming in at 13.04%.
  5. Spain is next with 12.17% of the organizations owning  CAs being based there..
  6. France is number three with 5.22% of the organizations owning  CAs being based there..
  7. 17.33% of the roots use 1024bit keys.
  8. 60.23% of the roots use 2048bit keys.
  9. 18.47% of the roots use 4096bit keys.
  10. 1.70% of the roots use ECDSA384 keys.
  11. 115 of the 206 sovereign nations have CAs within their borders.

Is SSL Broken?

[ This is a re-post of a article I wrote for the GlobalSign corporate blog, you can find it here]

It seems every month a new flaw is identified in SSL, and while that’s a slight exaggeration, after a while one starts to ask the question – is SSL broken? My answer would to that question would be no, but the protocol is nearly twenty years old and even though it now carries a new name (TLS) it also carries much of the baggage of the past in its design.

Despite this fact, my faith in TLS is stronger today than it ever was. My reasoning is simple – today we understand the strengths and weaknesses of this protocol better than we ever have. It is continuously reviewed by the world’s best engineers and cryptographers, trying to find the bad assumptions their predecessors made, strengthening it in response to identified weaknesses, and modernizing it to use the strongest forms of cryptography available.

This continuous investment in this foundational technology gives me faith.

Today another attack on TLS was made public.  “Lucky Thirteen” is a derivative of the work of French cryptographer Serge Vaudenay (Padding Oracles against CBC based ciphers – 2010), though unlike Vaudenay’s attack, Lucky Thirteen uses a known Timing Attack previously believed to be impractical. A successful application of this attack enables an attacker to decrypt your SSL communications.

Unlike other recent attacks, such as BEAST,  Lucky Thirteen requires a server-side fix. This means that complete and effective protection against this attack will require all webservers to be updated or patched.

That said, it is possible to mitigate the attack by removing CBC cipher suites, since the attack is against SSL/TLS’s use of CBC. But what to use in its place? The consensus of security researchers is to adopt suites based on AES-GCM, and while I agree, this has one problem – the large population of clients that do not yet support it.

This recommendation is complicated slightly by the BEAST attack from last year, the resolution of which required a client side fix which has, in all likelihood, not yet been deployed ubiquitously. As such, I still recommend prioritizing the older and less secure RC4 based suites above AES-GCM since it addresses both issues.

But should you be worried? It depends. If you are using TLS (and not its little brother DTLS) I would say your best bet is to walk calmly to the nearest exit, and use this as an excuse to ensure you are following industry Best Practices when deploying SSL – if  you’re not, this attack is the least of your worries.  Specifically I would recommend visiting the SSL Configuration Checker and make the critical (red) and important (yellow) configuration changes it suggests.

I would also encourage you to deploy HTTP Strict Transport Security  on your site since the attack this mitigates (SSL stripping) is much easier for an attacker to execute.

The good news is that if you were already following the advice of the SSL Configuration Checker you were prioritizing RC4 over other ciphers and most sessions to your server were resistant to this attack. This doesn’t mean you should not be deploying the patch to this issue, you just don’t need to do so in a crazed rush.

So are there any lessons we can take away from this? Of course there are. As a server operator, I would say this finding underscores the importance of regularly reviewing your server configuration to ensure that it follows industry best, and that you are always operating the most recent and stable release of your web server.

If you want a more technical walk through of this attack, I highly recommend this post by Mathew Green on TLS Timing Oracles or this one by Adam Langly.

How Facebook can avoid losing $100M in revenue when they switch to always-on SSL

Recently Facebook announced that they will be moving to Always-On-SSL, I for one am thrilled to see this happen – especially given how much personal data can be gleamed from observing a Facebook session.

When they announced this change they mentioned that users may experience a small performance tax due to the addition of SSL. This is unfortunately true, but when a server is well configured that tax should be minimal.

This performance tax is particularly interesting when you put it in the context of revenue, especially when you consider that Amazon found that every 100ms of latency cost them 1% of sales. What if the same holds true for Facebook? Their last quarter revenue was 1.23 billion, I wanted to take a few minutes and look at their SSL configuration to see what this tax might cost them.

First I started with WebPageTest; this is a great resource for the server administrator to see where time is spent when viewing a web page.

The way this site works is it downloads the content twice, using real instances of browsers, the first time should always be slower than the second since you get to take advantage of caching and session re-use.

The Details tab on this site gives us a break down of where the time is spent (for the first use experience), there’s lots of good information here but for this exercise we are interested in only the “SSL Negotiation” time.

Since Facebook requires authentication to see the “full experience” I just tested the log-on page, it should accurately reflect the SSL performance “tax” for the whole site.

I ran the test four times, each time summing the total number of milliseconds spent in “SSL Negotiation”, the average of these three runs was 4.111 seconds (4111 milliseconds).

That’s quite a bit but can we reduce it? To find out we need to look at their SSL configuration; when we do we see a few things they could do to improve things, these include:

Let’s explore this last point more, the status check the browser does is called an OCSP request. For the last 24 hours their current CA had an average world-wide OCSP response time of 287 ms, if they used OCSP Stapling the browser would need to do only one OCSP request, even with that optimization that request could be up to 7% of the SSL performance tax.

Globalsign’s average world-wide OCSP response time for the same period was 68 milliseconds, which in this case could have saved 219 ms. To put that in context Facebook gets 1.6 billion visits each week. If you do the math (219 * 1.6 billion / 1000 / 60 / 24), that’s 12.7 million days’ worth of time saved every year. Or put another way, it’s a lifetime worth of time people would have otherwise spent waiting for Facebook pages to load saved every two and a half hours!

If we consider that in the context of the Amazon figure simply changing their CA could be worth nearly one hundred million a year.

Before you start to pick apart these numbers let me say this is intended to be illustrative of how performance can effect revenue and not be a scientific exercise, so to save you the trouble some issues with these assumptions include:

  • Facebook’s business is different than Amazons and the impact on their business will be different.
  • I only did four samples of the SSL negotiation and a scientific measurement would need more.
  • The performance measurement I used for OCSP was an average and not what was actually experienced in the sessions I tested – It would be awesome if WebPageTest could include a more granular breakdown of the SSL negotiation.

With that said clearly even without switching there are a few things Facebook still can do to improve how they are deploying SSL.

Regardless I am still thrilled Facebook has decided to go down this route, the change to deploy Always-On-SSL will go a long way to help the visitors to their sites.

Ryan

A quick look at SSL performance

When people think about SSL performance they are normally concerned with the performance impact on the server, specifically they talk about the computational and memory costs of negotiating the SSL session and maintaining the encrypted link.  Today though it’s rare for a web server to be CPU or memory bound so this really shouldn’t be a concern, with that said you should still be concerned with SSL performance.

Did you know that at Google SSL accounts for less than 1% of the CPU load, less than 10KB of memory per connection and less than 2% of network overhead?

Why? Because studies have shown that the slower your site is the less people want to use it. I know it’s a little strange that they needed to do studies to figure that out but the upside is we now have some hard figures we can use to put this problem in perspective. One such study was done by Amazon in 2008, in this study they found that every 100ms of latency cost them 1% in sales.

That should be enough to get anyone to pay attention so let’s see what we can do to better understand what can slow SSL down.

Before we go much further on this topic we have to start with what happens when a user visits a page, the process looks something like this:

  1. Lookup the web servers IP address with DNS
  2. Create a TCP socket to the web server
  3. Initiate the SSL session
  4. Validate the certificates provided by the server
  5. Establish the SSL session
  6. Send the request content

What’s important to understand is that to a great extent the steps described above tasks happen serially, one right after another – so if they are not optimized they result in a delay to first render.

To make things worse this set of tasks can happen literally dozens if not a hundred times for a given web page, just imagine that processes being repeated for every resource (images, JavaScript, etc.) listed in the initial document.

Web developers have made an art out of optimizing content so that it can be served quickly but often forget about impact of the above, there are lots of things that can be done to reduce the time users wait to get to your content and I want to spend a few minutes discussing them here.

First (and often forgotten) is that you are dependent on the infrastructure of your CA partner, as such you can make your DNS as fast as possible but your still dependent on theirs, you can minify your web content but the browser still needs to validate the certificate you use with the CA you get your certificate from.

These taxes can be quite significant and add up to 1000ms or more.

Second a mis(or lazily)-configured web server is going to result in a slower user experience, there are lots of options that can be configured in TLS that will have a material impact on TLS performance. These can range from the simple certificate related to more advanced SSL options and configuration tweaks.

Finally simple networking concepts and configuration can have a big impact on your SSL performance, from the basic like using a CDN to get the SSL session to terminate as close as possible to the user of your site to the more advanced like tuning TLS record sizes to be more optimum.

Over the next week or so I will be writing posts on each of these topics but in the meantime here are some good resources available to you to learn about some of these problem areas:

Reading ocspreport and crlreport at x509labs.com

As you may know I have been hosting some performance and up-time monitors at: http://ocspreport.x509labs.com and http://crlreport.x509labs.com.

I started this project about six months ago when I walked the CAB Forum membership list, visited the sites of the larger CAs on that list, looked at their certificates and extracted both OCSP and CRL urls and added them into custom monitor running on AWS nodes.

Later I tried Pingdom and finally settled on using Monitis because Pingdom doesn’t let you control which monitoring points are used and doesn’t give you the ability to do comparison views. That said as a product I liked Pingdom much better.

As for how I configured Monitis, I did not do much — I set the Service Level Agreement (SLA) for uptime to 10 seconds which is the time required to be met by the CABFORUM for revocation responses. I also selected all of the monitoring locations (30 of them) and set it loose.

I put this up for my own purposes, so I could work on improving our own service but I have also shared it publicly and know several of the other CAs that are being monitored are also using it which I am happy to see.

OK, so today I found myself explaining a few things about these reports to someone so I thought it would be worthwhile to summarize those points for others, they are:

  1. Why is it so slow to render? – Unfortunately despite numerous requests to Monitis there is nothing I can do about this – Monitis is just slow.
  2. Why does it show downtime so often? – I do not believe the downtime figures, most of the time the failures show up on all of the urls. The times I have looked into theses it turned out the failures were at Monitis or due to regional network congestion / failures. Unfortunately this means we cannot rely on these figures for up-time assessment, at best they are indicators when looked at over long periods of time.
  3. Why do some tests show at 0-1 ms? – This is likely because the Monitis testing servers are located in the same data center as the OCSP servers in question. This skews the performance numbers a little bit but the inclusion of many perspectives should off-set this.

At this point I suspect you’re wondering, with these shortcomings what is this thing good for anyways? That’s a good question; OCSP (and CRLs) are a hidden tax that you and your users pay when they visit your site.

This is important because studies have found a direct correlation between latency and user abandonment and seriously who doesn’t just want their site to be fast as possible.

My hope is these resources help you understand what that tax is; if you’re a CA operator it can also help you tweak your performance as well as get an idea of what the global user experience is for the relying parties of your certificates.

On a related note I do think someone could make a pretty penny if they made an easy to use, yet powerful monitoring site 🙂

A look at revoked certificates

So today I have done posts on the browser user experience for expired and untrusted certificates but we wouldn’t have proper coverage on the topic of bad certificate user experience if we did not cover revoked certificates.

VeriSign is kind enough to host a test site that uses a revoked certificate (I know we do too I just can’t find it right now) so we will use that (https://test-sspev.verisign.com:2443/test-SSPEV-revoked-verisign.html)

Again what we want to see here is:

  1. Users are warned or prohibited from going to the site in question.
  2. The warning language used is easy to understand and explains the risks.
  3. The warning language used is related to the fact that the certificate is expired.
  4. The trust indicator does not show or is marked to indicate that there is a problem.

In this case I think again Internet Explorer and Chrome do the best; The worse experience is in Opera as it leads the user to believe there is a connectivity problem unless they expand the error message.

Chrome

Internet Explorer

Mozilla

Opera

Safari