What value can a third-party provide users when browsing the web?

While at the CA/Browser Forum I was asked by a friend if we wanted to replace EV with a new class of certificate what would that certificate look like?

My response was that I would frame the question differently. The “real” question is what problems does a typical user have that a third-party with the strengths of a CA could help with?

With this in mind, you need to first understand who this stereotypical user is, a software engineer may have different needs than a grocery store clerk. They may also have common needs, you won’t know that until you do research.

The only way to do reliable research on this topic is to actually work with those users to understand what their needs are. While this is much harder than it sounds due to biases introduced in such processes a real needs analysis requires that you start here.

With that said, I suspect this exercise would show a broad swath of the target users is concerned with these questions:

  • Will I have a good experience working with the people behind the website?
  • Do the people behind this website have a good reputation?
  • Are the people behind this website experts in their craft?
  • How do I figure out how to reach a real human when and if I need to?

I would put those concerns into the context of the interaction they will have with the website (buying a product, downloading software, etc).

With that understanding I would then try to understand what the strengths of the CA are, having been a CA for a long time I would say:

CAs are good at verifying claims relating to the subject of a certificate.

I would then try to map the identified problems and strengths together to see what potential value the CA could provide that user.

Again the right thing to do is formally do those above explorations but for the purpose of this post I suspect these exercises would find that:

  • When a user visits a website they may struggle to find out how to contact the sales/support for that business,
  • When a user visits a site for the first time it may be hard for them to determine what the companies true line of business is,
  • After a user previously visited a website and completed a transaction with it they sometimes need to contact that business after the fact and could be assisted in finding the right contact information,
  • Before deciding to do a high-value transaction with a business, customers may want to find out the experience others have had with that business.

Now, just because a user may have these problems and a CA may be able to help solve them, it does not mean the SSL indicator is the right place to help answer these questions. It just means that there is a problem and skills intersection.

When, and how to solve this problem is another exercise altogether. Let’s explore EV for a second to give that some context.

Today if we assume the information in an EV certificate is correct (and not confusing see: this and this for context) we can say it provides the answer to “if I need to sue these people where do I tell my lawyer they are at?”.

The problem with that is that you may not have that information when you need it. I say this because you typically need to sue someone after you completed a transaction with them not before. After the fact, you have no assurance that this information in the certificate will be available at the site you did the transaction with.  The website may have gone away, they could have changed their certificate, or could some other change may have taken place that makes that information not readily available to you when you need it.

In any event, the point of this post is to say CAs should not be asking what they can put into certificates but what problems users have that CAs are well suited to solve. Unless they start there they will not be solving a real problem, they will just be bolting more things onto a certificate and asking why browsers and users don’t users see value in it.

Reality vs Fantasy – The DV vs EV argument

This morning I woke up to a blog post from Melih, the founder of Comodo titled “Problem vs Solution Value mapping”.

This is a follow-up to an ongoing discussion Melih and I have been having about the value of EV, and positive trust indicators. On my blog, the conversation started July 2017 if you’re interested.

Melih’s focuses his most recent post on the assessment of “value”, correctly attempting to define it as the basis of the rest of the post. He chooses to define it as  “the direct result of a resolution to a problem.” I think it is this definition is the first part of his argument I have an issue with. Namely, The Oxford Dictionary defines “value” as “the regard that something is held to deserve; the importance, worth, or usefulness of something.”

When considering “value” with this definition, I believe an analysis of “value” would start by building a case on what is “deserved”. To do that, we have to also define a context in which that value is assessed. I think this is probably the hardest part, and probably where most of the disagreement on “value” of EV stems from.

If we say the context of this assessment is “the security and privacy guarantees that can be provided to the user by user agents to users” EV’s value is no better than that of DV. It is not a hard case to make either.

The security model of the browser is based on the concept of “origin” where that origin is essentially the “hostname” that the content was retrieved from. Any external website or resource embedded in the site (with rare exception) has the same permission as the original website as a result of this model. This is how web analytics work, advertising and many other products and services that make up the web.

Until user agents required all of these entities that make up a given site to use EV and to have the legal entity in all of the associated certificates match; EV is a false flag. It says “you are talking to this legal entity” when in-fact your talking to many legal entities and any one of them could equally harm you.

The reality is that if this change were to be made that you would almost never see EV badges though. This is because virtually every site is made up of content and services from across the web and this condition would almost never be met. This is why we do not see CAs making the argument that this rule should be enforced by UAs.

If we say the context of this assessment is “the average users practical ability to protect themselves from phishing” again EV does not fair well. There have been lots of user studies done on how users do not understand positive trust indicators, and in general, do even notice them in most cases.

Furthermore, even if we disregard these well-run studies (and the associated common sense) as Ian Carroll showed with his Stripe, Inc business in Kentucky the values displayed in these indicators can trivially be made, at a very low cost and with no traceability, be made to say whatever an attacker wants. This again frames EV as a false flag because it can so easily be used to lend credence to a phisher’s site by giving them the EV badge that says the same thing as their target site.

If this was not enough, again if we disregard these well-run studies and say that people need to take the responsibility for looking at the EV badge to get confidence they are dealing with a trustworthy entity we need to look no further than the work James Burton did when he got a certificate for his business “Identity Verified”.  In this case, if a user has been taught to look at the EV indicator for an abstract concept of “trustworthiness” we are back to the user being mislead.

All of this ignores another very real problem, that being most phishing sites are not bespoke sites, instead, they are sites that are hacked and re-purposed. A good example of this is this one from a few weeks ago. What we appear to have here is a company called Northern Computer Services, LLC hosting a website for a business with the domain name “stampsbyjudith.com” hosting a Bank of America phishing site.

Now EV proponents surely see this as an example of EV working but if you look at it critically you will see it is exactly the opposite. First, could a customer believe that this “Northern Computer Services” is somehow a service provider to Bank of America? It seems reasonable to assume that the average user does not know anything about the way Bank Of America operates its services. In-fact even if you do have some level of understanding it’s incredibly common for banks to use service providers for different capabilities, maybe this Northern Computer Services hosts the BoFa website or provide billpay or mortgage services. How is the average user to know?

But what about the URL? There is no plausible way Bank Of America is hosting their site on the domain stampsbyjudith.com! Your absolutely right! it’s a fair expectation of us that if a user happens to look at the address bar that they should be able to figure that out. This is of course something you get when you use DV though, no EV necessary. Then there is the issue that studies also show that users do not look at the address bar either.

This is why Microsoft has created SmartScreen and Google has created Safe Browsing. These solutions utilize the massive scale and technology depth of these organizations along with machine learning and other advanced techniques to find phishing sites. As a result when a user navigates to a site similar to this one they get a interstitial warning them about proceeding.

In summary, in this context, I would argue that as EV exists today it actually makes things harder on the user and easier on the attacker.

With that context in mind let’s explore each of the arguments that Melih makes.

Users want protection from Transit Providers. Sure they do but I would say the if a user framed the topic this way it would demonstrate the how little they actually understand of the problem in question. It is not just “transit providers” they need protection from, it is every entity other than those that are necessary to serve the application hosted at a domain.

Networking is so complex it is not possible to expect even some of the most technical users to understand all of the nuances involved here.

I would like to point out that Melih again attempts to redefine terms, this time in a disingenuous way. Specifically, in this part of his post suggest there is some common understanding that there is a difference between “encipherment” and “encryption”.

Let’s again take a look at what the Oxford Dictionary says:

Encryption – The process of converting information or data into a code, especially to prevent unauthorized access.

Encipherment – Convert (a message or piece of text) into a coded form.

As you can see, these words mean the same thing. The only difference being the example use case in one of the definitions. But maybe this inconsistency is use  is because the Oxford Dictionary does not address a cryptographers view on these words? Unfortunately, that is not the case either, if you were to look at books like Serious Cryptography, Cryptography and Network Security, or even the very dated Applied Cryptography you will find no usage of these terms in this way.

What Melih has suggested in the past, and continues to do so in this section is that somehow if you authenticate only the domain and use that authentication as the basis for the session protection that this is not “encryption”.

Going so far to suggest that it is only encryption if you authenticate the legal entity. This is frankly ludicrous and I can not even respond to this more than I just have here.

I can say, that redefining a term, especially in such a specious way devalues any other valid points he may have.

But what about the users! The users want to know who they are dealing with! I actually agree with this but I also think it is far more complicated than users actually understand. So much so I would argue it is not possible to do in most cases. As a father when I run into situations where my kids want things that are not possible I sometimes joke with them and say “Well I want a pony!”.

It feels to me this is probably a case where that response is appropriate. The reality is there is not a globally unique business name, this is also the case with logos. Probably the best mainstream examples of this are the fake Starbucks stores and the notorial “Apple Stores” of Asia.

Fake Apple Store Highlights Counterfeit China

77778-full

This is the nature of brand names, in-fact there is an entire discipline of law (Trademark Law) dedicated to this topic and multilateral international agreements on how such disputes are to be handled.

So in the context of the url, does EV as it stands today add or remove value? From my perspective, it seems to me at a minimum in this context it provides no value but I could also make a reasonable argument it makes things worse here as well due to the introduction of more surface area for confusion.

User’s want to know if its “safe” to interact with the website! Again I can agree with this, the problem is names do not harm — we even teach our kids rhymes to remind them of this fact:

Sticks and stones may break my bones, but names can never hurt me.

To keep users safe we have to look at far more than the name a website is hosted under; there are literally thousands of features that a solution intending to protect users safety need to consider and I would not be surprised to find out that the name is one of the least important.

This is, again, why we have solutions like SmartScreen and Safe Browsing these solutions are constantly watching feeds of data to determine if a website is safe or not. It is not possible to solve the “safety” problem in any meaningful way without similar techniques.

But user’s want to be able to trust the content they see! Again, I also think this is something that users want, I just don’t think they can have everything they want.

But before I talk about this I want to talk about how Melih is redefining a term again, he suggests that “trust” means “having the ability to validate VISA, Paypal logo etc”. The oxford dictionary defines trust as “Firm belief in the reliability, truth, or ability of someone or something.”

With that, I would think that it would be more correct to say that they want to believe what they see. This is of course a very natural thing, something scammers have taken advantage of since the dawn of time.

When considering this desire I think we have to ask ourselves what the best way we have to service the desire. We also have to acknowledge that malicious content is everywhere in the world (don’t forget our Fake Starbucks and Apple Stores from above) that the best we can do is provide a speedbump.

This is, again, why we have solutions like SmartScreen and Safe Browsing as they were designed, engineered and continually evolve to address these risks.

In closing, I believe EV as it stands today is a round peg in a square hole. This does not mean there is not value in knowing the legal identity of the organization who operates a website, it is also not because these third-parties can’t do more to help users manage the risks they are exposed to.

It is because EV is being sold as something it is not, a anti-phishing tool. Simply put it is not well suited to help with that problem and I would go so far that when we teach users to see it as such it even helps phishers.

Risk variance and managing risk

One of my favorite security sayings is “My threat model is not your threat model”. We broadly accept there are different perspectives for every problem — the same is true with security.

Consider an Enterprises IT organization where you are chartered to support and secure a business. You need to meet this charter with a fairly fixed set of resources but the business requirements you must support are always changing. To deliver a reasonable level of service enterprises standardize on a core set of ways that certain issues (user and access management, compute, etc.) will be provided and force business units down the path of adopting them.  But that standardization often results in non-ideal user experiences, disjointed business workflows, slow innovation and importantly, in this case, it also commonly results in the either over or under mitigating security risks.

Startups, who are on the other end of the spectrum, are in a race to demonstrate market traction before their funding runs out. As a result, startups either virtually ignore security and privacy altogether or re-using a component they do not understand that is not well suited to the business problem they are solving or simply solves the wrong problems for their business risks.

Both enterprises and startups show the hard reality that we are often so close to our problems and set in our way of thinking we simply miss the big picture. This natural bias can hurt our businesses and in the context of security trade-offs, and result in incidents like the recent Equifax breach or the ever-growing list of Bitcoin exchange compromises.

The first step in preventing this “missizing” of risk is to make sure you understand what your risks actually are. The right way to do that is to think adversarially, taking a step outside of your business process or solution and thinking about the structure of the system your protecting and defining a threat model that captures those risks. This is not a one time exercise, it is something you need to constantly be re-visiting and getting new perspectives into.

Consider the typical Bitcoin exchange compromise, the exchanges usually start with a basic system limited to “online hot wallets” with weak architectural protections. They probably know better (for humanity’s sake I hope they do) but decided that the risk, in the beginning, is low enough because they have so little to lose that they go forward. Later they find success and are focused on other parts of their business and never get back to fixing that early trade-off.

As an aside, this bitcoin example showcases lots of problems, the largest being the asymmetric risk distribution. Specifically, the risk here is that of the depositor but the decision to take the risk is made by the exchange. I digress but this class of problem is a real problem in most startups and is the impact of which is multiplied 1000x in Bitcoin startups.

In any event, we can see how these sorts of things happen in hindsight so how can we limit how often these issues happen in the first place? As a technologist it hurts me to say this, the answer is process.

The good news is that process does not need to be heavy-weight. You need to make sure you approach the problem systematically and regularly, for example:

  • Instead of a threat model, you can do a simple threat tree,
  • You can use your bug tracking system to track the security issues you have found,
    • You can make sure those bugs capture the security decisions you have made,
    • The consequences of the identified risks for the actors in the system are captured,
    • What mitigations you have put in place for those issues along with how effective you think they will be are captured,
  • And importantly have a plan for how you will respond when things go wrong because they will regardless of how well you plan and invest in making the right security decisions.

You can then make sure you are reviewing these and acting accordingly on a regular schedule. This will ensure your organization at least has an inventory of issues that can last the individuals on a team and will make sure there is a point of conscious risk acceptance that the organization has taken.

This does not replace a full-on security program but it can at least make sure you are looking at the problem in the context of your business, your users and not just assuming every system has the same security needs.

On a related note for you startups, especially those who are operating in murky waters regarding regulation. The current regulations belong in your threat tree so you can make sure you understand them and begin to understand how they might apply to you even if you have to squint a bit to do it.

The Evolution of Security Thinking

In design sometimes we refer to the strategies used during the design process as Design thinking.  The application of these strategies helps ensure you are solving the right problems and doing so in a repeatable way. You can attribute much of the massive improvements in usability in software and devices over the two decades to these strategies.

If we look at how we have evolved thinking around building secure systems over the last two decades we can see that we have evolved similar strategies to help ensure positive security outcomes.

If we go back to the late 80s we see systems that were largely designed for a world of honest actors. There was little real business happening on the Internet at the time and the hard problems to be solved were all related to how do we enable a global network of interconnected systems so thats where efforts were put. These efforts led us to the Internet of today but it also gave us systems vulnerable to trivial attacks such as the Morris Worm.

By the 90s the modern “security industry” was born and products designed to protect these insecure systems from the internet started to come to market. One of the most impactful examples of this was the TIS Firewall Toolkit, other examples of this way of thinking include Antivirus products and other agents that promised to keep our applications and operating systems safe from “attackers”.

By the late 90s and early 2000s, it was clear that these agents were never going to be effective at keeping the bad guys out and that we needed to be building systems that were Secure by Default, Secure by Design and Private by Design. This shift in thinking meant that solution developers needed to develop their own strategies and tooling to ensure systems could be built to be inherently resilient to the risks they were exposed to. The concept of Threat Modeling is probably the most concrete example of this, believe it or not, this basic concept was essentially absent from software development up until this point.

By this time the technical debt in deployed systems was so great we spent most of a decade just trying to rectify the mistakes of the past. Windows XP SP2 and the Microsoft Security Stand Down is probably the most visible example of the industry making this shift, it also leads to the Security Development Lifecycle that largely informs how we as an industry, approach building secure systems today.

During this timeline, cryptography was treated as something that you sprinkled on top of existing systems with the hope to make them more confidential and secure. As an industry, we largely relied on the US Government to define the algorithms we used and to tell us how to use them securely. As a general rule only products designed for government use or for the small group of “cypherpunks” even considered the inclusion of cryptography due to the complexity of “getting it right”.

Things are changing again, we see the IETF via the CFRG working to standardize on international and independently created and cryptographic algorithms in lieu of relying exclusively on governments to do this standardization. We also see the concept of Formal Verification being applied to cryptographic systems (Galois is doing great work here with Cryptol as are other great projects in the verifiable computing space) which is leading us to have frameworks we can apply to build these concepts into other products securely (check out the Noise Protocol Framework as an example).

I think the Signal Protocol, Rough time, Certificate Transparency and even Blockchain Technologies are examples of the next phase of evolution in our thinking about how we build secure systems. Not because of “decentralization” or some anti-government bent in technologists, instead, these systems were designed with a more-complete understanding of security risks associated with their use.

Trust is a necessary component of human existence. It can give us peace of mind but It can also give us broken hearts. The same is true in the context of system design. Trust cautiously.

These systems, by design, go to great length to limit the need for “trust” for a system to work as intended. They do this by minimizing the dependencies that a system takes in its design, this is because each of those dependencies represents an attack vector as we advance technology our attackers become more advanced as well. They also make extensive use of cryptography to make that possible.

This focus on dependency reduction is why we see Blockchain enthusiasts taking the maximalist position of “Decentralize all the Things”. In my opinion, centralization is not always a bad thing, over-centralization maybe, but centralization can provide value to users and that value is what we should be focused on as solution developers.

My personal take is that when we look back on the next decade we will the say the trend was not “blockchain” but instead this is when we evolved our security thinking and tooling to better utilize cryptography. Specifically that this is when we started to use cryptography to make transparency, confidentiality and verifiability part of the core of the solutions we build instead of thinking of it as a layer we apply once we are done.

Positive Trust Indicators and SSL

[Full disclosure I work at Google, I do not speak for the Chrome team, and more generically am not speaking for my employer in this or any of my posts here]

Recently Melih did a blog post on the topic of browser trust indicators. In this post he makes the argument that DV certificates should not receive any positive indicator in the browser user interface.

I agree with him, just not for the same reasons. Positive trust indicators largely do not work and usability studies prove that is true. Browsers introduced the “lock” user interface indicator as part of a set of incentives and initiatives intending to encourage site operators to adopt SSL.

What is important is that these efforts to encrypt the web are actually working, over half the web is now encrypted and more importantly the adoption rate is demonstrating hockey stick growth.

As a result, in 2014 Chrome started down the path of deprecating positive trust indicators all together. In-fact today Chrome already marks HTTP pages as “Not secure” if they have password or credit card fields.

The eventual goal being able to mark all HTTP pages as insecure but for this to happen SSL adoption needs to be much higher, I suspect browsers will want to see adoption in the 90% to 95% range before they are willing to make this change.

This is relevant in this case because if all pages are encrypted what value does a positive trust indicator have? None. This means that when all HTTP pages get marked “Not secure” we will probably see the lock icon disappear.

So, as I said, I agree with Mehli, the “Secure” indicator should go away but so should the lock, the question is not if, but when.

But what does that mean for EV trust indicators? I am a member of a small group, a group that thinks that EV certificates can provide value. With that said today EV certificates have some major shortcomings wich significantly limit their value, some of which include:

  • It is not possible to get an EV wildcard certificate,
  • CAs, largely, have ignored automation for EV certificates,
  • Due to the lack of automation EV certificates are long lived and their keys more susceptible to theft as a result,
  • The vetting processes used in the issuance of EV certificates are largely manual making them expensive and impractical to use in many cases,
  • CAs market them as an antiphishing tool when there are no credible studies that support that,
  • The business name in the certificate is based on the legal name of the entity, not the name they do business under (DBA),
  • The business address details in the certificate are based on where business is registered (e.g. Delaware),
  • There is no contact information in the certificate, short of the taxation address, that a user can use to reach someone in case of an issue.

Addressing these issues have either been actively been resisted by CAs, for example, DigiCert has tried to get EV wildcard certificates to be a thing in the CA/Browser Forum a number of times but CAs have voted against it every time, or simply ignored.

There are some people who are working towards addressing these gaps, for example, the folks over at CertSimple but without CAs taking a leading role in redefining the EV certificate the whole body of issues can nott be resolved. Importantly until that happens you won’t see browsers even considering updates to the EV UI.

Given this reality, browsers have slowly been minimizing the details shown in EV certificates since they can give users the wrong impression and have limited value given the contents of the certificate.

It is my belief that unless the CAs work together to address the above systematic issues in EV certificates that minimization will continue and when the web is “encrypted” it won’t just be DV that loses its positive trust indicator, it will be EV also.

 

Let’s talk about revocation checking, let’s talk about you and me.

I have been having a conversation with Melhi at Comodo, this is the most recent post in that series.

First of all, the author is unaware that my company has built the most sophisticated Certificate Management system that can automatically request, issue, renew and manage the whole lifecycle of the certificate. Many Fortune 500 companies rely on this amazing technology to manage their PKI infrastructure.

I am aware.

So it is with that expertise and insight I must insist that the author does not appreciate the nuance between “high frequency” renewal vs “low frequency” renewals. Short lived certs will require “high frequency” renewal system. To an IT admin this is a scary prospect! They tell us that!

Having been responsible for and or worked heavly on:

  • The Valicert OCSP responder and clients,
  • The Windows CA and the enrollment client at Microsoft, which is the most used CA software ever,
  • The Network Access Protection solution that used IPSEC and ~12-hour certificates to do segmentation of hosts at the IP layer and their health. based network isolation solution that was deployed into many of the Fortune 500,
  • All of GlobalSign’s technology offerings, in particular, the technology enabling their expansion into the Enterprise,
  • Helping the ISRG build out Let’s Encrypt,
  • Designing, building and operating numerous other high volume products and services in finance, healthcare, and government.

Above and beyond that I helped secure Bing Ads and Live ID and now work at Google on other very high scale systems.

Needless to say, I too have familiarity with “high frequency”,  “low frequency”, “disaster recovery” and “availability” problems.

As for fear, it is a natural part of life, I believe it was Nelson Mandela who famously said: Courage is not the absence of fear but the triumph over it.

The reality is that it is manual certificate lifecycle management that is the thing to be afraid of. That is why COMODO and other CAs have been building out cloud-based management and automation solutions over the last few years. For customers this reduces failure, reduces costs, gives more control, and nets higher customer satisfaction.

Importantly, one needs to remember it is manual processes that lead to the majority of outages [see 1,2,3, and 4]. The scale of this problem has even led to an entire market segment dedicated to managing the lifecycle of WebPKI certificates.

So to the question of fear, I would say the same thing my father told me, though change can be scary and uncomfortable in the short term, it usually turns out okay in the end, and often better.

I believe there can be a scalable revocation infrastructure that can serve status for all certs from all CAs that is backward compatible with existing issued certificates that can be called from a browser. As I said before I do believe in the ingenuity of our scientist and engineers to bring us this solution……soon….

You misunderstand me. I did not say it was impossible to have a scalable revocation infrastructure that is backward compatible. I said the creation of a new system that could deliver on the small message size promise we were discussing would take 10 years to design, build and deploy.

To understand my rational take a look at this post. It is a little old now, and update rates are a faster now but not massively. When you review it you will see that if you exclude IE/Edge it takes just under a year for a new version of the most common browsers to reach 90% market share.

To put that in a more concrete way, if today, Chrome, Firefox, and Safari decided to simultaneously release a new version with a new revocation scheme it would take a bit less than a year for us to see 90% deployment. That is, of course, an unrealistic goal. Apple, for example only releases new versions a few times a year and does not even support WebRTC yet. Additionally, browsers have a pretty deep-seated position on revocation checking at this point given all the problems of the past so convince them will take time.

A more realistic, but still optimistic period assuming pre-existing consensus and a will to solve the problem is 5 to 6 years. If you question this figure, just look at how long it took for TLS 1.2 to get deployed. TLS 1.2 was published in 2008 and was not enabled by default in Windows until Windows 8.1 in 2014.

Web servers are even worse, as an example consider Apache version 2 was released in 2002 and these later versions still have less than 50% deployment.

Lets call it DCSP 😉

DCSP isn’t a bad idea, but it has its own challenges. For example, some that come at the top of mind are:

  • Most browsers do not ship their own DNS clients, this is one of the problems DNSSEC had in deployment. If the operating system DNS APIs they use do not provide the information they need then they can not adopt any technology depend on it.
  • Middleboxes and captive networks make DNS-based distribution problematic. Again DNSSEC suffered from this.
  • If DCSP requires a custom record type, for example, DNS servers and tooling will need to be updated. DNS servers also do not get updated regularly, this is another thing that has held back DNSSEC as well as CAA. It is fair to say that it is fair for the WebPKI CAs to update their DNS but based on the glacial pace in which CAs adopt technology that is 3-4 years before you could see deployment even with that (see CT as an example).

To be clear, I think something like this is worth exploring but I don’t think it will see meaningful deployment in the near term.

If we were in Vegas, I would say the most expedient path is to build on what is there, while in parallel building the new thing. This can shave as much as 5 years off the time it takes to solve the problem. If nothing else this moves the CA ecosystem closer to an operational maturity capable of supporting the new system.

Ryan

My response, to his response, to my response? or short-lived certificates part 3

The conversation on short-lived certificates and their value continuesIn the most recent conversation, we have started to shift from an either or position to one where we explore what is needed to make revocation checking a viable technology, this is a topic I am passionate about.

That said, there still seems to be some confusion on short-lived certificates, specifically the author states:

Of course, the cost of short-lived certs is very high as change the whole computing infrastructure so that certificates are renewed on a daily basis (daily for it to be secure enough vs 90 day certificates in my view) and introduce this new moving part that might cause vulnerability and operational issues.

Let’s explore this “cost” argument for a moment. First, when I issue an end-entity certificate I minimally have to perform two cryptographic signatures, one on the certificate and one on the OCSP response (it’s actually more than this in some cases but to keep the conversation simple I have omitted the others).

If we look at the performance optimization that Firefox has implemented where they do not require revocation checking when the certificate is within a subset the resolution of the revocation period short-lived certificates can, in fact, reduce the cost for a CA. This is because you no longer need to sign two things during the first few days/hours of the life of the certificate and you do not need to distribute that response.

The only way I think short lived certificates are more expensive for the CA is if you compare a model of certificates issued for a period better measured in hours to a model with certificate validity measured in years providing weekly revocation updates. This, however, is a bad model, and something in-between is needed.

The author also believes the use of automation represents a security vulnerability, so this deserves a response. It is true that complexity is the enemy of security, it is even true that automation can if poorly implemented can hurt security. The inverse is also true, however.  It is also generally accepted that availability is a component of security and one of the more common problems in the WebPKI is poor manual management practices resulting in the lack of understanding what is deployed and those certificates that are deployed expiring [see 1,2,3, and 4] and taking down services.

It is also important to look at the big picture when evaluating the net-security benefits of automation, for example, does anyone honestly believe we would ever get to a world where the majority of the web is encrypted if organizations have to staff people to manually generate certificate requests and hand carry them to CAs? Is the net-benefit of automation of reliability and scope of deployment worth its secondary effects?

The author also suggests that a certificate must have a validity period of only a day to be “secure enough”. This seems both arbitrary and wrong, as stated previously the User Agents and WebTrust allows an OCSP response or CRL can be a week old and still be trusted.

One of the largest reasons for this is that clock skew is a big problem in the real world and as a result, you need to keep validity periods of certificates and revocation messages outside this skew period to prevent skew related failures.

The decision to define “secure enough” at a day, both defines the problem in an intractable way and furthermore ignores the fact that it establishes a double standard that does nothing to address the issue if stale revocation information.

If we were to bring this conversation back to how we improve certificate revocation I would say there should be one standard for how recent the client’s understanding of the certificate’s validity would need to be.

On that topic, the author goes on to discuss how 32 bytes is better than 470 (the size of the smallest OCSP response). I could not agree more about this, in fact in the 90s’ when I was at Valicert we implemented a proposal from Paul Kocher called Certificate Revocation Trees. This approach uses of Merkle Trees (the heart of the Bitcoin ledger) to provide a very space efficient solution to this problem. Unfortunately, we were unable to popularize this at the time.

Ben Laurie began work on a variation on this approach that leveraged sparse Merkle Trees that he called Revocation Transparency. I personally like the idea of this approach because it leverages the work done to make Certificate Transparency scalable. For example, Trillian, the foundational server for Google’s next generation log server is designed to scale to Trillions of certificates.

That said, there are a number of similar approaches that could be equally scalable.

While I do think that an approach similar to the above could be made to work today, I also think it is more of a long-term solution in that even with the significantly increased rate of technological adoption it would take close to ten years given the state of things for such a solution to be fully deployed if we started right now.

As such I would start with the problem definition, which would need to involve a more formal analysis of the role of revocation checking today so that the right solution was built.

In parallel I would want to see the industry adopt a more strategic plan to address the more practical and immediately solvable problems, including:

  • Measuring and improving the revocation infrastructure operated by CAs,
  • Establishing global performance and reliability metrics and reporting that all CAs must meet,
  • Funding improvements to Nginx and Apache’s OCSP Stapling implementations,
  • Working with browsers to adopt the performance optimization firefox has implemented for revocation checking,
  • Working with TLS stacks, User Agents, Servers and Service Providers to adopt OCSP Must-Staple,
  • Defining an OCSP transport based on DNS that would reduce dependency on CA infrastructure reliability,
  • Evangalizing the adoption of OCSP stapling with administrators.

Ryan

P.S.

The author also has also added in someone else who has asked some questions or more correctly seems to question my version of the historical narrative. To provide some context, my narrative comes from my practical experience working with Microsoft, eBay, Amazon, and other large companies in the mid to late 90s and through the mid-2000s.

I too have worked with the BBN Safekeeper, I have a fun story how we hired some people to extract the keys from one of these boxes I would be happy to share over a beer sometime.

However, a cool device, the first one I remember working with was the KOV-8 in the 1993/4 timeframe.

Anyway, it is true that SSL started its life in 1994/5 at which point only software implementations of crypto were used (they were all BSAFE) but it is also true that mass deployment of SSL (relatively speaking) did not start until the late 90s and early 00s and that is the time my narrative was based on.

He also has also questioned the narrative of what Windows supported in the context of key protection. Since the author knows me personally he must have simply forgotten that I was the PM for these technologies and was at Microsoft working in this area for about 15 years.

Again I think there is some confusion here, the author states:

The software based Cryptographic Service Provider for RSA allowed keys to be marked ‘not for export’ from a very early release if not the first.

and:

The CAPI features used to protect private keys were expanded and exposed as a separate API in Windows 2000 as the Data Protection API.

As someone who worked at Microsoft on these technologies for a long time I can say with absolute confidence they were not built to provide key isolation, do not provide key isolation properties and were actually not used by the SSL implementation (SCHANNEL) for the server keys. If you’re interested in learning more about the capabilities of Windows in this area check out this post I did recently.

He has also questioned the role of ValiCert in the definition of the RFC, thankfully the IETF PKIX archives are still there and if you care to look you can see Mike was basically checked out, Warwick was not publicly paticipating and the work to finalize the protocol was largely done by Ambarish Malpani the founder of Valicert.

It is my turn, or short-lived certificates part 2

My response to a recent post suggesting short-lived certificates were intended to remove the need to do revocation seems to have spurred a response from its author.

The thesis of the latest response is that:

Revocation can easily handle “key compromise” situation and do so by offering more security than short-lived certs.

The largest problem with this thesis is that it is based on an incorrect understanding. That being the author believes short-lived certificates do not get checked for revocation, or they believe it is a choice between short-lived certificates and revocation.

Both statements are not true. User Agents do not say “oh, this certification is short-lived so we won’t do revocation checking”. If User Agents do revocation checking of end-entity certificates, and not all do, the short-lived certificate will get checked just like the long-lived certificate.

Now there is a reasonable argument to be made that if CAs are allowed to produce revocation lists and OCSP responses every 7 days, and User agents will trust them for that time, that it wouldn’t be an unreasonable performance optimization. At least if the behavior was limited to certificates with validity periods shorter than 7 days. With that said, User Agents don’t do this today and I didn’t suggest they should in my post.

[Update 10:43PM May 4th 2017 It seems a changed happened while I was asleep at the wheel. Firefox has implemented the below optimization as of Firefox 41. A corrected statement follows in the next  two paragraphs]

Both statements are not true, well except for one performance optimization implemented by Firefox I will explain shortly. Basically, if User Agents do revocation checking of end-entity certificates, and not all do, the short-lived certificate will get checked just like the long-lived certificate as long as it is younger than the corresponding CRL or OCSP response would be (Firefox only).

This performance optimization Firefox implemented is based on the fact that  CAs are allowed to rely on 7-day old OCSP responses and CRLS. As a result, 7 days becomes the precision of revocation checking, It is not clear what value Firefox chose but it is a subset of that figure, not 90 days. But either way, no major CA that I am aware of issues such short-lived certs today due to time skew issues.

The second problem with this thesis is it presumes the user who does know of a compromise wants to announce it to the world and can. As I mentioned in an earlier post on revocation reasons subscribers are not keen to announce to the world that there was “a compromise”. I mention “can” because in some cases the one who knows about the comromise do not even have sufficient permissions to the associated CA consoles to request the revocation.

And finally, the last issue in this thesis is that it presumes an effectiveness of revocation checking. Today over 9% of OCSP responses fail due to issues with the CA’s revocation infrastructure (the connections time out).

That’s right 9 out of 100 revocation checks fail because CAs fail to operate capable enough infrastructure to meet the needs of the clients that rely on them. It is actually worse than that though, the largest websites use a technique called Domain Sharding to make their sites load faster, this means that the failure rate you would experience as a user if hard-fail was implemented could be 2-4x higher than that.

This is before we consider the fact that due to the poor performance and failures in CA revocation infrastructure revocation checking has been largely turned off in Chrome and other browsers.

I say this because for “revocation checking” to work for a key compromise case you need two things:

  • The CA to know the compromise occurred,
  • Revocation checking to actually work.

Now I want to be clear, I think revocation checking is a good thing and I would like to see the situation improved, as a proof of that statement here is an example of some of my work in this area:

  • While at ValiCert (the lead creators of OCSP) I worked on the standardization of RFC 2560 (OCSP) including running the interoperability testing that led to its ultimate standardization,
  • I also am the author of RFC 5019 which is the profile of OCSP in use by CAs and clients,
  • I have led the development of numerous PKI SDKs and servers which have implemented these standards (including leading the team that added support for OCSP to Windows),
  • I led the implementation of the first, and most reliable implementation of OCSP stapling (in SCHANNEL),
  • I led the CASC project to get OCSP stapling added Nginx,
  • I have helped create, and/or recreate numerous CAs and along with their revocation infrastructure,
  • And I led the first wide scale measurement and efforts of the CA revocation infrastructure via my X509LABS Revocation Report project which led to all CAs adopting the same design philosophies to begin to address abhorrent response times and uptime.

Basically, I see the value in revocation checking and think the investments need to be made to make it work and be relevant for the WebPKI of today. That said, this topic has zero to do with short-lived certificates.

Ryan

How to keep a secret in Windows

Protecting cryptographic keys is always a balancing act. For the keys to be useful they need to be readily accessible, recoverable and their use needs to be sufficiently performant so their use does not slow your application down. On the other hand, the more accessible and recoverable they are the less secure the keys are.

In Windows, we tried to build a number of systems to help with these problems, the most basic was the Window Data Protection API (DPAPI). It was our answer to the question: “What secret do I use to encrypt a secret”. It can be thought of as a policy or DRM system since as a practical matter it is largely a speed bump for privileged users who want access to the data it protects.

Over the years there have been many tools that leverage the user’s permissions to decrypt DPAPI protected data, one of the most recent was DPAPIPick. Even though I have framed this problem in the context of Windows, Here is a neat paper on this broad problem called “Playing hide and seek with stored keys”.

The next level of key protection offered by Windows is a policy mechanism called “non-exportable keys” this is primarily a consumer of DPAPI. Basically, when you generate the key you ask Windows to deny its export, as a result the key gets a flag set on it that can not, via the API, be changed. The key and this flag are then protected with DPAPI. Even though this is just a policy enforced with a DRM-like system it does serve its purpose, reducing the casual copying of keys.

Again over the years, there have been numerous tools that have leveraged the user’s permissions to access these keys, one of the more recent I remember was called Jailbreak (https://github.com/iSECPartners/jailbreak-Windows). There have also been a lot of wonderful walkthroughs of how these systems work, for example, this nice NCC Group presentation.

The problem with all of the above mechanisms is that they are largely designed to protect keys from their rightful user. In other words, even when these systems work they key usually ends up being loaded into memory in the clear where it is accessible to the user and their applications. This is important to understand since the large majority of applications that use cryptography in Windows do so in the context of the user.

A better solution to protecting keys from the user is putting them behind protocol specific APIs that “remote” the operation to a process in another user space. We would call this process isolation and the best example of this in Windows is SCHANNEL.

SCHANNEL is the TLS implementation in Windows, prior to Windows 2003 the keys used by SCHANNEL were loaded into the memory of the application calling it. In 2003 we moved the cryptographic operations into Local Security Authority Subsystem Service (LSAS) which is essentially RING 0 in Windows.

By moving the keys to this process we help protect, but don’t prevent, them from user mode processes but still enable applications to do TLS sessions. This comes at an expense, you now need to marshal data to and from user mode and LSAS which hurts performance.

[Nasko, a former SCHANNEL developer tells me he believes it was the syncronous nature of SSPI that hurt the perf the most, this is likely, but the net result is the same]

In fact, this change was cited as one of the major reasons IIS was so much slower than Apache for “real workloads” in Windows Server 2003. It is worth noting those of us involved in the decision to make this change surely felt vindicated when Heartbleed occurred.

This solution is not perfect either, again if you are in Ring 0 you can still access the cryptographic keys.

When you want to address this risk you would then remote the cryptographic operation to a dedicated system managed by a set of users that do not include the user. This could be TCP/IP remoted cryptographic service (like Microsoft KeyVault, or Google Cloud Key Manager) or maybe a Hardware Security Module (HSM) or smart card. This has all of the performance problems of basic process isolation but worse because the transition from the user mode to the protected service is even “further” or “bandwidth” constrained (smart cards often run at 115k BPS or slower).

In Windows, for TLS, this is accomplished through providers to CNG, CryptoAPI CSPs, and Smartcard minidrivers. These solutions are usually closed source and some of the security issues I have seen in them over the years are appalling but despite their failings, there is a lot of value in getting keys out of the user space and this is the most effective way of doing that.

These devices also provide, to varying degrees, protection from physical, timing and other more advanced attacks.

Well, that is my summary of the core key protection schemes available in Windows. Most operating systems have similar mechanisms, Windows just has superior documentation and has answers to each of these problems in “one logical place” vs from many different libraries from different authors.

Ryan

Why bother with short-lived certificates and keys in TLS?

There seems to be a lot of confusion and misinformation about the idea of short-lived certificates and keys so I thought I would pen some thoughts about the topic in the hope of providing some clarification.

I have seen some argue the rationale behind short-lived certificates is to address the shortcoming in the CA and browser revocation infrastructure, I would argue this is not the case at all.

In my mind, the main reason for them is to address the issue of key compromise. Long-lived keys have a long period in which they are exposed to theft (see Heartbleed) and therefor are a higher value to an attacker since a stolen key enables the attacker to impersonate the associated website for that period.

The most important thing to keep in mind is that the nature of key compromise is such that you almost never know it has happened until it is too late (consider Diginotar as an extreme example).

The importance of protecting the SSL private key is why in the 90s when SSL deployment started in earnest, large companies tried to deploy SSL in their environment using SSL “accelerators” and “security modules” that protected the keys. This was the “right technical” thing to do not not the right “practical thing” because it significantly reduced the scalability of SSL protected services and at the same time massively increased the cost to deploy SSL.

By the mid 00’s we recognized this was not workable and the use of accelerators basically stopped outside a few edge cases and software keys were used instead. Some implementations, like the Microsoft SCHANNEL implementation, tried to protect the keys by moving them to a separate process mitigating the risk of theft to some degree. Others simply loaded the keys into the web servers process and as a result exposed them to Heartbleed like attacks.

What’s important here is that when this shift happened no one pushed meaningful changes to the maximum validity period of certificates and by association their private keys. This meant that we had software keys exposed to theft for five years (max validity period at the time) with no reliable way to detect they were stolen and in the unlikely event we did find out of the theft we would rely on the unreliable CA revocation infrastructure to communicate that issue to User Agents — clearly this was not an ideal plan.

In short, by significantly reducing the validity of the certificate and keys, we also significantly reduce the value to the attacker.

Another issue that short-lived certificates help mitigate is the evolution of the WebPKI, with a long-lived certificate you get virtually no security benefits of policy and technology improvements until the old certificates and keys are expunged from the ecosystem. Today this is rollout of new policy largely accomplished via “natural expiration” which means you have to wait until the last certificate that was issued under the old policies expires before absolute enforcement is possible.

So what is the ideal certificate validity period then? I don’t think there is a one-size-fits-all answer to that question. The best I can offer is:

As short as possible but no shorter.

With some systems it is not possible to deploy automation and until it is one needs to pick a validity period that is short enough to mitigate the key compromise and policy risks I discussed while long enough to make management practical.

It is probably easier to answer the question what is the shortest validity period we can reliably use. The answer to that question is buried in the clock skew of the relying parties. Chrome recently released a new clock synchronization feature that significantly reduces errors related certificate validity periods. But until that is fully deployed and other UAs adopt similar solutions you are probably best to keep certificate validity periods at 30 days to accommodate skew and potential renewal failures.

In-short, long-lived keys for SSL exist because we never re-visited the threat model of key compromise when we stopped using hardware-protected keys in our SSL deployments and short-lived keys help deal with the modern reality of SSL deployments.

Ryan

– I have added a few pictures for fun and made some minor text changes for clarity.