Tag Archives: Best Practices

Automating Non-Human Identities: The Future of Production Key Management 

Historically, key management was seen as activities involving hardware security modules (HSMs), manual tasks, and audits. This approach was part of what we termed ‘responsible key management.’ However, HSMs were impractical for many use cases, and these manual tasks, typical of IT processes, were often poorly executed or never completed, frequently causing security incidents, outages, and unexpected work.

Simultaneously, as an industry, we began applying cryptography to nearly all communications and as a way to protect data at rest. This led to the adoption of cryptography as the method for authenticating hardware, machines, and workloads to apply access control to their activities. As a result, today, cryptography has become a fundamental component of every enterprise solution we depend on. This shift led us to attempt to apply legacy key management approaches at the enterprise scale. The increased use of cryptography within enterprises made it clear these legacy approaches ignored the majority of keys we relied on, so we took a tactical approach and created repositories to manage the sprawl of secrets. While a step forward, this approach also papered over the real problems with how we use, track, and manage keys.

It is time for us as an industry to start viewing cryptography and key management not just as a tax we must pay but as an investment. We need to manage these keys in an automated and scalable way that helps us manage risk in our businesses.

To do this, we need to start with a question: What are these keys, anyway? Broadly, I think of three categories of keys: long-lived asymmetric secrets like those associated with certificate authorities, long-lived shared secrets used for encryption and authentication, and modern-day asymmetric key credentials for users, devices, and workloads. The greatest growth in keys has been in the last category, so let’s focus on that for the purpose of this discussion.

Modern Credentials and Their Management

Modern-day asymmetric key-based credentials are not always “certificates,” but they generally bind some claim(s) to an asymmetric key pair. These certificates can be formatted as JSON, ASN.1, CBOR, TLVs, X.509, JWT, or some other encoding. They serve various purposes:

  1. User Certificates:  Issued to individual users to authenticate their identity within an organization, these certificates provide secure access to corporate resources, such as an SSH certificate used by developers to access production. They bind a user’s identity to a cryptographic key pair, ensuring only authorized individuals access sensitive information and systems.
  2. Hardware Certificates: Assigned by manufacturers during production, these certificates uniquely identify hardware devices. They are often used to bootstrap the identity of machines or workloads, ensuring only authorized devices can access resources on your network.
  3. Machine Certificates: Common in operational IT environments, these certificates authenticate servers associated with domains, IP addresses, or device identifiers. They are typically used with TLS and for network access use cases like 802.1x, IKE, and various VPNs.
  4. Workload Certificates: In cloud and serverless environments, workload certificates perform access control close to the business logic to minimize security exposure and deliver on zero trust goals. These dynamic certificates often reflect both the underlying hardware and the workload running on it, acting like multi-factor authentication for devices. The frequent need to re-credential workloads makes issuing credentials mission-critical, as failure to do so can cause outages. This necessitates issuers in each failure domain (think of this as a cluster of clusters) hosting these workloads to ensure continuous operation.

What we can take from this is that we have been approaching credentials incorrectly by treating them as a key management problem. This approach is akin to using password managers for hardware, machines, and workloads, whereas, for users, we have moved toward multi-factor authentication and non-password-based authenticators.

Towards Automated and Scalable Key Management

If password managers or key vaults are not the right solution for machine authentication, what is? The answer is simpler than it might seem. Just as with users, these cases require built-for-purpose Identity Providers (IDPs). This is especially true for workloads, which dynamically spin up and down, making durable identifiers impractical. An IDP becomes a security domain for a given deployment, ensuring that workloads are accessible only by appropriate resources. This setup limits attackers’ lateral movement, allows for nearly instant granting and removal of access, minimizes the impact of compromises, and enables easy federation between deployments—all while providing a central point for identity governance and ensuring the cryptographic keys associated with credentials are well-managed and protected.

Getting Started

Modernizing key management starts with measurement. Identify the most common types of keys in your secret vaults, typically workload-related credentials. Deploy a workload-specific IDP, such as those enabled via SPIFFE, to transition these credentials out of the secret manager. Over time, the secret manager will store static secrets like API keys for legacy systems, while dynamic credentials are managed appropriately.

Prevent using your secret manager as an IDP from the start, especially for new systems. Teams responsible for the operational burden of these systems usually support this change, as automated end-to-end credentialing of workloads is more agile, scalable, and secure. This results in fewer outages and security incidents related to secret managers and non-production quality dependencies.

From this point, the process becomes a cycle of identifying where static secrets or long-lived credentials are funneled through your secret manager and moving them to built-for-purpose credential lifecycle management solutions.

Multi-factor authentication for workloads

Adopting a purpose-built IDP workload solution is a good start, but keys can still be stolen or leaked. For machines and workloads, use hardware attestations. Built-in hardware authenticators, such as Trusted Platform Modules (TPMs), create and secure keys within the device, ensuring they never leave. TPMs also verify device integrity during boot-up, adding an extra layer of security. This combination provides stronger multi-factor authentication without the usability issues associated with similar patterns for user authentication.

Avoiding Common Mistakes

The most common mistake organizations make is applying existing systems to workload credential management problems without fully analyzing operational, scale, uptime, and security needs. For example, enterprise PKI teams might mandate using their existing CA infrastructure for managing workload credentials, assuming a certificate suffices. However, this often violates the principle of least privilege and struggles with dynamic identification required for workloads.

Existing credential issuance systems are designed for static, long-lived subjects, such as directory names, DNS names, or IP addresses, which don’t change frequently. In contrast, workload credentials may change every few minutes. Provisioning devices like network appliances before assigning durable identifiers adds to this challenge. New workload-based systems, like SPIFFE, assign identifiers based on runtime elements, preventing the same bad practices that led to secret sprawl and mismanaged key problems.

Reducing Reliance on Shared Secrets 

Moving away from shared secrets won’t eliminate the need for secret vaults but will significantly reduce the problem’s scope. As systems modernize, password-based authenticators will be updated or deprecated. Over time, we will see fewer shared, long-lived secrets used for workload identity, driven by zero trust and least privilege principles.

At the same time, we can do much to improve overall key management practices in production systems. However, that’s a topic for another post.

Closing Note

The challenges and opportunities in modern key management are significant, but by leveraging innovative solutions and focusing on automation and scalability, we can make substantial progress. As we adopt built-for-purpose Identity Providers (IDPs) and hardware attestations, it’s important to have the right tools and frameworks in place to succeed.

I  have been working with SPIRL, a company focused on making the right thing the easy thing for developers, operations, and compliance. I see firsthand how the right platform investments can simplify the creation and operation of production systems. SPIRL achieves this by authoring and adopting open standards for identity and building reliable, scalable infrastructure that provides greater visibility and control.

Even if you don’t use SPIRL, by focusing on these principles, organizations can better manage the complexities of modern workload-related key and credential management, ensuring greater productivity and security.

Navigating Content Authentication In the Age of Generative AI

In 1995, SSL was introduced, and it took 21 years for 40% of web traffic to become encrypted. This rate changed dramatically in 2016 with Let’s Encrypt and the adoption of ACME, leading to an exponential increase in TLS usage. In the next 8 years, adoption nearly reached 100% of web traffic. Two main factors contributed to this shift: first, a heightened awareness of security risks due to high-profile data breaches and government surveillance, creating a demand for better security. Second, ACME made obtaining and maintaining TLS certificates much easier.

Similarly, around 2020, the SolarWinds incident highlighted the issue of software supply chain security. This, among other factors, led to an increase in the adoption of code signing technologies, an approach that has been in use at least since 1995 when Microsoft used this approach to help deal with the problem of authenticity as we shifted away from CDs and floppy disks to network-based distributions of software. However, the complexity and cost of using code signing severely limited its widespread use, and where it was used, thanks to poor tooling, key compromises often led to a failure for most deployments to achieve the promised security properties. Decades later, projects like Binary Transparency started popping up and, thanks to the SolarWinds incident, projects that spun out of that like Go ChecksumDB, SigStore, and SigSum projects led to more usage of code signing.

Though the EU’s digital signature laws in 1999 specified a strong preference for cryptographic-based document signing technologies, their adoption was very limited, in part due to the difficulty of using the associated solutions. In the US, the lack of a mandate for cryptographic signatures also resulted in an even more limited adoption of this more secure approach to signing documents and instead relied on font-based signatures. However, during the COVID-19 pandemic, things started changing; in particular, most states adopted remote online notary laws, mandating the use of cryptographic signatures which quickly accelerated the adoption of this capability.

The next shift in this story started around 2022 when generative AI began to take off like no other technology in my lifetime. This resulted in a rush to create tools to detect this generated content but, as I mentioned in previous posts [1,2], this is at best an arms race and more practically intractable on a moderate to long-term timeline.

So, where does this take us? If we take a step back, what we see is that societally we are now seeing an increased awareness of the need to authenticate digital artifacts’ integrity and origin, just like we saw with the need for encryption a decade ago. In part, this is why we already see content authentication initiatives and discussions, geared for different artifact types like documents, pictures, videos, code, web applications, and others. What is not talked about much is that each of these use cases often involves solving the same core problems, such as:

  • Verifying entitlement to acquire the keys and credentials to be used to prove integrity and origin.
  • Managing the logical and physical security of the keys and associated credentials.
  • Managing the lifecycle of the keys and credentials.
  • Enabling the sharing of credentials and keys across the teams that are responsible for the objects in question.
  • Making the usage of these keys and credentials usable by machines and integrating naturally into existing workflows.

This problem domain is particularly timely in that the rapid growth of generative AI has raised the question for the common technology user — How can I tell if this is real or not? The answer, unfortunately, will not be in detecting the fakes, because of generative AIs ability to create content that is indistinguishable from human-generated work, rather, it will become evident that organizations will need to adopt practices, across all modalities of content, to not only sign these objects but also make verifying them easy so these questions can be answered by everyday users.

This is likely to be accelerated once the ongoing shifts take place in the context of software and service liability for meeting security basics. All of this seems to suggest we will see broader adoption of these content authentication techniques over the next decade if the right tools and services are developed to make adoption, usage, and management easy.

While no crystal ball can tell us for sure what the progression will look like, it seems not only plausible but necessary in this increasingly digital world where the lines between real and synthetic content continue to blur that this will be the case.

Update: Just saw this while checking out my feed on X and it seems quite timely 🙂

Tenement Farming and Cloud HSMs

While it’s fair to say that using a Cloud HSM means your keys are protected by a device meeting FIPS 140-3 standards, assuming the HSM in use has this certification, it’s important to realize this doesn’t guarantee the security you might expect. The security model of HSMs was built for the threats of the 1980s. These devices were not network-connected and were single-tenant — if they were “online” it was usually via HSMs attached to physical computers running an application on a machine connected to private networks — not connected to a globally reachable endpoint.

At their core, these devices were designed to protect keys from physical theft, more precisely, to slow down and increase the cost of theft, much like safe ratings (UL TL-15, TL-30, TL30x6) indicate for how effective the associated safes are. For example, early in my career, I worked on a project where we built attacks to extract non-exportable keys from a specific HSM and then imported them into another vendor’s HSM because the prior vendor went out of business. There have also been a number of key exfiltration bugs in these devices over the years as well.

We didn’t see network-connected HSMs until around 1999, but even then, these devices were single-tenant, essentially just a network-connected Linux or BSD box containing fundamentally the same hardware as years earlier. While this change did allow a single company to share an HSM across different application workloads, the assumption was still that this HSM was managed by the company in charge of all of these applications.

Why is this important today? Most computing is now done in shared cloud infrastructure, administered by someone else, with your competitor or an attacker on the same hardware as you. This presents a very different set of security considerations and design constraints than these devices were originally built. You are now exposed to the risks of the physical and logical administrators of these Cloud HSMs, the services they are dependent on, as well as other tenants of the Cloud HSM.

Consider that the compute operator usually can technically access the handle the application uses to talk to the HSM, and likely the secret used to authenticate to this HSM as well, meaning they, or an attacker, could potentially use that handle, or secret,  to sign, or decrypt data as they wish. You might find that an acceptable risk, but did you know some HSMs allow the administrator to blindly add users as operators to the “virtual HSMs” within it? Yup, they do.

What about when keys are stored in a KMS and the key policy dictates the key be managed by an HSM? If the HSM hardware attests that the key is stored in the HSM, and this attestation is verified, it’s nearly the same threat profile we just discussed. In some cases, it could be argued it is better because access to the HSM can have traditional user and service RBAC controls, and rate limiting, and keys can be replicated to many other HSMs without any administrative burden for you, keeping you safe from a common disaster recovery scenario while normalizing the management of these devices so it fits into your normal operational practices which hopefully are well managed and monitored.

Regardless of the approach, the bigger question is whether your provider’s operational and security practices are up to your specific threat model. Imagine a Bitcoin wallet worth 100 million dollars. Has your cloud provider proportionally invested enough into controls and tests around their system to prevent a motivated attacker from using your key to sign a transaction that moves all that to another wallet? Probably not.

The fundamental issue is that today’s HSMs were largely designed for different eras with different security concerns than we typically have today, mainly to protect against physical theft of keys in environments where data centers were effectively closets in dedicated office space. That doesn’t reflect today’s cloud computing scale.

It is worth noting that there are a few HSM solutions on the market that are making efforts to tackle some of these issues, but they still fall short but that is a topic for another post.

In essence, Cloud HSMs are to HSMs what Tenement Farming is to Farming.

That’s not to say there’s no value in these offerings, but as built today, they often fail to deliver the value they are assumed to deliver. And if regulations mandated their use before, say, 2010, chances are they’re not delivering the intended value that those regulations had in mind.

So, how should we be protecting keys now?

To be clear, this is not a case against Cloud HSMs, it is an argument to think about the threat model and use case you are solving for — for example if we look at Storm-0558 where Microsoft appears to have been using the private key material in the process of their IDP, the attacker was able to get a memory dump to be created and then via another attack vector gather the memory dump, and as a result the private key, we can take away at least one solid lesson. Do not load keys into the process of the applications that rely on them. In this case, the least costly method to have prevented this key theft would be simply moving the key to another process running in another user context with a very simple API that is easy to defend and can at least limit the attacker to a handle vs. what happened in this case where the attacker was able to use the key with impunity for years. This approach is the rough equivalent of a workload or node-specific software HSM similar in spirit to the original HSMs.

Another common problem we see in the industry is, that solutions like Hashicorp vault were designed to centralize key management and provide a one-size-fits-all answer to “Where do I keep my secrets?” Architecturally these solutions look much like a passively encrypted database, if you have sufficient permissions you can read the key in the clear and then copy it to whatever node or workload needs to use the key. This took us from secret sprawl to secret spray where we pushed the keys out in environment variables and files on production machines that later get dumped into logs, and backups, continuously exposing the keys to users who should have never had access, and often leaving key remnants all over the place. This is only marginally better than checking keys into dedicated source control repositories.

The problem here isn’t limited to these secret sprawl solutions, considering that almost every web-server TLS private key is sitting in the file system often with weak ACLs without any encryption which is then loaded into memory on that web server in the process. Similarly, most SSH keys are also sitting in some file, usually with a poor ACL, with a key either in the clear or with an easily grindable password so a malicious actor that gains read access to the file system is all that is needed to walk away with the key, for example, see this incident from last week

In both of these cases, we would be much better off if we would move these keys into another user context that is more defensible and constrained.

So how did we end here with such abysmal practices for managing keys?

While there is seldom one single reason to see such neglect, in this case, I think one of the largest is the dogmatic “all keys must be kept in HSMs or smart cards”. It is just too easy of a get-out-of-jail-free card for a security professional. Instead of thinking about the real risks and operational practices and then designing strategies to mitigate those threats that are practical and appropriate people who can afford to complete that checkbox do and those who can not just copy keys around in the clear out of a database.

The reality is we can do a lot better but as they say, the first step is to accept that you have a problem.

In short, as security professionals we need to avoid dogmatic answers to complex questions and spend the time to look more critically at the risks, constraints, obligations, resources, and real-world scenarios those we work with are operating within before we throw generic playbook answers to those coming to us for advice.

Deploying SSL – Beyond the certificate and cipher suites

If you were to go do a search on the internet for “configuring SSL” you would find a ton of references on configuring your favorite web server to do SSL some of it good and some of it not so good. But what you don’t see a lot of content on is how to deploy it successfully.

What do I mean by successfully? These articles ignore the larger picture, for example:

  1. Are there changes to your content you will need to make?
  2. What about external content and script references?
  3. Are there any SEO considerations?
  4. Are there other related considerations?

To some these things may be common-sense but even for those a refresher never hurt so lets go over them again briefly.

 

Are there changes to your content you will need to make?

Probably, lots of content I encounter explicitly references a protocol serving it (aka href=”http://…” and src=”http://…”) and if that’s the way your content looks then yes you will want to update your content to use relative references, for example

href=”//{hostname}/{uri}”

src=”/{uri}”

This way your content is independent of what protocols are used to transport it, it will also help prevent your users from encountering “mixed content” warnings.

 

What about external content and script references?

Another scenario that causes mixed content warnings is when sites use of scripts and content hosted on other servers that is explicitly referenced over HTTP. The two most common I encounter are YouTube Embeds and Google Analytics but there are lots of different third-party content and scripts out there and each one you embed will also need to support SSL.

Thankfully I have never encountered one that does not support SSL and in most cases you will just need to make the reference relative (“//”) and let the browser decide what protocol to use to get the reference. In the very rare cases where this does not work a quick email to support at the content/script provider will get you the URL to the SSL version of the content/script.

Though this has always been the case one thing to keep in mind is that the perceived performance and actual security of your site is dependent on the performance and security of the providers you include in it. I strongly recommend you check their performance and SSL configuration and ask them to make any changes necessary to address issues this might identify.

 

Are there any SEO considerations?

Aren’t there always? So to achieve all of security benefits of SSL you have to deploy SSL across your entire site (this is commonly referred to as Always On SSL). This means that as far as a search engine is concerned there could be two copies of the same content. This is treated as a negative condition in most page ranking schemes, we address this in a few ways:

1. Tell the search engine which content is authoritative (aka which one we want them to index), we do this using:

    • Updating <link rel=”canonical”> to point to the HTTPS version.
    • Updating the XML Sitemap to refer to the HTTPS version of the content.

Making these two changes ensures the search engine will index the SSL version of the site so the first link the user visits will be your HTTPS version.

These things not only improve the users experience by making them get at the content quicker (instead of relying on a rewrite rule to get them to the HTTPS content) but also help to mitigate MITM attacks that would be possible for organic traffic based on your HTTP urls.

2. Ensure the robots.txt is available over SSL.

3. Redirect all HTTP requests to your site to the HTTPs version using a permanent redirect (a HTTP 301), this will transfer your PageRank to the SSL url.

4. Update the search engine webmaster tools to refer to the HTTPS url instead of the HTTP URL.

 

Are there other considerations?

There are a few, for one there is performance. There is a myth that SSL is computationally expensive, it’s simply not true (at least today) but that doesn’t mean you don’t need to be concerned with performance.

There are several settings you care about, for example it’s common for websites to use domain sharding means when you’re using SSL is each one of those requests represents a new SSL negotiation and the negotiation is the most costly part of the SSL session. While we can’t eliminate this cost we can ensure that the servers terminating our SSL sessions implement session caching and reuse to reduce the impact of the SSL overhead. We can also try to limit the number of domains we use when sharding so reduce the number of SSL sessions needed to finish rendering a site.

You may also want to look at deploying a forward proxy in front of your web servers where all SSL would be terminated; this can give you performance benefits beyond SSL and can simplify key and SSL management in your environment at the same time.

Then there is the question of cookies, while all sensitive cookies should already be marked “secure” so they won’t get sent over non-secure sessions you should consider marking all cookies as “secure” since the whole site is now supposed to be served over SSL.

Depending on how you have authored your rewrite rules there may be static references to HTTP buried in there, you will want to review your rewites to ensure they are protocol independent (where appropriate) so that you don’t end up forcing users through an unnecessary redirect.

And finally setting the HTTP Strict Transport Security header means browsers will visit you over HTTPS the every time, even if not from search results; this will improve relative perceived performance and help protect from MITM attacks.

 

Ryan

 

Resources

1. Choose the Right Certificate, CA Security

2. Deploying SSL – How to get your server configuration right, Ryan Hurst

3. SSL Configuration Checker, X509 Labs

4. SSL Pulse, Trustworthy Internet Movement

5. Bulletproof SSL/TLS and PKI, Ivan Ristic

6. High Performance Browser Networking, Ilya Grigorik

7. How to get the latest stable OpenSSL, Apache and Nginx, Ryan Hurst

8. Always On SSL, OTA

9. Revocation Report, X509 Labs

10. SSL/TLS Deployment Best Practices, Qualys Labs

11. Transport Layer Security, WikiPedia

12. How to botch TLS forward secrecy, AGL

Deploying SSL – How to get your server configuration right

They say the most complicated skill is to be simple; despite SSL and HTTPS having been around for a long time, they still are not as simple as they could be.

One of the reasons for this is that the security industry is constantly learning more about how to design and build secure systems; as a result, the protocols and software used to secure online services need to continuously evolve to keep up with the latest risks.

This situation creates a moving target for server administrators, creating a situation where this year’s “best practice” may not meet next year’s. So how is a web server administrator to keep up with the ever-changing SSL deployment best practices?

There is, of course, a ton of great resources on the web that you can use to follow industry trends and recent security research, but it’s often difficult to distill this information into actionable and interoperable SSL configuration choices.

To help manage this problem there are tools like the X509labs SSL Configuration Checker which look your server’s configuration and makes recommendations on what you should change to address current industry best practices. This tool makes recommendations that are based on current and past security research, trends, and both client and server behavior and capability.

The tool performs over 33 different tests on your server configuration and, based on the results, recommends specific changes you should make to address its findings.

In general, the guidance the tool provides can be categorized as follows:

 

Support latest versions of TLS protocol

Often organizations are slow to pick up newer versions of their web server and SSL implementations.  This is normally a conscious decision attributed to the old adages of “if it’s not broken don’t fix it.”

The problem is that these older versions are plagued with security issues. In many cases, these organizations pick up security patches, but these patches do not include the more recent (and more secure) versions of the protocols.

It is important that all sites add support for TLS 1.2 as this new version of the protocol offers security improvements over its predecessors and lays the groundwork for addressing future security concerns.

Disable older known insecure versions of the SSL protocol

SSL was defined in 1995 and has evolved significantly since then, SSL 2.0 in particular has been found to have a number of vulnerabilities. Thankfully these issues have been resolved in later version of the protocol.

Unfortunately at least 28% of sites today still support it (based on SSL-pulse data); when I speak to server administrators about why they enable this older version they commonly mention concerns over client interoperability. Thankfully browser statistics show us that TLS 1.0 support is ubiquitous and it is no longer necessary to support the older insecure version of the protocol.

 

Choose secure and modern cipher suites

This is one of the more confusing parts of configuring SSL; it’s also one of the most important. No matter how strong the cryptographic key material that goes into your certificate, the strength of your SSL is only as secure as the cryptography used to encrypt the session.

You don’t need to be a cryptographer or security researcher to make the right choices though, the X509Labs SSL configuration checker will help you keep on top of current recommendations. Based on current research, the following would be solid choices for you to go with:

Apache

SSLProtocol -ALL +SSLv3 +TLSv1 +TLSv1.1 +TLSv1.2;
SSLCipherSuite "EECDH+ECDSA+AESGCM EECDH+aRSA+AESGCM EECDH+ECDSA+SHA384 EECDH+ECDSA+SHA256 EECDH+aRSA+SHA384 EECDH+aRSA+SHA256 EECDH+aRSA+RC4 EECDH EDH+aRSA RC4 !aNULL !eNULL !LOW !3DES !MD5 !EXP !PSK !SRP !DSS"
SSLHonorCipherOrder on;

Nginx

ssl_protocols SSLv3 TLSv1 TLSv1.1 TLSv1.2;
ssl_ciphers "EECDH+ECDSA+AESGCM EECDH+aRSA+AESGCM EECDH+ECDSA+SHA384 EECDH+ECDSA+SHA256 EECDH+aRSA+SHA384 EECDH+aRSA+SHA256 EECDH+aRSA+RC4 EECDH EDH+aRSA RC4 !aNULL !eNULL !LOW !3DES !MD5 !EXP !PSK !SRP !DSS";
ssl_prefer_server_ciphers on;

These settings were chosen based on several factors including strength of the cryptography, interoperability and support for forward secrecy whenever it is supported by both the client and the server.

What is forward secrecy? You have forward secrecy when an attacker needs more than the encrypted traffic from your server and its private key to decrypt the traffic.

For you to be able to use the cipher suites that support forward secrecy here you will need to be using a version of OpenSSL and your web server that was built with ECDHE support. If you’re not you can still use these settings you just won’t offer forward secrecy to your users.

 

Disable insecure options in SSL and HTTP

As a general rule, protocols have options; these options can have unforeseen side-effects.

A great example of this is the option of SSL compression. Compression was added to SSL to improve performance of the protocol but it had a side effect – it enabled attackers to perform cryptanalysis on the cryptographic keys used in SSL. This attack was called CRIME (Compression Ratio Info-leak Made Easy) and, as such, this option is disabled today in secure SSL configurations.

Ensuring your configuration does not enable any such options is key to having a secure SSL configuration.

 

Enable performance optimizing options in SSL

To truly benefit from deploying SSL you need to apply it to your whole site—not doing so exposes sessions to attacks. The most common reason I hear from organizations as to why they are not deploying SSL across their whole site concerns performance.

This is a legitimate concern, according to Forester Research “The average online shopper expects your pages to load in two seconds or less, down from four seconds in 2006; after three seconds, up to 40% will abandon your site”.

And while it is true that an improperly configured web server will perform notably different than a properly configured one, it’s not difficult to configure your servers so that performance is not a major concern.

 

Ryan

 

Resources

  1. Getting the Most Out of SSL Part 1: Choose the Right Certificate, CA Security
  2. SSL Configuration Checker, CA Security
  3. SSL Pulse, Trustworthy Internet Movement
  4. Bulletproof SSL/TLS and PKI, Ivan Ristic
  5. High Performance Browser Networking, Ilya Grigorik
  6. How to get the latest stable OpenSSL, Apache and Nginx, Ryan Hurst
  7. Always On SSL, OTA
  8. Revocation Report, X509 Labs
  9. Transport Layer Security, WikiPedia
  10. Perfect forward secrecy , Wikipedia
  11. SSL Labs: Deploying Forward Secrecy, Qualys
  12. Intercepted today, decrypted tomorrow, Netcraft
  13. How to Build Your Own OpenSSL, Ryan Hurst
  14. Deploying forward secrecy on RedHat, Centos or Fedora based systems, Ryan Hurst

Abstract: Revocation reality and the path to becoming effective

Just submitted my first abstract for the NIST workshop on “Workshop on Improving Trust in the Online Marketplace” in April, the title of the talk is “Revocation reality and the path to becoming effective”, the abstract of which is:

 

The concept of certificate revocation is core to the X.509 trust model however 18 years after its introduction the reality is as implemented and deployed it falls short of its promise to enable an issuer certificate issuers to protect relying parties from malicious actors and miss-issuance

This talk will discuss the findings of a project where I have observed the behavior (https://revocation-report.x509labs.com), up time and performance of revocation repositories for a number of commercial Certificate Authorities for a period of over six months.

Additionally I will overview the revocation behavior of the most common browsers, identifying the gaps as they exist in those implementations.

And finally I will provide a set of recommendations that I believe if followed can address the current gaps which would move us to a world where revocation checking is an effective means of protecting relying parties from known bad actors and miss-issuance.

How Facebook can avoid losing $100M in revenue when they switch to always-on SSL

Recently Facebook announced that they will be moving to Always-On-SSL, I for one am thrilled to see this happen – especially given how much personal data can be gleamed from observing a Facebook session.

When they announced this change they mentioned that users may experience a small performance tax due to the addition of SSL. This is unfortunately true, but when a server is well configured that tax should be minimal.

This performance tax is particularly interesting when you put it in the context of revenue, especially when you consider that Amazon found that every 100ms of latency cost them 1% of sales. What if the same holds true for Facebook? Their last quarter revenue was 1.23 billion, I wanted to take a few minutes and look at their SSL configuration to see what this tax might cost them.

First I started with WebPageTest; this is a great resource for the server administrator to see where time is spent when viewing a web page.

The way this site works is it downloads the content twice, using real instances of browsers, the first time should always be slower than the second since you get to take advantage of caching and session re-use.

The Details tab on this site gives us a break down of where the time is spent (for the first use experience), there’s lots of good information here but for this exercise we are interested in only the “SSL Negotiation” time.

Since Facebook requires authentication to see the “full experience” I just tested the log-on page, it should accurately reflect the SSL performance “tax” for the whole site.

I ran the test four times, each time summing the total number of milliseconds spent in “SSL Negotiation”, the average of these three runs was 4.111 seconds (4111 milliseconds).

That’s quite a bit but can we reduce it? To find out we need to look at their SSL configuration; when we do we see a few things they could do to improve things, these include:

Let’s explore this last point more, the status check the browser does is called an OCSP request. For the last 24 hours their current CA had an average world-wide OCSP response time of 287 ms, if they used OCSP Stapling the browser would need to do only one OCSP request, even with that optimization that request could be up to 7% of the SSL performance tax.

Globalsign’s average world-wide OCSP response time for the same period was 68 milliseconds, which in this case could have saved 219 ms. To put that in context Facebook gets 1.6 billion visits each week. If you do the math (219 * 1.6 billion / 1000 / 60 / 24), that’s 12.7 million days’ worth of time saved every year. Or put another way, it’s a lifetime worth of time people would have otherwise spent waiting for Facebook pages to load saved every two and a half hours!

If we consider that in the context of the Amazon figure simply changing their CA could be worth nearly one hundred million a year.

Before you start to pick apart these numbers let me say this is intended to be illustrative of how performance can effect revenue and not be a scientific exercise, so to save you the trouble some issues with these assumptions include:

  • Facebook’s business is different than Amazons and the impact on their business will be different.
  • I only did four samples of the SSL negotiation and a scientific measurement would need more.
  • The performance measurement I used for OCSP was an average and not what was actually experienced in the sessions I tested – It would be awesome if WebPageTest could include a more granular breakdown of the SSL negotiation.

With that said clearly even without switching there are a few things Facebook still can do to improve how they are deploying SSL.

Regardless I am still thrilled Facebook has decided to go down this route, the change to deploy Always-On-SSL will go a long way to help the visitors to their sites.

Ryan

A quick look at SSL performance

When people think about SSL performance they are normally concerned with the performance impact on the server, specifically they talk about the computational and memory costs of negotiating the SSL session and maintaining the encrypted link.  Today though it’s rare for a web server to be CPU or memory bound so this really shouldn’t be a concern, with that said you should still be concerned with SSL performance.

Did you know that at Google SSL accounts for less than 1% of the CPU load, less than 10KB of memory per connection and less than 2% of network overhead?

Why? Because studies have shown that the slower your site is the less people want to use it. I know it’s a little strange that they needed to do studies to figure that out but the upside is we now have some hard figures we can use to put this problem in perspective. One such study was done by Amazon in 2008, in this study they found that every 100ms of latency cost them 1% in sales.

That should be enough to get anyone to pay attention so let’s see what we can do to better understand what can slow SSL down.

Before we go much further on this topic we have to start with what happens when a user visits a page, the process looks something like this:

  1. Lookup the web servers IP address with DNS
  2. Create a TCP socket to the web server
  3. Initiate the SSL session
  4. Validate the certificates provided by the server
  5. Establish the SSL session
  6. Send the request content

What’s important to understand is that to a great extent the steps described above tasks happen serially, one right after another – so if they are not optimized they result in a delay to first render.

To make things worse this set of tasks can happen literally dozens if not a hundred times for a given web page, just imagine that processes being repeated for every resource (images, JavaScript, etc.) listed in the initial document.

Web developers have made an art out of optimizing content so that it can be served quickly but often forget about impact of the above, there are lots of things that can be done to reduce the time users wait to get to your content and I want to spend a few minutes discussing them here.

First (and often forgotten) is that you are dependent on the infrastructure of your CA partner, as such you can make your DNS as fast as possible but your still dependent on theirs, you can minify your web content but the browser still needs to validate the certificate you use with the CA you get your certificate from.

These taxes can be quite significant and add up to 1000ms or more.

Second a mis(or lazily)-configured web server is going to result in a slower user experience, there are lots of options that can be configured in TLS that will have a material impact on TLS performance. These can range from the simple certificate related to more advanced SSL options and configuration tweaks.

Finally simple networking concepts and configuration can have a big impact on your SSL performance, from the basic like using a CDN to get the SSL session to terminate as close as possible to the user of your site to the more advanced like tuning TLS record sizes to be more optimum.

Over the next week or so I will be writing posts on each of these topics but in the meantime here are some good resources available to you to learn about some of these problem areas:

Serving OCSP POST responses on a CDN

The other day I did a blog post on how we are using a CDN to front our OCSP services, the CDN we are using is CloudFlare who is one of our partners.

In that blog post I mentioned that POSTs requests from an OCSP client would normally be a cache miss for a CDN and this introduces additional latency in serving these responses.

Even though the response times we were getting had this additional latency the performance was still acceptable but we wanted to do better so a few weeks ago I went to the CloudFlare office and worked with them on making their services OCSP aware.

Specifically we made it so that when they get an OCSP POST they can determine what cached response to return (for example from a prior GET) instead of going back to our responders to have our responder do that for them.

As of today this change has gone live, as you know I love numbers which is why I have been publishing these repository performance numbers. What you will see if you look at those is that our worldwide average is right around 100ms, if we take out china and Australia that figure drops to around 60ms.

The problem with these numbers is they only test the GET variant of the protocol, that is because neither Pingdom nor Monitis let me simulate binary POSTs (which is what the OCSP POST variant looks like).

With that said we can extrapolate what the numbers look like fairly easily; from my current network location (which is slow) this is what I see:

 

 

What I want you to notice here is that over 15 requests when I do a POST to the CDN instance of our responder I get the same response time average as I do when I perform the same test with the GET variant.

What this tells us is that POST is performing the exact same as a GET which from this we can safely say that in our case the performance numbers I have been publishing for GET are also accurate for POSTs.

What you will also see that our responder is slower to server via POST, this is because it is designed around nonced OCSP requests and as such isn’t optimized as much as it could be for caching them.

I should also note that our decision to put a CDN in front of our OCSP responder does not break clients that want to send nonced requests (no browsers do by default btw), these are simply treated as a cache miss.

Ryan

Measuring OCSP Responder Performance with Powershell

The other day I did a post about testing if OCSP stapling was working, but what about telling how fast your OCSP responder is?

Well that is what this post is about, on Unix based systems with curl you would do something along these lines:

time curl –verbose –url http://ocsp2.globalsign.com/gsextendvalg2/MFMwUTBPME0wSzAJBgUrDgMCGgUA

BBSgcg6ganxiAlTyqPWd0nuk87cvpAQUsLBK/Rx1KPgcYaoT9vrBkD1rFqMCEhEh

D0Xjo/V7lgq3ziGoWG69rA==

But there is no time on Windows so how can you do it there?

PowerShell.exe
Measure-Command {Start-Process -wait “curl” “-verbose “”http://ocsp2.globalsign.com/gsextendvalg2/MFMwUTBPME0wSzAJBgUrDgMCGgUA

BBSgcg6ganxiAlTyqPWd0nuk87cvpAQUsLBK/Rx1KPgcYaoT9vrBkD1rFqMCEhEh

D0Xjo/V7lgq3ziGoWG69rA==”””}

 

Days : 0
Hours : 0
Minutes : 0
Seconds : 0
Milliseconds : 53
Ticks : 531200
TotalDays : 6.14814814814815E-07
TotalHours : 1.47555555555556E-05
TotalMinutes : 0.000885333333333333
TotalSeconds : 0.05312
TotalMilliseconds : 53.12

 

Here you see that it took 53 milliseconds to run the command, its important to note that there different overhead buried each time measurement approach (time vs Measure-Command) that make it inappropriate to compare the results from two approaches. With that said its still a good way to measure performance over time when using a single measurement approach.

Also if you wanted to do the same measurement for POSTs the command would look like one of these :

Measure-Command {Start-Process -wait “openssl” “ocsp -noverify -no_nonce -respout ocspglobalsignca.resp -reqout ocspglobalsignca.req

-issuer globalsigng2.cer -cert globalsign.com.cer -url “”http://ocsp2.globalsign.com/gsextendvalg2″” -header “”HOST”” “”ocsp2.globalsign.com”””}

 

Measure-Command {Start-Process -wait “curl” “-verbose -data-binary  @ocspglobalsignca.req -H “”Content-Type:application/ocsp-request””

http://ocsp2.globalsign.com/gsextendvalg2″}