The Importance of Proper Measurement in Enterprise IT Management: Lessons from Cloud Deployments

Peter Drucker once said, “You cannot manage what you cannot measure.” This quote is applicable in many aspects of technology development and business management. Neglecting measurement often leads to problems. Another relevant saying is “Lies, damned lies, and statistics.” The importance of what we measure and how we measure it cannot be overstated when using data to manage complex systems.

In enterprise IT, endpoint management systems are often touted as the source of truth for the environment, but this is misleading. They measure the intended state of the devices, not their actual state. In other words, they track what they believe they have done, rather than whether the change was correctly and successfully deployed. This might seem similar or sufficient, but it is not. Endpoint systems have many different software and settings that can interact in unexpected ways, making the data they provide nearly useless for risk assessment, according to most security professionals.

As for EDR systems, I would argue that they are simply advanced anti-viruses. They try to protect the operating system which is largely intractable. give the enterprise visibility to attacks and infections, while providing minimal tools for a response. To be clear EDR is valuable, but its overhead is high and it is not designed for device intelligence or observability; its purpose is detection and response.

If enterprises had proper investments in observability, they could discover outages before users report them. They could use the intelligence about the state and usage of their devices to proactively defend and isolate their assets, turning what has been a reactive and inaccurate dataset into a tool to protect their networks and improve user productivity.

There is a lot to learn from Cloud deployments when looking at how to solve these problems. For example, unlike most IT departments, cloud assets are heavily instrumented with logs being centralized, with dashboards reporting real-time uptime and health. There is an entire market of solutions focused on enabling this visibility, just consider how many log aggregation and analytics offerings such as Stackdriver, CloudWatch, and New Relic exist. 

Additionally, these assets typically have cross-platform machine identities that are used to facilitate security domain separation, and interestingly these identities are increasingly using hardware-backed keys to secure those credentials. These credentials are also used to help each endpoint in those deployments achieve some basic level of self-protection, where the credentials used by these assets will capture the posture of the machines and the peers they interact with, particularly when crossing security domains and enforcing policy based on these interactions.

I believe that over the next decade, enterprise IT deployments will increasingly resemble cloud deployments, with a focus on zero-trust design patterns. However, for that to happen there are product gaps that will need to be filled. For example, there is no turnkey solution for desktop state observability with structured data that can be used for real-time policy decisions. The big tech companies have built these solutions for their own deployments but there is nothing turnkey in the market that addresses this need. Then there is the issue of how to authenticate the device and its state. There are lots of pieces in this area but no solutions. If you are going to use the observed device state for the policy you also have to worry about the validity of the associated data, while this is not possible in existing systems to totally solve this problem there are lots of things that can be done to make data more reliable.

Finally, as we see this evolution, there is a need to rethink the way enterprises approach identity. It will become an alternative to Active Directory. No security practitioner would design their source of truth in the enterprise in the same way as Active Directory is today.

The Changing Landscape of Internet Protection

The United States government and big companies like Google have both played a significant role as protectors in their respective spheres. The US government, as the world’s leading military power, has tried to serve as a global peacekeeper and defender of democracy through its high military spending. Similarly, Google, as a success in the technology industry, has leveraged its financial power from advertising to fund various initiatives to protect and grow the internet and the web.

However, in recent years, Google has undergone significant changes. The COVID-19 pandemic accelerated the need for companies to adapt and re-envision themselves, and Google was no exception. The company, in a relatively short period, hired many externally sourced professional managers and underwent a series of re-organizations and corresponding cultural changes that have fundamentally changed the company forever. This, combined with recent fiscal irresponsibility in hiring, and inability to drive a cohesive and integrated product strategy (how many messengers does one company need after all?), led to Google’s first-ever layoffs, which included a significant impact on internet investments, and being transparent my own job.

This raises important questions about the future of internet protection. While Google was clearly not the exclusive protector of the internet, its direct and indirect investments in the internet have played a significant role in protecting it. It seems unlikely that moving forward, they can be counted on in the same way. We need to see other large organizations whose businesses similarly take a critical dependency on a safe and secure internet step up and fill the gap.

This is problematic for a number of reasons, one of the largest being that public companies are, to a fault, focused on quarter-to-quarter growth. That is of course the right thing for their shareholders, at least in the near term, but the sort of infrastructure investments necessary to secure the Internet can take a decade or more of investment. That’s quite hard for an organization to justify when the most valuable resource they have is engineers when those engineers can be spent working on securing and improving their private networks or directly generating revenue.

Many of these organizations already donate to security non-profits such as ISRG and OpenSSF and work through them to make some of these improvements to the Internet, but the funding to these organizations is often piecemeal and usually only aligned with the latest security trends. Furthermore, these investments are often associated with pet projects of the sponsoring companies rather than taking a strategic and proportional investment into different problem areas. A great example of this is how many of the most important open-source projects lack the basic funding necessary to sufficiently secure them long-term.

One approach to mitigating this is to lean on these security non-profits more and give them more autonomy and funding. This would allow them to take on a larger and more independent role via a larger commitment to funding with appropriate multi-year roadmaps, milestones, and success criteria approved by some mix of funding stakeholders invited experts in the specific areas in question and key engineering participants from relevant funding organizations. This would create predictable funding and a space where long-term roadmaps for research, securing, and improving the internet can be established without relying on a small number of companies with business models that support direct investment in those activities.

This approach would have its own challenges, for example, like most non-profits, these organizations will struggle to be impact focused rather than focused on their own pet projects or philosophies. That said, this is largely part of the human condition and something that always needs to be managed. The answer to how to manage this is surely buried in what systematic approach is used for choosing initiatives, measuring their success, and having lots of meaningful milestones along the way to enable them to checkpoint, course correct and the will power kill projects as appropriate.

In summary, the role of a protector comes with great responsibility. The recent changes at Google highlight the need for a diverse set of stakeholders to come together to safeguard the internet and the web and take less dependency on one company carrying too large of the load when it comes to protecting the internet.

The Limitations of Audits: What You Need to Know

In recent years, the SOC2 and ISO 27001 badges have become a staple at the bottom of nearly every SaaS website. This is largely due to the growing number of startups providing solutions for SOC2 and ISO 27001 audits, such as Vanta and Secureframe. These companies greatly reduce the time and money necessary to comply with various audit regimes. The fact large companies are increasingly incorporating the requirements for SOC2 and ISO 27001 audits into their procurement process of course also drives the demand for these audits. With that said it’s important to understand what these badges really mean and what they don’t.

One common misunderstanding is that audits are ongoing assessments of an organization’s security practices. In reality, audits are almost always point-in-time retrospectives, meaning they only reflect what was the case at the time of the audit and not what is currently the case. This is a material point, as many small organizations view the work associated with these audits as an annual tax, rather than integrating the associated practices into how they work.

Another factor to consider is the scope of the audits. The entity being audited gets to choose the scope, and it’s common for startups to exclude their IT environment (such as desktops) from their audits. This means that not all SOC2 audits are created equal, and you may not be getting the full picture.

Beyond that when audits involve sampling data, it is usually the subject of the audit who chooses the sample. This means that either by accident or on purpose, the analysis may be based on the most favorable data.

There is also the concept of accepted and managed risks. An organization can often get by with poor security practices as long as they acknowledge the risk and have a plan to resolve it within a fixed period of time. This period of time can be continuously extended, allowing the organization to avoid addressing the issue indefinitely.

Then you need to remember that the auditor works for the organization being audited. They want to be hired again next year, so they may be willing to accept the organization’s interpretation of the expectations in order to secure future business. This means that the results of the audit may not be completely impartial.

And finally, there is the question of the qualifications of the auditors conducting these assessments. Often, they are exclusively accountants and not technologists, meaning that they are not equipped to evaluate the technical security or correctness of the systems being audited. Instead, they are essentially tasked with assessing if the checklist represented in the audit regime can be reasonably deemed as met. 

In conclusion, while SOC2 and ISO 27001 audits have good intentions, they are far from sufficient in assessing an organization’s security practices. The fact that the audits are point-in-time retrospectives, the scope of the audit is chosen by the entity being audited, the auditor works for the organization, and the results may be based on favorable data, all contribute to the limitations of these audits. As a result, it’s important to be aware of these limitations and not rely solely on a SOC2 badge as a sign of a secure organization.

Using Caddy with Google Trust Services

Caddy is a powerful and easy-to-use web server that can be configured to use a variety of certificate authorities (CA) to issue SSL/TLS certificates. One popular CA is Google Trust Services, which offers an ACME endpoint that is already compatible with Caddy because it implements the industry-standard ACME protocol (RFC 8555). 

This means that Caddy can automatically handle the process of certificate issuance and renewal with Google Trust Services, once the External Account Binding (EAB) credentials required have been configured. 

How do I use it? 

Using global options

To use the Google Trust Services ACME endpoint you will need an API key so you can use a feature in ACME called External Account Binding. This enables us to associate your certificate requests to your Google Cloud account and allows us to impose rate limits on a per-customer basis. You may easily get an API key using the following commands:

$ gcloud config set project <project ID>
$ gcloud projects add-iam-policy-binding project-foo \  –member=user:[email protected] \  –role=roles/publicca.externalAccountKeyCreator
# Request a key:
$ gcloud alpha publicca external-account-keys create

You will need to add this API key and specify the Google Trust Services ACME directory endpoint along with your email address in your Caddyfile:

{    
acme_ca https://dv.acme-v02.api.pki.goog/directory
email  [email protected]
acme_eab {
        key_id  <key_id>
        mac_key <mac_key>
    }}

It is important to remember that when you use this configuration Google Trust Services explicitly is used as the only CA.

If you want to use multiple CAs for redundancy Caddy which is recommended the configuration would look something like this:

{     
cert_issuer acme https://dv.acme-v02.api.pki.goog/directory  {
          eab <key_id>  <key>
     }
     cert_issuer acme
}

In this example Google Trust Services will be tried and if there is a problem it will fall back to Let’s Encrypt.

It is also worth noting that the Google Trust Services EAB key is one time use only. This means that once Caddy has created your ACME account these can be safely removed.

Using the tls directive

If you want to use Google Trust Services for only some of your sites, you can use the tls directive in your Caddyfile like you’re used to:

tls [email protected] {
    ca https://dv.acme-v02.api.pki.goog/directory
   eab <key_id> <mac_key>
}

The email address in this example identifies the ACME account to be used when doing enrollment.

In conclusion, using Caddy with Google Trust Services is a simple and simple and secure way to issue and manage SSL/TLS certificates for your websites. With the easy-to-use Caddyfile configuration, you can quickly and easily configure your server to use Google Trust Services for all of your sites or just a select few. With Google Trust Services, you can trust that your websites will be secure and your visitors’ data will be protected.

To err is human, to forgive is divine

Every service should strive to provide perfect availability, the reality is that it’s not possible to be perfect. Mistakes happen, it’s how you deal with them that is important.

Successfully dealing with availability issues requires planning, and when dealing with a client-server solution, it requires both parties to make improvements.

In the context of ACME today the large majority of clients have no failover or logic. This means if the enrollment fails due to connectivity issues to the specified CA, the CA has an outage, or the CA suspends operations for one reason or another the enrollment will fail.

This is a problem for any service protected with TLS, which is basically every service, that wants to have a highly available service. One of the ways services deal with this is to proactively acquire two certificates for every endpoint, this is viable, but I would argue this is not the right solution.

The right solution is to not rely on a single CA and instead failover between many CAs that are capable of servicing your needs. This way if any single CA fails you are fine, you can just keep chugging along.

That is not sufficient to address this risk though. Not all certificates have the same level of device trust. Sometimes clients make bad assumptions and trust specific CA hierarchies assuming these configurations are static, even when they are not.

To help mitigate this behavior clients should implement a round-robin or random CA selection logic so that subsequent renewals will hit different CAs. This will force clients to make sure they work with any of your chosen CAs. This way you won’t find yourself breaking apps that make those bad assumptions when your CA fails.

Caddy Server already implements both of these strategies as I understand it but every ACME client should be doing the same thing.

A Boy Scout is always prepared

My father always says it’s not a problem to make a mistake, what is important is how you deal with it. 

The same thing is true when it comes to WebPKI CAs, broadly the incident response process used in this ecosystem could be categorized as Blameless Post Mortem. The focus is on what happened, what contributed to it, and what was done to address the issue and not on fault.

A few years ago a number of large CAs had to do millions of revocations, in all of the cases I am thinking the required deadline for those revocations was 5 days. Revoking a large number of certificates that are not directly obvious but if you’re a CA who has done any moderate level of planning it’s something you should be up for.

The thing is that doing so can cause harm, for example, the issue that necessitates the revocation might be incredibly subtle and not security-impacting. Nonetheless, the requirement is what the requirement is — the certificate needs to be revoked.

The question then becomes how can you meet that timeline objective without creating an unnecessary outage for customers? If you defy the rules you risk being distrusted, if you act blindly you could take down your customer’s services.

The question then becomes how do you contact millions of customers and give them enough time to replace their certificates without an outage with these constraints?

Like most scale problems the answer is automation, in the context of certificate lifecycle management that means an extension to the ACME protocol. To that end, there is now a draft for something called “ACME Renewal Information” which when implemented by CAs and ACME clients will enable a CA to signal that there may be a need to replace their certificate earlier than expected.

The basic idea with this proposal is that the CA will make available hints on when it would like certificates to be updated and the client will periodically check this information and use it to guide its renewal behavior.

To be clear, this is just a hint, a CA might be providing this hint just to smooth out the load, but there is no mandate to rely on the hint. With that said I do hope that all major ACME clients implement this standard and respect it by default because it will make the WebPKI a lot less fragile.

Baseline requirements are just that a baseline

I’ll start by saying this post is just a collection of personal thoughts and not a statement from my employer, nor does it reflect anyone’s opinions other than my own.

A common problem in the WebPKI is that CAs forget why they exist. No one cares about their business objectives. They exist to serve the public interest, more explicitly they exist to help the web solve the TOFU problem for domain control verification.

They exist because user agents (usually browsers) made the decision to delegate the TOFU problem to certificate authorities. To understand the browser’s motivation you simply need to look at the name “user agent” to understand that they directly serve the user.

These user agents used to meet individually with each and every one of these CAs to discuss the user agent requirements for CAs but ultimately decided to do this in a more collaborative way by participating in the CA/Browser Forum. 

Simply put, this organization would very likley stop existing if the user agents stopped participating and did not rely on the documents that come from this organization.

We know this to be true because of the voting structure of the forum and because the forum does not produce documents that govern the user agents, it produces documents that govern the practices of CAs.

To understand the nature of the CA/Browser program you just have to look at the name of the first and most important document they produced — the Server Certificate Baseline Requirements.

base·line

noun

     a minimum or starting point used for comparisons.

That’s right, it sets the minimum bar that a user agent should consider when trusting a certificate authority.

Far too often compliance programs end up driving the roadmap and operational practices of CAs and it stops being about what is right and becomes what is minimally required.

This leads to all kinds of spectacular failures, including basic failures in communication:

Instead of this compliance-focused mindset, CAs must be focused on why they exist, who they serve, and what is at risk if they mess things up.

I would argue that every major failure in the WebPKI has stemmed from forgetting about these simple questions.

TOFU and the Web

I’ll start by saying this post is just a collection of personal thoughts and not a statement from my employer, nor does it reflect anyone’s opinions other than my own.

For the uninitiated, TOFU is an acronym for Trust On First Use. The basic idea is that the user makes the trust decision associated with a communication channel on the first visit.

To understand the concept here, you need to know what we mean by “communication channel and what we “trust” them for.

Let’s start with what we mean by the “communication channel”. In the context of the web, for the most part, a user agent (often a browser) initiates communication to a service located at a fully qualified domain name. This is usually done using the Transport Layer Security (TLS) protocol. 

Next, let’s look at what “trust decision” is being made. Notice we don’t have an application yet, we have just established a session that an application and a service can use to exchange information. As a result, the trust decision we are making is:

Did my traffic end up at a service associated with the aforementioned fully qualified domain name?

The naive way to answer the question is to ask that question every time a user visits the service. In fact, some browsers support this model explicitly:

This is the moral equivalent of saying “do you feel lucky?” because the user has no way to answer this question.

It is actually worse than that, the few ways you can answer that question require the participation of the service being authenticated. That means any solution has to potentially scale up to the 5 billion users on the internet

Since each of these users has varying degrees of technical prowess and the service in question has a limited amount of time and resources; doing this for every user interactivity would be impractical.

To give this some modern color, Signal solves this problem by having the users in a conversation verify out of band they are talking to the right person. Almost no one does this of course and just begins chatting without going through this dance.

The gray hairs of the internet decided to solve this problem by delegating this trust decision to entities known as Certificate Authorities (CAs). They indirectly act as a middleman for the session establishment, they do this by issuing the service a credential it can use to prove that someone did this verification for them. 

These CAs do this verification by having the service perform some authenticatable action that only someone that controls the fully qualified domain name in question could complete This whole dance is basically delegated TOFU of control of the fully qualified domain name.

Over the last 28 years, this delegation has taken us from no HTTPS to near-ubiquitous HTTPS.

This is of course fantastic but it does beg the question what has this journey taught us and what might we change moving forward?

For one right now, if we believe this table, there are about 89 organizations in the world that are approved Certificate Authorities. But the top 5 of those CAs issue 99% of all certificates in use on the web.

That’s right 96% of the CAs that can issue certificates for the web could disappear tomorrow and most people would not notice.

What concerns me is that running a CA is hard, expensive, and has materially different operating constraints than other services. This combined with the love nation-state attackers have for these services has led to numerous colorful events over the years 1,2,3,4,5,6.

To be clear I am not making the case that those CAs should be removed, just making it clear that the web is carrying a much larger surface area than it strictly needs to deliver on its original design goal. 

That takes us back to the question of what lessons the last 28 years have taught us and what might we change moving forward. The answer here is likely buried in the adage of having one joint and keeping it well-oiled is key. 

The journey from 0% of the web being protected with HTTPS to nearly 100% has been one of simplification and automation. There is plenty of opportunity for us to do more of that and doing so will be key to the survivability of encryption on the web, some problems we might want to see in addressed that spirit include:

  1. Too many single points of failure in the WebPKI. If any of those top 5 CAs were to fail we would experience a massive failure on the web. For us to address this we need to standardize on a single certificate enrollment protocol, this would make the CAs plug-and-play replacements for each other. We already have a de-facto standard RFC 8555 (ACME) which is responsible for around 70% of all certificate issuance. We should just embrace that and make it official. Once we see sufficient ACME adoption we need the ACME clients to fail over and round-robin across all the associated CAs. This will help ensure clients are able to survive any single CA failure.
  1. We need the WebPKI to be less fragile. There are lots of examples of WebPKI fragility, some are:
  • Clients sometimes get hard coded to expect a single issuing CA. This is a very bad practice, one some CAs have already started to dissuade by randomly selecting issuing CAs.
  • Reliance on very long-lived root certificates instead of very frequent root updates. This is also thankfully starting to change but solving this problem will take us a very long time thanks to root programs that do not continually manage what roots are trusted in deployed systems.
  • The inability of the web to survive CA mistakes and potential security issues without creating outages. There is also work being done here in an extension to ACME called ARI to help enable the web to seamlessly renew certificates in such cases.
  1. Too many CAs with too broad permissions. Since most CAs are not necessary for the web to operate but may still have a legitimate need to exist we should leverage usage and name constraints in root trust more. This would go a long way to both reduce the value of a compromise of these CA long tail CAs to an attacker as well as reduce the exposure to the web as a whole to the practices of these CAs. We should also come back and revisit the need for CAs to keep those permissions periodically.
  1. Over-reliance on manual assessments by unqualified auditors and security professionals. This introduces a number of issues, some of which include:
  1. Root programs are too slow to respond to bad or questionable actors. Suspension of a CA is not-done and total distrust can take months or longer. There are a lot of politics involved here but equally important is the tooling to remove CAs, both procedurally and technically. In these things are simply not “well oiled”. They need to be made far more agile so that this long tail can be more effectively managed.

So what can you do about all of this? Well if you are a root program you should be looking at how you help address the above issues through the way you manage that program.

If you are a website operator? Well, the answer there is easy — adopt ACME-based certificate lifecycle management, keep an eye on Certificate Transparency logs for certificates issued by that long tail of CAs and consider specifying CAA restrictions that limit issuance to a small number of CAs.

Top #5 CAs by issuance volume as of 09/19/22

As Google Trust Services has been available for a few weeks I thought it would be interesting to look at where it stands relative to other CAs based on its issuance volume.

#CA OwnerCertificatesPre-Certificates
AllUnexpiredAllUnexpired
1Internet Security Research Group2,972,072,131270,986,3172,689,923,862233,570,311
2DigiCert115,808,4067,603,151443,129,508138,144,685
3Sectigo580,982,48145,262,868517,794,47746,204,659
4Google Trust Services LLC13,909,548467,650120,070,01321,232,287
5Microsoft Corporation17,56717032,448,45316,959,805

For more on this methodology of counting see this post.

Why hasn’t SHAKEN/STIR made a big dent in the volume of robocalls?

If you live in the United States you surely get a ton of robocalls. As an aside, for whatever reason, it appears the problem is less severe in other countries.

If you look at these robocalls, one common element is that the phone numbers used are usually made up to look like they are hyper-local calls or simply do not disclose the origin. They do this to increase the chance you pick up the call.

This is possible because the phone system was largely designed on blind trust in telephone operators. As an example of this, you do not need to look further than the caller id metadata. Today the originator and their carrier get to set any phone number they like, as such, you can’t reliably block the call based on caller id or carrier metadata. Back when there was Ma Bell that was probably a rational choice. That said after Bell System was broken up it became far more problematic — the expansion of telephone services made that design assumption a larger issue.

That takes us to STIR/SHAKEN — this blind trust still exists in much of the telephone ecosystem and one area where this manifests today is these robocalls.

STIR/SHAKEN is intended to solve this problem, or at least be a key foundation for solving the problem. By making each telephone operator cryptographically sign the call metadata it becomes possible to hold the originator and their carrier accountable.

While this will not solve the problem of robocalls it does introduce a durable cryptographically verifiable credential that can be used to build databases of a reputation for both originating phone numbers and telephone companies.

In other words, adopting SHAKEN/STIR is about making it possible to fight robocalls, not fighting them directly. To be clear, there may be some ancillary reduction of robocalls due to the increased cost and complexity associated with participating in the STIR/SHAKEN ecosystem but this is at best a speed bump to a motivated attacker.

The good news is that once there is near ubiquitous adoption of this standard, it becomes far more practical for needed reputation systems to built. This in turn will hopefully enable the telephone companies to make a big dent in the problem.

In my opinion, the development and maintenance of these reputation systems is the harder problem of the two parts. I say this because today there is no signal from end-users to indicate that they believe the call was in fact a robocall.

While it is technically possible to look at aggregate user behavior, for example how quickly they hang up, or if they do not answer, to come up with a probability of being a robocall without this positive signal it will be a guess. The same is said for traffic pattern reputation systems, these are essentially a guess without affirmative labeling from the user.

The other problem that exists here is that this behavior data is only available to handset manufacturers and telephone network operators. I also suspect in both cases each party has usage restrictions on what they can do with the data, for example, are they allowed to share the data with other telephone network operators or handset manufacturers?

There are several robocall abatement applications out there that try to address this problem, the ones I have looked at appear to have horrible privacy policies and many are free suggesting their revenue model must be based on leveraging your call history.

Once installed these applications are surely sending this behavioral data to the cloud to analyze it for this specific purpose. The apps I have seen also expose ways to explicitly flag calls as spam or unwanted. This is probably the best you can do as a third party to build such a reputation system,

All of this basically means that you are in essence being asked to give up your privacy, and the privacy of those you communicate with, in exchange to avoid nuisance calls.

For some, this may be a good trade-off but for others, it likely is not.

I would rather see a model where telephone operators were required to share data with each other and do so with data usage constraints. If that was combined with some standardization by platforms like Android and IOS that made it possible for users to flag unwanted calls, and made it so that information could be shared across providers we might see a reliable, privacy-respecting solution to this problem become reality at some point.