Tag Archives: Security

Another Sleeping Giant: Microsoft’s Root Program and the 1.1.1.1 Certificate Slip

This morning (September 3, 2025), someone posted an incident to the Mozilla dev-security-policy list that exposed a serious incident: “Incident Report: Mis-issued Certificates for SAN iPAddress: 1.1.1.1 by Fina RDC 2020.” An obscure CA, Fina RDC 2020, issued certificates containing the IP address 1.1.1.1, used by Cloudflare for encrypted DNS. These certificates should never have been issued.

Reference: https://groups.google.com/a/mozilla.org/g/dev-security-policy/c/SgwC1QsEpvc

Why this matters

1.1.1.1 is a critical bootstrap endpoint

This IP anchors encrypted DNS (DoH, DoT). A mis-issued certificate, combined with a BGP hijack, allows an attacker to intercept traffic before secure tunnels form. Some may argue that an attacker would also need to perform a BGP hijack to exploit such a certificate, but this is no real mitigation. BGP hijacks are a regular occurrence – often state-sponsored — and when combined with a valid certificate, they turn a routing incident into a full man-in-the-middle compromise. Cloudflare documentation:

Mainstream browsers like Chrome and Firefox typically use the domain name cloudflare-dns.com when bootstrapping DoH/DoT, so they would ignore a certificate that only listed IP:1.1.1.1. However, both Google and Cloudflare also support direct IP endpoints (https://8.8.8.8/dns-query and https://1.1.1.1/dns-query). In these cases, an IP SAN certificate would validate just as Alex Radocea’s curl test showed: “subjectAltName: host “8.8.8.8” matched cert’s IP address!”.

The behavior differs slightly between providers:

Google accepts raw-IP DoH connections and does not mandate a specific HTTP Host header.
Cloudflare accepts raw-IP DoH connections but requires the Host: cloudflare-dns.com header.

This means raw-IP endpoints are real-world usage, not just theoretical. If a mis-issued IP SAN cert chains to a CA trusted by the platform (like in Microsoft’s store), exploitation becomes practical.

A Chronic Pattern of Mis-issuance

The incident report cites two mis-issued certs are still valid as of September 3 2025:

https://crt.sh/?id=18603461241 (issued 2025-05; SAN includes 1.1.1.1)
https://crt.sh/?id=19749721864 (issued 2025-05; SAN includes 1.1.1.1)

This was not an isolated event. A broader search of CT logs reveals a recurring pattern of these mis-issuances from the same CA, indicating a systemic problem.

Reference: https://crt.sh/?q=1.1.1.1

CT monitoring failed to trigger remediation

Cloudflare runs its own CT monitoring system: https://developers.cloudflare.com/ssl/ct/monitor/

If these certs surfaced in crt.sh, Cloudflare should have flagged them. Instead, there was no visible remediation.

What’s going wrong?

Are IP SANs being monitored effectively enough to detect a pattern?
Is there a proper pipeline from alerts to action?
Why were these certs still valid after months?

It is easy to focus on where CT monitoring fell short in this case, and clearly it did. But we should not lose sight of its value. Without CT, we would not have visibility into many of these mis-issuances, and we do not know how many attackers were dissuaded simply because their activity would be logged. CT is not perfect, but this incident reinforces why it is worth doubling down: improving implementations, making monitoring easier, and investing in features like static CT and verifiable indices that make it easier to use this fantastic resource.

What this means for users

The impact is scoped to users of browsers and applications that rely on the Windows operating system’s root store, such as Microsoft Edge. Browsers like Chrome and Firefox, which manage their own root stores. However, because both Google and Cloudflare accept raw-IP DoH endpoints, the risk extends beyond pure edge cases.

For an affected user, at a public hotspot, if a BGP hijack redirects their traffic and the client connects directly to 1.1.1.1, their system checks the cert and sees a valid Cloudflare-like certificate. The attacker succeeds not just in breaking DNS, but in controlling secure web sessions.

This enables:

Silent proxying of all traffic
Token and session theft via impersonation
Decryption of DoH queries (for raw-IP clients)
In-flight alteration of pages and updates

This is the full man-in-the-middle playbook.

A tiny CA with outsized impact

Certificate issuance across the WebPKI is highly skewed, as of today:

CA Owner	% of Unexpired Pre-Certificates
Internet Security Research Group	46.52%
DigiCert	22.19%
Sectigo	11.89%
Google Trust Services	8.88%
GoDaddy	5.77%
Microsoft Corporation	3.45%
IdenTrust	0.63%
All others	<0.70%

Refference: https://crt.sh/cert-populations and https://docs.google.com/spreadsheets/d/1gshICFyR6dtql-oB9uogKEvb2HarjEVLkrqGxYzb1C4/

Fina RDC 2020 has ~201 unexpired certs, accounting for <0.00002% of issuance. Yet Microsoft trusts it, while Chrome, Firefox, and Safari do not. That asymmetry leaves Edge and Windows users vulnerable.

Fina RDC 2020 is also listed in the EU Trusted List (EUTL) as authorized to issue Qualified Website Authentication Certificates (QWACs): https://eidas.ec.europa.eu/efda/trust-services/browse/eidas/tls/tl/HR/tsp/1/service/14. While no mainstream browsers currently import the EUTL for domain TLS, eIDAS 2.0 risks forcing them to. That means what is today a Microsoft-specific trust asymmetry could tomorrow be a regulatory mandate across all browsers.

Moreover, Microsoft requires that CAs provide “broad value to Windows customers and the internet community.”

Reference: https://learn.microsoft.com/en-us/security/trusted-root/program-requirements

Fina clearly does not meet that threshold.

Certificate lifespan context

The CA/Browser Forum (April 2025) approved ballot SC-081v3: TLS certificates will be limited to 47 days by March 2029:

March 15 2026: max 200 days
March 15 2027: max 100 days
March 15 2029: max 47 days

Reference: https://www.cyberark.com/resources/white-papers/how-47-day-tls-certificates-are-reshaping-digital-trust

IP SAN certs, however, are much shorter:

Google Trust Services (via ACME) issues these certs with a maximum lifetime of 10 days: https://pki.goog/faq/
Let’s Encrypt began issuing IP certs in July 2025, valid for ~6 days: https://letsencrypt.org/2025/07/01/issuing-our-first-ip-address-certificate and https://letsencrypt.org/2025/01/16/6-day-and-ip-certs

The missed opportunity: If short-lived IP certs were required by policy, this incident would have already expired, reducing exposure from months to days. Baseline Requirements currently do not differentiate IP SAN certs from DNS names. If they did, such exposure could be avoided.

On intent and attribution

Attribution in WebPKI failures is notoriously difficult. The noisy, repeated issuance pattern suggests this may have been a systemic accident rather than a targeted attack. Still, intent is irrelevant. A certificate that shouldn’t exist was issued and left valid for months. Once issued, there is no way to tell what a certificate and its key were used for. The governance failure is what matters.

From an attacker’s perspective, this is even more concerning. CT logs can be mined as a reconnaissance tool to identify the weakest CAs – those with a track record of mis-issuance. An adversary doesn’t need to compromise a major CA; they only need to find a small one with poor controls, like Fina RDC 2020, that is still widely trusted in some ecosystems. That makes weak governance itself an attack surface.

Reference: https://unmitigatedrisk.com/?p=850

Root governance failures

We need risk-aware, continuous root governance:

Blast radius control for low-volume CAs
Tiered validation for high-value targets (like IP SAN certificates)
Real CT monitoring that triggers remediation
Cross-vendor accountability (one vendor’s trust shouldn’t be universal if others reject it)

I wrote on this governance model here: https://unmitigatedrisk.com/?p=923

A broader pattern: software and the WebPKI

This incident is not a standalone. It follows previous failures:

DigiNotar (2011, Netherlands): Compromised, issuing hundreds of rogue certificates – including for Google – that were used in live MITM surveillance against Iranian users. The CA collapsed and was distrusted by all major browsers.

Reference: https://unmitigatedrisk.com/?p=850

TÜRKTRUST (2013, Turkey): Issued an intermediate CA certificate later used to impersonate Google domains.

Reference: https://unmitigatedrisk.com/?p=850

ANSSI (2013, France): Government CA issued Google-impersonation certs. Browsers blocked them.

Reference: https://arstechnica.com/information-technology/2013/12/french-agency-caught-minting-ssl-certificates-impersonating-google/

Microsoft code-signing trust (2025): I documented how poor governance turned Microsoft code signing into a subversion toolchain: https://unmitigatedrisk.com/?p=1085

The pattern is clear: long-lived trust, weak oversight, repeated governance failures. And notably, most of the major real-world MITM events trace back to European CAs. That history underscores how dangerous it is when regulatory frameworks like eIDAS 2.0 mandate trust regardless of technical or governance quality.

Why eIDAS 2.0 makes it worse

The EU’s provisional agreement on eIDAS 2.0 (November 2023) mandates that browsers must trust CAs approved by EU member states, issuing QWACs (Qualified Website Authentication Certificates).

References:

EU text: https://www.europarl.europa.eu/cmsdata/278103/eIDAS-4th-column-extract.pdf
My notes on this: https://docs.google.com/document/d/1sGzaE9QTs-qorr4BTqKAe0AaGKjt5GagyEevDoavWU0/

This comes as EV certificates (part of the “OV” slice) have declined to only ~4% of issuance as tracked on https://merkle.town (OV includes EV in their charts). EV failed because it misled users. The Stripe-name example shows that easily: https://arstechnica.com/information-technology/2017/12/nope-this-isnt-the-https-validated-stripe-website-you-think-it-is/

Browsers are negotiating with the EU to separate QWAC trust from normal trust. For now, however, eIDAS still forces browsers to trust potentially weak or misbehaving CAs. This repeats the Microsoft problem but now globally and legislatively enforced.

And the oversight problem runs deeper. Under eIDAS, Conformity Assessment Bodies (CABs) are responsible for auditing and certifying qualified trust service providers. Auditors should have caught systemic mis-issuance like the 1.1.1.1 case, and so should CABs. Yet in practice many CABs are general IT auditors with limited PKI depth. If they miss problems, the entire EU framework ends up institutionalizing weak governance rather than correcting it.

A parallel exists in other ecosystems: under Entrust’s audits, roughly 3% of issued certificates are supposed to be reviewed. Because CAs select the audit samples, in theory one could envision a CA trying to hide bad practices. But the more likely situation is that auditors simply missed it. With enough audits over time, all but deliberate concealment should have brought at least one of these mis-issued certificates into scope. That points to an auditing and CAB oversight gap, not just a rogue CA.

The bigger picture

A few CAs dominate WebPKI. The long tail, though small, can create massive risk if trusted. Microsoft’s root store is broader and more passive than others. The critical issue is the number of trusted organizations, each representing a potential point of failure. Microsoft’s own list of Trusted Root Program participants includes well over one hundred CAs from dozens of countries, creating a vast and difficult-to-audit trust surface.

Reference: https://ccadb-public.secure.force.com/microsoft/IncludedCACertificateReportForMSFT

Forced or passive trust is a recipe for systemic risk.

Microsoft’s root store decisions expose its users to risks that Chrome, Firefox, and Safari users are shielded from. Yet Microsoft participates minimally in WebPKI governance. If the company wants to keep a broader and more permissive root store, it should fund the work needed to oversee it – staffing and empowering a credible root program, actively participating in CA/Browser Forum policy, and automating the monitoring Certificate Transparency logs for misuse. Without that investment, Microsoft is effectively subsidizing systemic risk for the rest of the web.

Thanks to Alex Radocea for double-checking DoH/DoT client behavior in support of this post.

Why CP and CPSs Matter More Than You Think

2 Replies

I’ve been in the PKI space for a long time, and I’ll be honest, digging through Certificate Policies (CPs) and Certification Practice Statements (CPSs) is far from my favorite task. But as tedious as they can be, these documents serve real, high-value purposes. When you approach them thoughtfully, the time you invest is anything but wasted.

What a CPS Is For

Beyond satisfying checkbox compliance, a solid CPS should:

Build trust by showing relying parties how the CA actually operates.
Guide subscribers by spelling out exactly what is required to obtain a certificate.
Clarify formats by describing certificate profiles, CRLs, and OCSP responses so relying parties know what to expect.
Enable oversight by giving auditors, root store programs, and researchers a baseline to compare against real-world issuance.

If a CPS fails at any of these, it fails in its primary mission.

Know Your Audience

A CPS is not just for auditors. It must serve subscribers who need to understand their obligations, relying parties weighing whether to trust a certificate, and developers, security researchers, and root store operators evaluating compliance and interoperability.

The best documents speak to all of these readers in clear, plain language without burying key points under mountains of boilerplate.

A useful parallel is privacy policies or terms of service documents. Some are written like dense legal contracts, full of cross-references and jargon. Others aim for informed consent and use plain language to help readers understand what they are agreeing to. CPs and CPSs should follow that second model.

Good Examples Do Exist

If you’re looking for CPS documents that get the basics right, Google Trust Services and Fastly are two strong models:

There are many ways to evaluate a CPS, but given the goals of these documents, fundamental tests of “good” would certainly include:

Scope clarity: Is it obvious which root certificates the CPS covers?
Profile fidelity: Could a reader recreate reference certificates that match what the CA actually issues?

Most CPSs fail even these basic checks. Google and Fastly pass, and their structure makes independent validation relatively straightforward. Their documentation is not just accurate, it is structured to support validation, monitoring, and trust.

Where Reality Falls Short

Unfortunately, most CPSs today don’t meet even baseline expectations. Many lack clear scope. Many don’t describe what the issued certificates will look like in a way that can be independently verified. Some fail to align with basics like RFC 3647, the framework they are supposed to follow.

Worse still, many CPS documents fail to discuss how or if they meet requirements they claim compliance with. That includes not just root program expectations, but also standards like:

Server Certificate Baseline Requirements
S/MIME Baseline Requirements
Network and Certificate System Security Requirements

These documents may not need to replicate every technical detail, but they should objectively demonstrate awareness of and alignment with these core expectations. Without that, it’s difficult to expect trust from relying parties, browsers, or anyone else depending on the CA’s integrity.

Even more concerning, many CPS documents don’t fully reflect the requirements of the root programs that grant them inclusion:

The Cost of Getting It Wrong

These failures are not theoretical. They have led to real-world consequences.

Take Bug 1962829, for example, a recent incident involving Microsoft PKI Services. “A typo” introduced during a CPS revision misstated the presence of the keyEncipherment bit in some certificates. The error made it through publication and multiple reviews, even as millions of certificates were issued under a document that contradicted actual practice.

The result? Distrust risks, revocation discussions, and a prolonged, public investigation.

The Microsoft incident reveals a deeper problem, CAs that lack proper automation between their documented policies and actual certificate issuance. This wasn’t just a documentation error, it exposed the absence of systems that would automatically catch such discrepancies before millions of certificates were issued under incorrect policies.

This isn’t an isolated case. CP and CPS “drift” from actual practices has played a role in many other compliance failures and trust decisions. This post discusses CA distrust and misissuance due to CP or CPS not matching observable reality is certainly a common factor.

Accuracy Is Non-Negotiable

Some voices in the ecosystem now suggest that when a CPS is discovered to be wrong, the answer is simply to patch the document retroactively and move on. This confirms what I have said for ages, too many CAs want the easy way out, patching documents after problems surface rather than investing in the automation and processes needed to prevent mismatches in the first place.

That approach guts the very purpose of a CPS. Making it easier for CAs to violate their commitments creates perverse incentives to avoid investing in proper compliance infrastructure.

Accountability disappears if a CA can quietly “fix” its promises after issuance. Audits lose meaning because the baseline keeps shifting. Relying-party trust erodes the moment documentation no longer reflects observable reality.

A CPS must be written by people who understand the CA’s actual issuance flow. It must be updated in lock-step with code and operational changes. And it must be amended before new types of certificates are issued. Anything less turns it into useless marketing fluff.

Make the Document Earn Its Keep

Treat the CPS as a living contract:

Write it in plain language that every audience can parse.
Tie it directly to automated linting so profile deviations are caught before issuance. Good automation makes policy violations nearly impossible; without it, even simple typos can lead to massive compliance failures.
Publish all historical versions so the version details in the document are obvious and auditable. Better yet, maintain CPS documents in a public git repository with markdown versions that make change history transparent and machine-readable.
Run every operational change through a policy-impact checklist before it reaches production.

If you expect others to trust your certificates, your public documentation must prove you deserve that trust. Done right, a CPS is one of the strongest signals of a CA’s competence and professionalism. Done wrong, or patched after the fact, it is worse than useless.

Root programs need to spend time documenting the minimum criteria that these documents must meet. Clear, measurable standards would give CAs concrete targets and make enforcement consistent across the ecosystem. Root programs that tolerate retroactive fixes inadvertently encourage CAs to cut corners on the systems and processes that would prevent these problems entirely.

CAs, meanwhile, need to ask themselves hard questions: Can someone unfamiliar with internal operations use your CPS to accomplish the goals outlined in this post? Can they understand your certificate profiles, validation procedures, and operational commitments without insider knowledge?

More importantly, CAs must design their processes around ensuring these documents are always accurate and up to date. This means implementing testing to verify that documentation actually matches reality, not just hoping it does.

The Bottom Line

CPS documents matter far more than most people think. They are not busywork. They are the public guarantee that a CA knows what it is doing and is willing to stand behind it, in advance, in writing, and in full view of the ecosystem.

Rethinking Authentication: “Something You Have,” “Something You Know,” and “Something You Are” for Workloads and Machines

1 Reply

Passwords have existed for millennia, and their weaknesses have persisted just as long. Their simplicity led to widespread adoption, but as their use expanded, so did the frequency of their abuse. To address this, we implemented stricter password policies—longer lengths, special characters, regular changes—much like hiding vulnerable software behind firewalls. When these efforts fell short, we evolved to multi-factor authentication (MFA), introducing the principles of “Something You Have,” “Something You Know,” and “Something You Are.”

MFA brought its own challenges—deployment complexity and user friction. Password managers helped bridge some gaps by generating and storing random passwords for us. While each of these steps enhanced security, none addressed the core problem: passwords and shared secrets are fundamentally flawed authenticators. This realization spurred efforts like WebAuthn, FIDO, and Passkeys, which replaced passwords with cryptographic keys and secure protocols, eliminating shared secrets entirely.

However, while user authentication evolved, workload and machine authentication lagged behind. Instead of passwords, workloads relied on API keys—essentially shared passwords—managed through “password managers” rebranded as secret vaults. These shared secrets are just as fragile and inadequate for today’s complex, scaled environments as passwords were for users.

The path forward is clear: workloads and machines need their own authentication revolution. We must replace shared secrets with cryptographic keys and implement MFA for workloads. But what does machine-focused MFA look like? Let’s explore how the three fundamental authentication factors apply to workloads and machines.

Applying Authentication Factors to Workloads and Machines

1. Something the Workload Has

This encompasses physical or cryptographic elements unique to the workload:

Hardware Roots of Trust: Security processors like TPM, Microsoft Pluton, Google Titan, and Apple’s Secure Enclave provide tamper-resistant foundations for device identity and posture assessment.
Cryptographic Keys: Private keys secured within hardware security processors serve as a robust “something you have,” enabling strong authentication.
Credentials: Like OIDC tokens and X.509 certificates, uniquely identify machines and workloads within trusted environments.

These mechanisms form the backbone of secure workload authentication, much like physical security tokens do for human users.

2. Something the Workload Knows

This parallels knowledge-based authentication but focuses on workload-specific secrets:

Shared Secrets: Shared API keys, symmetric keys, and asymmetric credentials that are used for authentication.
Configuration Data: Runtime-specific information like environment configuration.

Although often necessary for a service’s functionality, these weak attributes are highly susceptible to exposure, reuse, and theft. Implementing credentialing systems like SPIFFE can significantly mitigate these risks by replacing shared secrets with cryptographically secure, short-lived credentials uniquely tailored to each workload.

3. Something the Workload Is

This represents inherent characteristics of the workload, similar to human biometrics:

Trusted Execution Environments (TEEs): Secure enclaves like Intel SGX or AWS Nitro verify and attest to the integrity of the workload’s execution environment.
Immutable Code or Container Hashes: Binary or container image hashes verify workload integrity.
Runtime Attestation: Environmental and configuration validation ensures compliance with security policy.
POSIX Process Names and Metadata: Process information and runtime metadata provide operational context.

By combining these attributes, workloads can demonstrate their role and environment to enable more contextual identification and authorization.

The Future of Workload Identity

Authentication factors vary in strength—cryptographic keys and runtime attestations can provide strong confidence in what you are talking to while process names and secrets offer weaker assurance. Combining these elements creates a more comprehensive picture of workload authenticity. Standards like SPIFFE leverage this combination, creating strong workload identities by incorporating hardware roots of trust, runtime attestations, and other security properties. Over time, these attestations can be enriched with supply chain provenance, vulnerability assessments, and other compliance data.

As we look further ahead to agentic workloads and AI systems, we will need to develop protocols and policies that enable us to consider both workload identity and the entities they represent. For example, an AI model handling financial transactions needs both verified workload identity and specific policies for each user it serves so that the agent does not become a way to escalate privileges and access data that would otherwise have been unreachable.

This evolving, layered approach ensures that workload authentication scales with increasingly complex machine ecosystems while maintaining security and accountability. By unifying identity, actions, and policies, we prepare for a future of autonomous yet accountable workloads.

When Words Mislead: Cybersecurity’s Terminology Problem

Leave a reply

At Black Hat this year, I did my usual walk around the vendor floor. I talked to lots of companies about their products. One thing that stood out to me is vendors either by accident or on purpose are redefining terms in a way that does harm. One vendor in particular was calling “bearer tokens” “attestations” in both their marketing and product documentation. Let’s use this as an example and break down why this matters.

What’s an attestation?

An attestation is when someone says something is true and puts their reputation behind that statement. It’s like when your friend vouches for you at a new job. A good technology example is a TPM attestation. The TPM in your computer can prove certain things about how your machine started up. When we trust the chip’s design and the company that made it, we can believe what it tells us.

What’s a claim?

A claim is just something someone says. It might be true, but there’s no proof. If I tell you my name is Ryan Hurst, that’s a claim. I haven’t shown you my ID or anything. Claims can also be about other people or things. If I say Cloudflare is safe to use, that’s just my opinion unless I back it up with something.

What’s a bearer token?

A bearer token is like a temporary password. It’s a secret that proves who you are to a service. Anyone who has the token can pretend to be you. We use them because they’re necessary, but we try to limit their use in modern systems.

You usually get a bearer token by trading in something more permanent, like an API key, which is essentially a long-lived password. It’s like swapping a house key for a hotel room key. The hotel key only works for a short time, but anyone who finds it can get into your room.

Why does any of this matter?

When companies use the wrong terms to explain what they do it can lead people to make bad security choices. For example, If you hear a vendor say their system relies on bearer tokens and then you do a search on the term, you’ll find experts talking about their risks and how to manage them. But if you search for attestations, you’ll find different info about how they help prove things are reliable, trustworthy or factual.

If a company selling security products tells you it does one thing, but it does another, it’s a bad sign. They either have some technical debt buried in the design that may have a negative impact, don’t know what they’re talking about, or they’re trying to confuse you. Either way, you might want to look at other options.

So, when you’re buying security products for your company, pay attention to how vendors use words. If they’re changing the meaning of important terms, be careful. It could mean you’re not getting what you think you are.

Navigating Content Authentication In the Age of Generative AI

Leave a reply

In 1995, SSL was introduced, and it took 21 years for 40% of web traffic to become encrypted. This rate changed dramatically in 2016 with Let’s Encrypt and the adoption of ACME, leading to an exponential increase in TLS usage. In the next 8 years, adoption nearly reached 100% of web traffic. Two main factors contributed to this shift: first, a heightened awareness of security risks due to high-profile data breaches and government surveillance, creating a demand for better security. Second, ACME made obtaining and maintaining TLS certificates much easier.

Similarly, around 2020, the SolarWinds incident highlighted the issue of software supply chain security. This, among other factors, led to an increase in the adoption of code signing technologies, an approach that has been in use at least since 1995 when Microsoft used this approach to help deal with the problem of authenticity as we shifted away from CDs and floppy disks to network-based distributions of software. However, the complexity and cost of using code signing severely limited its widespread use, and where it was used, thanks to poor tooling, key compromises often led to a failure for most deployments to achieve the promised security properties. Decades later, projects like Binary Transparency started popping up and, thanks to the SolarWinds incident, projects that spun out of that like Go ChecksumDB, SigStore, and SigSum projects led to more usage of code signing.

Though the EU’s digital signature laws in 1999 specified a strong preference for cryptographic-based document signing technologies, their adoption was very limited, in part due to the difficulty of using the associated solutions. In the US, the lack of a mandate for cryptographic signatures also resulted in an even more limited adoption of this more secure approach to signing documents and instead relied on font-based signatures. However, during the COVID-19 pandemic, things started changing; in particular, most states adopted remote online notary laws, mandating the use of cryptographic signatures which quickly accelerated the adoption of this capability.

The next shift in this story started around 2022 when generative AI began to take off like no other technology in my lifetime. This resulted in a rush to create tools to detect this generated content but, as I mentioned in previous posts [1,2], this is at best an arms race and more practically intractable on a moderate to long-term timeline.

So, where does this take us? If we take a step back, what we see is that societally we are now seeing an increased awareness of the need to authenticate digital artifacts’ integrity and origin, just like we saw with the need for encryption a decade ago. In part, this is why we already see content authentication initiatives and discussions, geared for different artifact types like documents, pictures, videos, code, web applications, and others. What is not talked about much is that each of these use cases often involves solving the same core problems, such as:

Verifying entitlement to acquire the keys and credentials to be used to prove integrity and origin.
Managing the logical and physical security of the keys and associated credentials.
Managing the lifecycle of the keys and credentials.
Enabling the sharing of credentials and keys across the teams that are responsible for the objects in question.
Making the usage of these keys and credentials usable by machines and integrating naturally into existing workflows.

This problem domain is particularly timely in that the rapid growth of generative AI has raised the question for the common technology user — How can I tell if this is real or not? The answer, unfortunately, will not be in detecting the fakes, because of generative AIs ability to create content that is indistinguishable from human-generated work, rather, it will become evident that organizations will need to adopt practices, across all modalities of content, to not only sign these objects but also make verifying them easy so these questions can be answered by everyday users.

This is likely to be accelerated once the ongoing shifts take place in the context of software and service liability for meeting security basics. All of this seems to suggest we will see broader adoption of these content authentication techniques over the next decade if the right tools and services are developed to make adoption, usage, and management easy.

While no crystal ball can tell us for sure what the progression will look like, it seems not only plausible but necessary in this increasingly digital world where the lines between real and synthetic content continue to blur that this will be the case.

Update: Just saw this while checking out my feed on X and it seems quite timely 🙂

Tenement Farming and Cloud HSMs

Leave a reply

While it’s fair to say that using a Cloud HSM means your keys are protected by a device meeting FIPS 140-3 standards, assuming the HSM in use has this certification, it’s important to realize this doesn’t guarantee the security you might expect. The security model of HSMs was built for the threats of the 1980s. These devices were not network-connected and were single-tenant — if they were “online” it was usually via HSMs attached to physical computers running an application on a machine connected to private networks — not connected to a globally reachable endpoint.

At their core, these devices were designed to protect keys from physical theft, more precisely, to slow down and increase the cost of theft, much like safe ratings (UL TL-15, TL-30, TL30x6) indicate for how effective the associated safes are. For example, early in my career, I worked on a project where we built attacks to extract non-exportable keys from a specific HSM and then imported them into another vendor’s HSM because the prior vendor went out of business. There have also been a number of key exfiltration bugs in these devices over the years as well.

We didn’t see network-connected HSMs until around 1999, but even then, these devices were single-tenant, essentially just a network-connected Linux or BSD box containing fundamentally the same hardware as years earlier. While this change did allow a single company to share an HSM across different application workloads, the assumption was still that this HSM was managed by the company in charge of all of these applications.

Why is this important today? Most computing is now done in shared cloud infrastructure, administered by someone else, with your competitor or an attacker on the same hardware as you. This presents a very different set of security considerations and design constraints than these devices were originally built. You are now exposed to the risks of the physical and logical administrators of these Cloud HSMs, the services they are dependent on, as well as other tenants of the Cloud HSM.

Consider that the compute operator usually can technically access the handle the application uses to talk to the HSM, and likely the secret used to authenticate to this HSM as well, meaning they, or an attacker, could potentially use that handle, or secret, to sign, or decrypt data as they wish. You might find that an acceptable risk, but did you know some HSMs allow the administrator to blindly add users as operators to the “virtual HSMs” within it? Yup, they do.

What about when keys are stored in a KMS and the key policy dictates the key be managed by an HSM? If the HSM hardware attests that the key is stored in the HSM, and this attestation is verified, it’s nearly the same threat profile we just discussed. In some cases, it could be argued it is better because access to the HSM can have traditional user and service RBAC controls, and rate limiting, and keys can be replicated to many other HSMs without any administrative burden for you, keeping you safe from a common disaster recovery scenario while normalizing the management of these devices so it fits into your normal operational practices which hopefully are well managed and monitored.

Regardless of the approach, the bigger question is whether your provider’s operational and security practices are up to your specific threat model. Imagine a Bitcoin wallet worth 100 million dollars. Has your cloud provider proportionally invested enough into controls and tests around their system to prevent a motivated attacker from using your key to sign a transaction that moves all that to another wallet? Probably not.

The fundamental issue is that today’s HSMs were largely designed for different eras with different security concerns than we typically have today, mainly to protect against physical theft of keys in environments where data centers were effectively closets in dedicated office space. That doesn’t reflect today’s cloud computing scale.

It is worth noting that there are a few HSM solutions on the market that are making efforts to tackle some of these issues, but they still fall short but that is a topic for another post.

In essence, Cloud HSMs are to HSMs what Tenement Farming is to Farming.

That’s not to say there’s no value in these offerings, but as built today, they often fail to deliver the value they are assumed to deliver. And if regulations mandated their use before, say, 2010, chances are they’re not delivering the intended value that those regulations had in mind.

So, how should we be protecting keys now?

To be clear, this is not a case against Cloud HSMs, it is an argument to think about the threat model and use case you are solving for — for example if we look at Storm-0558 where Microsoft appears to have been using the private key material in the process of their IDP, the attacker was able to get a memory dump to be created and then via another attack vector gather the memory dump, and as a result the private key, we can take away at least one solid lesson. Do not load keys into the process of the applications that rely on them. In this case, the least costly method to have prevented this key theft would be simply moving the key to another process running in another user context with a very simple API that is easy to defend and can at least limit the attacker to a handle vs. what happened in this case where the attacker was able to use the key with impunity for years. This approach is the rough equivalent of a workload or node-specific software HSM similar in spirit to the original HSMs.

Another common problem we see in the industry is, that solutions like Hashicorp vault were designed to centralize key management and provide a one-size-fits-all answer to “Where do I keep my secrets?” Architecturally these solutions look much like a passively encrypted database, if you have sufficient permissions you can read the key in the clear and then copy it to whatever node or workload needs to use the key. This took us from secret sprawl to secret spray where we pushed the keys out in environment variables and files on production machines that later get dumped into logs, and backups, continuously exposing the keys to users who should have never had access, and often leaving key remnants all over the place. This is only marginally better than checking keys into dedicated source control repositories.

The problem here isn’t limited to these secret sprawl solutions, considering that almost every web-server TLS private key is sitting in the file system often with weak ACLs without any encryption which is then loaded into memory on that web server in the process. Similarly, most SSH keys are also sitting in some file, usually with a poor ACL, with a key either in the clear or with an easily grindable password so a malicious actor that gains read access to the file system is all that is needed to walk away with the key, for example, see this incident from last week.

In both of these cases, we would be much better off if we would move these keys into another user context that is more defensible and constrained.

So how did we end here with such abysmal practices for managing keys?

While there is seldom one single reason to see such neglect, in this case, I think one of the largest is the dogmatic “all keys must be kept in HSMs or smart cards”. It is just too easy of a get-out-of-jail-free card for a security professional. Instead of thinking about the real risks and operational practices and then designing strategies to mitigate those threats that are practical and appropriate people who can afford to complete that checkbox do and those who can not just copy keys around in the clear out of a database.

The reality is we can do a lot better but as they say, the first step is to accept that you have a problem.

In short, as security professionals we need to avoid dogmatic answers to complex questions and spend the time to look more critically at the risks, constraints, obligations, resources, and real-world scenarios those we work with are operating within before we throw generic playbook answers to those coming to us for advice.

Effortless Certificate Lifecycle Management for S/MIME

2 Replies

In September 2023, the SMIME Baseline Requirements (BRs) officially became a requirement for Certificate Authorities (CAs) issuing S/MIME certificates (for more details, visit CA/Browser Forum S/MIME BRs).

The definition of these BRs served two main purposes. Firstly, they established a standard profile for CAs to follow when issuing S/MIME certificates. Secondly, they detailed the necessary steps for validating each certificate, ensuring a consistent level of validation was performed by each CA.

One of the new validation methods introduced permits mail server operators to verify a user’s control over their mailbox. Considering that these services have ownership and control over the email addresses, it seems only logical for them to be able to do domain control verification on behalf of their users since they could bypass any individual domain control challenge anyway. This approach resembles the HTTP-01 validation used in ACME (RFC 8555), where the server effectively ‘stands in’ for the user, just as a website does for its domain.

Another new validation method involves delegating the verification of email addresses through domain control, using any approved TLS domain control methods. Though all domain control methods are allowed for in TLS certificates as supported its easiest to think of the DNS-01 method in ACME here. Again the idea here is straightforward: if someone can modify a domain’s TXT record, they can also change MX records or other DNS settings. So, giving them this authority suggests they should reasonably be able to handle certificate issuance.

Note: If you have concerns about these two realities, it’s a good idea to take a few steps. First, ensure that you trust everyone who administers your DNS and make sure it is securely locked down.

To control the issuance of S/MIME certificates and prevent unauthorized issuance, the Certification Authority Authorization (CAA) record can be used. Originally developed for TLS, its recently been enhanced to include S/MIME (Read more about CAA and S/MIME).

Here’s how you can create a CAA record for S/MIME: Suppose an organization, let’s call it ‘ExampleCo’, decides to permit only a specific CA, ‘ExampleCA’, to issue S/MIME certificates for its domain ‘example.com’. The CAA record in their DNS would look like this:

example.com. IN CAA 0 smimeemail "ExampleCA.com"

This configuration ensures that only ‘ExampleCA.com’ can issue S/MIME certificates for ‘example.com’, significantly bolstering the organization’s digital security.

If you wanted to stop any CA from issuing a S/MIME certificate you would create a record that looks like this:

example.com. IN CAA 0 issuemail ";"

Another new concept introduced in this round of changes is a new concept called an account identifier in the latest CAA specification. This feature allows a CA to link the authorization to issue certificates to a specific account within their system. For instance:

example.com. IN CAA 0 issue "ca-example.com; account=12345"

This means that ‘ca-example.com’ can issue certificates for ‘example.com’, but only under the account number 12345.

This opens up interesting possibilities, such as delegating certificate management for S/MIME or CDNs to third parties. Imagine a scenario where a browser plugin, is produced and managed by a SaaS on behalf of the organization deploying S/MIME. This plug-in takes care of the initial enrollment, certificate lifecycle management, and S/MIME implementation acting as a sort of S/MIME CDN.

This new capability, merging third-party delegation with specific account control, was not feasible until now. It represents a new way for organizations to outsource the acquisition and management of S/MIME certificates, simplifying processes for both end-users and the organizations themselves.

To the best of my knowledge, no one is using this approach yet, and although there is no requirement yet to enforce CAA for SMIME it is in the works. Regardless the RFC has been standardized for a few months now but despite that, I bet that CAs that were issuing S/MIME certificates before this new CAA RFC was released are not respecting the CAA record yet even though they should be. If you are a security researcher and have spare time that’s probably a worthwhile area to poke around 😉

The Rise of Key Transparency and Its Potential Future in Email Security

2 Replies

Key Transparency has slowly become a crucial part of providing truly secure end-to-end encrypted messaging. Don’t believe me? The two largest providers of messaging services, Apple and Facebook (along with their WhatsApp service), have openly adopted it, and I am hopeful that Google, one of its early advocates, will follow suit.

Google has been a pioneer in prototyping Key Transparency at a large scale. Read more about Google’s Key Transparency Project.
Apple has successfully deployed Key Transparency into production. Learn about Apple’s advancements in user security.
Facebook and WhatsApp have also implemented Key Transparency into their production systems. Discover how WhatsApp has utilized Key Transparency.

At the same time, we are on the precipice of interoperable group messaging as Messaging Layer Security (MLS) was recently standardized. Its contributors included representatives from employees of the mentioned services and more, which suggests they may adopt it eventually. What does this have to do with Key Transparency? It acknowledges the need for secure, privacy-preserving key discovery through its inclusion of Key Transparency in its architecture.

It’s also noteworthy to see that Apple has agreed to support RCS, Android’s messaging protocol. While there is no public hint of this yet, it’s possible that since they have positioned themselves as privacy champions over the last decade frequently toting their end-to-end encryption, we may see them push for MLS to be adopted within RCS, which could net the world its first interoperable cross-network messaging with end-to-end encryption, and that would need a key discovery mechanism.

In that spirit, recently the Internet Engineering Task Force (IETF) has established a Working Group on Key Transparency, and based on the participation in that group, it seems likely we will see some standardization around how to do Key Transparency in the future.

What’s Next for Key Transparency Adoption Then?

I suspect the focus now shifts to S/MIME, a standard for public key encryption and signing of emails. Why? Well, over the last several years, the CA/Browser Forum adopted Baseline Requirements (BRs) for S/MIME to help facilitate uniform and interoperable S/MIME, and those became effective on September 1, 2023 – this means CAs that issue these certificates will need to conform to those new standards.

At the same time, both Google and Microsoft have made strides in their implementations of S/MIME for their webmail offerings.

In short, despite my reservations about S/MIME due to its inability to address certain security challenges (such as metadata confidentiality, etc), it looks like it’s witnessing a resurgence, particularly fueled by government contracts. But does it deliver on the promise today? In some narrow use cases like mail signing or closed ecosystem deployments of encrypted mail where all participants are part of the same deployment, it is probably fair to say yes.

With that said, mail is largely about interoperable communications, and for that to work with encrypted S/MIME, we will need to establish a standard way for organizations and end-users to discover the right keys to use with a recipient outside of their organization. This is where Key Transparency would fit in.

Key Transparency and S/MIME

Today, it is technically possible for two users to exchange certificates via S/MIME, enabling them to communicate through encrypted emails. However, the process is quite awkward and non-intuitive. How does it work? You either provide the certificate out of band to those in the mail exchange, and they add it to their contact, or some user agents automatically use the keys associated with S/MIME signatures from your contacts to populate the recipient’s keys.

This approach is not ideal for several reasons beyond usability. For instance, I regularly read emails across three devices, and the private keys used for signing may not be the same on each device. Since consistent signing across devices isn’t required, if I send you an email from my phone and then you send me an encrypted message that I try to open on my desktop, it won’t open.

Similarly, if I roll over my key to a new one because it was compromised or lost, we would need to go through this certificate distribution workflow again. While Key Transparency doesn’t solve all the S/MIME-related problems, it does provide a way to discover keys without the cumbersome process, and at the same time, it allows recipients to know all of my active and published certificates, not just the last one they saw.

One of the common naive solutions to this problem is to have a public directory of keys like what was used for PGP. However, such an approach often becomes a directory for spammers. Beyond that, you have the problem of discovering which directory to use with which certificate. The above Key Transparency implementations are all inspired by the CONIKS work, which has an answer to this through the use of a Verifiable Random Function (VRF). The use of the VRF in CONIKS keeps users’ email addresses (or other identifiers) private. When a user queries the directory for a key, the VRF is used to generate a unique, deterministic output for each input (i.e., the user’s email). This output is known only to the directory and the user, preserving privacy.

The generic identifier-based approach in Key Transparency means it can neatly address the issue of S/MIME certificate discovery. The question then becomes, how does the sender discover the Key Transparency server?

Key Transparency Service Discovery

The answer to that question probably involves DNS resource records (RRs). We use DNS every day to connect domain names with IP addresses. It also helps us find services linked to a domain. For instance, this is how your email server is located. DNS has a general tool, known as an SRV record, which is designed to find other services. This tool would work well for discovering the services we’re discussing.

_sm._keytransparency._https.example.com. 3600 IN SRV 10 5 443 sm-kt.example.com.

In this example, _sm the identifier is placed before _keytransparency. and _https shows that this SRV record is specifically for a Key Transparency service for Secure Messaging. This allows someone to ask DNS for a S/MIME-specific Key Transparency service. It also means we can add other types of identifiers later, allowing for the discovery of various KT services for the same domain.

Conclusion

While S/MIME faces many challenges, such as key roaming, message re-encrypting on key rollover, and cipher suite discoverability, before it becomes easy to use and widely adopted—not to mention whether major mail services will invest enough in this technology to make it work—there’s potential for a directory based on Key Transparency if they do.

Hopefully, the adoption of Key Transparency will happen if this investment in S/MIME continues, as it’s the only large-scale discovery service for user keys we’ve seen in practice. Unlike other alternatives, it’s both privacy-respecting and transparently verifiable, which are important qualities in today’s world. Only time will tell, though.

Exploring the Potential of Domain Control Notaries for MPDV in WebPKI

Leave a reply

In an earlier post on the Role of Multiple Perspective Domain Control Validation (MPDV) in the WebPKI, I discussed how there was an opportunity for CAs to work together to reduce the cost of meeting the upcoming requirements while also improving the security of the ultimate solution.

In this post, I want to talk a little about an idea I have been discussing for a while: specifically, Domain Control Notaries.

Before I explain the idea, let’s first look at how domain control verification (MPDV) happens. In short, the CA generates a random number and asks the certificate requestor to place that number in a location that only an administrator can access, for example, in a DNS record, in part of the TLS exchange, or in a well-known location.

If the CA is able to fetch the number, and the underlying network is living up to its promises, it can have confidence that the requestor is likely authorized for the given domain.

To understand Domain Control Notaries you will also need to have a basic understanding of what MPDV means. Thankfully, that’s easy: do that from multiple network locations and require a quorum policy to be met. This basically means that an attacker would have to tick enough of the quorum participants to bypass the quorum policy.

So that takes us back to the idea of Domain Control Notaries. As you can see, Domain Control Verification is super simple. That means you could put it on a very small computer, and it could be able to perform this simple task. For example, imagine a USB Armory that was fused at manufacturing time with firmware that only did these domain control checks. That this hardware also had a cryptographically unique key derived from the hardware fused to the device at manufacturing time.

Now imagine an aggregator handling admissions control for a network of these Domain Control Notaries. This aggregator would only allow devices that were manufactured to meet these basic requirements. It would enforce this because the manufacturer would publish a list of the public keys of each of the devices to the aggregator, which would use for admission control.

This aggregator would expose a simple REST API that would let the requestor specify some basic policy on how many of these Domain Control Notaries to broadcast their request to, what domain control methods to use, and a callback URL to be used by the aggregator when the verification is complete.

These aggregators would only accept responses from the Domain Control Notaries that signed their responses and whose keys were on this authorized list and were not added to their deny lists.

This basic framework sets you up with a network of very cheap network endpoints that can be used to perform domain control verification. You could even have a few of these aggregators each with its own Domain Control Notaries. CAs could use multiple of these aggregator networks to reduce centralization risk.

You might be asking yourself how these tiny computers could deal with the scale and performance of this task! The reality is that in the grand scheme of things, the WebPKI is relatively small! It is responsible for only 257,035 certificates every hour. The real number is actually smaller than this too because that includes some pre-certificates and in the context of domain control verification. CAs are also allowed to do some re-use of past validations if recent enough. This means we should be able to use this as a worst-case number safely. Simply put, that is only 1.18 certificates every second. That is tiny. If you spread that out over a few hundred Domain Control Notaries, the number of transactions gets that much smaller.

Then there is the question of who would run these Domain Control Notaries? A lesson learned from Certificate Transparency is that if you make it easy and cheap to participate and make it easy to both come and go, most organizations are willing to help. You’re basically asking an operator to provide a few lightbulbs of electricity and a tiny amount of network connectivity. This is easy for most organizations to sign up for since there is no tax in turning it down, and no impact if there is an outage.

Remember how I said there could be multiple aggregators? These aggregators could also expose more expensive heavy-weight nodes that were not reliant on Domain Control Notaries to provide a more reliable substrate to power such a network.

That’s Domain Control Notaries. Maybe they can be a tool to help us get to this world of MPDV everywhere.

Why crawling is not an adequate measurement methodology for the WebPKI

Leave a reply

The answer is simple — It’s an incomplete view of the use of the WebPKI.

There are a number of different methodologies a web crawler-based approach might take in measuring the size of the WebPKI. The most naive approach would be to simply scan all IPv4 address space and log all of the certificates you see during this scan.

The problem is that this only shows a small fraction of the certificates that are out there. When you connect to an IP address and the associated web server doesn’t know what host you are trying to connect to it will return its “default” website and use the associated certificate.

That same IP address may literally be responsible for serving millions of sites based on the client’s indicated hostname. With this IP-based enumeration approach at best you would get one certificate from that host, at worse you wouldn’t even get that because some servers are not configured with a default site. This is just one problem with this approach there are many more.

Though most WebPKI market share reports do not document their methodology anecdotally it appears most work on this crawler approach and at least historically some have taken periodic drops from CAs to make their view “more complete”.

Today though the only way to measure CA market share that should be used is by relying on the pre-certificate counts in Certificate Transparency logs.