Category Archives: Technology

TPMs, TEEs, and Everything In Between: What You Actually Need to Know

Ever been in a meeting where someone drops terms like “TEE,” “TPM,” or “FIPS-certified” and everyone nods along, pretending they understand? Yeah, me too.

Last night I saw JP Aumasson tweet something that hit home:

“Some discussions would be so much easier if people knew the definitions of ‘TEE’, ‘TPM’, ‘Secure element’, ‘Secure enclave’, ‘HSM’, ‘Trusted computing’, ‘FIPS(140-2/3)-certified’, ‘Common criteria’, ‘security target’, etc. Plus now the marketing-oriented term ‘confidential computing’ is used to mean a variety of things with varying security properties.”

He’s right – the security tech space is a mess of overlapping terms, marketing buzzwords, and genuine technical concepts. So I threw together a guide to sort this stuff out.

What’s Actually Different Between These Things?

At their core, these technologies do three things:

Minimize what code you need to trust (the TCB)
Create isolation between different parts of a system
Establish trust across different machines

A TPM is not the same as a TEE. Intel SGX is not identical to AMD SEV. And no, slapping “FIPS-certified” on your product doesn’t automatically make it secure.

The Real-World Impact

When your vendor says they use “Confidential Computing,” do you know what that actually means for your data? Could be anything from “your data is encrypted in memory” to “we’ve got a fancy marketing term for standard virtualization.”

The differences matter. A secure element in your phone has around 10-50KB of trusted code. A standard Linux kernel? About 27.8 MILLION lines. One of these is much easier to secure than the other.

When Things Break

Even the most certified security tech fails. Hardware Security Modules (HSMs) with FIPS 140-2 certification—supposedly the gold standard for cryptographic security—have been compromised by design flaws. Look at the 2015 Safenet HSM vulnerability where API flaws in the PKCS#11 interface allowed full key extraction. Attackers with authenticated access could exploit weak key derivation mechanisms to extract the very keys the HSM was designed to protect.

Bottom line: No security technology is perfect. Each has its place, limitations, and potential failure modes.

I’ve put together a full technical deep-dive on this topic: From TPMs to TEEs: How Security Technologies Work—and Where They Fail.

As Winston Churchill observed, “He who fails to plan is planning to fail.” Understanding what’s under the hood of these technologies isn’t just academic—it’s essential for building systems that can actually withstand the threats they’ll face.

Operational Evolution Revisited: How AI-Native Systems Will Revolutionize Infrastructure

Leave a reply

The evolution of technology operations has always been driven by necessity. From the early days of single system operators (sysops) managing physical servers through hands-on intervention, to today’s complex landscape of distributed microservices, containers, and serverless functions, each operational paradigm shift has emerged to address growing complexity.

The Journey of Operational Evolution

From the hands-on Sysops era of the 1960s-80s when operators physically managed as as little as few to 10s of servers each, to the System Administration period of the 1990s when centralized tools expanded reach to hundreds of systems, technology operations have continuously transformed. DevOps emerged in the mid-2000s, leveraging Infrastructure as Code to manage thousands of systems, followed by SRE practices in the 2010s with error budgets and self-healing systems handling tens of thousands of containers. Looking ahead to 2025, AI-Driven Operations promises autonomous management of millions of components.

Each transition has been driven by necessity – not choice – as technology’s relentless complexity has overwhelmed previous operational models.

The Machine Concept Has Transformed

What’s particularly interesting is how we use the word “machine” has changed dramatically. In the early days, machines were physical servers with stable operating systems and predictable maintenance schedules. Today, with serverless computing, the very concept of a server has become fluid – functions materialize only when triggered, often lasting mere seconds before vanishing.

This ephemeral nature of modern computing creates unprecedented coordination challenges that exceed manual and even moderate automation approaches to management.

The Limits of Current Approaches

Even advanced DevOps and SRE practices are struggling with the scale and complexity of today’s systems. Many vendors have responded by adding AI or ML features to their products, but these “bolt-on” enhancements only provide incremental benefits – analyzing logs, detecting anomalies, or generating suggestions for known issues.

What’s needed is a more fundamental reimagining of operations, similar to how cloud-native architectures transformed infrastructure beyond simple virtualization.

AI-Native: A New Operational Paradigm

An AI-native platform isn’t just software that applies ML algorithms to operational data. It’s a new foundation where intelligence is deeply integrated into orchestration, observability, security, and compliance layers.

In these systems:

Instrumentation is dynamic and context-aware
Security is adaptive, learning normal communication patterns and immediately flagging and in even some cases quarantining anomalous processes
Compliance shifts from periodic audits to continuous enforcement

The timeline above illustrates how each operational era has enabled engineers to manage exponentially more systems as complexity has grown.

This diagram shows the widening gap between human management capacity and system complexity, which AI-native operations will ultimatley address.

The Human Role Transforms, Not Disappears

Rather than eliminating jobs, AI-native operations redefine how engineers spend their time. As a result, we will ultimately see the concept “force multiplier engineers” who will build advanced AI-driven frameworks that amplify the productivity of all other developers.

Freed from repetitive tasks like scaling, patching, and log parsing, these professionals can focus on innovation, architecture, and strategic risk management.

The Inevitable Shift

This transition isn’t optional but inevitable. As systems become more fragmented, ephemeral, and globally distributed, conventional approaches simply can’t keep pace with the complexity.

Those who embrace AI-native operations early will gain significant advantages in reliability, security, cost-efficiency, and talent utilization. Those who hesitate risk being overwhelmed by complexity that grows faster than their capacity to manage it.

What do you think about the future of AI in operations? Are you seeing early signs of this transition in your organization? Let me know in the comments!

Here is a whitepaper on this topic I threw together: Operational Evolution Revisited: How AI-Native Systems Will Revolutionize Infrastructure

Educating the Champion, the Buyer, and the Market

1 Reply

Security used to be something we tried to bolt on to inherently insecure systems. In the 1990s, many believed that if we simply patched enough holes and set up enough firewalls, we could protect almost anything. Today, hard-won experience has shown that secure-by-design is the only sustainable path forward. Rather than treating security as an afterthought, we need to bake it into a system’s very foundation—from its initial design to its day-to-day operation.

Yet even the best security technology can fail to catch on if no one understands its value. In my time in the field I’ve seen a recurring theme: great solutions often falter because they aren’t communicated effectively to the right audiences. Whether you’re a security entrepreneur, an in-house security architect, or part of a larger development team, you’ll likely need to equip three distinct groups with the right messaging: the Technical Champion, the Economic Buyer, and the Broader Market. If any of them fail to see why—and how—your solution matters, momentum stalls.

From Bolt-On to Secure-by-Design

The security industry has undergone a massive shift, moving away from the idea that you can simply bolt on protection to an already flawed system. Instead, we now realize that security must be designed in from the start. This demands a lifecycle approach—it’s not enough to fix bugs after deployment or put a facade in front of a service. We have to consider how software is built, tested, deployed, and maintained over time.

This evolution requires cultural change: security can’t just live in a silo; it has to be woven into product development, operations, and even business strategy. Perhaps most importantly, we’ve learned that people, processes, and communication strategies are just as important as technology choices.

This shift has raised the bar. It’s no longer sufficient to show that your solution works; you must show how it seamlessly integrates into existing workflows, consider the entire use lifecycle, supports future needs, and gets buy-in across multiple levels of an organization.

The Three Audiences You Need to Win Over

The Technical Champion (80% Tech / 20% Business)

Your security solution will often catch the eye of a deeply technical person first. This might be a security engineer who’s tired of patching the same vulnerabilities or a software architect who sees design flaws that keep repeating. They’re your first and most crucial ally.

Technical champions need more than promises—they need proof. They want detailed demos showing real-world scenarios, sample configurations they can experiment with, and pilot environments where they can test thoroughly. Give them architecture diagrams that satisfy their technical depth, comprehensive documentation that anticipates their questions, and a clear roadmap showing how you’ll address emerging threats and scale for future needs.

Integration concerns keep champions awake at night. They need to understand exactly how your solution will mesh with existing systems, what the deployment strategy looks like, and who owns responsibility for updates and patches. Address their concerns about learning curves head-on with clear documentation and practical migration paths.

While technology drives their interest, champions eventually have to justify their choices to management. Give them a concise one-pager that frames the returns in business terms: reduced incident response time, prevented security gaps, and automated fixes that save precious engineer hours.

Why This Matters:
When you equip your champion with the right resources, they become heroes inside their organizations. They’re the one who discovered that crucial solution before a major breach, who saved the team countless hours of manual work, who saw the strategic threat before anyone else. That kind of impact directly translates to recognition, promotions, and career advancement. The champion who successfully implements a game-changing security solution often becomes the go-to expert, earning both peer respect and management attention. When you help a champion shine like this, they’ll pull your solution along with them as they climb the organizational ladder.

The Economic Buyer (20% Tech / 80% Business)

A passionate champion isn’t always the one holding the purse strings. Often, budget is controlled by directors, VPs, or executives who juggle competing priorities and are measured by overall business outcomes, not technical elegance.

Your buyer needs a concise, compelling story about how this investment reduces risk, saves costs, or positions the company advantageously. Frame everything in terms of bottom-line impact: quantifiable labor hours saved, reduced compliance burdens, and concrete return on investment timelines.

Even without extensive case studies, you can build confidence through hypothetical or pilot data. Paint a clear picture: “Similar environments have seen 30% reduction in incident response time” or “Based on initial testing, we project 40% fewer false positives.” Consider proposing a small pilot or staged rollout—once they see quick wins scaling up becomes an easier sell.

Why This Matters:
When buyers successfully champion a security solution, they transform from budget gatekeepers into strategic leaders in the eyes of executive management. They become known as the one who not only protected the company but showed real business vision. This reputation for combining security insight with business acumen often fast-tracks their career progression. A buyer who can consistently tell compelling business stories—especially about transformative security investments—quickly gets noticed by the C-suite. By helping them achieve these wins, you’re not just securing a deal; you’re empowering their journey to higher organizational levels. And as they advance, they’ll bring your solution with them to every new role and company they touch.

The Broader Market: Present, Teach, and Farm

While winning over individual champions and buyers is crucial, certain security approaches need industry-wide acceptance to truly succeed. Think of encryption standards, identity protocols, and AI based security research tools—these changed the world only after enough people, in multiple communities, embraced them.

Build visibility through consistent conference presentations, industry webinars, and local security meetups. Even with novel technologies, walking people through hypothetical deployments or pilot results builds confidence. Panels and Q&A sessions demonstrate your openness to tough questions and deep understanding of the problems you’re solving.

Make your message easy to spread and digest. While detailed whitepapers have their place, supplement them with short video demonstrations, clear infographics, and focused blog posts that capture your solution’s essence quickly. Sometimes a two-minute video demonstration or one-page technical overview sparks more interest than an extensive document.

Think of education as planting seeds—not every seed sprouts immediately, but consistent knowledge sharing shapes how an entire field thinks about security over time. Engage thoughtfully on social media, address skepticism head-on, and highlight relevant use cases that resonate with industry trends. Consider aligning with open-source projects, industry consortiums, or standards bodies to amplify your reach.

Why This Matters:
By consistently educating and contributing to the community dialogue, you create opportunities for everyone involved to shine. Your champions become recognized thought leaders, speaking at major conferences about their successful implementations. Your buyers get profiled in industry publications for their strategic vision. Your early adopters become the experts everyone else consults. This creates a powerful feedback loop where community advocacy not only drives adoption but establishes reputations and advances careers. The security professionals who help establish new industry norms often find themselves leading the next wave of innovation—and they remember who helped them get there.

Overcoming Common Challenges

The “Not Invented Here” Mindset

Security professionals excel at finding flaws, tearing down systems, and building their own solutions. While this breaker mindset is valuable for discovering vulnerabilities, it can lead to the “Not Invented Here” syndrome: a belief that external solutions can’t possibly be as good as something built in-house.

The key is acknowledging and respecting this culture. Offer ways for teams to test, audit, or customize your solution so it doesn’t feel like an opaque black box. Show them how your dedicated support, updates, and roadmap maintenance can actually free their talent to focus on unique, high-value problems instead of maintaining yet another in-house tool.

Position yourself as a partner rather than a replacement. Your goal isn’t to diminish their expertise—it’s to provide specialized capabilities that complement their strengths. When teams see how your solution lets them focus on strategic priorities instead of routine maintenance, resistance often transforms into enthusiasm.

The Platform vs. Product Dilemma

A common pitfall in security (and tech in general) is trying to build a comprehensive platform before solving a single, specific problem. While platforms can be powerful, they require critical mass and broad ecosystem support to succeed. Many promising solutions have faltered by trying to do too much too soon.

Instead, focus on addressing one pressing need exceptionally well. This approach lets you deliver value quickly and build credibility through concrete wins. Once you’ve proven your worth in a specific area, you can naturally expand into adjacent problems. You might have a grand vision for a security platform, but keep your initial messaging focused on immediate, tangible benefits.

Navigating Cross-Organizational Dependencies

Cross-team dynamics can derail implementations in two common ways: operational questions like “Who will manage the database?” and adoption misalignment where one team (like Compliance) holds the budget while another (like Engineering) must use the solution. Either can stall deals for months.

Design your proof of value (POV) deployments to minimize cross-team dependencies. The faster a champion can demonstrate value without requiring multiple department sign-offs, the better. Start small within a single team’s control, then scale across organizational boundaries as value is proven.

Understand ownership boundaries early: Who handles infrastructure? Deployment? Access control? Incident response? What security and operational checklists must be met for production? Help your champion map these responsibilities to speed implementation and navigate political waters.

The Timing and Budget Challenge

Success often depends on engaging at the right time in the organization’s budgeting cycle. Either align with existing budget line items or engage early enough to help secure new ones through education. Otherwise, your champion may be stuck trying to spend someone else’s budget—a path that rarely succeeds. Remember that budget processes in large organizations can take 6-12 months, so timing your engagement is crucial.

The Production Readiness Gap

A signed deal isn’t the finish line—it’s where the real work begins. Without successful production deployment, you won’t get renewals and often can’t recognize revenue. Know your readiness for the scale requirements of target customers before engaging deeply in sales.

Be honest about your production readiness. Can you handle their volume? Meet their SLAs? Support their compliance requirements? Have you tested at similar scale? If not, you risk burning valuable market trust and champion relationships. Sometimes the best strategy is declining opportunities until you’re truly ready for that tier of customer.

Having a clear path from POV to production is critical. Document your readiness criteria, reference architectures, and scaling capabilities. Help champions understand and navigate the journey from pilot to full deployment. Remember: a successful small customer in production is often more valuable than a large customer stuck in pilot or never deploys into production and does not renew.

Overcoming Entrenched Solutions

One of the toughest challenges isn’t technical—it’s navigating around those whose roles are built on maintaining the status quo. Even when existing solutions have clear gaps (like secrets being unprotected 99% of their lifecycle), the facts often don’t matter because someone’s job security depends on not acknowledging them.

This requires a careful balance. Rather than directly challenging the current approach, focus on complementing and expanding their security coverage. Position your solution as helping them achieve their broader mission of protecting the organization, not replacing their existing responsibilities. Show how they can evolve their role alongside your solution, becoming the champion of a more comprehensive security strategy rather than just maintaining the current tools.

Putting It All Together

After three decades in security, one insight stands out: success depends as much on communication as on code. You might have the most innovative approach, the sleekest dashboard, or a bulletproof protocol—but if nobody can articulate its value to decision-makers and colleagues, it might remain stuck at the proof-of-concept stage or sitting on a shelf.

Your technical champion needs robust materials and sufficient business context to advocate internally. Your economic buyer needs clear, ROI-focused narratives supported by concrete outcomes. And the broader market needs consistent education through various channels to understand and embrace new approaches.

Stay mindful of cultural barriers like “Not Invented Here” and resist the urge to solve everything at once. Focus on practical use cases, maintain consistent messaging across audiences, and show how each stakeholder personally benefits from your solution. This transforms curiosity into momentum, driving not just adoption but industry evolution.

Take a moment to assess your approach: Have you given your champion everything needed to succeed—technical depth, migration guidance, and business context? Does your buyer have a compelling, ROI-focused pitch built on solid data? Are you effectively sharing your story with the broader market through multiple channels?

If you’re missing any of these elements, now is the time to refine your strategy. By engaging these three audiences effectively, addressing cultural barriers directly, and maintaining focus on tangible problems, you’ll help advance security one success story at a time.

From Years to Seconds: Rethinking Public Key Infrastructure

Leave a reply

Public Key Infrastructure was designed for a world where identities persisted for years—employees joining a company, servers running in data centers, devices connecting to networks. In this world, the deliberate pace of certificate issuance and revocation aligned perfectly with the natural lifecycle of these long-lived identities. But today’s cloud-native workloads—containers, serverless functions, and microservices—live and die in seconds, challenging these fundamental assumptions.

Though these ephemeral workloads still rely on public key cryptography for authentication, their deployment and management patterns break the traditional model. A container that exists for mere seconds to process a single request can’t wait minutes for certificate issuance. A serverless function that scales from zero to thousands of instances in moments can’t depend on manual certificate management. The fundamental mismatch isn’t about the cryptography—it’s about the infrastructure and processes built around it.

This isn’t a problem of public key infrastructure being inadequate but rather of applying it in a way that doesn’t align with modern workload realities. These new patterns challenge us to rethink how authentication and identity management systems should work—not just to ensure security, but to support the flexibility, performance, and speed that cloud-native infrastructure demands.

Why Workloads Are Different

Unlike human or machine identities, workloads are ephemeral by design. While a human identity might persist for years with occasional role changes, and a machine identity might remain tied to a server or device for months, workloads are created and destroyed on-demand. In many cases, they live just long enough to process a task before disappearing.

Unlike human and machine identities where identifiers are pre-assigned, workload identifiers must be dynamically assigned at runtime based on what is running and where. This transient nature makes revocation—a cornerstone of traditional PKI—irrelevant. There’s no need to revoke a workload’s credentials because they’ve already expired. In fact, much like Kerberos tickets, workload credentials are short-lived by design, issued for just long enough to meet deployment SLAs.

The Identity Lifecycle Dynamics graphic below illustrates these differences clearly:

Human identities are persistent, often spanning years, with sequential changes governed by compliance and auditing processes.
Machine identities are semi-persistent, lasting weeks or months, with planned updates and automated renewals sometimes tied to devices or hardware lifetimes.
Workload identities, by contrast, are ephemeral. They join and leave almost instantly, with lifespans measured in minutes and operations occurring at massive scale.

Compounding this difference is the scale and speed at which workloads operate. It’s not unusual for thousands of workloads to be created simultaneously, each requiring immediate authentication. Traditional PKI processes, designed for slower-moving environments, simply can’t keep up. And workloads don’t just operate in isolation—they’re often distributed across multiple regions to minimize latency and avoid unnecessary points of failure. This means the supporting credentialing infrastructure must also be distributed, capable of issuing and verifying credentials locally without introducing bottlenecks or dependency risks.

Governance adds another layer of complexity. While human and machine identities are often subject to compliance-driven processes focused on auditability and security, workloads are governed by operational priorities:

Zero downtime: Workloads must scale rapidly and without disruption.
Regional performance: Authentication systems must match the workloads’ regional deployments to avoid latency.
Developer flexibility: Identity systems must integrate with whatever technology stacks developers are already using.

The lifecycle of a workload identity reflects the immediacy of software deployment cycles, rather than the structured schedules of hardware or personnel management.

Rethinking Identity Infrastructure for Workloads

The traditional PKI model isn’t going away—it remains essential for the stable, predictable environments it was designed to support. But workloads require a shift in approach. They demand systems capable of:

Dynamic credential issuance: Credentials must be created on-demand to support rapid scaling, with automated identifier assignment based on runtime context and workload characteristics.
Ephemeral lifecycles: Workload credentials should expire automatically, eliminating the need for revocation, with lifecycle durations matched to actual workload runtime requirements.
Multi-factor workload authentication: Something the workload has (hardware roots of trust, cryptographic keys), something the workload knows (runtime configuration), something the workload is (attestation data, container hashes, process metadata).
Distributed infrastructure: Regional authentication systems ensure low latency and high availability, with local credential issuance capabilities.
Massive scalability: Systems must support thousands of identity events per minute, operating across clouds or hybrid environments, with automated identifier management at scale.
Runtime identifier assignment based on: What is running (container hashes, process information), Where it’s running (environment context, runtime attestation), how it’s running (execution environment verification).

As highlighted in the lifecycle comparison, workload identities aren’t simply a smaller, faster version of machine identities. Their governance models reflect their role in delivering business-critical objectives like seamless scaling and developer empowerment.

Perhaps the most significant difference is the role of developers. Workload identity systems can’t impose rigid, one-size-fits-all requirements. Developers need the freedom to:

Work with existing technology stacks.
Seamlessly integrate identity management into their workflows.
Build and deploy at the speed demanded by modern infrastructure.

In this way, workload identity management becomes not just a security task but a foundational enabler of innovation and efficiency.

Taking the First Step with SPIFFE

SPIFFE (Secure Production Identity Framework For Everyone) is an open standard designed to enable workloads to automatically acquire identities, certificates, and OIDC tokens for secure zero-trust communication between services. Rather than retrofitting existing systems, look for upcoming greenfield deployments where you can engage early in the design phase. This allows you to build SPIFFE’s workload identity patterns in from the start—solving different problems than traditional PKI, not competing with it. Use this greenfield project to demonstrate how PKI as a technology via SPIFFE can help solve additional problems for production environments.

Final Thoughts

Workloads have redefined how we think about identity. They operate at a speed and scale that traditional PKI never anticipated, governed by different priorities and lifecycles that reflect the realities of modern software. While PKI will continue to serve as a critical component of authentication, it must evolve to meet the unique demands of ephemeral, distributed workloads.

This isn’t about abandoning the tools we rely on but about adapting them for a world where zero downtime, developer flexibility, and seamless scalability are non-negotiable. The future of identity isn’t static or centralized—it’s dynamic, distributed, and built to support workloads that define modern infrastructure.

For organizations looking to implement these patterns quickly and efficiently, SPIRL (a company I advise) provides tools to make workload identity management straightforward and accessible.

Rethinking Authentication: “Something You Have,” “Something You Know,” and “Something You Are” for Workloads and Machines

1 Reply

Passwords have existed for millennia, and their weaknesses have persisted just as long. Their simplicity led to widespread adoption, but as their use expanded, so did the frequency of their abuse. To address this, we implemented stricter password policies—longer lengths, special characters, regular changes—much like hiding vulnerable software behind firewalls. When these efforts fell short, we evolved to multi-factor authentication (MFA), introducing the principles of “Something You Have,” “Something You Know,” and “Something You Are.”

MFA brought its own challenges—deployment complexity and user friction. Password managers helped bridge some gaps by generating and storing random passwords for us. While each of these steps enhanced security, none addressed the core problem: passwords and shared secrets are fundamentally flawed authenticators. This realization spurred efforts like WebAuthn, FIDO, and Passkeys, which replaced passwords with cryptographic keys and secure protocols, eliminating shared secrets entirely.

However, while user authentication evolved, workload and machine authentication lagged behind. Instead of passwords, workloads relied on API keys—essentially shared passwords—managed through “password managers” rebranded as secret vaults. These shared secrets are just as fragile and inadequate for today’s complex, scaled environments as passwords were for users.

The path forward is clear: workloads and machines need their own authentication revolution. We must replace shared secrets with cryptographic keys and implement MFA for workloads. But what does machine-focused MFA look like? Let’s explore how the three fundamental authentication factors apply to workloads and machines.

Applying Authentication Factors to Workloads and Machines

1. Something the Workload Has

This encompasses physical or cryptographic elements unique to the workload:

Hardware Roots of Trust: Security processors like TPM, Microsoft Pluton, Google Titan, and Apple’s Secure Enclave provide tamper-resistant foundations for device identity and posture assessment.
Cryptographic Keys: Private keys secured within hardware security processors serve as a robust “something you have,” enabling strong authentication.
Credentials: Like OIDC tokens and X.509 certificates, uniquely identify machines and workloads within trusted environments.

These mechanisms form the backbone of secure workload authentication, much like physical security tokens do for human users.

2. Something the Workload Knows

This parallels knowledge-based authentication but focuses on workload-specific secrets:

Shared Secrets: Shared API keys, symmetric keys, and asymmetric credentials that are used for authentication.
Configuration Data: Runtime-specific information like environment configuration.

Although often necessary for a service’s functionality, these weak attributes are highly susceptible to exposure, reuse, and theft. Implementing credentialing systems like SPIFFE can significantly mitigate these risks by replacing shared secrets with cryptographically secure, short-lived credentials uniquely tailored to each workload.

3. Something the Workload Is

This represents inherent characteristics of the workload, similar to human biometrics:

Trusted Execution Environments (TEEs): Secure enclaves like Intel SGX or AWS Nitro verify and attest to the integrity of the workload’s execution environment.
Immutable Code or Container Hashes: Binary or container image hashes verify workload integrity.
Runtime Attestation: Environmental and configuration validation ensures compliance with security policy.
POSIX Process Names and Metadata: Process information and runtime metadata provide operational context.

By combining these attributes, workloads can demonstrate their role and environment to enable more contextual identification and authorization.

The Future of Workload Identity

Authentication factors vary in strength—cryptographic keys and runtime attestations can provide strong confidence in what you are talking to while process names and secrets offer weaker assurance. Combining these elements creates a more comprehensive picture of workload authenticity. Standards like SPIFFE leverage this combination, creating strong workload identities by incorporating hardware roots of trust, runtime attestations, and other security properties. Over time, these attestations can be enriched with supply chain provenance, vulnerability assessments, and other compliance data.

As we look further ahead to agentic workloads and AI systems, we will need to develop protocols and policies that enable us to consider both workload identity and the entities they represent. For example, an AI model handling financial transactions needs both verified workload identity and specific policies for each user it serves so that the agent does not become a way to escalate privileges and access data that would otherwise have been unreachable.

This evolving, layered approach ensures that workload authentication scales with increasingly complex machine ecosystems while maintaining security and accountability. By unifying identity, actions, and policies, we prepare for a future of autonomous yet accountable workloads.

Beyond the Facade: Evaluating Long-Term Impacts of Enterprise Software Architectures

Leave a reply

Many enterprise products these days have a core architecture that consists of placing a proxy in front of an existing service. While the facade architecture makes sense in some cases, it’s usually a temporary measure because it increases the costs of administration, scaling, security, and debugging. It also adds complexity to general management.

The use cases for these offerings often involve one team in an organization providing incremental value to a service operated by another team. This introduces various organizational politics, which make anything more than a proof of concept not really viable, at least on an ongoing basis.

Essentially, anyone who has ever carried a pager or tried to deploy even the most basic system in a large enterprise should avoid this pattern except as a stopgap solution for a critical business system. It is far better, in the medium and long term, to look at replacing the fronted system with something that provides the needed integration or capability natively.

For example, some solutions aim to enable existing systems to use quantum-secure algorithms. In practice, these solutions often look like a single TLS server sitting in front of another TLS server, or a TLS-based VPN where a TLS client (for example, integrating via SOCKS) interfaces with your existing TLS client, which then communicates with that TLS server sitting in front of another TLS server. You can, of course, implement this, and there are places where it makes sense. However, on a long-term basis, you would be better off if there were native support for quantum-secure algorithms or switching out the legacy system altogether.

Similarly, it’s very common now for these enterprise-focused solutions to split the architecture between on-premise/private cloud and a SaaS component. This approach has several benefits: the on-premise part enables integration, core policy enforcement, and transaction handling, and, if done right, ensures availability. The SaaS component serves as the control plane. This combination gives you the best benefits of both on-premise and SaaS offerings and can be implemented while minimizing the security impact of the service provider.

Another pattern that might be confused with this model is one where transactional policy enforcement or transaction handling happens in the SaaS part of the solution, while the integration element remains on-premise. This is probably the easiest to deploy, so it goes smoothly in a proof of concept. However, it makes the SaaS component mission-critical, a performance bottleneck, and a single point of failure, while also pulling it entirely into the threat model of the component. There are cases where this model can work, but for any system that needs to scale and be highly reliable, it’s often not the best choice.

These architectural decisions in third-party solutions have other large impacts that need to be considered, such as data residency and compliance. These are especially important topics outside the US but are also issues within the US if you handle sensitive data and work in regulated markets. Beyond that, architecture and design choices of third-party products can have vendor lock-in consequences. For example, if the solution is not based on open standards, you may find yourself in an inescapable situation down the road without a forklift upgrade, which is often not viable organizationally if the solution fails to deliver.

So why does all of this matter? When we evaluate the purchase of enterprise security software, we need to be sure to look beyond the surface, beyond the ingredient list, and understand how the system is built and how those design decisions will impact our security, availability, performance, maintainability and total cost of ownership both in the near term and long term. Enterprise architects and decision-makers should carefully consider these factors when evaluating different architectural patterns.

When Words Mislead: Cybersecurity’s Terminology Problem

Leave a reply

At Black Hat this year, I did my usual walk around the vendor floor. I talked to lots of companies about their products. One thing that stood out to me is vendors either by accident or on purpose are redefining terms in a way that does harm. One vendor in particular was calling “bearer tokens” “attestations” in both their marketing and product documentation. Let’s use this as an example and break down why this matters.

What’s an attestation?

An attestation is when someone says something is true and puts their reputation behind that statement. It’s like when your friend vouches for you at a new job. A good technology example is a TPM attestation. The TPM in your computer can prove certain things about how your machine started up. When we trust the chip’s design and the company that made it, we can believe what it tells us.

What’s a claim?

A claim is just something someone says. It might be true, but there’s no proof. If I tell you my name is Ryan Hurst, that’s a claim. I haven’t shown you my ID or anything. Claims can also be about other people or things. If I say Cloudflare is safe to use, that’s just my opinion unless I back it up with something.

What’s a bearer token?

A bearer token is like a temporary password. It’s a secret that proves who you are to a service. Anyone who has the token can pretend to be you. We use them because they’re necessary, but we try to limit their use in modern systems.

You usually get a bearer token by trading in something more permanent, like an API key, which is essentially a long-lived password. It’s like swapping a house key for a hotel room key. The hotel key only works for a short time, but anyone who finds it can get into your room.

Why does any of this matter?

When companies use the wrong terms to explain what they do it can lead people to make bad security choices. For example, If you hear a vendor say their system relies on bearer tokens and then you do a search on the term, you’ll find experts talking about their risks and how to manage them. But if you search for attestations, you’ll find different info about how they help prove things are reliable, trustworthy or factual.

If a company selling security products tells you it does one thing, but it does another, it’s a bad sign. They either have some technical debt buried in the design that may have a negative impact, don’t know what they’re talking about, or they’re trying to confuse you. Either way, you might want to look at other options.

So, when you’re buying security products for your company, pay attention to how vendors use words. If they’re changing the meaning of important terms, be careful. It could mean you’re not getting what you think you are.

The True Essence of Secure by Design

Leave a reply

When we discuss “secure by design,” we often focus on capabilities, features, and defaults—such as logging and monitoring, default-deny, regular updates, authentication, and minimizing by default privileges—rather than focus on the word “design.” Don’t get me wrong; those things are important. But if we have learned anything from the last several decades of evolving security practices, it is that security features and settings alone do not make for secure products.

For instance, the recent CrowdStrike outages and the Microsoft Storm 0558 incident illustrate that design choices have a significant impact on the security and reliability of a system. So what does it mean to have a secure “design” then?

The whole concept of interconnected systems was bolted onto the way computers work. We didn’t even see a TCP/IP stack included by default in an operating system until BSD did this in 1983. Microsoft Windows didn’t get a TCP/IP stack until 1994, around the same time TISFWTK was released, which was a framework to build a firewall, not a firewall itself. For some, these developments may seem like ancient history; however, the methods we use to build and secure software and systems have evolved from this bolt-on approach to security.

“The beginning is the most important part of the work.” — Plato

In my view, a secure design incorporates an understanding of the entire lifecycle of a system. It recognizes that the question is not if it will be hacked, but when. It then employs questions such as “How do you reduce the chances of that happening?” and “How can you make it easy to recover when it does?” to inform the system’s shape.

The Storm case is a great example of insecure design. Loading extremely valuable keys into memory on a front-end machine, in the user context of code parsing user-supplied data, enabled an unauthenticated user on the internet to trigger a memory dump that exposed the key to theft via another attack vector. At a minimum, this key should have been kept in a separate user process, one that operated at a different permission level.

Similarly, the update mechanism for CrowdStrike’s ‘signatures’ functioned like a command and control network for a large array of bots installed on mission-critical systems. The fact that they could, in essence, push code at runtime, at will into the kernel privilege space of their customers means that a malicious actor with access to these systems could push malicious code onto all of those endpoints simultaneously. This insecure design choice put customers in a precarious position, making CrowdStrike’s update system a direct and uncontrollable method for compromising their systems.

How Did We Get Here?

Today, much of the software we depend on still carries the legacy of this bolt-on approach to security. This led the industry to adopt compliance and audit standards similar to the programs used in financial services, such as the Sarbanes-Oxley Act, which paved the way for standards like ISO 27001 and PCI DSS which aspired to manage this risk. However, these measures, like the financial audits that preceded them and missed major frauds such as those at Enron and Wirecard, unsurprisingly also often fail to prevent significant security risks.

This in turn led to a shift towards posture management, where products aimed at risk discovery have proliferated. However, these solutions predominantly focus on identifying potential problems and leave the filtering and remediating of them to IT security, thereby turning into a flurry of risk mitigation tasks rather than identifying the root causes and manifesting into design changes that would fundamentally reduce the risk. These approaches are symptomatic of a broader issue in security—treating the symptoms of poor foundational security rather than addressing the root causes.

Alongside these developments, the last few decades have seen massive growth in the adoption of distributed computing technologies, which first led to MSPs, then SaaS, and now cloud computing. This brought about the concept of shared responsibility, where security is managed as an incremental part of “solving the problem” for the buyer, allowing them to focus on their core business. For many, if not most business problems, this is the right way to go but it also often complicates accountability and obscures the visibility of security controls, sometimes exacerbating the very issues it seeks to mitigate.

Is It Really That Bleak?

At first glance, this all might seem pretty dire. After all, we do not get to replace all the technology we depend on at once, and even if we could, these new solutions will always come with their own problems. However, we must accept that you can’t effectively retrofit “secure by design” onto an existing system—or if you do, at best it will take decades to replace poorly designed bits and pieces of those existing systems. Even when improvements do occur, as we saw with Microsoft’s Trustworthy Computing initiative, eventually the organization may lose focus and start to regress, leaving us on a rollercoaster of empty promises.

We have seen some organizations take this to heart; a good example is Google. They built their infrastructure and processes on platforms that address foundational issues, allowing the vast majority of their developers to remain largely unaware of the broad spectrum of security challenges that operating services like theirs are subject to. For instance, most Google developers couldn’t tell you much about the infrastructure that runs their code. They build on top of mandated frameworks that “make the right thing the easy thing,” enabling the central security organization to perform core security functions.

A telling example of the design of these systems is how the entire concepts of authentication, authorization, and encryption are abstracted away from the developer, so they largely just happen automatically. To be clear, this approach is not without its faults, and Google’s systems are neither perfect nor suitable for all types of workloads. However, the base principles applied here are fairly universal: Make investments in infrastructure design upfront to ensure that the systems built on them are secure, well, by design.

We are starting to see the industry come around to this reality with the advent of platforms like Kubernetes and infrastructure such as SPIFFE (the Secure Production Identity Framework For Everyone), which are designed to foster environments where security is integrated at the core rather than being an add-on but this is just the beginning of this focus on incorporating security infrastructure into the way we build, not the end.

Conclusion

What is necessary is a cultural shift in how we approach design. For this to happen, customers need to start demanding transparency from vendors about how their systems are designed, assessed, operated, and serviced. Part of this responsibility involves conducting their own assessments and not relying solely on blind faith. The recent recognition on supply chain risks and global scale outages is a direct result of such blind trust.

By demanding vendors to provide a comprehensive narrative on how they’ve built their systems to be secure, rather than merely listing security features, and by engaging in initial and continuous technological due diligence to hold them accountable, we can hasten the transition to a world where security is woven into the fabric of every system, not merely added as an afterthought.

In conclusion, while we are just at the beginning of this journey and will have to continue managing systems that were not designed with transparent “secure by design” principles for decades, the direction is clear. We are moving towards a world where code, as well as users, are authenticated, and automated intelligent systems that utilize expert systems and AI evaluate whether systems are regressing or living up to their promises. If we continue on this path, our children and grandchildren may have the opportunity to live in a world that is inherently more secure, or at the very least, they will be better equipped to manage the security risks associated with the massive expansion of technology that we rely on daily.

Content is King in Phishing and the Role of Publicly Trusted CAs

1 Reply

Phishing attacks often begin with a seemingly simple email. These emails appear to be from trusted sources and include links to fake websites or messages that create a false sense of urgency, prompting users to act without considering the consequences. Once on the phishing site, everything looks familiar. Attackers meticulously copy the layout, design, and even the content from real websites. The only noticeable difference in a well-executed phishing site is where the data is submitted.

Although studies show that users typically do not scrutinize the address bar or understand URLs, some more technical users may check the domain name hosting the content as a last resort to verify its legitimacy. This approach, however, is problematic for several reasons:

Domain Confusion: Not everyone knows which domain is appropriate for which content. For example, TD Bank, N.A. issues co-branded credit cards for various companies. When paying your credit card bill online, you might see a domain name owned by TD Bank rather than the brand on your credit card.
Brand Ambiguity: The global uniqueness of business names or brands is not something you can rely on. For instance, when I hear “Stripe,” I think of the San Francisco-based payments company, but there could be other companies with the same name, such as one in the pavement striping industry. To put that into context it costs about $100 and 24 hours of somebody else’s money to register a company with virtually whatever name you want.
Unrelated Legal and Domain Names: Company brands and their legal names often differ, as do their domain names. This discrepancy makes domain names more useful to attackers than to users for detection.

Furthermore, compromised websites from small businesses, churches, and other less technical organizations are frequently used to host phishing content. The key takeaway here is that content, not the domain name, is what makes a phishing campaign successful. The familiar look and feel of the content on phishing sites play a crucial role in deceiving users, making them less likely to question the site’s authenticity.

The Role of Certificate Authorities

Certificate authorities (CAs) in the web ecosystem exist to solve the trust-on-first-use problem introduced by our desire to protect the content we submit over the web from prying eyes (for more information, see TOFU and the Web). At its core, this problem arises because if you perform an anonymous key exchange or simply download the public key of a website from an unauthenticated source like DNS (while DNSSEC does technically exist, it is poorly adopted outside of TLDs), an attacker could replace the published key with one they control. This allows them to perform a man-in-the-middle (MITM) attack.

Certificate authorities are the answer to this problem. Browsers maintain a set of requirements that CAs must meet if they are to be distributed as “trusted” within their browsers (for more information, see Why We Trust WebPKI Root Certificate Authorities). In this ecosystem, browsers delegate solving the trust-on-first-use problem to the CAs. Essentially, they require CAs to prove that the requester of a certificate has administrative control over the domain and issue a certificate that attests to this. This enables browsers, as agents of the user, to assure the user that no one else should be able to intercept the information sent to the website (for some value of “website” — the modern website is made up of many third-party components that can potentially see your content also, but that’s a topic for another post).

Where things start to get tricky is how standards are defined. Anyone who works in a standards body knows that getting to a consensus is hard and often leads to less-than-ideal explanations of things. This is made worse by not everyone who participates understands the whole problem domain. Unfortunately, this is as true in the CA/Browser Forum as it is in other standards bodies. In the context of phishing, this comes into play in the “Baseline Requirements for the Issuance and Management of Publicly-Trusted TLS Server Certificates”, which has the concept of a “High-Risk Certificate Request” that states:

High-Risk Certificate Request: A Request that the CA flags for additional scrutiny by reference to internal criteria and databases maintained by the CA, which may include names at higher risk for phishing or other fraudulent usage, names contained in previously rejected certificate requests or revoked Certificates, names listed on the Miller Smiles phishing list or the Google Safe Browsing list, or names that the CA identifies using its own risk-mitigation criteria.

(It is worth noting Google Safe Browsing isn’t a list it is a web service…. sigh….)

And

The CA SHALL provide all personnel performing information verification duties with skills-training that covers basic Public Key Infrastructure knowledge, authentication and vetting policies and procedures (including the CA’s Certificate Policy and/or Certification Practice Statement), common threats to the information verification process (including phishing and other social engineering tactics), and these Requirements.

On its surface, this seems harmless enough. It doesn’t mandate that the CA do anything for phishing beyond training its validation staff about the contents but does allow them to do something about it if they can. After all, maybe they see phishing content and can stop the issuance altogether before a certificate is issued, preventing the use of SSL on the phishing site altogether. The problem is this all does more harm than good.

Why CAs Don’t Make Good Content Police

Determining whether content is phishing or legitimate can be highly subjective and context-dependent. What might appear as a phishing attempt to one person could be a legitimate operation to another.

As of 2024, there are approximately 1.13 billion websites on the internet, of which around 201.9 million are actively maintained and regularly updated. Each of these websites often consists of many different subdomains, all of which receive certificates, the large majority via automation with no opportunity for manual content inspection.

Simply put, this is a scale of a problem that does not lend itself to manual review or subjective assessments of phishing. But it gets worse.

There are around 86 WebPKI trusted certificate authorities on the web today, many of which operate with just a handful of people. Beyond that, though they exist to serve the whole web, they are in many different legal jurisdictions around the world, each with its own cultural norms and expectations. However, the web is an international asset, and if we were to rely on CAs to be the content police on the web, we would have hugely inconsistent results, especially given the current rules.

It is also worth noting if the decision-making power regarding content trustworthiness is applied at the domain name control verification, users are disempowered and it begins to resemble censorship rather than protection. Users should have the ability to choose what kind of subjectivity they want when it comes to protecting themselves from malicious content.

So why don’t we just make the rules clearer on what constitutes appropriate content? Most would agree this is a hugely difficult problem, but even if we put that aside, the reality is that CAs visit websites only at issuance time, often before there is any content published, since SSL is a requirement for a website to even launch. Beyond that, websites often serve different content to different users. In my last role at Microsoft, I was responsible for security engineering for the advertising business. Malicious advertisers would often serve Microsoft IP addresses content that met our policies but serve other users content that violated our policies. So even if CAs did check for phishing, all the phishers would need to do is serve clean content to the CA or change the content after the fact.

Beyond that, there is the question of how a CA would respond to finding a website hosting phishing content. They could revoke it, but as I mentioned earlier, often the website hosting content has been compromised in some way, and revoking that certificate would impact that other website. For example, it’s not uncommon to see phishing content served through shared services like drive.google.com or S3 bucket URLs. Revoking that certificate would impact all of those users, that is if revocation was actually effective, but it is not.

Revocation services like OCSP and CRL repositories are often so slow or unreliable that browsers were never able to deploy what we call hard fail revocation. This means that even when a certificate is revoked, the revocation message might never reach the browser for various reasons, so the CA may feel better that they have revoked the certificate, but in practice, it would at best deliver inconsistent results, making the web less reliable without actually addressing the real problems that enable phishing.

For more insights on the challenges and limitations of CAs in fighting phishing and malware, check outLet’s Encrypt’s post on “The CA’s Role in Fighting Phishing and Malware”.

So What Can We Do to Help with the Problem of Phishing?

To effectively combat phishing, we need a robust reputation system that continuously monitors content over time from many different network perspectives. This includes residential IPs, commercial ones, and more. Here are just a few examples of things we could look at with such a service:

Analyze Domain Registration: Look at the age and subject of the domain registration and the registrar of the domain since some are used more commonly than others.
Examine Hosting Providers: Identify patterns in hosting providers serving the content, as certain providers may be more frequently associated with phishing activities.
Inspect Website Scripts: Evaluate the JavaScript on the website to understand its functionality and detect potentially malicious behavior.
Assess Content Similarity: Compare the content served by the website to other known websites to identify similarities that might indicate phishing.
Utilize Machine Learning Models: Feed all of this data into well-trained machine learning models to generate probability scores indicating the likelihood of phishing.

This system should be integrated into user agents, such as browsers, so phishing checks can occur as part of the rendering process, providing real-time protection for users. Technologies like Google Safe Browsing and Microsoft SmartScreen already do some of this, and similar projects are often staffed with hundreds of engineers and other professionals. They deal with recourse for mislabeling, monitor false positives and false negatives, and tweak algorithms to improve solutions, all while designing privacy-preserving methods.

Are these services perfect? Absolutely not! Can they improve? The answer is unquestionably yes, but the key thing is that ad hoc censorship at issuance time by CAs is simply not the answer. It makes the web less reliable and at best offers a false sense of value in exchange for giving CAs a market message that they are fighting phishing when in reality they deliver no material value to the problem.

Global Consistency for AKD: Using Armored Witness to Prevent Split Views

Leave a reply

By Al Cutter and Ryan Hurst

When it comes to end-to-end encryption, securing the directory of public keys for each recipient is crucial so relying parties can have confidence they are talking to the right party. An attacker who can manipulate these directories can impersonate any participant in the communication without detection.

Over the years, various efforts have worked towards addressing this challenge. These efforts aimed to solve the problem of undetectable modifications to these central key servers. For example:

CONIKS was the first academic attempt to create a key transparency framework that enabled users to verify the authenticity of keys in a privacy-preserving way without relying solely on the key server’s integrity.
Google Key Transparency, based on CONIKS, was Google’s effort to prove it was possible to deploy these patterns at an internet scale.
Apple Contact Discovery was the first deployment of these patterns for a large-scale messaging system.

Arguably the most successful and well-understood implementation of these patterns though is Facebook’s Auditable Key Directory (AKD), which is intended to be used across its messaging offerings but is already deployed with WhatsApp. Like the other solutions mentioned above, it is reliant on constant monitoring and verification to ensure global consistency, mitigating “Split View” attacks against its relying parties and ensuring these end-to-end encrypted messaging solutions maintain the promised security properties.

A research paper by Sarah Meiklejohn et al., titled “Think Global, Act Local: Gossip and Client Audits in Verifiable Data Structures” (https://arxiv.org/abs/2011.04551), set out to define a verifiable scheme to perform this verification in a cost-efficient and provable way. The essential idea in the paper is that the problem can be solved by gossiping the state of the underlying verifiable data structure across a network of perspectives. The goal is to ensure that relying parties verify they are seeing the same version of the directory as everyone else who is participating.

Introducing Google’s Armored Witness

Google’s Armored Witness project aims to operationalize this concept by providing a geographically distributed network of secure hardware-based “notaries.” Its design is based on custom hardware, the USB Armory, designed with security, verifiability, and maximal openness given commercially available components and associated constraints. It modifies the USB Armory by removing unused components and adding Power over Ethernet (PoE).

The USB Armory platform uses a Go unikernel design, which has many benefits. For one, Go naturally supports “reproducible builds,” meaning that any attempt to rebuild from an application’s source will yield the same binary. This is important because it enables the hardware to report what firmware it is running and map that back to the source, allowing you to understand what is happening inside.

The hardware itself also contains a trusted execution environment and a secure element that enables sensitive code to be isolated and the keys to be secured from many classes of logical and physical attacks. These keys are then used to do the “notarization” of their observation of these ledgers.

Check out this presentation from Andrea Barsani at 2024 CanSecWest to learn more about the hardware.

Hardware alone isn’t enough, though. We need a global network based on these devices so that relying parties can get assurance that an adequate number of network perspectives are seeing the same log contents.

If you want to learn more about the broader problem being solved, check out this video of Fillipo Valsorda from Real World Crypto 2024, and for more on the Witness network idea, Fillipo also talked about the work here at the CATS conference.

But How Armored Witness Might Apply to AKD?

Here’s how an integration of Google’s Armored Witness into Facebook’s AKD framework could work:

Log Integration and Epoch Consistency: By integrating Armored Witness into the AKD framework, each Armored Witness device can act as a witness to a log of AKD epoch roots. Specifically::
- Log Integration: Each AKD epoch root is added to the log, such that the N^th epoch root is present at the N^th log index.
- Witness Network: The witness network, comprising Armored Witness devices, counter-signs checkpoints from these logs, providing an additional layer of verification.
Client-Side Verification: As the AKD is updated, AKD clients are provided with epoch update bundles, which include:
Clients verify these bundles by:
- Checking the signature on the directory epoch root.
- Verifying the countersignatures on the log checkpoint.
- Using the inclusion proof to confirm the correct log position of the epoch root, ensuring that the Nth root is in the Nth log position.
Split View Protection: The distributed nature of Armored Witness devices ensures that:
- Split views of the log are prevented due to the geographically distributed countersigning by the witness network.
- The log sequence number matches the directory epoch number, maintaining a strict one-to-one correspondence.
Heavy Lift Verifiers: Heavy lift verifiers, which perform more in-depth consistency checks, can source epoch roots from the log for pairwise append-only verification. This process ensures the correct operation of the directory over time, further strengthening the security framework.
The new directory epoch root commitment.
Inclusion proof hashes for the epoch root in the log.
The countersigned log checkpoint.

Conclusion

Integrating Google’s Armored Witness with Facebook’s AKD framework offers a simple solution to a very hard problem. More importantly, adopting this approach not only addresses the split-view problem for Facebook but also provides a substrate that can protect all verifiable data structures.

By leveraging a generic network of geographically distributed Armored Witness devices, solutions dependent on verifiable data structures to prove integrity, such as AKD deployments, can achieve robust split-view protection, ensuring integrity and authenticity across different regions.