If we examine the contents of most secret management solutions, like HashiCorp Vault, we will find that we primarily store the logical equivalent of user IDs and passwords for services, workloads, and machines. Much like the old-school practice of writing passwords on post-it notes and sharing them, these secrets are then distributed to whatever entity needs access to the protected resource. As a result, it is no surprise that we encounter the same problems as password management: weak secrets, shared secrets, and stolen secrets. This is why we see machine credentials and keys being stolen in events like STORM 0558.
Secret Management: A Machine-Scale Password Management
Secret management can be seen as password management at a machine scale. These systems centralize secrets (e.g., API keys, tokens, and passwords) to manage their sprawl and then make it easy to pass these shared secrets around various services and applications. However, just as simple passwords alone are seen as insufficient for user authentication—necessitating the adoption of multi-factor authentication (MFA) and migration to non-password authenticators like Passkeys—the same shortcomings apply to this legacy approach to managing these credentials.
The reality is that these secret managers were designed to address the symptoms rather than the root cause of secret sprawl. By storing and distributing these static secrets without treating them as credentials to be managed, you centralize the chaos but do not solve the fundamental problem: the need for robust, scalable machine identity management. Machines, services, and workloads, much like users credentials, require purpose-built approaches. Just as we have applied multi-factor authentication, identity governance and administration (IGA) and solutions like OKTA to user management, we need analogous systems for machines.
Credential Management for Machines
At machine and workload scale, credential management solutions need to be architected differently to handle the dynamic nature of modern cloud environments. This is where purpose-built solutions come into play. These solutions must address the unique challenges of credential management in dynamic, cloud-native environments. Unlike secret management, which often relies on static secrets, credential management provides dynamic, attested identities for workloads. For example, SPIFFE (Secure Production Identity Framework for Everyone) offers a strategy for managing dynamic identities, providing the foundation for standards-based, scalable, and robust workload identity management.
Higher-Level Concepts in Workload Credential Management:
Provisioning and Deprovisioning:
Efficiently creating, updating, and removing machine identities as services start, stop, and scale.
Role-Based Access Control (RBAC):
Assigning access rights based on predefined roles, ensuring that machines and services have appropriate permissions.
Periodic Reviews:
Conducting regular reviews of machine access rights to ensure compliance and appropriateness.
Policy Enforcement:
Defining and enforcing access control policies, ensuring that machines and services adhere to security guidelines.
Audit and Reporting:
Generating comprehensive reports to demonstrate compliance with regulatory requirements and internal policies.
Risk Analysis:
Identifying and mitigating risks associated with machine identities and their entitlements.
Behavioral Analysis:
Monitoring machine behavior to detect anomalies and potential security threats.
Just as legacy approaches to secret management fail for workloads, machines, and services, these legacy IGA approaches will not work either. We need built-for-purpose solutions to adapt and not repeat the mistakes of the past. Those solutions will look like credential management systems that enable these functions as a byproduct of how they work, rather than bolting them onto an existing secret management or IGA process.
This is why I have been an advisor to SPIRL since its founding. The founders, Evan and Eli, are why I knew I had to be involved. As the authors of the standard and having had real production-at-scale experience with SPIFFE, I knew they had what it took, which is why I have been working with them in a more formal capacity as of this last month. They are building the infrastructure that makes, as I like to say, the right thing the easy thing, ensuring their customers don’t repeat the mistakes of the past when it comes to workload and service identity.
Conclusion
Secret management tools were a necessary step in addressing the sprawl of machine secrets, and they will continue to be needed. However, as the complexity and scale of cloud environments grow, so does the need for more sophisticated solutions for workload and service credentials. Credential management systems provide the dynamic, robust framework required for modern machine identity management.
By adopting credential management as the primary approach and using secret management as the exception, organizations can achieve greater security, scalability, and operational efficiency in their cloud-native architectures. To learn more about effective machine identity management, check out SPIRL and see how they are leading the way.
In the world of certificate lifecycle management for workloads, two approaches often come into focus: ACME (Automated Certificate Management Environment) and SPIFFE (Secure Production Identity Framework for Everyone). While both can deliver certificates to a device or workload, they cater to different use cases. Let’s explore their core differences and why these matter, especially for dynamic workload deployments.
Control Verification: ACME verifies control through various methods such as HTTP-01, DNS-01, TLS-ALPN, and External Account Binding.
Attestation (Optional):Attestation in ACME is secondary and optional, primarily enabling verifying if the key is well protected.
Pre-assigned Identifiers: ACME assumes that the identifier (like a domain name) is pre-assigned and focuses on validating control over this identifier.
This approach is particularly useful for scenarios where identifiers are static and pre-assigned, making it ideal for server authenticated TLS and applications that rely on domain names.
SPIFFE: Dynamic Assignment of Identifier Based on Attestation
SPIFFE, conversely, is designed for dynamic workloads, which frequently change as services scale or update. SPIFFE assigns identifiers to workloads dynamically, based on attestation.
Identifier Assignment: SPIFFE assigns an identifier to the subject (such as a workload) using attestation about the subject to construct this identifier.
Namespacing and Least Privilege: SPIFFE facilitates the namespacing of identifiers, building a foundation that enables authorization frameworks that promote least privilege, ensuring workloads only access necessary resources.
Minimal Security Domains: At the core of SPIFFE is the concept of security domains, which serve as trust boundaries between services, helping to minimize the attack surface.
JWT and X.509: SPIFFE supports SVIDs in both X.509 and JWT formats, enabling seamless integration with various systems and protocols within cloud deployments.
Importance in Dynamic Workload Deployments
The differences between ACME and SPIFFE are particularly significant in dynamic workload deployments:
Flexibility and Scalability: SPIFFE’s dynamic identifier assignment is highly suitable for environments where workloads are frequently spun up and down, such as in microservices architectures and Kubernetes clusters.
Role-Based Authentication: By making attestation a core component and promoting least privilege, SPIFFE ensures that each workload is authenticated and authorized precisely for its intended role.
Granular Policy Enforcement: SPIFFE’s namespacing and minimal security domain features enable fine-grained policy enforcement, enabling organizations to define and enforce policies at a more granular level, such as per-workload or per-service.
Identity Federation: SPIFFE supports identity federation enabling different trust domains to interoperate securely. This is particularly beneficial in multi-cloud and hybrid environments where different parts of an organization or third-parties.
Conclusion
While both ACME and SPIFFE are used for certificate management, they cater to different scenarios and can complement each other effectively. ACME is ideal for static environments where identifiers are pre-allocated, focusing on certificate lifecycle management for issuing, renewing, and revoking certificates. It excels in managing the lifecycle of certificates for web servers and other relatively static resources.
On the other hand, SPIFFE excels in dynamic, high-scale environments, emphasizing credential lifecycle management with flexible and robust authentication and authorization through dynamic identifiers and attestation. SPIFFE is designed for modern, cloud-native architectures where workloads are ephemeral, and identifiers need to be dynamically issued and managed.
By understanding these differences, you can leverage both ACME and SPIFFE to enhance your identity management strategy. Use ACME for managing certificates in static environments to ensure efficient lifecycle management. Deploy SPIFFE for dynamic environments to provide strong, attested identities for your workloads.
Creating a cohesive and effective roadmap requires more than just a list of tasks and deadlines; it requires a clear mission, vision, and strategy that aligns the entire organization. This alignment ensures that every team member understands the overarching goals and the path to achieving them, helping facilitate a unified and focused approach to succeeding.
Mission: The Foundation of Purpose
The mission statement serves as the foundation, articulating the core purpose of the organization. It answers why the organization exists and what it aims to achieve. For example:
A strong mission focuses on the problem, providing clarity on the organization’s purpose.
Vision: The Aspirational Future
The vision statement paints an aspirational picture of the future the organization seeks to create within the next 2-5 years. It describes desired outcomes and the impact of the organization’s efforts. For example, envisioning customers leveraging advanced features to transform operations provides a clear goal. For example:
While the mission addresses the ‘why,’ the vision focuses on the ‘what’—what success looks like in practical terms.
Strategy: The Path to Achievement
Strategy bridges the gap between the current state and the envisioned future. It outlines ‘how’ the organization will achieve its mission and vision, including:
Assessing the current environment, including market conditions and internal capabilities.
Using data and trends to inform strategic decisions.
Setting clear, measurable business outcomes.
Deciding where to focus resources and explaining why certain paths are chosen.
Roadmap: The Plan
The roadmap translates the strategic vision into a plan, providing an 18-24 month overview of key milestones. It breaks down big problems into manageable tasks, ensuring alignment with overall goals. A good roadmap is not just a schedule; it’s a strategic tool that shows how initiatives drive business success, allowing for flexibility and adaptation while staying true to the mission and vision.
For example:
Q1 2024: Launch Enhanced AI and Machine Learning Tools
Initiative: Develop and release new AI and machine learning tools designed to address specific industry needs, such as healthcare analytics and financial risk modeling.
Milestone: Achieve a 10% increase in customer adoption of AI services within three months of launch.
Q2 2024: Expand Data Security Solutions
Initiative: Introduce advanced data security features, including encryption and threat detection, to meet growing cybersecurity demands.
Milestone: Secure certifications for new security standards and onboard 50 new enterprise customers by the end of the quarter.
Q3 2024: Integrate Cloud Services with Major ERP Systems
Initiative: Develop seamless integrations with leading ERP systems to streamline operations for enterprise customers.
Milestone: Complete integration with three major ERP platforms and pilot with 10 key clients.
Q4 2024: Launch Global Data Centers Expansion
Initiative: Open new data centers in strategic locations worldwide to enhance global reach and reliability.
Milestone: Increase data center footprint by 20% and reduce latency by 15% for international clients.
Q1 2025: Roll Out Customer Success Programs
Initiative: Implement customer success programs aimed at improving user experience and satisfaction.
Milestone: Achieve a 15% increase in customer satisfaction scores and reduce churn rate by 10%.
Achieving Alignment and Focus
Aligning the mission, vision, and strategy, and clearly communicating them, helps maintain organizational focus. This alignment prevents reactive firefighting and directional resets that consume resources. By fostering proactive planning and execution, organizations can drive meaningful progress toward business goals.
Conclusion
Effective roadmaps and work plans stem from a well-aligned mission, vision, and strategy. This alignment provides a stable foundation, clear goals, and a practical path to achieve them. By maintaining this alignment, organizations can move from reactive to proactive operations, regain stakeholder trust, and achieve sustainable success.
Thanks to Michale Windsor for his advice on this and other topics over the years
Building Effective, Evidence-Based Roadmaps for Business Success
In defining roadmaps, the primary goal is to deliver on product objectives that will generate revenue. These roadmaps are prioritized to achieve business goals as swiftly as possible. Typically, the prioritized items listed on roadmaps are broadly visible to customers but often depend on behind-the-scenes work that is not customer-visible, determined when building out work plans.
These work plans allocate time for roadmap items, customer obligations, and technical debt. For instance, a typical allocation might be 70% for roadmap items and their dependencies, 20% for customer obligations, and 10% for technical debt. In healthy organizations, these target allocations will ebb and flow but are largely adhered to, with the roadmap representing the largest commitment in the allocation.
When this is not the case, work plans are treated as roadmaps. In these situations, organizations are in a reactive position, where the only work that gets done is what has been committed to customers, and addressing technical debt has gotten so bad it is already causing harm to the business. These organizations are reactive and are always in a fight-or-flight mode.
Such reactive modes become a byproduct of the lack of structure in building the roadmap, the disconnect in the work supporting that roadmap, and a lack of recognition of the current situation regarding customer commitments. This, in turn, creates a situation where leadership loses faith in the organization’s ability to produce a compelling product roadmap.
This can be the beginning of a death spiral, where the expectation is that every item in the work plan is mapped back to customer revenue. This results in the items necessary to achieve forward growth business objectives being de-prioritized, with the committed sales funnel driving all engineering spend. The work plan and the management of it replace the accountability to build a business and associated engineering investment with a list of customer asks. As a consequence, the organization can never address technical debt or implement the kind of product design pivots needed to tackle issues like improving the size of deals through expanding its footprint within customers, reducing the cost of goods sold, time to close deals, or creating new parallel go-to-market strategies.
While there is no one-size-fits-all approach to product planning, the principles outlined here offer valuable lessons that can be adapted to various organizational contexts.To transition to a more proactive and accountable roadmap-driven model, organizations must commit to a disciplined approach where a significant portion of the effort is dedicated to developing a product that can deliver business outcomes, supported by a robust body of customer evidence.
This shift also involves ensuring alignment with internal stakeholders, such as product, engineering, and executive leadership. Achieving this alignment is key to prevent the cycle of reactive firefighting and directional resets that consume resources and hinder progress on business objectives.
By returning to first principles and building a compelling, evidence-based roadmap, organizations can over time move the organizations culture from a reactive to a proactive operational mode, regain trust of all participants, and as a result drive meaningful progress toward their business goals.
Historically, key management was seen as activities involving hardware security modules (HSMs), manual tasks, and audits. This approach was part of what we termed ‘responsible key management.’ However, HSMs were impractical for many use cases, and these manual tasks, typical of IT processes, were often poorly executed or never completed, frequently causing security incidents, outages, and unexpected work.
Simultaneously, as an industry, we began applying cryptography to nearly all communications and as a way to protect data at rest. This led to the adoption of cryptography as the method for authenticating hardware, machines, and workloads to apply access control to their activities. As a result, today, cryptography has become a fundamental component of every enterprise solution we depend on. This shift led us to attempt to apply legacy key management approaches at the enterprise scale. The increased use of cryptography within enterprises made it clear these legacy approaches ignored the majority of keys we relied on, so we took a tactical approach and created repositories to manage the sprawl of secrets. While a step forward, this approach also papered over the real problems with how we use, track, and manage keys.
It is time for us as an industry to start viewing cryptography and key management not just as a tax we must pay but as an investment. We need to manage these keys in an automated and scalable way that helps us manage risk in our businesses.
To do this, we need to start with a question: What are these keys, anyway? Broadly, I think of three categories of keys: long-lived asymmetric secrets like those associated with certificate authorities, long-lived shared secrets used for encryption and authentication, and modern-day asymmetric key credentials for users, devices, and workloads. The greatest growth in keys has been in the last category, so let’s focus on that for the purpose of this discussion.
Modern Credentials and Their Management
Modern-day asymmetric key-based credentials are not always “certificates,” but they generally bind some claim(s) to an asymmetric key pair. These certificates can be formatted as JSON, ASN.1, CBOR, TLVs, X.509, JWT, or some other encoding. They serve various purposes:
User Certificates: Issued to individual users to authenticate their identity within an organization, these certificates provide secure access to corporate resources, such as an SSH certificate used by developers to access production. They bind a user’s identity to a cryptographic key pair, ensuring only authorized individuals access sensitive information and systems.
Hardware Certificates: Assigned by manufacturers during production, these certificates uniquely identify hardware devices. They are often used to bootstrap the identity of machines or workloads, ensuring only authorized devices can access resources on your network.
Machine Certificates: Common in operational IT environments, these certificates authenticate servers associated with domains, IP addresses, or device identifiers. They are typically used with TLS and for network access use cases like 802.1x, IKE, and various VPNs.
Workload Certificates: In cloud and serverless environments, workload certificates perform access control close to the business logic to minimize security exposure and deliver on zero trust goals. These dynamic certificates often reflect both the underlying hardware and the workload running on it, acting like multi-factor authentication for devices. The frequent need to re-credential workloads makes issuing credentials mission-critical, as failure to do so can cause outages. This necessitates issuers in each failure domain (think of this as a cluster of clusters) hosting these workloads to ensure continuous operation.
What we can take from this is that we have been approaching credentials incorrectly by treating them as a key management problem. This approach is akin to using password managers for hardware, machines, and workloads, whereas, for users, we have moved toward multi-factor authentication and non-password-based authenticators.
Towards Automated and Scalable Key Management
If password managers or key vaults are not the right solution for machine authentication, what is? The answer is simpler than it might seem. Just as with users, these cases require built-for-purpose Identity Providers (IDPs). This is especially true for workloads, which dynamically spin up and down, making durable identifiers impractical. An IDP becomes a security domain for a given deployment, ensuring that workloads are accessible only by appropriate resources. This setup limits attackers’ lateral movement, allows for nearly instant granting and removal of access, minimizes the impact of compromises, and enables easy federation between deployments—all while providing a central point for identity governance and ensuring the cryptographic keys associated with credentials are well-managed and protected.
Getting Started
Modernizing key management starts with measurement. Identify the most common types of keys in your secret vaults, typically workload-related credentials. Deploy a workload-specific IDP, such as those enabled via SPIFFE, to transition these credentials out of the secret manager. Over time, the secret manager will store static secrets like API keys for legacy systems, while dynamic credentials are managed appropriately.
Prevent using your secret manager as an IDP from the start, especially for new systems. Teams responsible for the operational burden of these systems usually support this change, as automated end-to-end credentialing of workloads is more agile, scalable, and secure. This results in fewer outages and security incidents related to secret managers and non-production quality dependencies.
From this point, the process becomes a cycle of identifying where static secrets or long-lived credentials are funneled through your secret manager and moving them to built-for-purpose credential lifecycle management solutions.
Multi-factor authentication for workloads
Adopting a purpose-built IDP workload solution is a good start, but keys can still be stolen or leaked. For machines and workloads, use hardware attestations. Built-in hardware authenticators, such as Trusted Platform Modules (TPMs), create and secure keys within the device, ensuring they never leave. TPMs also verify device integrity during boot-up, adding an extra layer of security. This combination provides stronger multi-factor authentication without the usability issues associated with similar patterns for user authentication.
Avoiding Common Mistakes
The most common mistake organizations make is applying existing systems to workload credential management problems without fully analyzing operational, scale, uptime, and security needs. For example, enterprise PKI teams might mandate using their existing CA infrastructure for managing workload credentials, assuming a certificate suffices. However, this often violates the principle of least privilege and struggles with dynamic identification required for workloads.
Existing credential issuance systems are designed for static, long-lived subjects, such as directory names, DNS names, or IP addresses, which don’t change frequently. In contrast, workload credentials may change every few minutes. Provisioning devices like network appliances before assigning durable identifiers adds to this challenge. New workload-based systems, like SPIFFE, assign identifiers based on runtime elements, preventing the same bad practices that led to secret sprawl and mismanaged key problems.
Reducing Reliance on Shared Secrets
Moving away from shared secrets won’t eliminate the need for secret vaults but will significantly reduce the problem’s scope. As systems modernize, password-based authenticators will be updated or deprecated. Over time, we will see fewer shared, long-lived secrets used for workload identity, driven by zero trust and least privilege principles.
At the same time, we can do much to improve overall key management practices in production systems. However, that’s a topic for another post.
Closing Note
The challenges and opportunities in modern key management are significant, but by leveraging innovative solutions and focusing on automation and scalability, we can make substantial progress. As we adopt built-for-purpose Identity Providers (IDPs) and hardware attestations, it’s important to have the right tools and frameworks in place to succeed.
I have been working with SPIRL, a company focused on making the right thing the easy thing for developers, operations, and compliance. I see firsthand how the right platform investments can simplify the creation and operation of production systems. SPIRL achieves this by authoring and adopting open standards for identity and building reliable, scalable infrastructure that provides greater visibility and control.
Even if you don’t use SPIRL, by focusing on these principles, organizations can better manage the complexities of modern workload-related key and credential management, ensuring greater productivity and security.
Over the last few decades, we seem to have gotten better at the micro aspects of security, such as formally verifying protocols and designing cryptographic algorithms, but have worsened, or at least failed to keep up at the macro aspects, such as building and managing effective, reproducible, risk-based security programs.
This can probably be attributed to both the scale of the systems we now depend on and perhaps even more to human factors. The quote, “Bureaucracy defends the status quo long past the time when the quo has lost its status,” is relevant here. Security organizations grow organically, not strategically, usually in response to a past failure or a newly recognized risk, which ultimately results in new teams that exist forever increasing the overall load of the security organization on the business. The role of these organizations typically expands over time to justify this load transforming them into data-gathering organizations. This explains why enterprises have so many dashboards of data that fail to drive action. Organizations are overwhelmed and struggle to understand where their risks lie so they can effectively allocate their limited resources toward achieving security-positive outcomes.
Building on this foundation of understanding risks, then there is the question of how we manage the success of security organizations in managing risk, as the saying goes, “If you can’t measure it, you can’t manage it.” The issue is that organizations often turn these measurements into metrics of success, which seems rational on the surface. However, in practice, we encounter another adage: “When a measure becomes a target, it ceases to be a good measure,” something that is especially true in security. For example, we often measure a security program for responding to an incident well, shipping new capabilities or passing audits, but this is an incomplete picture, using audits as an example, they are designed to fit every organization, which takes us to the saying, “If you try to please all, you please none.” In the case of security, you are guaranteed to be missing major issues or, at an absolute minimum, prioritizing activity over effectiveness with this approach. To make this more concrete this incentivizes successful audits (despite the incomplete picture this represents), and seeing the number of CVEs identified go down (despite these figures almost always being false positives) which in turn mislead organizations into believing they are more secure than they actually are.
Transitioning from metrics a combination of metrics, technology, and continual improvement, can help with some of the scale problems above, for example, AI shows promise for triaging issues and accelerating the review of low-level issues that fit neatly into a small context window but at the same time give a false sense of security. The human problems, on the other hand, are something we cannot simply automate; the best we can do is to rethink the way we build organizations so that empowerment and accountability are woven into how the organization operates. This will require ensuring those who take on that accountability truly understand how systems work and build their teams with a culture of critical, continual improvement. The second-order effects of these scale tools is de-skilling of the creators and operators of the tools — for example, I am always blown away how little modern computer science graduates understand of the way the systems they write code on operate — even from the most prestigious schools. We must also make continual education a key part of how we build our organizations.
Finally, focusing on system design, to support this from a design standpoint, we also need to consider how we design systems. The simplest approach is to design these systems in the most straightforward way possible, but even then, we need to consider the total operational nature of the system as part of the design. In user experience, we design for key users; early on, we talk about the concept of user stories. In system design, we often design and then figure out how that system will be operated and managed. We need to incorporate operational security into our designs. Do they emit the information (metrics and contextualized logs) required to monitor, do we provide the tools to use them to do this detection? If not how are their users to know they are operating securely? For example, do they enable monitoring for threats like token forgery over time? We must make the systems we ship less dependent on the human being involved in their operation, recognizing that they are generalists, not specialists, and give them simple answers to these operational issues if we want to help them achieve secure outcomes.
In conclusion, organizations need to look at the technological, and human aspects of their security program as well as their technology choices continuously and critically. This will almost certainly involve rethinking the metrics they use to drive security efforts, while also building a workplace culture centered on empowerment, accountability, continuous improvement and fundamentally integrating the security lifecycle of systems into the design process from day one. By adopting these strategic pillars, organizations can build more resilient, effective, and adaptable security programs that are equipped to meet the challenges of today’s dynamic environment.
I started my career at Microsoft in the 90s, initially working on obscure third-party networking issues, printing, and later Internet Explorer. Back then, even though I had gotten into computers through what today would probably be categorized as security research, it would have been nearly impossible in those days to find someone who wanted to hire me for those skills. I left the company a few years later and found my first job in computer security and never looked back.
I came back to Microsoft in 2000, but this time I was working on authentication and cryptography. This was just a few years before the infamous security standdown that was kicked off by the Bill Gates’ Trustworthy Computing Memo. This gave me a firsthand view into what led to that pivotal moment and how it evolved afterward. The work that was done during the subsequent years changed the way the industry looked at building secure software.
The thing is, at the same time, the concepts of third-party operated applications (SaaS) and shared computing platforms (Cloud Computing) were gaining traction. The adoption of these concepts required us to rethink how we build secure software for these new use cases and environments. For example, this shift introduced the concepts of massive multi-tenancy and operational shared fate between customers and their providers and made shipping updates much easier on a large scale. This accelerated rate of change also drove the need to rethink how we manage a security program, as the approaches used by the traditional software business often did not apply in this fast-paced world. My initial exposure to this problem came from my last role at Microsoft, where I was responsible for security engineering for the Advertising business.
The company had not defined mature approaches to how to secure online services yet which created the opportunity for us to find ways to use similar but different models that could fit into the realities of these new environments, which had both positive impacts (agility to remediate) and negative impacts (scale and speed) and through that, try to build a security program that could work in this new reality.
I share that context to give a bit of color to the bias and background I bring to the current situation Microsoft finds itself in. Having lived through what was surely the world’s single largest investment in making software and services secure to that point, and having spent decades working in security, I have had the chance to see several cycles in the way we look at building systems.
A New Chapter Unfolds
All things in life have natural cycles, and the same is true for how the industry views security. Organizations ebb and flow as a result of market changes, leadership changes, and as customer demands evolve. In the case of security, there is also the false idea that it is a destination or a barrier to delivering on business objectives that factors into these cycles.
As a result, it’s no surprise that over the following decade and a half, we saw Microsoft lessen its commitment to security – especially in the fast-moving and growing opportunity for cloud services. As an outsider looking in, it felt like they lost their commitment around the time they began viewing security as a business rather than the way you keep your promises to customers. At some point it felt like every month, you would see outages related to mishandling basics, with increased frequency of the same type of issues, for example, multi-tenancy violations, one right after another.
This increase in basic security issues was paired with poor handling of incidents which is why it was no surprise to see the incident known as STORM-0558 come about. As soon as this incident became public it was clear what had happened, the organization adopted the most convenient practices to ship and operate and under-invested in the most basic lessons of the preceding two decades in a trade-off that externalized the consequences of those decisions to their customers.
Microsoft had no choice but to respond in some way, so three months after the issue became public they announced the Secure Future Initiative which can be summarized as:
Apply AI to Scale Security
Using More Secure Defaults
Rolling Out Zero Trust Principles
Adopting Better Key Management
Consistency in Incident Response
Advocating for Broader Security Investments
This was lauded by some as the next Trustworthy Computing initiative, but on the surface, that’s a far cry from the kind of investment made during those days. To me it sounds more like a mix of how Microsoft intends to meet the CISA Secure By Design initiative and how they think they need to respond to the STORM-0558 incident. There is always a question of messaging versus reality, so I personally held, and still do, hope that this was the first organizational sign of an awakening that could lead to a similar level of investment.
Shortly after the Storm-0558 incident, I appeared on “Security Conversations” with Ryan Naraine. We discussed how the situation might have unfolded and identified the root causes—my answer was lack of security leadership. Therefore, it was no surprise to me to see that when the CSRB came out, the reviewers reached the same conclusion.
“Microsoft’s security culture was inadequate and requires an overhaul, particularly in light of the company’s centrality in the technology ecosystem and the level of trust customers place in the company to protect their data and operations.”
Despite these challenges, it’s important to recognize that not all teams within Microsoft have been equally impacted by these systemic issues. As William Gibson famously stated, ‘The future is already here — it’s just not very evenly distributed.’ This is evident within Microsoft, where the Windows team, for example, appears to have continued to do well relative to its peers.
Beyond Metrics to Meaningful Reform
The other day Satya, someone who did what many thought was impossible by turning Microsoft around from the company it became during the Balmer years, wrote an internal memo, that amongst other things stated:
“If you’re faced with the tradeoff between security and another priority, your answer is clear: Do security.”
In this memo, he also stated:
“In addition, we will instill accountability by basing part of the compensation of the senior leadership team on our progress towards meeting our security plans and milestones.”
At the same time, Charlie Bell released more information on the intended implementation of the Microsoft Secure Future Initiative which puts some more meat on what the initial announcement promised, and how they are expanding it as a result of the CSRB findings, the most impactful organizational change probably being the decision to add deputy CISOs in product teams.
So the good news is that this does signal that Microsoft has heard the message from the CSRB. These are good first steps in addressing the cultural issues that have contributed to Microsoft’s broad decline as a leader in security over the last several decades.
The question then becomes: does the executive leadership under Satya understand how their personal choices, organizational structure, approach to culture, approach to staffing, and overall business management decisions have contributed to the current situation? In my experience, the types of changes needed to achieve the transformational shifts required to address security neglect often necessitate leadership changes. Merely issuing a strong directive from the CEO and allocating additional budget is seldom enough to materially create the needed changes to a company’s approach to security.
What concerns me is the wording of the statement about tying compensation to security. What it actually does is link compensation to progress in meeting their “security plans and milestones”.
So the question becomes, do those security plans and milestones manifest into the technical changes and cultural changes needed to address the problem and the hole they have dug for themselves?
If we look at the value they will realize by executing on work items called out in the Security First Initiative as the definition of what they believe their problems are I have some doubts.
Bridging Visions and Realities
If we look at the CSRB report and classify the issues identified into 5 categories we see that the majority of the identified issues were related to design decisions.
Category
Example
Proportion
Security Design Issues
Inadequate cryptographic key management, failure to detect forged tokens
40%
Incident Response
Delays in updating the public on the true nature of incidents, slow response to key compromise
20%
Operational Issues
Failure to rotate keys automatically, using aging keys, allowing consumer keys to access enterprise data
20%
Vulnerability Management
Lack of controls to alert for aging keys, not detecting unauthorized token use
10%
Risk Management
Inadequate security practices compared to other CSPs, not having a detection system for forged tokens
10%
Key:
Security Design Issues: This includes fundamental flaws in how security measures were architected.
Incident Response: Refers to the overall handling and transparency of the incident, including the timeliness and accuracy of public communications.
Operational Issues: These are failures in the operational handling of security mechanisms.
Vulnerability Management: Concerns the lack of proactive measures to detect and mitigate vulnerabilities.
Risk Management: Describes the overall approach to assessing and managing risks, highlighting a lack of comparable security controls relative to industry standards.
If we compare those issues with the areas of investment outlined above in the Secure Futures Investment announcement it’s not clear to me that they would have made a meaningful dent to the root cause of the identified issues, and more generally they don’t seem to look at the larger systemic issues Microsoft is experiencing at all.
Let’s just take the plan to use AI to help scale out their security program as an example, certainly a worthwhile initiative but the root cause here was a design issue. Today’s AI systems are good at automating the tasks we understand how to do well and even then they struggle at that, and that’s not even touching on the more nuanced issue of “design”.
For example, there is a great paper from Dan Boneh and his students that shows that code from solutions like OpenAI’s Codex, may contribute to the creation of less secure code. Another research effort focused on GitHub CoPilot reported similar findings.
This doesn’t mean that this technology isn’t promising or that it can’t help manage security issues in the massive software systems we rely on today. However, it’s unlikely to significantly impact the types of issues currently seen at Microsoft. That’s why the CSRB has emphasized the need for a cultural overhaul in how Microsoft approaches security organizationally. Satya Nadella’s message about prioritizing security is a step in the right direction, and the Charlie blog post does outline a systemization of how they will go about that but meaningful cultural changes and making Microsoft a leader in security again will require much more than a blog post and incentives to execute in a timely manner.
Conclusion
Microsoft was once known for its poor track record in building secure software and services, they made huge investments and became a leader, over time, they lost their edge. The Secure Future Initiative marks a step forward, as does the recent memo from Satya Nadella prioritizing security above all else. However, true progress will depend on Microsoft’s ability to roll out organizational changes and rebuild a culture that prioritizes security not just meeting milestones.
The good news is that they have the talent, the resources, and still some of the muscle memory on how to get this done at scale. If Satya can turn around the company from those ailing Ballmer years, I have faith he can address this issue too.
The other day Microsoft announced something it calls Zero Trust DNS. At a high level, it is leveraging clients’ underlying name resolution capabilities to establish and enforce security controls below the application.
In design, it is quite similar to what we did in Windows 2008 and Network Access Protection (NAP), a now deprecated solution that if released today would be considered some flavor of “Zero Trust” network access control.
NAP supported several different enforcement approaches for network policy. One of the less secure methods assessed a client’s posture; if found to be non-compliant with organizational requirements, the DHCP server would assign a restricted IP address configuration. The most secure approach relied on certificate-based IPSEC and IPv6. Healthy clients were provisioned with what we called a health certificate. All participating devices would then communicate only with IPSEC-authenticated traffic and drop the rest.
If ZeroTrust DNS had been released in 2008, it would have been another enforcement mode for Network Access Protection. It operates very similarly, essentially functioning on the basis that :
The enterprise controls the endpoint and, through group policy, dictates DNS client behavior and network egress policy,
Leverage mutual TLS and DNS over HTTPS (DoH) to authenticate clients, and
Transform the DNS server into a policy server for network access.
I was always disappointed to see NAP get deprecated, especially the 802.1X and IPSEC-based enforcement models. Don’t get me wrong, there were many things we should have done differently, but the value of this kind of enforcement and visibility in an enterprise is immense. This is why, years later, the patterns in NAP were reborn in solutions like BeyondCorp and the myriad of “Zero Trust” solutions we see today.
So why might this ZeroTust DNS be interesting? One of the more common concerns I have heard from large enterprises is that the mass adoption of encrypted communications has made it hard for them to manage the security posture of their environment. This is because many of the controls they have historically relied on for security were designed around monitoring clear text traffic.
This is how we ended up with abominations like Enterprise Transport Security which intentionally weakens the TLS 1.3 protocol to enable attackers—erm, enterprises—to continue decrypting traffic.
One of the theses of this Zero Trust DNS solution appears to be that by turning DNS into what we used to call a Policy Enforcement Point for the network enterprises get some of that visibility back. While they do not get cleartext traffic, they do get to reliably control and audit what domain names you resolve. When you combine that with egress network filtering, it has the potential to create a closed loop where an enterprise can have some confidence about where traffic is going and when. While I would not want my ISP to do any of this, I think it’s quite reasonable for an enterprise to do so; it’s their machine, their data, and their traffic. It also has the potential to be used as a way to make lateral movement in a network, when a compromise takes place, harder and maybe, in some cases, even make exfiltration harder.
Like all solutions that try to deliver network isolation properties, the sticking point comes back to how do you create useful policies that reduce your risk but still let work happen as usual. Having the rules based on high-level concepts like a DNS name should make this better, than with, for example, IPSEC-based isolation models, but it still won’t be trivial to manage. It looks like this will have all of those challenges still but that is true of all network segmentation approaches.
What I appreciate most about this is its potential to reduce an organization’s perceived need to deploy MiTM solutions. From a security protocol design perspective, what is happening here is a trade-off between metadata leakage and confidentiality. The MiTM solutions in use today cause numerous problems; they hinder the adoption of more secure protocol variants and objectively reduce enterprise security in at least one dimension. They also make rolling out new versions of TLS and more secure cryptography that much harder. Therefore, in my opinion, this is likely a good trade-off for some organizations.
To be clear, I do not believe that host-based logging of endpoint communications will lead all enterprises to abandon these MiTM practices. For example, some MiTM use cases focus on the intractable problems of Data Leak Protection, network traffic optimization, or single sign-on and privilege access management through protocol-level manipulation. These solutions clearly require cleartext access, and name-based access control and logging won’t be enough to persuade enterprises that rely on these technologies to move away from MiTM. However, there are some use cases where it might.
So, is this a good change or a bad change? I would say on average it’s probably good, and with some more investment from Microsoft, it could be a reasonable pattern to adopt for more granular network segmentation while giving enterprises more visibility in this increasingly encrypted world without needing to break TLS and other encryption schemes.
Have you found a use case where you think DANE and DNSSEC might be helpful? For example, the discovery of some configuration associated with a domain? Since a practically useful DNSSEC deployment, which requires individual domains (ex: example.com) to adopt DNSSEC and for relevant clients to use a fully validating DNSSEC resolver, which has not happened yet at any reasonable scale, maybe consider using certificates to sign the values you place in DNS instead.
SXG certificates are, in essence, signing certificates tied to a domain, while you could use a regular TLS certificate, for today’s post let’s assume that SXG certificates were the path you chose.
You can enroll for SXG certificates for free through Google Trust Services. This would allow you to benefit from signed data in DNS that is verifiable and deployable today not only once DNSSEC gets broad deployment.
Assuming a short certificate chain, since DNS does have size restrictions, this could be done as simply as:
Get an SXG certificate that you can use for signing….
In practice, due to DNS size constraints, you would likely use a simpler signature format, such as encoding just the signing certificate and the signature in Base64 as a type-length-value representation; with a 256-bit ECC certificate and its signature, this record would total around 1152 bytes, which would comfortably fit inside a DNS TXT record, thanks to EDNS which has a capacity of up to 4096 bytes.
While this does not provide the authoritative non-existence property that DNSSEC has, which exists to address downgrade attacks, which means relying parties could not detect if the record was removed, nor does it provide security to DNS overall, but if we are just talking about configuration data, it might be a viable approach to consider.
On the other hand, this form of signing a TXT record using a certificate is never going to take down your domain and DNSSEC adoption cannot say that!
As the old saying goes, “You can take the boy out of the farm, but you can’t take the farm out of the boy.” Although I was raised in metro Seattle, my father grew up on a farm in Eastern Washington, in the city of Walla Walla. We made regular trips there during my childhood, especially when my great-grandmother lived there by herself. These visits were more than just familial obligations; they were my introduction to values like hard work, family, and the joy of being close to the earth—values that have profoundly influenced who I am today.
I also fondly recall visits and weekend trips to my uncle’s, where my cousins and I would ride in the bed of his Chevy 3100, sliding around as we drove down the road, laughing and jostling around — back when the world was less concerned about safety regulations. Those moments of freedom are treasures I still carry.
This sense of nostalgia may explain why, after a career in information security, I felt compelled to restore several late 19th and early 20th-century safes. A few years ago, I embarked on a project to restore a Dodge Power Wagon, which encapsulates the strength, reliability, and spirit of those farmstead adventures.
Power Wagons Origin Story
The Dodge Power Wagon earned its legendary status on American farmlands shortly after World War II. Returning servicemen recognized the potential of the Dodge WCs they had used in the war. These vehicles could navigate the rugged farm terrain much like the battlefields they’d left behind. Equipped with a Power Take Off (PTO) and winch, the Dodge WC was not just a means of transport; it transformed into a tool that could till the fields or haul away a fallen tree. Recognizing its demand, Dodge released a civilian version—the Dodge Power Wagon.
The Power Wagon was the first mass-produced civilian 4×4 vehicle, ultimately symbolizing an era when durability and utility were paramount in vehicle design. Its introduction led to the widespread adoption of 4×4 capabilities by nearly every truck manufacturer.
Power Wagons also played a vital role in developing early infrastructure, aiding transportation and communication networks for rail and telephone companies. Coachbuilders would modify these trucks by combining two Power Wagons, to create multi-door vehicles that could transport crews to remote or difficult-to-access sites. It wasn’t until International Harvester introduced the Travelette in 1957 that a production truck with three or more doors became available.
Anyone who has ever done a high-end restoration of a vehicle will tell you it takes way longer than you expect, my project is no different. While we are getting close to the end of the project, after all it runs, drives, stops, has been put back together, and has been painted, and is now getting its interior done I would surely be wrong with whatever guess I gave.
My Power Wagon Restoration
Restoring this piece of history isn’t just about reviving a classic vehicle; in a way, it’s a tribute to my father, my family’s legacy. It’s a pilgrimage back to my roots, a way to share my family’s story with my children and, eventually, my grandchildren.