Monthly Archives: May 2024

Rethinking Security in Complex Systems

Over the last few decades, we seem to have gotten better at the micro aspects of security, such as formally verifying protocols and designing cryptographic algorithms, but have worsened, or at least failed to keep up at the macro aspects, such as building and managing effective, reproducible, risk-based security programs.

This can probably be attributed to both the scale of the systems we now depend on and perhaps even more to human factors. The quote, “Bureaucracy defends the status quo long past the time when the quo has lost its status,” is relevant here. Security organizations grow organically, not strategically, usually in response to a past failure or a newly recognized risk, which ultimately results in new teams that exist forever increasing the overall load of the security organization on the business. The role of these organizations typically expands over time to justify this load transforming them into data-gathering organizations. This explains why enterprises have so many dashboards of data that fail to drive action. Organizations are overwhelmed and struggle to understand where their risks lie so they can effectively allocate their limited resources toward achieving security-positive outcomes.

Building on this foundation of understanding risks, then there is the question of how we manage the success of security organizations in managing risk, as the saying goes, “If you can’t measure it, you can’t manage it.” The issue is that organizations often turn these measurements into metrics of success, which seems rational on the surface. However, in practice, we encounter another adage: “When a measure becomes a target, it ceases to be a good measure,” something that is especially true in security. For example, we often measure a security program for responding to an incident well, shipping new capabilities or passing audits, but this is an incomplete picture, using audits as an example, they are designed to fit every organization, which takes us to the saying, “If you try to please all, you please none.” In the case of security, you are guaranteed to be missing major issues or, at an absolute minimum, prioritizing activity over effectiveness with this approach. To make this more concrete this incentivizes successful audits (despite the incomplete picture this represents), and seeing the number of CVEs identified go down (despite these figures almost always being false positives) which in turn mislead organizations into believing they are more secure than they actually are.

Transitioning from metrics a combination of metrics, technology, and continual improvement, can help with some of the scale problems above, for example, AI shows promise for triaging issues and accelerating the review of low-level issues that fit neatly into a small context window but at the same time give a false sense of security. The human problems, on the other hand, are something we cannot simply automate; the best we can do is to rethink the way we build organizations so that empowerment and accountability are woven into how the organization operates. This will require ensuring those who take on that accountability truly understand how systems work and build their teams with a culture of critical, continual improvement. The second-order effects of these scale tools is de-skilling of the creators and operators of the tools — for example, I am always blown away how little modern computer science graduates understand of the way the systems they write code on operate — even from the most prestigious schools. We must also make continual education a key part of how we build our organizations.

Finally, focusing on system design, to support this from a design standpoint, we also need to consider how we design systems. The simplest approach is to design these systems in the most straightforward way possible, but even then, we need to consider the total operational nature of the system as part of the design. In user experience, we design for key users; early on, we talk about the concept of user stories. In system design, we often design and then figure out how that system will be operated and managed. We need to incorporate operational security into our designs. Do they emit the information (metrics and contextualized logs) required to monitor, do we provide the tools to use them to do this detection? If not how are their users to know they are operating securely? For example, do they enable monitoring for threats like token forgery over time? We must make the systems we ship less dependent on the human being involved in their operation, recognizing that they are generalists, not specialists, and give them simple answers to these operational issues if we want to help them achieve secure outcomes.

In conclusion, organizations need to look at the technological, and human aspects of their security program as well as their technology choices continuously and critically. This will almost certainly involve rethinking the metrics they use to drive security efforts, while also building a workplace culture centered on empowerment, accountability, continuous improvement and fundamentally integrating the security lifecycle of systems into the design process from day one. By adopting these strategic pillars, organizations can build more resilient, effective, and adaptable security programs that are equipped to meet the challenges of today’s dynamic environment.

Navigating Security and Innovation

I started my career at Microsoft in the 90s, initially working on obscure third-party networking issues, printing, and later Internet Explorer. Back then, even though I had gotten into computers through what today would probably be categorized as security research, it would have been nearly impossible in those days to find someone who wanted to hire me for those skills. I left the company a few years later and found my first job in computer security and never looked back.

I came back to Microsoft in 2000, but this time I was working on authentication and cryptography. This was just a few years before the infamous security standdown that was kicked off by the Bill Gates’ Trustworthy Computing Memo. This gave me a firsthand view into what led to that pivotal moment and how it evolved afterward. The work that was done during the subsequent years changed the way the industry looked at building secure software.

The thing is, at the same time, the concepts of third-party operated applications (SaaS) and shared computing platforms (Cloud Computing) were gaining traction. The adoption of these concepts required us to rethink how we build secure software for these new use cases and environments. For example, this shift introduced the concepts of massive multi-tenancy and operational shared fate between customers and their providers and made shipping updates much easier on a large scale. This accelerated rate of change also drove the need to rethink how we manage a security program, as the approaches used by the traditional software business often did not apply in this fast-paced world. My initial exposure to this problem came from my last role at Microsoft, where I was responsible for security engineering for the Advertising business.

The company had not defined mature approaches to how to secure online services yet which created the opportunity for us to find ways to use similar but different models that could fit into the realities of these new environments, which had both positive impacts (agility to remediate) and negative impacts (scale and speed) and through that, try to build a security program that could work in this new reality.

I share that context to give a bit of color to the bias and background I bring to the current situation Microsoft finds itself in. Having lived through what was surely the world’s single largest investment in making software and services secure to that point, and having spent decades working in security, I have had the chance to see several cycles in the way we look at building systems.

A New Chapter Unfolds

All things in life have natural cycles, and the same is true for how the industry views security. Organizations ebb and flow as a result of market changes, leadership changes, and as customer demands evolve. In the case of security, there is also the false idea that it is a destination or a barrier to delivering on business objectives that factors into these cycles.

As a result, it’s no surprise that over the following decade and a half, we saw Microsoft lessen its commitment to security – especially in the fast-moving and growing opportunity for cloud services. As an outsider looking in, it felt like they lost their commitment around the time they began viewing security as a business rather than the way you keep your promises to customers. At some point it felt like every month, you would see outages related to mishandling basics, with increased frequency of the same type of issues, for example, multi-tenancy violations, one right after another.

This increase in basic security issues was paired with poor handling of incidents which is why it was no surprise to see the incident known as STORM-0558 come about. As soon as this incident became public it was clear what had happened, the organization adopted the most convenient practices to ship and operate and under-invested in the most basic lessons of the preceding two decades in a trade-off that externalized the consequences of those decisions to their customers.

Microsoft had no choice but to respond in some way, so three months after the issue became public they announced the Secure Future Initiative which can be summarized as:

  1. Apply AI to Scale Security
  2. Using More Secure Defaults
  3. Rolling Out Zero Trust Principles 
  4. Adopting Better Key Management
  5. Consistency in Incident Response
  6. Advocating for Broader Security Investments

This was lauded by some as the next Trustworthy Computing initiative, but on the surface, that’s a far cry from the kind of investment made during those days. To me it sounds more like a mix of how Microsoft intends to meet the CISA Secure By Design initiative and how they think they need to respond to the STORM-0558 incident. There is always a question of messaging versus reality, so I personally held, and still do, hope that this was the first organizational sign of an awakening that could lead to a similar level of investment.

Shortly after the Storm-0558 incident, I appeared on “Security Conversations” with Ryan Naraine. We discussed how the situation might have unfolded and identified the root causes—my answer was lack of security leadership. Therefore, it was no surprise to me to see that when the CSRB came out, the reviewers reached the same conclusion.

“Microsoft’s security culture was inadequate and requires an overhaul, particularly in light of the company’s centrality in the technology ecosystem and the level of trust customers place in the company to protect their data and operations.” 

Despite these challenges, it’s important to recognize that not all teams within Microsoft have been equally impacted by these systemic issues. As William Gibson famously stated, ‘The future is already here — it’s just not very evenly distributed.’ This is evident within Microsoft, where the Windows team, for example, appears to have continued to do well relative to its peers.

Beyond Metrics to Meaningful Reform

The other day Satya, someone who did what many thought was impossible by turning Microsoft around from the company it became during the Balmer years, wrote an internal memo, that amongst other things stated: 

“If you’re faced with the tradeoff between security and another priority, your answer is clear: Do security.”

In this memo, he also stated:

“In addition, we will instill accountability by basing part of the compensation of the senior leadership team on our progress towards meeting our security plans and milestones.”

At the same time, Charlie Bell released more information on the intended implementation of the Microsoft Secure Future Initiative which puts some more meat on what the initial announcement promised, and how they are expanding it as a result of the CSRB findings, the most impactful organizational change probably being the decision to add deputy CISOs in product teams.

So the good news is that this does signal that Microsoft has heard the message from the CSRB. These are good first steps in addressing the cultural issues that have contributed to Microsoft’s broad decline as a leader in security over the last several decades.

The question then becomes: does the executive leadership under Satya understand how their personal choices, organizational structure, approach to culture, approach to staffing, and overall business management decisions have contributed to the current situation? In my experience, the types of changes needed to achieve the transformational shifts required to address security neglect often necessitate leadership changes. Merely issuing a strong directive from the CEO and allocating additional budget is seldom enough to materially create the needed changes to a company’s approach to security.

What concerns me is the wording of the statement about tying compensation to security. What it actually does is link compensation to progress in meeting their “security plans and milestones”.

So the question becomes, do those security plans and milestones manifest into the technical changes and cultural changes needed to address the problem and the hole they have dug for themselves?

If we look at the value they will realize by executing on work items called out in the Security First Initiative as the definition of what they believe their problems are I have some doubts.

Bridging Visions and Realities

If we look at the CSRB report and classify the issues identified into 5 categories we see that the majority of the identified issues were related to design decisions. 

CategoryExampleProportion
Security Design IssuesInadequate cryptographic key management, failure to detect forged tokens40%
Incident ResponseDelays in updating the public on the true nature of incidents, slow response to key compromise20%
Operational IssuesFailure to rotate keys automatically, using aging keys, allowing consumer keys to access enterprise data20%
Vulnerability ManagementLack of controls to alert for aging keys, not detecting unauthorized token use10%
Risk ManagementInadequate security practices compared to other CSPs, not having a detection system for forged tokens10%

Key:

  • Security Design Issues: This includes fundamental flaws in how security measures were architected.
  • Incident Response: Refers to the overall handling and transparency of the incident, including the timeliness and accuracy of public communications.
  • Operational Issues: These are failures in the operational handling of security mechanisms.
  • Vulnerability Management: Concerns the lack of proactive measures to detect and mitigate vulnerabilities.
  • Risk Management: Describes the overall approach to assessing and managing risks, highlighting a lack of comparable security controls relative to industry standards.

If we compare those issues with the areas of investment outlined above in the Secure Futures Investment announcement it’s not clear to me that they would have made a meaningful dent to the root cause of the identified issues, and more generally they don’t seem to look at the larger systemic issues Microsoft is experiencing at all.

Let’s just take the plan to use AI to help scale out their security program as an example, certainly a worthwhile initiative but the root cause here was a design issue. Today’s AI systems are good at automating the tasks we understand how to do well and even then they struggle at that, and that’s not even touching on the more nuanced issue of “design”.

For example, there is a great paper from Dan Boneh and his students that shows that code from solutions like  OpenAI’s Codex, may contribute to the creation of less secure code. Another research effort focused on GitHub CoPilot reported similar findings.

This doesn’t mean that this technology isn’t promising or that it can’t help manage security issues in the massive software systems we rely on today. However, it’s unlikely to significantly impact the types of issues currently seen at Microsoft. That’s why the CSRB has emphasized the need for a cultural overhaul in how Microsoft approaches security organizationally. Satya Nadella’s message about prioritizing security is a step in the right direction, and the Charlie blog post does outline a systemization of how they will go about that but meaningful cultural changes and making Microsoft a leader in security again will require much more than a blog post and incentives to execute in a timely manner.

Conclusion

Microsoft was once known for its poor track record in building secure software and services, they made huge investments and became a leader, over time, they lost their edge. The Secure Future Initiative marks a step forward, as does the recent memo from Satya Nadella prioritizing security above all else. However, true progress will depend on Microsoft’s ability to roll out organizational changes and rebuild a culture that prioritizes security not just meeting milestones.

The good news is that they have the talent, the resources, and still some of the muscle memory on how to get this done at scale. If Satya can turn around the company from those ailing Ballmer years, I have faith he can address this issue too.

The Rebirth of Network Access Protection with Microsoft’s Zero Trust DNS

The other day Microsoft announced something it calls Zero Trust DNS. At a high level, it is leveraging clients’ underlying name resolution capabilities to establish and enforce security controls below the application.

In design, it is quite similar to what we did in Windows 2008 and Network Access Protection (NAP), a now deprecated solution that if released today would be considered some flavor of “Zero Trust” network access control.

NAP supported several different enforcement approaches for network policy. One of the less secure methods assessed a client’s posture; if found to be non-compliant with organizational requirements, the DHCP server would assign a restricted IP address configuration. The most secure approach relied on certificate-based IPSEC and IPv6. Healthy clients were provisioned with what we called a health certificate. All participating devices would then communicate only with IPSEC-authenticated traffic and drop the rest.

If ZeroTrust DNS had been released in 2008, it would have been another enforcement mode for Network Access Protection. It operates very similarly, essentially functioning on the basis that :

  1. The enterprise controls the endpoint and, through group policy, dictates DNS client behavior and network egress policy,
  2. Leverage mutual TLS and DNS over HTTPS (DoH) to authenticate clients, and
  3. Transform the DNS server into a policy server for network access.

I was always disappointed to see NAP get deprecated, especially the 802.1X and IPSEC-based enforcement models. Don’t get me wrong, there were many things we should have done differently, but the value of this kind of enforcement and visibility in an enterprise is immense. This is why, years later, the patterns in NAP were reborn in solutions like BeyondCorp and the myriad of “Zero Trust” solutions we see today.

So why might this ZeroTust DNS be interesting? One of the more common concerns I have heard from large enterprises is that the mass adoption of encrypted communications has made it hard for them to manage the security posture of their environment. This is because many of the controls they have historically relied on for security were designed around monitoring clear text traffic.

This is how we ended up with abominations like Enterprise Transport Security which intentionally weakens the TLS 1.3 protocol to enable attackers—erm, enterprises—to continue decrypting traffic. 

One of the theses of this Zero Trust DNS solution appears to be that by turning DNS into what we used to call a Policy Enforcement Point for the network enterprises get some of that visibility back. While they do not get cleartext traffic, they do get to reliably control and audit what domain names you resolve. When you combine that with egress network filtering, it has the potential to create a closed loop where an enterprise can have some confidence about where traffic is going and when. While I would not want my ISP to do any of this, I think it’s quite reasonable for an enterprise to do so; it’s their machine, their data, and their traffic. It also has the potential to be used as a way to make lateral movement in a network, when a compromise takes place, harder and maybe, in some cases, even make exfiltration harder.

Like all solutions that try to deliver network isolation properties, the sticking point comes back to how do you create useful policies that reduce your risk but still let work happen as usual. Having the rules based on high-level concepts like a DNS name should make this better, than with, for example, IPSEC-based isolation models, but it still won’t be trivial to manage. It looks like this will have all of those challenges still but that is true of all network segmentation approaches.

What I appreciate most about this is its potential to reduce an organization’s perceived need to deploy MiTM solutions. From a security protocol design perspective, what is happening here is a trade-off between metadata leakage and confidentiality. The MiTM solutions in use today cause numerous problems; they hinder the adoption of more secure protocol variants and objectively reduce enterprise security in at least one dimension. They also make rolling out new versions of TLS and more secure cryptography that much harder. Therefore, in my opinion, this is likely a good trade-off for some organizations.

To be clear, I do not believe that host-based logging of endpoint communications will lead all enterprises to abandon these MiTM practices. For example, some MiTM use cases focus on the intractable problems of Data Leak Protection, network traffic optimization, or single sign-on and privilege access management through protocol-level manipulation. These solutions clearly require cleartext access, and name-based access control and logging won’t be enough to persuade enterprises that rely on these technologies to move away from MiTM. However, there are some use cases where it might.

So, is this a good change or a bad change? I would say on average it’s probably good, and with some more investment from Microsoft, it could be a reasonable pattern to adopt for more granular network segmentation while giving enterprises more visibility in this increasingly encrypted world without needing to break TLS and other encryption schemes.