Category Archives: Thoughts

Rethinking Security in Complex Systems

Over the last few decades, we seem to have gotten better at the micro aspects of security, such as formally verifying protocols and designing cryptographic algorithms, but have worsened, or at least failed to keep up at the macro aspects, such as building and managing effective, reproducible, risk-based security programs.

This can probably be attributed to both the scale of the systems we now depend on and perhaps even more to human factors. The quote, “Bureaucracy defends the status quo long past the time when the quo has lost its status,” is relevant here. Security organizations grow organically, not strategically, usually in response to a past failure or a newly recognized risk, which ultimately results in new teams that exist forever increasing the overall load of the security organization on the business. The role of these organizations typically expands over time to justify this load transforming them into data-gathering organizations. This explains why enterprises have so many dashboards of data that fail to drive action. Organizations are overwhelmed and struggle to understand where their risks lie so they can effectively allocate their limited resources toward achieving security-positive outcomes.

Building on this foundation of understanding risks, then there is the question of how we manage the success of security organizations in managing risk, as the saying goes, “If you can’t measure it, you can’t manage it.” The issue is that organizations often turn these measurements into metrics of success, which seems rational on the surface. However, in practice, we encounter another adage: “When a measure becomes a target, it ceases to be a good measure,” something that is especially true in security. For example, we often measure a security program for responding to an incident well, shipping new capabilities or passing audits, but this is an incomplete picture, using audits as an example, they are designed to fit every organization, which takes us to the saying, “If you try to please all, you please none.” In the case of security, you are guaranteed to be missing major issues or, at an absolute minimum, prioritizing activity over effectiveness with this approach. To make this more concrete this incentivizes successful audits (despite the incomplete picture this represents), and seeing the number of CVEs identified go down (despite these figures almost always being false positives) which in turn mislead organizations into believing they are more secure than they actually are.

Transitioning from metrics a combination of metrics, technology, and continual improvement, can help with some of the scale problems above, for example, AI shows promise for triaging issues and accelerating the review of low-level issues that fit neatly into a small context window but at the same time give a false sense of security. The human problems, on the other hand, are something we cannot simply automate; the best we can do is to rethink the way we build organizations so that empowerment and accountability are woven into how the organization operates. This will require ensuring those who take on that accountability truly understand how systems work and build their teams with a culture of critical, continual improvement. The second-order effects of these scale tools is de-skilling of the creators and operators of the tools — for example, I am always blown away how little modern computer science graduates understand of the way the systems they write code on operate — even from the most prestigious schools. We must also make continual education a key part of how we build our organizations.

Finally, focusing on system design, to support this from a design standpoint, we also need to consider how we design systems. The simplest approach is to design these systems in the most straightforward way possible, but even then, we need to consider the total operational nature of the system as part of the design. In user experience, we design for key users; early on, we talk about the concept of user stories. In system design, we often design and then figure out how that system will be operated and managed. We need to incorporate operational security into our designs. Do they emit the information (metrics and contextualized logs) required to monitor, do we provide the tools to use them to do this detection? If not how are their users to know they are operating securely? For example, do they enable monitoring for threats like token forgery over time? We must make the systems we ship less dependent on the human being involved in their operation, recognizing that they are generalists, not specialists, and give them simple answers to these operational issues if we want to help them achieve secure outcomes.

In conclusion, organizations need to look at the technological, and human aspects of their security program as well as their technology choices continuously and critically. This will almost certainly involve rethinking the metrics they use to drive security efforts, while also building a workplace culture centered on empowerment, accountability, continuous improvement and fundamentally integrating the security lifecycle of systems into the design process from day one. By adopting these strategic pillars, organizations can build more resilient, effective, and adaptable security programs that are equipped to meet the challenges of today’s dynamic environment.

Navigating Security and Innovation

I started my career at Microsoft in the 90s, initially working on obscure third-party networking issues, printing, and later Internet Explorer. Back then, even though I had gotten into computers through what today would probably be categorized as security research, it would have been nearly impossible in those days to find someone who wanted to hire me for those skills. I left the company a few years later and found my first job in computer security and never looked back.

I came back to Microsoft in 2000, but this time I was working on authentication and cryptography. This was just a few years before the infamous security standdown that was kicked off by the Bill Gates’ Trustworthy Computing Memo. This gave me a firsthand view into what led to that pivotal moment and how it evolved afterward. The work that was done during the subsequent years changed the way the industry looked at building secure software.

The thing is, at the same time, the concepts of third-party operated applications (SaaS) and shared computing platforms (Cloud Computing) were gaining traction. The adoption of these concepts required us to rethink how we build secure software for these new use cases and environments. For example, this shift introduced the concepts of massive multi-tenancy and operational shared fate between customers and their providers and made shipping updates much easier on a large scale. This accelerated rate of change also drove the need to rethink how we manage a security program, as the approaches used by the traditional software business often did not apply in this fast-paced world. My initial exposure to this problem came from my last role at Microsoft, where I was responsible for security engineering for the Advertising business.

The company had not defined mature approaches to how to secure online services yet which created the opportunity for us to find ways to use similar but different models that could fit into the realities of these new environments, which had both positive impacts (agility to remediate) and negative impacts (scale and speed) and through that, try to build a security program that could work in this new reality.

I share that context to give a bit of color to the bias and background I bring to the current situation Microsoft finds itself in. Having lived through what was surely the world’s single largest investment in making software and services secure to that point, and having spent decades working in security, I have had the chance to see several cycles in the way we look at building systems.

A New Chapter Unfolds

All things in life have natural cycles, and the same is true for how the industry views security. Organizations ebb and flow as a result of market changes, leadership changes, and as customer demands evolve. In the case of security, there is also the false idea that it is a destination or a barrier to delivering on business objectives that factors into these cycles.

As a result, it’s no surprise that over the following decade and a half, we saw Microsoft lessen its commitment to security – especially in the fast-moving and growing opportunity for cloud services. As an outsider looking in, it felt like they lost their commitment around the time they began viewing security as a business rather than the way you keep your promises to customers. At some point it felt like every month, you would see outages related to mishandling basics, with increased frequency of the same type of issues, for example, multi-tenancy violations, one right after another.

This increase in basic security issues was paired with poor handling of incidents which is why it was no surprise to see the incident known as STORM-0558 come about. As soon as this incident became public it was clear what had happened, the organization adopted the most convenient practices to ship and operate and under-invested in the most basic lessons of the preceding two decades in a trade-off that externalized the consequences of those decisions to their customers.

Microsoft had no choice but to respond in some way, so three months after the issue became public they announced the Secure Future Initiative which can be summarized as:

  1. Apply AI to Scale Security
  2. Using More Secure Defaults
  3. Rolling Out Zero Trust Principles 
  4. Adopting Better Key Management
  5. Consistency in Incident Response
  6. Advocating for Broader Security Investments

This was lauded by some as the next Trustworthy Computing initiative, but on the surface, that’s a far cry from the kind of investment made during those days. To me it sounds more like a mix of how Microsoft intends to meet the CISA Secure By Design initiative and how they think they need to respond to the STORM-0558 incident. There is always a question of messaging versus reality, so I personally held, and still do, hope that this was the first organizational sign of an awakening that could lead to a similar level of investment.

Shortly after the Storm-0558 incident, I appeared on “Security Conversations” with Ryan Naraine. We discussed how the situation might have unfolded and identified the root causes—my answer was lack of security leadership. Therefore, it was no surprise to me to see that when the CSRB came out, the reviewers reached the same conclusion.

“Microsoft’s security culture was inadequate and requires an overhaul, particularly in light of the company’s centrality in the technology ecosystem and the level of trust customers place in the company to protect their data and operations.” 

Despite these challenges, it’s important to recognize that not all teams within Microsoft have been equally impacted by these systemic issues. As William Gibson famously stated, ‘The future is already here — it’s just not very evenly distributed.’ This is evident within Microsoft, where the Windows team, for example, appears to have continued to do well relative to its peers.

Beyond Metrics to Meaningful Reform

The other day Satya, someone who did what many thought was impossible by turning Microsoft around from the company it became during the Balmer years, wrote an internal memo, that amongst other things stated: 

“If you’re faced with the tradeoff between security and another priority, your answer is clear: Do security.”

In this memo, he also stated:

“In addition, we will instill accountability by basing part of the compensation of the senior leadership team on our progress towards meeting our security plans and milestones.”

At the same time, Charlie Bell released more information on the intended implementation of the Microsoft Secure Future Initiative which puts some more meat on what the initial announcement promised, and how they are expanding it as a result of the CSRB findings, the most impactful organizational change probably being the decision to add deputy CISOs in product teams.

So the good news is that this does signal that Microsoft has heard the message from the CSRB. These are good first steps in addressing the cultural issues that have contributed to Microsoft’s broad decline as a leader in security over the last several decades.

The question then becomes: does the executive leadership under Satya understand how their personal choices, organizational structure, approach to culture, approach to staffing, and overall business management decisions have contributed to the current situation? In my experience, the types of changes needed to achieve the transformational shifts required to address security neglect often necessitate leadership changes. Merely issuing a strong directive from the CEO and allocating additional budget is seldom enough to materially create the needed changes to a company’s approach to security.

What concerns me is the wording of the statement about tying compensation to security. What it actually does is link compensation to progress in meeting their “security plans and milestones”.

So the question becomes, do those security plans and milestones manifest into the technical changes and cultural changes needed to address the problem and the hole they have dug for themselves?

If we look at the value they will realize by executing on work items called out in the Security First Initiative as the definition of what they believe their problems are I have some doubts.

Bridging Visions and Realities

If we look at the CSRB report and classify the issues identified into 5 categories we see that the majority of the identified issues were related to design decisions. 

CategoryExampleProportion
Security Design IssuesInadequate cryptographic key management, failure to detect forged tokens40%
Incident ResponseDelays in updating the public on the true nature of incidents, slow response to key compromise20%
Operational IssuesFailure to rotate keys automatically, using aging keys, allowing consumer keys to access enterprise data20%
Vulnerability ManagementLack of controls to alert for aging keys, not detecting unauthorized token use10%
Risk ManagementInadequate security practices compared to other CSPs, not having a detection system for forged tokens10%

Key:

  • Security Design Issues: This includes fundamental flaws in how security measures were architected.
  • Incident Response: Refers to the overall handling and transparency of the incident, including the timeliness and accuracy of public communications.
  • Operational Issues: These are failures in the operational handling of security mechanisms.
  • Vulnerability Management: Concerns the lack of proactive measures to detect and mitigate vulnerabilities.
  • Risk Management: Describes the overall approach to assessing and managing risks, highlighting a lack of comparable security controls relative to industry standards.

If we compare those issues with the areas of investment outlined above in the Secure Futures Investment announcement it’s not clear to me that they would have made a meaningful dent to the root cause of the identified issues, and more generally they don’t seem to look at the larger systemic issues Microsoft is experiencing at all.

Let’s just take the plan to use AI to help scale out their security program as an example, certainly a worthwhile initiative but the root cause here was a design issue. Today’s AI systems are good at automating the tasks we understand how to do well and even then they struggle at that, and that’s not even touching on the more nuanced issue of “design”.

For example, there is a great paper from Dan Boneh and his students that shows that code from solutions like  OpenAI’s Codex, may contribute to the creation of less secure code. Another research effort focused on GitHub CoPilot reported similar findings.

This doesn’t mean that this technology isn’t promising or that it can’t help manage security issues in the massive software systems we rely on today. However, it’s unlikely to significantly impact the types of issues currently seen at Microsoft. That’s why the CSRB has emphasized the need for a cultural overhaul in how Microsoft approaches security organizationally. Satya Nadella’s message about prioritizing security is a step in the right direction, and the Charlie blog post does outline a systemization of how they will go about that but meaningful cultural changes and making Microsoft a leader in security again will require much more than a blog post and incentives to execute in a timely manner.

Conclusion

Microsoft was once known for its poor track record in building secure software and services, they made huge investments and became a leader, over time, they lost their edge. The Secure Future Initiative marks a step forward, as does the recent memo from Satya Nadella prioritizing security above all else. However, true progress will depend on Microsoft’s ability to roll out organizational changes and rebuild a culture that prioritizes security not just meeting milestones.

The good news is that they have the talent, the resources, and still some of the muscle memory on how to get this done at scale. If Satya can turn around the company from those ailing Ballmer years, I have faith he can address this issue too.

Restoring Memories

As the old saying goes, “You can take the boy out of the farm, but you can’t take the farm out of the boy.” Although I was raised in metro Seattle, my father grew up on a farm in Eastern Washington, in the city of Walla Walla. We made regular trips there during my childhood, especially when my great-grandmother lived there by herself. These visits were more than just familial obligations; they were my introduction to values like hard work, family, and the joy of being close to the earth—values that have profoundly influenced who I am today.

I also fondly recall visits and weekend trips to my uncle’s, where my cousins and I would ride in the bed of his Chevy 3100, sliding around as we drove down the road, laughing and jostling around — back when the world was less concerned about safety regulations. Those moments of freedom are treasures I still carry.

This sense of nostalgia may explain why, after a career in information security, I felt compelled to restore several late 19th and early 20th-century safes. A few years ago, I embarked on a project to restore a Dodge Power Wagon, which encapsulates the strength, reliability, and spirit of those farmstead adventures.

Power Wagons Origin Story

The Dodge Power Wagon earned its legendary status on American farmlands shortly after World War II. Returning servicemen recognized the potential of the Dodge WCs they had used in the war. These vehicles could navigate the rugged farm terrain much like the battlefields they’d left behind. Equipped with a Power Take Off (PTO) and winch, the Dodge WC was not just a means of transport; it transformed into a tool that could till the fields or haul away a fallen tree. Recognizing its demand, Dodge released a civilian version—the Dodge Power Wagon.

The Power Wagon was the first mass-produced civilian 4×4 vehicle, ultimately symbolizing an era when durability and utility were paramount in vehicle design. Its introduction led to the widespread adoption of 4×4 capabilities by nearly every truck manufacturer.

Power Wagons also played a vital role in developing early infrastructure, aiding transportation and communication networks for rail and telephone companies. Coachbuilders would modify these trucks by combining two Power Wagons, to create multi-door vehicles that could transport crews to remote or difficult-to-access sites. It wasn’t until International Harvester introduced the Travelette in 1957 that a production truck with three or more doors became available.

Anyone who has ever done a high-end restoration of a vehicle will tell you it takes way longer than you expect, my project is no different. While we are getting close to the end of the project, after all it runs, drives, stops, has been put back together, and has been painted, and is now getting its interior done I would surely be wrong with whatever guess I gave.

​​
My Power Wagon Restoration

Restoring this piece of history isn’t just about reviving a classic vehicle; in a way, it’s a tribute to my father, my family’s legacy. It’s a pilgrimage back to my roots, a way to share my family’s story with my children and, eventually, my grandchildren.

Navigating Content Authentication In the Age of Generative AI

In 1995, SSL was introduced, and it took 21 years for 40% of web traffic to become encrypted. This rate changed dramatically in 2016 with Let’s Encrypt and the adoption of ACME, leading to an exponential increase in TLS usage. In the next 8 years, adoption nearly reached 100% of web traffic. Two main factors contributed to this shift: first, a heightened awareness of security risks due to high-profile data breaches and government surveillance, creating a demand for better security. Second, ACME made obtaining and maintaining TLS certificates much easier.

Similarly, around 2020, the SolarWinds incident highlighted the issue of software supply chain security. This, among other factors, led to an increase in the adoption of code signing technologies, an approach that has been in use at least since 1995 when Microsoft used this approach to help deal with the problem of authenticity as we shifted away from CDs and floppy disks to network-based distributions of software. However, the complexity and cost of using code signing severely limited its widespread use, and where it was used, thanks to poor tooling, key compromises often led to a failure for most deployments to achieve the promised security properties. Decades later, projects like Binary Transparency started popping up and, thanks to the SolarWinds incident, projects that spun out of that like Go ChecksumDB, SigStore, and SigSum projects led to more usage of code signing.

Though the EU’s digital signature laws in 1999 specified a strong preference for cryptographic-based document signing technologies, their adoption was very limited, in part due to the difficulty of using the associated solutions. In the US, the lack of a mandate for cryptographic signatures also resulted in an even more limited adoption of this more secure approach to signing documents and instead relied on font-based signatures. However, during the COVID-19 pandemic, things started changing; in particular, most states adopted remote online notary laws, mandating the use of cryptographic signatures which quickly accelerated the adoption of this capability.

The next shift in this story started around 2022 when generative AI began to take off like no other technology in my lifetime. This resulted in a rush to create tools to detect this generated content but, as I mentioned in previous posts [1,2], this is at best an arms race and more practically intractable on a moderate to long-term timeline.

So, where does this take us? If we take a step back, what we see is that societally we are now seeing an increased awareness of the need to authenticate digital artifacts’ integrity and origin, just like we saw with the need for encryption a decade ago. In part, this is why we already see content authentication initiatives and discussions, geared for different artifact types like documents, pictures, videos, code, web applications, and others. What is not talked about much is that each of these use cases often involves solving the same core problems, such as:

  • Verifying entitlement to acquire the keys and credentials to be used to prove integrity and origin.
  • Managing the logical and physical security of the keys and associated credentials.
  • Managing the lifecycle of the keys and credentials.
  • Enabling the sharing of credentials and keys across the teams that are responsible for the objects in question.
  • Making the usage of these keys and credentials usable by machines and integrating naturally into existing workflows.

This problem domain is particularly timely in that the rapid growth of generative AI has raised the question for the common technology user — How can I tell if this is real or not? The answer, unfortunately, will not be in detecting the fakes, because of generative AIs ability to create content that is indistinguishable from human-generated work, rather, it will become evident that organizations will need to adopt practices, across all modalities of content, to not only sign these objects but also make verifying them easy so these questions can be answered by everyday users.

This is likely to be accelerated once the ongoing shifts take place in the context of software and service liability for meeting security basics. All of this seems to suggest we will see broader adoption of these content authentication techniques over the next decade if the right tools and services are developed to make adoption, usage, and management easy.

While no crystal ball can tell us for sure what the progression will look like, it seems not only plausible but necessary in this increasingly digital world where the lines between real and synthetic content continue to blur that this will be the case.

Update: Just saw this while checking out my feed on X and it seems quite timely 🙂

Tenement Farming and Cloud HSMs

While it’s fair to say that using a Cloud HSM means your keys are protected by a device meeting FIPS 140-3 standards, assuming the HSM in use has this certification, it’s important to realize this doesn’t guarantee the security you might expect. The security model of HSMs was built for the threats of the 1980s. These devices were not network-connected and were single-tenant — if they were “online” it was usually via HSMs attached to physical computers running an application on a machine connected to private networks — not connected to a globally reachable endpoint.

At their core, these devices were designed to protect keys from physical theft, more precisely, to slow down and increase the cost of theft, much like safe ratings (UL TL-15, TL-30, TL30x6) indicate for how effective the associated safes are. For example, early in my career, I worked on a project where we built attacks to extract non-exportable keys from a specific HSM and then imported them into another vendor’s HSM because the prior vendor went out of business. There have also been a number of key exfiltration bugs in these devices over the years as well.

We didn’t see network-connected HSMs until around 1999, but even then, these devices were single-tenant, essentially just a network-connected Linux or BSD box containing fundamentally the same hardware as years earlier. While this change did allow a single company to share an HSM across different application workloads, the assumption was still that this HSM was managed by the company in charge of all of these applications.

Why is this important today? Most computing is now done in shared cloud infrastructure, administered by someone else, with your competitor or an attacker on the same hardware as you. This presents a very different set of security considerations and design constraints than these devices were originally built. You are now exposed to the risks of the physical and logical administrators of these Cloud HSMs, the services they are dependent on, as well as other tenants of the Cloud HSM.

Consider that the compute operator usually can technically access the handle the application uses to talk to the HSM, and likely the secret used to authenticate to this HSM as well, meaning they, or an attacker, could potentially use that handle, or secret,  to sign, or decrypt data as they wish. You might find that an acceptable risk, but did you know some HSMs allow the administrator to blindly add users as operators to the “virtual HSMs” within it? Yup, they do.

What about when keys are stored in a KMS and the key policy dictates the key be managed by an HSM? If the HSM hardware attests that the key is stored in the HSM, and this attestation is verified, it’s nearly the same threat profile we just discussed. In some cases, it could be argued it is better because access to the HSM can have traditional user and service RBAC controls, and rate limiting, and keys can be replicated to many other HSMs without any administrative burden for you, keeping you safe from a common disaster recovery scenario while normalizing the management of these devices so it fits into your normal operational practices which hopefully are well managed and monitored.

Regardless of the approach, the bigger question is whether your provider’s operational and security practices are up to your specific threat model. Imagine a Bitcoin wallet worth 100 million dollars. Has your cloud provider proportionally invested enough into controls and tests around their system to prevent a motivated attacker from using your key to sign a transaction that moves all that to another wallet? Probably not.

The fundamental issue is that today’s HSMs were largely designed for different eras with different security concerns than we typically have today, mainly to protect against physical theft of keys in environments where data centers were effectively closets in dedicated office space. That doesn’t reflect today’s cloud computing scale.

It is worth noting that there are a few HSM solutions on the market that are making efforts to tackle some of these issues, but they still fall short but that is a topic for another post.

In essence, Cloud HSMs are to HSMs what Tenement Farming is to Farming.

That’s not to say there’s no value in these offerings, but as built today, they often fail to deliver the value they are assumed to deliver. And if regulations mandated their use before, say, 2010, chances are they’re not delivering the intended value that those regulations had in mind.

So, how should we be protecting keys now?

To be clear, this is not a case against Cloud HSMs, it is an argument to think about the threat model and use case you are solving for — for example if we look at Storm-0558 where Microsoft appears to have been using the private key material in the process of their IDP, the attacker was able to get a memory dump to be created and then via another attack vector gather the memory dump, and as a result the private key, we can take away at least one solid lesson. Do not load keys into the process of the applications that rely on them. In this case, the least costly method to have prevented this key theft would be simply moving the key to another process running in another user context with a very simple API that is easy to defend and can at least limit the attacker to a handle vs. what happened in this case where the attacker was able to use the key with impunity for years. This approach is the rough equivalent of a workload or node-specific software HSM similar in spirit to the original HSMs.

Another common problem we see in the industry is, that solutions like Hashicorp vault were designed to centralize key management and provide a one-size-fits-all answer to “Where do I keep my secrets?” Architecturally these solutions look much like a passively encrypted database, if you have sufficient permissions you can read the key in the clear and then copy it to whatever node or workload needs to use the key. This took us from secret sprawl to secret spray where we pushed the keys out in environment variables and files on production machines that later get dumped into logs, and backups, continuously exposing the keys to users who should have never had access, and often leaving key remnants all over the place. This is only marginally better than checking keys into dedicated source control repositories.

The problem here isn’t limited to these secret sprawl solutions, considering that almost every web-server TLS private key is sitting in the file system often with weak ACLs without any encryption which is then loaded into memory on that web server in the process. Similarly, most SSH keys are also sitting in some file, usually with a poor ACL, with a key either in the clear or with an easily grindable password so a malicious actor that gains read access to the file system is all that is needed to walk away with the key, for example, see this incident from last week

In both of these cases, we would be much better off if we would move these keys into another user context that is more defensible and constrained.

So how did we end here with such abysmal practices for managing keys?

While there is seldom one single reason to see such neglect, in this case, I think one of the largest is the dogmatic “all keys must be kept in HSMs or smart cards”. It is just too easy of a get-out-of-jail-free card for a security professional. Instead of thinking about the real risks and operational practices and then designing strategies to mitigate those threats that are practical and appropriate people who can afford to complete that checkbox do and those who can not just copy keys around in the clear out of a database.

The reality is we can do a lot better but as they say, the first step is to accept that you have a problem.

In short, as security professionals we need to avoid dogmatic answers to complex questions and spend the time to look more critically at the risks, constraints, obligations, resources, and real-world scenarios those we work with are operating within before we throw generic playbook answers to those coming to us for advice.

Evolving Challenges in Software Security

In 2023, we observed an average month-to-month increase in CVEs of approximately 1.64%, with this rate accelerating as the year progressed. At the same time, several trends emerged that are associated with this increase. These include a heightened focus on supply chain security by governments and commercial entities, intensified regulatory discussions around how to roll out concepts of software liability, and the expanded application of machine learning technologies in software security analysis.

Despite the broad use of open source, the large majority of software is still delivered and consumed in binary form. There are a few reasons for this, but the most obvious is that the sheer size and complication of code bases combined with the limited availability of expertise and time within consuming organizations makes the use of the source to manage risk impractical.

At the same time, it’s clear this issue is not new, for example in 1984, Ken Thompson, in his Turing Award Lecture, mentioned, “No amount of source-level verification or scrutiny will protect you from untrusted code”. This statement has been partially vindicated recently, as intelligent code analysis agents, although faster ways to produce code, have been found to exert downward pressure on code quality while also reducing the developer’s understanding of the code they produce — a bad combination.

To the extent these problems are resolved we can expect the attackers to be using the same tools to more rapidly identify new and more complex attack chains. In essence, it has become an arms race to build and apply these technologies to both offensive and defensive use cases.

It is this reality that has led to DARPA’s creation of the DARPA’s Artificial Intelligence Cyber Challenge and its various projects on using AI to both identify and fix security defects at scale.

The saying “In the middle of difficulty lies opportunity” aptly describes the current situation, where numerous security focused startups claim to offer solutions to our problems. However, the truth is often quite different.

Some of those racing to take advantage of this opportunity are focusing on software supply chain security, particularly with a focus on software composition analysis. This is largely driven by regulatory pressures to adopt the Software Bill of Materials concept. Yet, most tools that generate these documents only examine interpreted code and declared dependencies. As previously mentioned, the majority of code is delivered and consumed in compiled form, leaving customers unable to assess its correctness and completeness without enough data to do so. As a result, although these tools may help with compliance, they inadvertently cause harm by giving a false sense of security.

There are other vendors still that are essentially scaling up traditional source code reviews using large language models (LLMs). But as we’ve discussed, these tools are currently showing signs of reducing code quality and developers’ understanding of their own code. At the same time these tools produce such a high volume of false positives given the lack of context this analysis has available to it triaging the outputs can turn into a full-time job. This suggests that negative outcomes could ensue over time if we don’t adjust how we apply this technology or see significant improvements in the underlying technology itself.

These efforts are all concentrated on the software creators but if we expand the problem domain to include the consumers of software we see that outside of cloud environments, where companies like Wiz and Aqua Security provide vulnerability assessments at scale, there are hardly any resources aiding software consumers in making informed decisions about the risks they face by the software they use. A big part of this is the sheer amount of noise even these products produce, combined with the lack of actionability in such data for the consumer of the software. With that said these are tractable problems if we just choose to invest in new solutions rather than apply the same old approaches we have in the past.

As we look toward the next decade, it is clear that software security is at a pivotal point, and navigating it goes beyond just technology; it requires a change in mindset towards more holistic security strategies that consider both technical and human factors. The next few years will be critical as we see whether the industry can adapt to these challenges.

Echoes of the Past and Their Impact on Security Today

When I was a boy, my parents often made me read books they thought were important. One of these was “The Republic” by Plato, written around 380 BC. After reading each book, they’d ask me to talk about what I learned. Reading this one, I realized that politics haven’t changed much over time and that people always seem to believe their group should be the ones making the big decisions. This was the first time I truly understood the saying “History doesn’t repeat itself, but it often rhymes.” As someone who works in security, I think it’s important we all remember this. For example, these days, there’s a huge focus on Supply Chain Security in software, almost like it’s a brand-new idea. But if we look back to 1984, Ken Thompson talked about this very concept in his Turing Award lecture where he said, “No amount of source-level verification or scrutiny will protect you from using untrusted code.”

This is a common thread in information security in general, take, for example, the original forged message attack on RSA called the Bleichenbacher’s Oracle attack, it was published at the CRYPTO ’98 conference, and nearly two decades later we see Return Of Bleichenbacher’s Oracle Threat. Or the recent key recovery attack on SIDH, one of the NIST PQC selections, in this attack it was found that SIDH was vulnerable to a theorem known as “glue-and-split” developed in 1997!

While there is certainly an element of human nature involved here, there are also extenuating factors like the sheer amount of knowledge that we as a society have amassed. One of the exciting things about Large Language Models and AI more generically is that these techniques have the potential to harness the entire body of knowledge that society has amassed and to do so with far fewer mistakes enabling us to advance even faster.

With that said, there is a problem larger than that, especially as security practitioners, we often frame our problem wrong. Back in 1998 when Dan Geer was at CertCo (I worked at a competitor called Valicert back then), he wrote an excellent post on how “Risk Management is Where the Money Is”. In this post, he argued that the security industry as it was would be transformed into a risk management industry — something that has certainly happened. In this talk, he also eloquently frames how customers look at risk-reward trade-offs, and how the internet would evolve into a data center (e.g., the Cloud as we know it today), and more.

The reality is there is a lot to learn from our predecessors and by understanding historical patterns, and better utilizing the lessons learned from the past we can better prepare for and address the security issues we face today.

Challenges in Digital Content Authentication and the Persistent Battle Against Fakes

Efforts have been made for years to detect modified content by enabling content-creation devices, such as cameras, to digitally sign or watermark the content they produce. Significant efforts in this area include the Content Authenticity Initiative and the Coalition for Content Provenance and Authenticity. However, these initiatives face numerous issues, including privacy concerns and fundamental flaws in their operation, as discussed here.

It is important to understand that detecting fakes differs from authenticating originals. This distinction may not be immediately apparent, but it is essential to realize that without 100% adoption of content authentication technology—an unachievable goal—the absence of a signature or a watermark does not mean that it is fake. To give that some color just consider that photographers to this day love antique Leica cameras and despite modern alternatives, these are still often their go-to cameras.

Moreover, even the presence of a legitimate signature on content does not guarantee its authenticity. If the stakes are high enough, it is certainly possible to extract signing key material from an authentic device and use it to sign AI-generated content. For instance, a foreign actor attempting to influence an election may find the investment of time and money to extract the key from a legitimate device worthwhile. The history of DVD CSS demonstrates how easy it is for these keys to be extracted from devices and how even just being able to watch a movie on your favorite device can provide enough motivation for an attacker to extract keys. Once extracted, you cannot unring this bell.

This has not stopped researchers from developing alternative authenticity schemes. For example, Google recently published a new scheme they call SynthID. That said, this approach faces the same fundamental problem: authenticating trustworthy produced generative AI content isn’t the same thing as detecting fakes.

It may also be interesting to note that the problem of detecting authentic digital content isn’t limited to generative AI content. For example, the Costco virtual member card uses a server-generated QR code that rotates periodically to limit the exposure of sharing of that QR code via screenshots.

This does not mean that the approach of signing or watermarking content to make it authenticatable lacks value; rather, it underscores the need to recognize that detecting fakes is not the same as authenticating genuine content — and even then we must temper the faith we put in those claims.

Another use case for digital signatures and watermarking techniques involves their utility in combating the use of generative AI to create realistic-looking fake driver’s licenses and generative AI videos capable of bypassing liveness tests. There have also been instances of generative AI being used in real-time to impersonate executives in video conferences, leading to significant financial losses.

Mobile phones, such as iPhones and Android devices, offer features that help remote servers authenticate the applications they communicate with. While not foolproof, assuming a hardened and unmodified mobile device, these features provide a reasonable level of protection against specific attacks. However, if a device is rooted at the kernel level, or physically altered, these protective measures become ineffective. For instance, attaching an external, virtual camera could allow an attacker to input their AI-generated content without the application detecting the anomaly.

There have also been efforts to extend similar capabilities to browsers, enabling modern web applications to benefit from them. Putting aside the risks of abuse of these capabilities to make a more closed web, the challenge here, at least in these use cases, is that browsers are used on a wider range of devices than just mobile phones, including desktops, which vary greatly in configuration. A single driver update by an attacker could enable AI-generated content sources to be transmitted to the application undetected.

This does not bode well for the future of remote identification on the web, as these problems are largely intractable. In the near term, the best option that exists is to force users from the web to mobile applications where the server captures and authenticates the application, but even this should be limited to lower-value use cases because it too is bypassable by a motivated attacker.

In the longer term, it seems that it will fuel the fire for governments to become de facto authentication service providers, which they have demonstrated to be ineffective at. Beyond that, if these solutions do become common, we can certainly expect their use to be mandated in cases that create long-lasting privacy problems for our children and grandchildren.

UPDATE: A SecurityWeek article came out today on this topic that has some interesting figures on this topic.

UPDATE: Another SecurityWeek article on this came out today.

Gov ID: If at First You Don’t Succeed, Try, Try Again

In the eIDAS 2.0 framework, the identity wallet is central to its expanded scope, mirroring early European government efforts at smart card-based national identity cards as well as subsequent identity wallet attempts. These efforts saw limited adoption, except for a few cases such as the Estonian national identity card,  the Swedish e-identification, and the Dutch eID schemes. It seems that this part of eIDAS 2.0 is an effort to blend the best aspects of these projects with elements of Web3 in an attempt to achieve a uniform solution.

A significant shift from these past identity wallet efforts is the government’s role in identity verification, reminiscent of the earlier smart card national ID initiatives. This approach diverges from the prior identity wallet models, where external entities such as banks, telecoms, and commercial identity verification companies were responsible for verification. This combination potentially helps pave the way for holistic public sector adoption similar to what was seen with Estonia’s national ID project’s success just on a much larger scale.

With that said it is important to remember that the majority of past efforts have struggled to achieve broad adoption. For example, the GOV.UK Verify platform encountered substantial usability issues, leading to resistance and eventually discontinued use by organizations that were mandated to use it. While the software-based nature of identity wallets may reduce deployment costs relative to smart cards, and government mandates could kick-start some level of adoption, the challenge of achieving widespread acceptance does not go away.

As it stands, it does seem that European CAs are betting on this to bootstrap a larger market for themselves. However, in a system as described above, this raises questions about the broader value and future role of third-party trust providers especially in a world where HTTPS on the web is protected with domain-validated certificates that these CAs have largely ignored or resisted.

This brings us to the contentious issue of the eIDAS 2.0 framework’s push for Qualified Web Authentication Certificates (QWACs) and the enforced support by browsers. While it is tempting to look at these two parts of the effort in isolation it is important to remember that regulations like these are made up of horse trading, so it is not surprising to see how clumsily this has all progressed. 

As an aside if you have not seen it there was an interesting talk at Chaos Computer Club last month about how badly these identity schemes have been executed that is worth watching. Only time will tell how effectively eIDAS 2.0 navigates these challenges and whether it can achieve the broad adoption that has eluded past initiatives.

Rethinking How We Assess Risk in the Software We Rely On

Despite today’s widespread use of open-source software, most software is still delivered in binary form. This includes everything from the foundational firmware of our computers to the applications we use for work, extending all the way to the containers running our server software in the cloud.

A significant challenge arises when even if the source code of the software is available, reproducing the exact binary from it is often impossible. Consequently, companies and users are essentially operating on blind faith regarding any qualitative or quantitative assurances received from software suppliers. This stark reality played a critical role in the rapid and broad spread of the SolarWinds incident across the industry.

The SolarWinds Wake-Up Call

The SolarWinds attack underscored the risks inherent in placing our trust in software systems. In this incident, attackers infiltrated build systems, embedding malware into the legitimate SolarWinds software. Customers updating to the latest software version unwittingly became victims in this attack chain. It’s crucial to acknowledge that targeting a software supply chain for widespread distribution is not a new tactic. Ken Thompson, in his 1984 Turing Award Lecture, famously stated, “No amount of source-level verification or scrutiny will protect you from using untrusted code.” Regrettably, our approaches to this challenge haven’t significantly evolved since then.

Progress in the domain of supply chain security was initially slow. In 1996, Microsoft began promoting the concept of code signing with its Authenticode support, allowing customers to verify that their software hadn’t been altered post-distribution. Subsequently, the open-source movement gained traction, particularly following the release of Netscape Navigator’s source code. Over the next two decades, the adoption of open source, and to a lesser extent, code signing increased. The use of interpreted languages aided in understanding software operations, but as software grew in size and complexity, the demand for software engineers began to outstrip the supply. The adage “Given enough eyeballs, all bugs are shallow” suggests that greater openness can enhance security, yet the industry has struggled to develop a talent pool and incentive models robust enough to leverage source code availability effectively.

Before the SolarWinds incident, the industry, apart from some security engineers advocating for practices like reproducible builds, memory-safe languages, and interpreted languages, largely overlooked the topic of supply chain security. Notable initiatives like Google’s work on Binary Transparency, which predates SolarWinds, began to create an environment for broader adoption of code signing-like technologies with efforts like Go SumDBSigStore, and Android’s Binary Transparency (each of which I had the opportunity to contribute to). However, even these solutions don’t fully address the challenge of understanding the issues within a binary, a problem that remains at the forefront of security.

The industry’s response to SolarWinds also included embracing the concept of Software Bill of Materials (SBOM). These artifacts, envisioned to be produced by the build system, document the, often third-party, components used in software. However, this approach faces challenges, such as the possibility of attackers manipulating SBOMs if they compromise the build system.

The complexity of compiled software adds another layer of difficulty. Each compiled dependency has its own dependencies, not all of which are publicly declared, as is the case with static dependencies. When software is compiled, only portions of the dependencies that are used get included, potentially incorporating multiple versions of a single dependency into the final binary. This complexity makes simple statements about software components, like “I use OpenSSL 1.0,” inaccurate for even moderately complex code. Moreover, the information derived from SBOMs is often not actionable. Without access to all sources or the ability to build binaries independently, users are left with CVE lists that provide more noise than actionable insight.

To make things worse compilers, through the optimization of builds can even remove security fixes that developers carefully put in to mitigate known issues, for example, freeing memory to keep keys cryptographic keys and passwords from getting paged to disk.

The Critical Role of Binary Analysis

If all we have is a binary, the only way to understand the risks it represents is to analyze it in the same way an attacker would. However, doing this at scale and making the analysis actionable is challenging. Recent advancements in machine learning and language development are key to addressing this challenge.

Currently, tools that operate on binaries alone fall into two categories. The first are solutions akin to 1990s antivirus programs – matching binaries to known issues. The second category helps skilled professionals reverse engineer the binary’s contents more quickly.

Both categories have struggled to keep pace with the rapid changes in software over the past few decades. A new category of tools is emerging, led by companies like Binarly, which I advise. Binarly’s approach to automated binary analysis began with key goals such as achieving processor architecture independence and language independence. This enables the analysis of binaries across different architectures without duplicating threat intelligence and identifying insecure patterns stemming from ported code or common insecure Stack Overflow examples. Identifying static dependencies and which parts of them are used in a binary is both challenging and crucial for understanding the security issues that lie beneath the surface.

Their approach is remarkable in its ability to detect “known unknowns,” enabling the identification of classes of security vulnerabilities within a binary alone. Furthermore, through symbolic execution, they can perform reachability analysis, ensuring that flagged issues are not just theoretical but can potentially be exploited by attackers.

Though their approaches are not firmware-specific, Firmware is a great example of the problems that come from binary-only distributions and customers’ reliance on blind faith that their vendors are making the right security investments. It is their unique approach to binary analysis that has enabled them to file and report more CVEs in the last two years than have ever been reported before.

Binary analysis of this kind is crucial as it scrutinizes software in its final, executable form—the form in which attackers interact with it.

Conclusion

The lesson from the SolarWinds attack is clear: no build system-based approach to articulate dependencies is entirely secure. Ken Thompson’s 1984 assertion about the limitations of trusting any code you didn’t produce yourself remains relevant. In a world where software vulnerabilities have extensive and far-reaching impacts, binary analysis is indispensable. Binarly’s approach represents a paradigm shift in how we secure software, offering a more robust and comprehensive solution in our increasingly connected world.