Reading the Tea Leaves: What Led to the Largest IT Outage in History

Last night, on July 18, 2024, a significant IT outage disrupted businesses worldwide, affecting airlines, financial services, TV broadcasters, and more. Some have described this outage as potentially the “largest IT outage in history.” It was linked to a recent update from cybersecurity giant CrowdStrike. According to CrowdStrike CEO George Kurtz, the disruption was due to a defect in a content update for Windows hosts (source). This incident underscores a critical lesson in software deployment: the importance of thorough testing and staged rollouts, often referred to as canarying.

What is Canarying?

Canarying a release involves deploying updates to a small subset of users or systems before a full-scale rollout. Named after the canaries once used in coal mines to detect toxic gases, canarying in software development serves as an early warning system. By releasing software to a limited audience first, potential issues can be identified in a controlled environment, allowing for fixes before broader deployment.

Did Canarying Happen?

The recent outage suggests that these canarying stages were either skipped or inadequately performed. In complex IT environments, the surface area for potential configurations of general-purpose computing endpoints is vast, making it impossible to test all combinations of usage patterns, hardware, software, and configurations. This is where canarying proves invaluable. Issues that are difficult to detect in a general release can be caught by exposing the update to a limited audience first. The failure to do so can lead to significant disruptions, as seen in this case. This isn’t solely CrowdStrike’s problem; it affects customers too. Enterprises often use tools like Microsoft’s Windows Server Update Services (WSUS) to control what and when updates are deployed to their devices, mitigating risks associated with issues just like this.

Looking at the Scale

The broader implications of this incident echo concerns from the 2000s about the risks of a Windows monoculture (source). Back then, critics argued that Microsoft’s dominance created a single point of failure. If a vulnerability or defect affected Windows, it could potentially impact millions of systems worldwide. This monoculture made it easier for malware to propagate, as seen with worms like Code Red (source) and Nimda (source), which exploited Windows systems en masse.

Historical incidents like the WannaCry ransomware attack (source) and the recent CDK Global outage impacting the automotive industry illustrate the widespread impact software vulnerabilities can have. The CDK outage, caused by two cyberattacks, forced the company to shut down its systems for approximately two weeks, resulting in an estimated $1.02 billion loss for dealerships (source). During that time, the impacted dealerships had the option to revert back to pen-and-paper and third-party software workarounds. However, this will likely not be possible in the CrowdStrike outage, as a result, it will likely have a much larger financial impact, especially given the broader scope and scale of the affected industries. Moreover, this incident echoes the CSRB monoculture concerns highlighted in the STORM-0558 incident, where Google’s response to the CSRB report on Microsoft emphasized the risks of relying on a single vendor (source).

Conclusion

This outage reminds us of a key lesson: always thoroughly test software before full deployment. But there’s a bigger lesson here, one that echoes Ken Thompson’s famous words: “You can’t trust code you didn’t totally create yourself.” This outage highlights the risks of relying too heavily on third-party software without proper mitigating controls.

To build more reliable systems, we need to be vigilant. This means not just blindly trusting vendors but verifying their work. It’s also crucial for basic business continuity, not just for avoiding security defects (source) and regressions like we recently saw with regreSSHion (source). And let’s not forget the risks associated with software monocultures; we should diversify, to the extent we can, our software to reduce the risk of widespread failures and where we can’t implement appropriate mitigating controls.

In essence, while we may not write every line of code ourselves, we must manage the risks represented by each piece of software we rely on, or we are doomed to repeat outages like today.

UNMITIGATED RISK

un.mit.i.gat.ed: Adj. Not diminished or moderated in intensity or severity; unrelieved. risk: N. The possibiity of suffering harm or loss; danger.

Reading the Tea Leaves: What Led to the Largest IT Outage in History

What is Canarying?

Looking at the Scale

Conclusion

One thought on “Reading the Tea Leaves: What Led to the Largest IT Outage in History”

Leave a Reply Cancel reply