Category Archives: Security

Towards Greater Accountability: A Proposal for CA Issuance Decision Logs

It took us a long time, but objectively, Certificate Transparency is a success. We had to make numerous technology tradeoffs to make it something that the CAs would adopt, some of which introduced problems that took even longer to tackle. However, we managed to create an ecosystem that makes the issuance practices of the web transparent and verifiable.

We need to go further, though. One of the next bits of transparency I would like to see is CAs producing logs of what went into their issuance decisions and making this information public. This would be useful in several scenarios. For example, imagine a domain owner who discovers a certificate was issued for their domain that they didn’t directly request. They could look at this log and see an audit history of all the inputs that went into the decision to issue, such as:

  • When was the request for the certificate received?
  • What is the hash of the associated ACME key that made the request?
  • When CAA was checked, what did it say? Was it checked via multiple perspectives? Did all perspectives agree with the contents?
  • When Domain Control was checked, did it pass? Which methods were used? Was multiple perspectives used? Did all perspectives agree with the contents?
  • What time was the pre-certificate published? What CT logs was it published to?
  • What time was the certificate issued?
  • What time was the certificate picked up?

This is just an example list, but hopefully, it is enough to give the idea enough shape for the purpose of this post. The idea here is that the CA could publish this information into cheap block storage files, possibly. I imagine a directory structure something like: ” /<CA CERTHASH>/<SUBJECT CERTHASH>/log”

The log itself could be a merkle tree of these values, and at the root of the directory structure, there could be a merkle tree of all the associated logs. Though the verifiability would not necessarily be relied upon initially, doing the logging in this fashion would make it possible for these logs to be tamper-evident over time with the addition of monitors.

The idea is that these logs could be issued asynchronously, signed with software-backed keys, and produced in batches, which would make them very inexpensive to produce. Not only would these logs help the domain owner, but they would also help researchers who try to understand the WebPKI, and ultimately, it could help the root programs better manage the CA ecosystem.

This would go a long way to improving the transparency into CA operations and I hope we see this pattern or something similar to it adopted sooner rather than later.

Strengthening Domain Control Verification: The Role of Multiple Perspectives and Collaboration

The security and stability of encryption on the web rely on robust domain control verification. Certificate Authorities in the WebPKI are increasingly facing attacks that exploit weaknesses in the Border Gateway Protocol (BGP) ecosystem to acquire certificates for domains they do not legitimately control.

While Resource Public Key Infrastructure (RPKI) has the potential to mitigate these threats, its adoption is hindered by several structural barriers that have slowed its adoption.

In response, larger more security-minded CAs have started embracing the concept of Multiple Perspective Domain Control Verification (MPDV) to enhance their defenses. The fundamental idea of MPDV is that before issuing a certificate, the CA will require numerous network perspectives to agree that the domain control verification criteria have been met.

Researchers at Princeton University have played a significant role in this journey in various ways, including raising awareness about the issue, evaluating the effectiveness of different MPDV implementations, and helping determine efficient quorum policies.

This combination has led to Google Chrome signaling an intention to require MPDV from all CAs. This indicates that there is enough data to demonstrate this is both valuable and doable and I agree with this conclusion.

This new requirement will have several consequences. This is because implementing a competent MPDV solution is more difficult than it appears on the surface. For instance, these network perspectives need to be located in different networks for this to be an effective tool to mitigate these risks. One of the most expensive aspects of operating a transactional service is managing the environment in which the service runs. This means that if CAs distribute the entire MPDV checking process to alternative network perspectives, they will need to manage multiple such environments. The cost and complexity of this go up as the number of perspectives is added.

This should not be a problem for the largest CAs, and since the top 10 CAs by issuance volume account for 99.58% of all WebPKI certificates, achieving broad coverage of the web only requires a few companies to make these investments and they should be able to assume those costs. But what about the smaller CAs?

These smaller, regional CAs are often focused on language-specific support in the markets they operate in, assisting with local certificate-related product offerings such as document signing or identity certificates, and adhering to regional regulations. These are much smaller markets and leave them with far fewer resources and skills to tackle problems like this. The larger CAs on the other hand will also end up duplicating much of the same infrastructure as they worked toward meeting these requirements. 

This suggests there is an opportunity for CAs to collaborate in building a shared network of perspectives. By working together, CAs can pool resources to create a more diverse network of perspectives. This can help them meet the new requirements more efficiently and effectively, while also strengthening the internet’s overall security.

Key Management and preparing for the Crypto Apocalypse

Today, keeping sensitive information secure is more critical than ever. Although I’m not overly concerned about the looming threat of quantum computers breaking cryptography, I do worry about our approach to key management, which significantly impacts how we will ultimately migrate to new algorithms if necessary.

Traditional key management systems are often just simple encrypted key-value stores with access controls that release keys to applications. Although this approach has served us well in the past, moving away from bearer tokens to asymmetric key-based authentication and, ultimately, the era of post-quantum cryptography (PQC) demands a different approach.

Why am I concerned about bearer tokens? Well, the idea of a long-lived value that is passed around, allowing anyone who sees the token to impersonate its subject, is inherently risky. While there are ways to mitigate this risk to some extent, most of the time, these tokens are poorly managed or, worse, never changed at all. It’s much easier to protect a key that no one sees than one that many entities see.

The old key-value approach was designed around this paradigm, and although some systems have crude capabilities that work with asymmetric keys, they leave much to be desired. If we want seamless, downtime-free rollover of cryptographic keys across complex systems, we need a model that keeps keys isolated even from the entities that use them. This change will significantly improve key rollover.

This shift will require protocol-level integration into the key management layer, but once it’s done in a reusable way, keys can be changed regularly. A nice side effect of this transition is that these components will provide the integration points allowing us to move to new algorithms in the future.

What does this have to do with PQC? Unlike the shift from DES to AES or RSA to ECC, the post-quantum algorithms available now are substantially larger and slower than their predecessors, meaning the gradual migration from the once state-of-the-art to the new state-of-the-art won’t start until it absolutely has to. Instead, the migration to PQC starts by changing the way we build systems, specifically in how we architect key rollover and the lifecycle of keys. I personally believe the near-term impetus for this change will be the deprecation of bearer tokens.

The importance of seamless and automated rollover of keys is crucial for making systems secure, even if the post-quantum apocalypse never happens.

I also think we will see PQC readiness in credentialing systems. For example, we may see ACME clients support enrolling for PQC certificates simultaneously as they enroll for their ECC certificates, or perhaps support the (more bloated) hybrid certificates.

In conclusion, rethinking our key management approach is increasingly important. So far, I have not seen anyone come to market with what I would call a different approach to key management, and we need one.

The Growing Security Concerns of Modern Firmware and the Need for Change.

Today’s firmware is larger and more complex than ever before. In 1981, the IBM PC BIOS was a mere 8 KB, but now UEFI, even without considering machines with BMCs, can be 32 MB or even larger! To illustrate the magnitude of the problem, Intel will soon release its next-generation SoCs with support for 128 MB of firmware!

Essentially, UEFI has become a real-time OS with over 6 million lines of code and is still growing. This is larger than most modern operating systems. Furthermore, the various boot phases and hardware layers create significant complexity for defenders. Increased surface area leads to more vulnerabilities.

The most impactful and difficult-to-patch vulnerabilities reside in the underbelly of technology. Firmware, file systems, BGP, and other foundational aspects of technology that we often take for granted are becoming more vulnerable to attacks. It’s time to prioritize security for the very foundation of our tech. Benjamin Franklin once said, “A failure to plan is a plan to fail.” This adage often applies to long-term vulnerabilities in technology. Insufficient planning can lead to an inability to detect issues, inadequate data to assess their true severity, and a lack of responsiveness.

Firmware serves as a prime example. Many firmware-level issues remain unpatched because firmware often lacks the measurement and patching middleware we expect from software. Moreover, hardware vendors frequently behave as if their job is complete once they release a patch. Imagine if, in 2023, software vendors merely dropped a patched piece of software into a barely discoverable HTTP-accessible folder and proclaimed, “Thank goodness we’ve done our part.” This scenario largely reflects the current state of firmware.

One reason for this situation is that the problem on the surface appears intractable. A typical PC may house dozens of firmware components, with no inventory of what exists. This firmware often originates from multiple vendors and may include outdated chips that have not been updated.

Another fitting saying is, “You can’t manage what you can’t measure.” Combine this with the exponential growth of firmware surface area and the increasing number of internet-connected devices containing firmware, and you have a massive security issue arising from decades of neglect.

There is no silver bullet here. One aspect to address is the way firmware is built. USB Armory aims to solve this by making firmware memory safe, easy to read, and with minimal dependencies. While this is a positive step, it is not sufficient on its own. Binarly.io has created the best automation available for detecting firmware issues automatically, which is invaluable considering that old approaches will persist for decades.

To drive change, we need better measurement and widespread adoption of automatic update mechanisms for firmware of all sizes. These mechanisms must be safe, reliable, and robust. Misaligned incentives contribute to the problem, often resulting from a lack of accountability and transparency. This is why I dedicated as much time as I could to binary.transparency.dev while at Google.

The momentum around software supply chain security is essential, as it sheds some light on the problem, but alone it is not enough to bring about the necessary change. If you create a chip with firmware that has a vulnerability, your responsibility does not end with shipping a patch. If you ship devices without providing a way to seamlessly patch all firmware, you are failing.

Relying on the next hardware refresh cycle to update firmware on your devices in the field is insufficient. With cloud adoption, refresh cycles might even lengthen. A long-term strategy to support your devices is necessary; not doing so externalizes the consequences of your inaction on society.

If you have devices in the field that are in use, and you don’t have a confident inventory of the dependencies that exist in them, and you’re not monitoring those dependencies and the firmware itself for issues, you are part of the problem, externalizing consequences on society.

We can do better.

To improve firmware security, the industry must collaborate and adopt best practices. This includes embracing transparency, robust patch management systems, and long-term support strategies for devices. By working together to address these challenges, we can build a more secure foundation for the technology that underpins our modern world.

In conclusion, it’s crucial that we prioritize firmware security, as it plays a fundamental role in the safety and reliability of our devices and systems. By implementing more effective measurement, automatic update mechanisms, and long-term support strategies, we can reduce the risks associated with outdated and vulnerable firmware. This will help create a safer digital environment for everyone.

P.S. Thanks to @matrosov and @zaolin for their insights on the problem on Twitter.

Global Trend in Tech Regulation and its Consequences for the EU and the US

The United States has long been a leader in technological innovation, with companies such as Google, Apple, Facebook, and Amazon paving the way. As of October 2021, 62% of global tech unicorns have emerged from the US, with China accounting for 19%, while only 12% have come from the EU. One explanation for this delta is the size of the regulatory regime in the EU, which is known to favor larger companies and make it more expensive and harder for small companies to enter the market.

It’s been only 29 years since the EU transitioned from a trading bloc to a union of 27 member states with coordinating policies. Considering the complications that such a transition represents, it’s not surprising that, relative to the US and China, the EU has more bureaucracy. However, things are changing, as there is now an adult generation that has grown up with the EU as part of their national identity. They have also seen the benefits of that partnership manifest for much of their lives. While the member states of the EU will continue to evolve how they work with each other, they have come a long way in terms of coordination and cooperation and have a solid foundation to build upon.

Another argument that I have heard is that the EU’s focus on creating a stable and cooperative union took away from the focus on technological and market growth. That may be true but over the last decade, they have focused on creating regulations they hope will create a Digital Single Market which they hope will address this problem. During this same period, the US regulatory framework largely stood still, but they also experienced the most rapid growth of technology companies of any nation during this time.

It’s worth noting that the EU’s approach to regulation has been very implementation-specific when compared to the U.S. approach to similar regulation, as seen with the EIdAS, the EU’s digital signature legislation, and the associated supporting efforts which choose which technologies must be used. The first version of which left the topic of interoperability as a footnote and ignored the concept of reciprocity. This essentially created member-state monopolies around the legislation where country-to-country transactions would still be signed on pen and paper. That did change a few years ago, but better technological approaches to solving the associated problems were established and proven since the initial legislation was proposed two decades ago, and their adoption was held back due to this legislation’s technical specificity.

On the other hand, there is a credible argument to be made that the US has failed when it comes to creating meaningful legislation to protect its citizens from the overreach of tech giants and the increasingly online nature of daily life. In fact, many would argue that, at least for the last decade, they have failed when it comes to creating meaningful legislation, period. This failure has created an opportunity for the EU to step up and leave its mark on the global technology industry, which it certainly has been doing.

What is noteworthy here is that many of these regulations are being framed as globally binding. The concept being applied here is called extraterritoriality, which basically means that the jurisdiction of the law extends beyond the physical borders of the country or region that has enacted it. The argument is that by extending the reach of its laws to cover EU citizens wherever they may be, they are protecting the rights of its citizens and promoting a level playing field for all companies, regardless of where they are based.

But what is a regulation without teeth? This is why these regulations usually empower the EU member states to impose fines on companies, regardless of where they exist, if the associated regulations are found not to have been met. The trend to leverage extraterritoriality is sure to continue and likely accelerate. In the case of security, one such upcoming regulation from the EU is NIS2, which is looking to regulate incident response handling, again with potential fines for non-compliance.

It’s not to say that all of this EU legislation is bad, though I would argue that the more explicit it is on technical details, the worse it is. For example, consider the EU Cookie legislation–it has resulted in the online equivalent of graffiti all over the internet with no meaningful improvement for users.

As I think about this, one of the things I struggle with is that the geographic nexus of a business is increasingly not relevant to users. In one respect, this is an argument for exactly what the EU is doing. But there are also 195 sovereign nations, each with its own cultural norms and political interests. Can they all regulate online businesses regardless of where they are located? What will the long-term effects of this global regulatory trend be?

Of course, the answer is that only those countries that have enough financial clout to get away with this, but even then, would the EU want its businesses regulated in this way by the US or China? And what do you do when the regulations conflict? Does the future require proof of citizenship before using any online service?

The Importance of Proper Measurement in Enterprise IT Management: Lessons from Cloud Deployments

Peter Drucker once said, “You cannot manage what you cannot measure.” This quote is applicable in many aspects of technology development and business management. Neglecting measurement often leads to problems. Another relevant saying is “Lies, damned lies, and statistics.” The importance of what we measure and how we measure it cannot be overstated when using data to manage complex systems.

In enterprise IT, endpoint management systems are often touted as the source of truth for the environment, but this is misleading. They measure the intended state of the devices, not their actual state. In other words, they track what they believe they have done, rather than whether the change was correctly and successfully deployed. This might seem similar or sufficient, but it is not. Endpoint systems have many different software and settings that can interact in unexpected ways, making the data they provide nearly useless for risk assessment, according to most security professionals.

As for EDR systems, I would argue that they are simply advanced anti-viruses. They try to protect the operating system which is largely intractable. give the enterprise visibility to attacks and infections, while providing minimal tools for a response. To be clear EDR is valuable, but its overhead is high and it is not designed for device intelligence or observability; its purpose is detection and response.

If enterprises had proper investments in observability, they could discover outages before users report them. They could use the intelligence about the state and usage of their devices to proactively defend and isolate their assets, turning what has been a reactive and inaccurate dataset into a tool to protect their networks and improve user productivity.

There is a lot to learn from Cloud deployments when looking at how to solve these problems. For example, unlike most IT departments, cloud assets are heavily instrumented with logs being centralized, with dashboards reporting real-time uptime and health. There is an entire market of solutions focused on enabling this visibility, just consider how many log aggregation and analytics offerings such as Stackdriver, CloudWatch, and New Relic exist. 

Additionally, these assets typically have cross-platform machine identities that are used to facilitate security domain separation, and interestingly these identities are increasingly using hardware-backed keys to secure those credentials. These credentials are also used to help each endpoint in those deployments achieve some basic level of self-protection, where the credentials used by these assets will capture the posture of the machines and the peers they interact with, particularly when crossing security domains and enforcing policy based on these interactions.

I believe that over the next decade, enterprise IT deployments will increasingly resemble cloud deployments, with a focus on zero-trust design patterns. However, for that to happen there are product gaps that will need to be filled. For example, there is no turnkey solution for desktop state observability with structured data that can be used for real-time policy decisions. The big tech companies have built these solutions for their own deployments but there is nothing turnkey in the market that addresses this need. Then there is the issue of how to authenticate the device and its state. There are lots of pieces in this area but no solutions. If you are going to use the observed device state for the policy you also have to worry about the validity of the associated data, while this is not possible in existing systems to totally solve this problem there are lots of things that can be done to make data more reliable.

Finally, as we see this evolution, there is a need to rethink the way enterprises approach identity. It will become an alternative to Active Directory. No security practitioner would design their source of truth in the enterprise in the same way as Active Directory is today.

The Changing Landscape of Internet Protection

The United States government and big companies like Google have both played a significant role as protectors in their respective spheres. The US government, as the world’s leading military power, has tried to serve as a global peacekeeper and defender of democracy through its high military spending. Similarly, Google, as a success in the technology industry, has leveraged its financial power from advertising to fund various initiatives to protect and grow the internet and the web.

However, in recent years, Google has undergone significant changes. The COVID-19 pandemic accelerated the need for companies to adapt and re-envision themselves, and Google was no exception. The company, in a relatively short period, hired many externally sourced professional managers and underwent a series of re-organizations and corresponding cultural changes that have fundamentally changed the company forever. This, combined with recent fiscal irresponsibility in hiring, and inability to drive a cohesive and integrated product strategy (how many messengers does one company need after all?), led to Google’s first-ever layoffs, which included a significant impact on internet investments, and being transparent my own job.

This raises important questions about the future of internet protection. While Google was clearly not the exclusive protector of the internet, its direct and indirect investments in the internet have played a significant role in protecting it. It seems unlikely that moving forward, they can be counted on in the same way. We need to see other large organizations whose businesses similarly take a critical dependency on a safe and secure internet step up and fill the gap.

This is problematic for a number of reasons, one of the largest being that public companies are, to a fault, focused on quarter-to-quarter growth. That is of course the right thing for their shareholders, at least in the near term, but the sort of infrastructure investments necessary to secure the Internet can take a decade or more of investment. That’s quite hard for an organization to justify when the most valuable resource they have is engineers when those engineers can be spent working on securing and improving their private networks or directly generating revenue.

Many of these organizations already donate to security non-profits such as ISRG and OpenSSF and work through them to make some of these improvements to the Internet, but the funding to these organizations is often piecemeal and usually only aligned with the latest security trends. Furthermore, these investments are often associated with pet projects of the sponsoring companies rather than taking a strategic and proportional investment into different problem areas. A great example of this is how many of the most important open-source projects lack the basic funding necessary to sufficiently secure them long-term.

One approach to mitigating this is to lean on these security non-profits more and give them more autonomy and funding. This would allow them to take on a larger and more independent role via a larger commitment to funding with appropriate multi-year roadmaps, milestones, and success criteria approved by some mix of funding stakeholders invited experts in the specific areas in question and key engineering participants from relevant funding organizations. This would create predictable funding and a space where long-term roadmaps for research, securing, and improving the internet can be established without relying on a small number of companies with business models that support direct investment in those activities.

This approach would have its own challenges, for example, like most non-profits, these organizations will struggle to be impact focused rather than focused on their own pet projects or philosophies. That said, this is largely part of the human condition and something that always needs to be managed. The answer to how to manage this is surely buried in what systematic approach is used for choosing initiatives, measuring their success, and having lots of meaningful milestones along the way to enable them to checkpoint, course correct and the will power kill projects as appropriate.

In summary, the role of a protector comes with great responsibility. The recent changes at Google highlight the need for a diverse set of stakeholders to come together to safeguard the internet and the web and take less dependency on one company carrying too large of the load when it comes to protecting the internet.

The Limitations of Audits: What You Need to Know

In recent years, the SOC2 and ISO 27001 badges have become a staple at the bottom of nearly every SaaS website. This is largely due to the growing number of startups providing solutions for SOC2 and ISO 27001 audits, such as Vanta and Secureframe. These companies greatly reduce the time and money necessary to comply with various audit regimes. The fact large companies are increasingly incorporating the requirements for SOC2 and ISO 27001 audits into their procurement process of course also drives the demand for these audits. With that said it’s important to understand what these badges really mean and what they don’t.

One common misunderstanding is that audits are ongoing assessments of an organization’s security practices. In reality, audits are almost always point-in-time retrospectives, meaning they only reflect what was the case at the time of the audit and not what is currently the case. This is a material point, as many small organizations view the work associated with these audits as an annual tax, rather than integrating the associated practices into how they work.

Another factor to consider is the scope of the audits. The entity being audited gets to choose the scope, and it’s common for startups to exclude their IT environment (such as desktops) from their audits. This means that not all SOC2 audits are created equal, and you may not be getting the full picture.

Beyond that when audits involve sampling data, it is usually the subject of the audit who chooses the sample. This means that either by accident or on purpose, the analysis may be based on the most favorable data.

There is also the concept of accepted and managed risks. An organization can often get by with poor security practices as long as they acknowledge the risk and have a plan to resolve it within a fixed period of time. This period of time can be continuously extended, allowing the organization to avoid addressing the issue indefinitely.

Then you need to remember that the auditor works for the organization being audited. They want to be hired again next year, so they may be willing to accept the organization’s interpretation of the expectations in order to secure future business. This means that the results of the audit may not be completely impartial.

And finally, there is the question of the qualifications of the auditors conducting these assessments. Often, they are exclusively accountants and not technologists, meaning that they are not equipped to evaluate the technical security or correctness of the systems being audited. Instead, they are essentially tasked with assessing if the checklist represented in the audit regime can be reasonably deemed as met. 

In conclusion, while SOC2 and ISO 27001 audits have good intentions, they are far from sufficient in assessing an organization’s security practices. The fact that the audits are point-in-time retrospectives, the scope of the audit is chosen by the entity being audited, the auditor works for the organization, and the results may be based on favorable data, all contribute to the limitations of these audits. As a result, it’s important to be aware of these limitations and not rely solely on a SOC2 badge as a sign of a secure organization.

To err is human, to forgive is divine

Every service should strive to provide perfect availability, the reality is that it’s not possible to be perfect. Mistakes happen, it’s how you deal with them that is important.

Successfully dealing with availability issues requires planning, and when dealing with a client-server solution, it requires both parties to make improvements.

In the context of ACME today the large majority of clients have no failover or logic. This means if the enrollment fails due to connectivity issues to the specified CA, the CA has an outage, or the CA suspends operations for one reason or another the enrollment will fail.

This is a problem for any service protected with TLS, which is basically every service, that wants to have a highly available service. One of the ways services deal with this is to proactively acquire two certificates for every endpoint, this is viable, but I would argue this is not the right solution.

The right solution is to not rely on a single CA and instead failover between many CAs that are capable of servicing your needs. This way if any single CA fails you are fine, you can just keep chugging along.

That is not sufficient to address this risk though. Not all certificates have the same level of device trust. Sometimes clients make bad assumptions and trust specific CA hierarchies assuming these configurations are static, even when they are not.

To help mitigate this behavior clients should implement a round-robin or random CA selection logic so that subsequent renewals will hit different CAs. This will force clients to make sure they work with any of your chosen CAs. This way you won’t find yourself breaking apps that make those bad assumptions when your CA fails.

Caddy Server already implements both of these strategies as I understand it but every ACME client should be doing the same thing.

A Boy Scout is always prepared

My father always says it’s not a problem to make a mistake, what is important is how you deal with it. 

The same thing is true when it comes to WebPKI CAs, broadly the incident response process used in this ecosystem could be categorized as Blameless Post Mortem. The focus is on what happened, what contributed to it, and what was done to address the issue and not on fault.

A few years ago a number of large CAs had to do millions of revocations, in all of the cases I am thinking the required deadline for those revocations was 5 days. Revoking a large number of certificates that are not directly obvious but if you’re a CA who has done any moderate level of planning it’s something you should be up for.

The thing is that doing so can cause harm, for example, the issue that necessitates the revocation might be incredibly subtle and not security-impacting. Nonetheless, the requirement is what the requirement is — the certificate needs to be revoked.

The question then becomes how can you meet that timeline objective without creating an unnecessary outage for customers? If you defy the rules you risk being distrusted, if you act blindly you could take down your customer’s services.

The question then becomes how do you contact millions of customers and give them enough time to replace their certificates without an outage with these constraints?

Like most scale problems the answer is automation, in the context of certificate lifecycle management that means an extension to the ACME protocol. To that end, there is now a draft for something called “ACME Renewal Information” which when implemented by CAs and ACME clients will enable a CA to signal that there may be a need to replace their certificate earlier than expected.

The basic idea with this proposal is that the CA will make available hints on when it would like certificates to be updated and the client will periodically check this information and use it to guide its renewal behavior.

To be clear, this is just a hint, a CA might be providing this hint just to smooth out the load, but there is no mandate to rely on the hint. With that said I do hope that all major ACME clients implement this standard and respect it by default because it will make the WebPKI a lot less fragile.