SHA1 and Mercurial security

Why you shouldn't panic yet (Feb 2017).

Contents

The classes of attacks against hash functions
The Google SHA1 announcement
What can you do with a collision attack?
Version control: you already have worse problems!
Conclusion
Future directions

1. The classes of attacks against hash functions

There are two distinct classes of attack against hash functions: collision attacks and preimage attacks.

A collision attack (also known as a birthday attack involves creating two brand new documents with the same hash. An attacker who can do a collision attack can give you one document (like an SSL certificate) and then later swap it for another one. For a N-bit hash, this should require roughly 2**(N/2) hash computations, which should make it impossible to break. But more on that shortly.

The second class of attacks, the preimage attacks, involve taking an existing hash or document and finding a document with the same hash. This is a much harder problem, and the best known attacks still take on the order of 2**N steps.

2. The Google SHA1 announcement

In February 2017, Google announced that it has found the first known SHA1 collision. This took over 6500 years of CPU time (divided across thousands of computers). So we now know it's possible for sophisticated attackers to carry out collision attacks against SHA1

3. What can you do with a collision attack?

The idea with a collision attack is for the attacker to create two documents, one benign and one hostile. If the attacker can get the first one accepted, they may later be able to swap in the second one.

We know from earlier attacks against MD5 that documents like SSL certificates can be subverted in this way. The attacker creates twin certificates for myhost.com and yourbank.com, tweaks bits in them until they collide, and then gets the first one signed by a certificate authority. Now the attacker can take that signature and use it on the second one, which will allow them to man-in-the-middle attack your connection to your bank.

Attacks like this require the document to have insignificant bits that can be tweaked without causing suspicion. More on this later.

Also, remember that the collision attacker cannot take an existing document and change it into a hostile document. That is a preimage attack.

4. Version control: you already have worse problems!

Now we're ready to talk about version control and the threat model therein. Mercurial (and Git) use SHA1 hashes to uniquely identify changesets and documents, so let's imagine what a collision attack would require here.

4.1. What if the attacker controls the repository?

Our attacker would create a pair of documents. Let's say they're executables. The attacker then places the benign version in a public repository on the web. Various people pull the code, run it, and attest that a version with a particular changeset hash is safe. But when you go to pull it, the attacker swaps the executable for the malicious version. Despite checking the changeset, you can still be compromised.

This analysis is all correct as far as it goes.. but is hiding a major problem! And that problem is that the middle step where people check for malicious code is extremely hard. By the same token, creating subtly malicious code (either executable or source) is relatively easy. In fact, it's many orders of magnitude easier for the attacker to just make a secretly malicious executable that is hard to detect than it is to mount a collision attack.

For better or worse, our primary defenses against this today are trust and reputation: a good file signature from SuspiciousCorp is not worth much and if you run code from them, you're going to have a bad day.

4.2. What if the attacker is a project contributor?

Here, an attacker could construct a pair of changes and attempt to get the benign one accepted by the project. After it's accepted, the attacker could then present his malicious one as the canonical change.

There are a couple significant challenges here, though. First, the benign change now lives on the canonical server, so swapping in the malicious change is tricky. Second, the change would have to be in a fairly opaque form like a binary file to allow known collision attacks to work without being very suspicious.

And again, it's still far far easier to simply create hard-to-spot malicious code that passes review. See Apple's notorious goto fail bug for an example of how subtle such code can be,

4.3. What if the attacker breaks into the server of a trusted project?

First, remember the attacker can't simply swap a pre-existing source file with their own hostile version because that's the much harder preimage attack. So our attacker won't be able to sneak a change in without changing some hashes. But here again, the simplest thing to do is just add a new changeset without using a collision attack. If the user isn't diligently checking changeset hashes against an external list of known good hashes (or signed changesets), they will be none the wiser.

If the attacker manages to submit a hash collision, get it signed by a trusted party, and then break into the server and subvert the repository, they will be able to carry out a full collision attack. But hopefully, you can see by now that this approach is far more work than more conventional attacks.

5. Conclusion

The existing conventional attacks against systems like Mercurial (and Git) are already significantly easier than SHA1 collision attacks. This is simply the nature of managing code: it's very difficult to identify malicious code and once you run it, all bets are off.

If you're not already being extremely diligent about vetting your project's contributors and contributions, cryptography will provide very little defense.

6. Future directions

In spite of the larger problems present above, we are still pressing forward on stronger hash functions. In fact, work has been underway for years to retrofit Mercurial's data structures and protocols for SHA1's successors. Storage space was allocated for larger hashes in our revlog structure over 10 years ago in Mercurial 0.9 with the the introduction of RevlogNG. The bundle2 format introduced more recently supports the exchange of different hash types over the network. The only remaining pieces are choice of a replacement function and choosing a backwards-compatibility strategy.

mpm/SHA1