SHA1TransitionPlan

SHA-1 is cryptographically weakened. Mercurial needs to switch to a strong hash function.

Goals

New hash algorithm should be cryptographically secure.
New hash algorithm should be fast, if possible (SHA-1 hashing is already a bottleneck in some operations).
Mercurial should support N hash algorithms without requiring invasive changes to storage data structures, wire protocol communication is. (This is because whatever we replace SHA-1 with will presumably be broken in several years anyway and we shouldn't need to retool everything to roll out a new hash algorithm.)
Transition plan will be up to repository owner, not a strict requirement for a specific version of Mercurial
Repos and servers will be able to have a flag day where all new commits are a specific hash

Non-Goals

Commit signing implications. Commit signing and cryptographic chain of custody is an independent (but related to repo security) topic. See CommitSigningPlan for more.

Goals Not Yet Classified

Do we support a repo owner deciding to rehash to a new algorithm? If so, how do we allow old hashes to be used for lookups (i.e. links to hgweb to old hashes can't stop working)? Also, how do we mitigate downgrade attacks in this scenario?

Selection of a Hash Algorithm

Mostly TODO. Blake2b at 30 or 31 bytes currently has the inside track.

Storage / Requirements Changes

A new repository requirement will need to be created to specify support for non-SHA-1 hashes.

There may need to be a repository requirement to specify the *primary* hash for new commits.

Revlogs already support 32 bytes for hash storage but only use 20 bytes for SHA-1. Assuming we use the existing revlog for storage, we'll reserve 1 or 2 bytes in the hash field to record the hash type then use the remaining bytes for hash storage. This allows multiple hash formats to be stored in the hash entry.

Future: in next revlog design, hash field should be variable width per revlog. This will allow using full 32 byte hashes and allow >32 byte hashes in the future. The revlog/store will need to be rewritten/upgraded to support wider hashes. But this one-time operation is acceptable because hash transitions should be rare.

Future: consider something like https://github.com/multiformats/multihash for declaring which hash is used. This will likely require a new revlog with >32 bytes for hash storage.

Wire Protocol Transition

Capabilities negotiation will need to exchange hash information and support.

Servers that have transitioned to a new hash will need to reject clients not supporting that hash and tell them to upgrade. The rejection should ideally be fast. This may be difficult in some cases because clients don't expose their features until bundle request time. We may have to error during discovery when SHA-1 hashes are used to request data stored under <HGHASH>.

TODO audit wire protocol and figure out how to do this.

Feedback from Git People

> * Did you encounter any unexpected issues that you wished you had though
> about before hand?

The main issues in the Git codebase were some coding practices which
didn't anticipate changing the hash function.  For example, there were a
lot of "unsigned char sha1[20]" declarations in the code, as well as
magic numbers like 48 ("shallow " plus a hex SHA-1 value), which all had
to be identified and converted.

There was also some reticence at first on the part of the community.
People didn't think it was that important, so I started by introducing a
set of #define constants and a structure for object IDs and pitched it
as a code cleanup with the vague possibility of a hash function
transition in the future.

I often had multiple series of work that hadn't been sent upstream and
found that other topics had conflicted with my changes.  I probably
should have been better about sending out a lot of these patches sooner,
which would have decreased the number of conflicts.

There are also people who expected us to have completed this work
already and who questioned the decisions we have made, including why we
did not pick their preferred hash algorithm.  This being the Internet,
this is not entirely unexpected, but it is something to be aware of.  I
recommend easily accessible pointers to documentation you can provide.

> * How much time did you spent on that sha256 conversion already, and how
> much more do you expect to spend?

I've sent 17 sets of patches that converted all the uses of "unsigned
char sha1[20]" into a C structure (so we could extend it in the future),
there are 9 sets of patches which update the testsuite to make it work
with SHA-256, and then three sets of patches that actually implement
SHA-256, and that's just to get us to the point where a repository can
be either entirely SHA-1 or entirely SHA-256.  Interoperability and
transition (storing in SHA-256 but allowing input or output in SHA-1)
will require more patches, most of which haven't yet been written.

I can't estimate how many hours I've spent on this, but it started in
2015 and has been going on during my free time for years.  If you
consider that there are about 20-30 patches in each set of patches, then
that gives you a rough idea of the scope.  I anticipate writing at least
ten more series of patches before the entire thing is done.  This is our
equivalent of your Python 3 work.

If y'all already have a structure or data type for the hash, or some
sort of abstraction for it, then I expect you'll spend a lot less time,
especially since Python (and now Rust, AIUI) are a little more object
oriented.  I highly recommend starting there with some abstractions,
switching everything to use them, and then seeing what works and
doesn't.  If your test suite has any hard-coded hash values, prepare to
spend a good amount of time fixing assumptions there.

> * Do you have any advices for other people trying the same endeavor in
> Mercurial?

It's been my view that moving away from SHA-1 is essential to the
viability of Git as a project.  If you can't store arbitrary data in
your repository, you're going to have a problem, and any signatures you
make are going to be meaningless if the hash is weak.  So my suggestion
is to consider it as important, reasonably urgent work, not to the point
of panicking, but something to prioritize.

I also think it's helpful to have a plan.  We have a transition plan
and added documentation (in
Documentation/technical/hash-function-transition.txt) and are
implementing it reasonably well.  Some things haven't gone exactly
according to the plan, but it's helpful that everyone is on the same
page.  We also planned for interoperability between the old and new so
people can switch over one repo at a time, which I think is enormously
important (but is going to be a lot of work).

My approach after making all of the struct object_id conversions was to
compile a binary that switched the hash wholezale (without any config
options) and then find what broke.  I fixed the most basic things that
prevented repository creation from working and then went from there,
fixing tests as I went.  I also made our testsuite care less about hash
values by computing them in a lot of tests, since tests about, say, the
diff format care about the format, not the specific values involved.

Of course, there may be other approaches that work as well, but that one
worked for me.

> * What motivate the choice of sha256 as a replacement? Have other hash
> function been considered? And if so, what made you discard them ?

When I started the work, I started with BLAKE2b-256.  I wanted a 256-bit
hash because it fits on an 80-column terminal.  I started with BLAKE2b
because it's fast, and I wanted to give people a reason to switch.  A
lot of people don't know or care about why SHA-1 is weak, and saying,
"You should switch because it's much faster _and_ more secure," is a
compelling argument.

We discussed several alternatives: BLAKE2b-256, SHA-256, SHA3-256,
SHAKE256, SHA-512/256, K12 (a Keccak-based hash), and others.  We
settled on SHA-256 because it's ubiquitous and we depend on platform
crypto libraries for fast implementations.  Windows and macOS have a
tiny number of hash algorithms implemented, and SHA-256 is really the
only 256-bit option.

The fact that it is vulnerable to length-extension attacks is irrelevant
to us because we hash the type and length as a prefix to the object, so
we aren't vulnerable to it.  SHA-256 also has hardware acceleration on
newer Intel and AMD processors, as well as on ARM, which was a
compelling reason.

My advice is to pick a SHA-2 or SHA-3 algorithm (including SHAKE256)
and, if your object format is not immune to length-extension attacks, to
not pick SHA-256.  The reason is that you have government agencies and
contractors (all over the world) who are legally required to pick and
use only approved algorithms, and you don't want people to not pick
Mercurial because of some silly policy reason.  I love BLAKE2b, and I
certainly don't love those policies, but that's the world we live in.

Related work

Git's migration plan Fossil's approach

CategoryNewFeatures CategoryDeveloper