[PATCH RFC] similar: allow similarity detection to use sha256 for digesting file contents

Wed Mar 1 15:02:09 UTC 2017

# HG changeset patch
# User FUJIWARA Katsunori <foozy at lares.dti.ne.jp>
# Date 1488380487 -32400
#      Thu Mar 02 00:01:27 2017 +0900
# Node ID 018d9759cb93f116007d4640341a82db6cf2d45c
# Parent  0bb3089fe73527c64f1afc40b86ecb8dfe7fd7aa
similar: allow similarity detection to use sha256 for digesting file contents

Before this patch, similarity detection logic (used for addremove and
automv) uses SHA-1 digesting. But this cause incorrect rename
detection, if:

  - removing file A and adding file B occur at same committing, and
  - SHA-1 hash values of file A and B are same

This may prevent security experts from managing sample files for
SHAttered issue in Mercurial repository, for example.

  https://security.googleblog.com/2017/02/announcing-first-sha1-collision.html
  https://shattered.it/

Hash collision itself isn't so serious for core repository
functionality of Mercurial, described by mpm as below, though.

  https://www.mercurial-scm.org/wiki/mpm/SHA1

HOW ABOUT:

  - which should we use default algorithm SHA-1, SHA-256 or SHA-512 ?

    ease (handling problematic files safely by default) or
    performance?

  - what name of config knob is reasonable to control digesting algorithm ?

    or should we fully switch to more secure digest alrgorithm ?

BTW, almost all of SHA-1 clients in Mercurial source tree applies it
on other than file contents (e.g. list of parent hash ids, file path,
URL, and so on). But some SHA-1 clients below apply it on file
contents.

  - patch.trydiff()

    SHA-1 is applied on "blob SIZE\0" header + file content (only if
    experimental.extendedheader.index is enabled)

  - largefiles

The former should be less serious.

On the other hand, the latter causes unintentional unification between
largefiles, which cause same SHA-1 hash value, at checking files out.

diff --git a/mercurial/similar.py b/mercurial/similar.py
--- a/mercurial/similar.py
+++ b/mercurial/similar.py
@@ -12,9 +12,15 @@ import hashlib
 from .i18n import _
 from . import (
     bdiff,
+    error,
     mdiff,
 )
 
+DIGESTERS = {
+    'sha1': hashlib.sha1,
+    'sha256': hashlib.sha256,
+}
+
 def _findexactmatches(repo, added, removed):
     '''find renamed files that have no changes
 
@@ -23,19 +29,25 @@ def _findexactmatches(repo, added, remov
     '''
     numfiles = len(added) + len(removed)
 
+    digest = repo.ui.config('ui', 'similaritydigest', 'sha256')
+    if digest not in DIGESTERS:
+        raise error.ConfigError(_("ui.similaritydigest value is invalid: %s")
+                                % digest)
+    digester = DIGESTERS[digest]
+
     # Get hashes of removed files.
     hashes = {}
     for i, fctx in enumerate(removed):
         repo.ui.progress(_('searching for exact renames'), i, total=numfiles,
                          unit=_('files'))
-        h = hashlib.sha1(fctx.data()).digest()
+        h = digester(fctx.data()).digest()
         hashes[h] = fctx
 
     # For each added file, see if it corresponds to a removed file.
     for i, fctx in enumerate(added):
         repo.ui.progress(_('searching for exact renames'), i + len(removed),
                 total=numfiles, unit=_('files'))
-        h = hashlib.sha1(fctx.data()).digest()
+        h = digester(fctx.data()).digest()
         if h in hashes:
             yield (hashes[h], fctx)