[PATCH RFC] similar: allow similarity detection to use sha256 for digesting file contents
FUJIWARA Katsunori
foozy at lares.dti.ne.jp
Wed Mar 1 15:02:09 UTC 2017
# HG changeset patch
# User FUJIWARA Katsunori <foozy at lares.dti.ne.jp>
# Date 1488380487 -32400
# Thu Mar 02 00:01:27 2017 +0900
# Node ID 018d9759cb93f116007d4640341a82db6cf2d45c
# Parent 0bb3089fe73527c64f1afc40b86ecb8dfe7fd7aa
similar: allow similarity detection to use sha256 for digesting file contents
Before this patch, similarity detection logic (used for addremove and
automv) uses SHA-1 digesting. But this cause incorrect rename
detection, if:
- removing file A and adding file B occur at same committing, and
- SHA-1 hash values of file A and B are same
This may prevent security experts from managing sample files for
SHAttered issue in Mercurial repository, for example.
https://security.googleblog.com/2017/02/announcing-first-sha1-collision.html
https://shattered.it/
Hash collision itself isn't so serious for core repository
functionality of Mercurial, described by mpm as below, though.
https://www.mercurial-scm.org/wiki/mpm/SHA1
HOW ABOUT:
- which should we use default algorithm SHA-1, SHA-256 or SHA-512 ?
ease (handling problematic files safely by default) or
performance?
- what name of config knob is reasonable to control digesting algorithm ?
or should we fully switch to more secure digest alrgorithm ?
BTW, almost all of SHA-1 clients in Mercurial source tree applies it
on other than file contents (e.g. list of parent hash ids, file path,
URL, and so on). But some SHA-1 clients below apply it on file
contents.
- patch.trydiff()
SHA-1 is applied on "blob SIZE\0" header + file content (only if
experimental.extendedheader.index is enabled)
- largefiles
The former should be less serious.
On the other hand, the latter causes unintentional unification between
largefiles, which cause same SHA-1 hash value, at checking files out.
diff --git a/mercurial/similar.py b/mercurial/similar.py
--- a/mercurial/similar.py
+++ b/mercurial/similar.py
@@ -12,9 +12,15 @@ import hashlib
from .i18n import _
from . import (
bdiff,
+ error,
mdiff,
)
+DIGESTERS = {
+ 'sha1': hashlib.sha1,
+ 'sha256': hashlib.sha256,
+}
+
def _findexactmatches(repo, added, removed):
'''find renamed files that have no changes
@@ -23,19 +29,25 @@ def _findexactmatches(repo, added, remov
'''
numfiles = len(added) + len(removed)
+ digest = repo.ui.config('ui', 'similaritydigest', 'sha256')
+ if digest not in DIGESTERS:
+ raise error.ConfigError(_("ui.similaritydigest value is invalid: %s")
+ % digest)
+ digester = DIGESTERS[digest]
+
# Get hashes of removed files.
hashes = {}
for i, fctx in enumerate(removed):
repo.ui.progress(_('searching for exact renames'), i, total=numfiles,
unit=_('files'))
- h = hashlib.sha1(fctx.data()).digest()
+ h = digester(fctx.data()).digest()
hashes[h] = fctx
# For each added file, see if it corresponds to a removed file.
for i, fctx in enumerate(added):
repo.ui.progress(_('searching for exact renames'), i + len(removed),
total=numfiles, unit=_('files'))
- h = hashlib.sha1(fctx.data()).digest()
+ h = digester(fctx.data()).digest()
if h in hashes:
yield (hashes[h], fctx)
More information about the Mercurial-devel
mailing list