[PATCH RFC] similar: allow similarity detection to use sha256 for digesting file contents
Martin von Zweigbergk
martinvonz at google.com
Wed Mar 1 12:05:39 EST 2017
On Wed, Mar 1, 2017 at 7:02 AM, FUJIWARA Katsunori
<foozy at lares.dti.ne.jp> wrote:
> # HG changeset patch
> # User FUJIWARA Katsunori <foozy at lares.dti.ne.jp>
> # Date 1488380487 -32400
> # Thu Mar 02 00:01:27 2017 +0900
> # Node ID 018d9759cb93f116007d4640341a82db6cf2d45c
> # Parent 0bb3089fe73527c64f1afc40b86ecb8dfe7fd7aa
> similar: allow similarity detection to use sha256 for digesting file contents
>
> Before this patch, similarity detection logic (used for addremove and
> automv) uses SHA-1 digesting. But this cause incorrect rename
> detection, if:
>
> - removing file A and adding file B occur at same committing, and
> - SHA-1 hash values of file A and B are same
If rename detection in this very special case is all that's lost, I
don't think the extra complexity (code and config) is worth it.
>
> This may prevent security experts from managing sample files for
> SHAttered issue in Mercurial repository, for example.
>
> https://security.googleblog.com/2017/02/announcing-first-sha1-collision.html
> https://shattered.it/
>
> Hash collision itself isn't so serious for core repository
> functionality of Mercurial, described by mpm as below, though.
>
> https://www.mercurial-scm.org/wiki/mpm/SHA1
>
> HOW ABOUT:
>
> - which should we use default algorithm SHA-1, SHA-256 or SHA-512 ?
>
> ease (handling problematic files safely by default) or
> performance?
>
> - what name of config knob is reasonable to control digesting algorithm ?
>
> or should we fully switch to more secure digest alrgorithm ?
Yes, I'd rather wait until we do that.
>
> BTW, almost all of SHA-1 clients in Mercurial source tree applies it
> on other than file contents (e.g. list of parent hash ids, file path,
> URL, and so on). But some SHA-1 clients below apply it on file
> contents.
>
> - patch.trydiff()
>
> SHA-1 is applied on "blob SIZE\0" header + file content (only if
> experimental.extendedheader.index is enabled)
>
> - largefiles
>
> The former should be less serious.
>
> On the other hand, the latter causes unintentional unification between
> largefiles, which cause same SHA-1 hash value, at checking files out.
>
> diff --git a/mercurial/similar.py b/mercurial/similar.py
> --- a/mercurial/similar.py
> +++ b/mercurial/similar.py
> @@ -12,9 +12,15 @@ import hashlib
> from .i18n import _
> from . import (
> bdiff,
> + error,
> mdiff,
> )
>
> +DIGESTERS = {
> + 'sha1': hashlib.sha1,
> + 'sha256': hashlib.sha256,
> +}
> +
> def _findexactmatches(repo, added, removed):
> '''find renamed files that have no changes
>
> @@ -23,19 +29,25 @@ def _findexactmatches(repo, added, remov
> '''
> numfiles = len(added) + len(removed)
>
> + digest = repo.ui.config('ui', 'similaritydigest', 'sha256')
> + if digest not in DIGESTERS:
> + raise error.ConfigError(_("ui.similaritydigest value is invalid: %s")
> + % digest)
> + digester = DIGESTERS[digest]
> +
> # Get hashes of removed files.
> hashes = {}
> for i, fctx in enumerate(removed):
> repo.ui.progress(_('searching for exact renames'), i, total=numfiles,
> unit=_('files'))
> - h = hashlib.sha1(fctx.data()).digest()
> + h = digester(fctx.data()).digest()
> hashes[h] = fctx
>
> # For each added file, see if it corresponds to a removed file.
> for i, fctx in enumerate(added):
> repo.ui.progress(_('searching for exact renames'), i + len(removed),
> total=numfiles, unit=_('files'))
> - h = hashlib.sha1(fctx.data()).digest()
> + h = digester(fctx.data()).digest()
> if h in hashes:
> yield (hashes[h], fctx)
>
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel at mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
More information about the Mercurial-devel
mailing list