[PATCH RFC] similar: allow similarity detection to use sha256 for digesting file contents

Wed Mar 1 12:05:39 EST 2017

On Wed, Mar 1, 2017 at 7:02 AM, FUJIWARA Katsunori
<foozy at lares.dti.ne.jp> wrote:
> # HG changeset patch
> # User FUJIWARA Katsunori <foozy at lares.dti.ne.jp>
> # Date 1488380487 -32400
> #      Thu Mar 02 00:01:27 2017 +0900
> # Node ID 018d9759cb93f116007d4640341a82db6cf2d45c
> # Parent  0bb3089fe73527c64f1afc40b86ecb8dfe7fd7aa
> similar: allow similarity detection to use sha256 for digesting file contents
>
> Before this patch, similarity detection logic (used for addremove and
> automv) uses SHA-1 digesting. But this cause incorrect rename
> detection, if:
>
>   - removing file A and adding file B occur at same committing, and
>   - SHA-1 hash values of file A and B are same

If rename detection in this very special case is all that's lost, I
don't think the extra complexity (code and config) is worth it.

>
> This may prevent security experts from managing sample files for
> SHAttered issue in Mercurial repository, for example.
>
>   https://security.googleblog.com/2017/02/announcing-first-sha1-collision.html
>   https://shattered.it/
>
> Hash collision itself isn't so serious for core repository
> functionality of Mercurial, described by mpm as below, though.
>
>   https://www.mercurial-scm.org/wiki/mpm/SHA1
>
> HOW ABOUT:
>
>   - which should we use default algorithm SHA-1, SHA-256 or SHA-512 ?
>
>     ease (handling problematic files safely by default) or
>     performance?
>
>   - what name of config knob is reasonable to control digesting algorithm ?
>
>     or should we fully switch to more secure digest alrgorithm ?

Yes, I'd rather wait until we do that.

>
> BTW, almost all of SHA-1 clients in Mercurial source tree applies it
> on other than file contents (e.g. list of parent hash ids, file path,
> URL, and so on). But some SHA-1 clients below apply it on file
> contents.
>
>   - patch.trydiff()
>
>     SHA-1 is applied on "blob SIZE\0" header + file content (only if
>     experimental.extendedheader.index is enabled)
>
>   - largefiles
>
> The former should be less serious.
>
> On the other hand, the latter causes unintentional unification between
> largefiles, which cause same SHA-1 hash value, at checking files out.
>
> diff --git a/mercurial/similar.py b/mercurial/similar.py
> --- a/mercurial/similar.py
> +++ b/mercurial/similar.py
> @@ -12,9 +12,15 @@ import hashlib
>  from .i18n import _
>  from . import (
>      bdiff,
> +    error,
>      mdiff,
>  )
>
> +DIGESTERS = {
> +    'sha1': hashlib.sha1,
> +    'sha256': hashlib.sha256,
> +}
> +
>  def _findexactmatches(repo, added, removed):
>      '''find renamed files that have no changes
>
> @@ -23,19 +29,25 @@ def _findexactmatches(repo, added, remov
>      '''
>      numfiles = len(added) + len(removed)
>
> +    digest = repo.ui.config('ui', 'similaritydigest', 'sha256')
> +    if digest not in DIGESTERS:
> +        raise error.ConfigError(_("ui.similaritydigest value is invalid: %s")
> +                                % digest)
> +    digester = DIGESTERS[digest]
> +
>      # Get hashes of removed files.
>      hashes = {}
>      for i, fctx in enumerate(removed):
>          repo.ui.progress(_('searching for exact renames'), i, total=numfiles,
>                           unit=_('files'))
> -        h = hashlib.sha1(fctx.data()).digest()
> +        h = digester(fctx.data()).digest()
>          hashes[h] = fctx
>
>      # For each added file, see if it corresponds to a removed file.
>      for i, fctx in enumerate(added):
>          repo.ui.progress(_('searching for exact renames'), i + len(removed),
>                  total=numfiles, unit=_('files'))
> -        h = hashlib.sha1(fctx.data()).digest()
> +        h = digester(fctx.data()).digest()
>          if h in hashes:
>              yield (hashes[h], fctx)
>
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel at mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel