[PATCH RFC] similar: allow similarity detection to use sha256 for digesting file contents

Wed Mar 1 19:34:43 EST 2017

On Wed, Mar 1, 2017 at 7:02 AM, FUJIWARA Katsunori <foozy at lares.dti.ne.jp>
wrote:

> # HG changeset patch
> # User FUJIWARA Katsunori <foozy at lares.dti.ne.jp>
> # Date 1488380487 -32400
> #      Thu Mar 02 00:01:27 2017 +0900
> # Node ID 018d9759cb93f116007d4640341a82db6cf2d45c
> # Parent  0bb3089fe73527c64f1afc40b86ecb8dfe7fd7aa
> similar: allow similarity detection to use sha256 for digesting file
> contents
>
> Before this patch, similarity detection logic (used for addremove and
> automv) uses SHA-1 digesting. But this cause incorrect rename
> detection, if:
>
>   - removing file A and adding file B occur at same committing, and
>   - SHA-1 hash values of file A and B are same
>
> This may prevent security experts from managing sample files for
> SHAttered issue in Mercurial repository, for example.
>
>   https://security.googleblog.com/2017/02/announcing-first-
> sha1-collision.html
>   https://shattered.it/
>
> Hash collision itself isn't so serious for core repository
> functionality of Mercurial, described by mpm as below, though.
>
>   https://www.mercurial-scm.org/wiki/mpm/SHA1
>
> HOW ABOUT:
>
>   - which should we use default algorithm SHA-1, SHA-256 or SHA-512 ?
>

SHA-512 should be faster than SHA-256 on 64-bit hardware. So, there's
likely no good reason to use SHA-256 for simple identity checks.

>
>     ease (handling problematic files safely by default) or
>     performance?
>

On my Skylake at 4.0 GHz, SHA-1 is capable of running at ~975 MB/s and
SHA-512 at ~700 MB/s. Both are fast enough that for simple one-time content
identity checks, hashing shouldn't be a bottleneck, at least not for most
repos.

So I think it is fine to change this function from SHA-1 to SHA-512
assuming the hashes don't "leak" into storage. If they end up being stored
or used for something other than identity checks, then we need to bloat
scope to discuss our general hashing future. And that needs its own thread
;)

>
>   - what name of config knob is reasonable to control digesting algorithm ?
>
>     or should we fully switch to more secure digest alrgorithm ?
>
> BTW, almost all of SHA-1 clients in Mercurial source tree applies it
> on other than file contents (e.g. list of parent hash ids, file path,
> URL, and so on). But some SHA-1 clients below apply it on file
> contents.
>
>   - patch.trydiff()
>
>     SHA-1 is applied on "blob SIZE\0" header + file content (only if
>     experimental.extendedheader.index is enabled)
>
>   - largefiles
>
> The former should be less serious.
>
> On the other hand, the latter causes unintentional unification between
> largefiles, which cause same SHA-1 hash value, at checking files out.
>

>
> diff --git a/mercurial/similar.py b/mercurial/similar.py
> --- a/mercurial/similar.py
> +++ b/mercurial/similar.py
> @@ -12,9 +12,15 @@ import hashlib
>  from .i18n import _
>  from . import (
>      bdiff,
> +    error,
>      mdiff,
>  )
>
> +DIGESTERS = {
> +    'sha1': hashlib.sha1,
> +    'sha256': hashlib.sha256,
> +}
> +
>  def _findexactmatches(repo, added, removed):
>      '''find renamed files that have no changes
>
> @@ -23,19 +29,25 @@ def _findexactmatches(repo, added, remov
>      '''
>      numfiles = len(added) + len(removed)
>
> +    digest = repo.ui.config('ui', 'similaritydigest', 'sha256')
> +    if digest not in DIGESTERS:
> +        raise error.ConfigError(_("ui.similaritydigest value is invalid:
> %s")
> +                                % digest)
> +    digester = DIGESTERS[digest]
> +
>      # Get hashes of removed files.
>      hashes = {}
>      for i, fctx in enumerate(removed):
>          repo.ui.progress(_('searching for exact renames'), i,
> total=numfiles,
>                           unit=_('files'))
> -        h = hashlib.sha1(fctx.data()).digest()
> +        h = digester(fctx.data()).digest()
>          hashes[h] = fctx
>
>      # For each added file, see if it corresponds to a removed file.
>      for i, fctx in enumerate(added):
>          repo.ui.progress(_('searching for exact renames'), i +
> len(removed),
>                  total=numfiles, unit=_('files'))
> -        h = hashlib.sha1(fctx.data()).digest()
> +        h = digester(fctx.data()).digest()
>          if h in hashes:
>              yield (hashes[h], fctx)
>
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel at mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.mercurial-scm.org/pipermail/mercurial-devel/attachments/20170301/2cc20c2a/attachment.html>