[PATCH RFC] similar: allow similarity detection to use sha256 for digesting file contents

Wed Mar 1 19:46:07 EST 2017

On Wed, Mar 01, 2017 at 04:34:43PM -0800, Gregory Szorc wrote:
> On Wed, Mar 1, 2017 at 7:02 AM, FUJIWARA Katsunori <foozy at lares.dti.ne.jp>
> wrote:
> 
> > # HG changeset patch
> > # User FUJIWARA Katsunori <foozy at lares.dti.ne.jp>
> > # Date 1488380487 -32400
> > #      Thu Mar 02 00:01:27 2017 +0900
> > # Node ID 018d9759cb93f116007d4640341a82db6cf2d45c
> > # Parent  0bb3089fe73527c64f1afc40b86ecb8dfe7fd7aa
> > similar: allow similarity detection to use sha256 for digesting file
> > contents
> >
> > Before this patch, similarity detection logic (used for addremove and
> > automv) uses SHA-1 digesting. But this cause incorrect rename
> > detection, if:
> >
> >   - removing file A and adding file B occur at same committing, and
> >   - SHA-1 hash values of file A and B are same
> >
> > This may prevent security experts from managing sample files for
> > SHAttered issue in Mercurial repository, for example.
> >
> >   https://security.googleblog.com/2017/02/announcing-first-
> > sha1-collision.html
> >   https://shattered.it/
> >
> > Hash collision itself isn't so serious for core repository
> > functionality of Mercurial, described by mpm as below, though.
> >
> >   https://www.mercurial-scm.org/wiki/mpm/SHA1
> >
> > HOW ABOUT:
> >
> >   - which should we use default algorithm SHA-1, SHA-256 or SHA-512 ?
> >
> 
> SHA-512 should be faster than SHA-256 on 64-bit hardware. So, there's
> likely no good reason to use SHA-256 for simple identity checks.
> 
> 
> >
> >     ease (handling problematic files safely by default) or
> >     performance?
> >
> 
> 
> On my Skylake at 4.0 GHz, SHA-1 is capable of running at ~975 MB/s and
> SHA-512 at ~700 MB/s. Both are fast enough that for simple one-time content
> identity checks, hashing shouldn't be a bottleneck, at least not for most
> repos.
> 
> So I think it is fine to change this function from SHA-1 to SHA-512
> assuming the hashes don't "leak" into storage. If they end up being stored
> or used for something other than identity checks, then we need to bloat
> scope to discuss our general hashing future. And that needs its own thread
> ;)

With hashing, there is *always* the risk of collision. It might be tiny,
but it still exists. Why not just compare the contents when the hash
match? Then it doesn't really matter what the hash is. The hash is just
a shortcut to avoid comparing full contents in a O(n^2) fashion.

There aren't going to be that many hash matches anyways, comparing the
content then should not make a significant difference in speed, but
would guarantee that the "similarity" is real.

(BTW, interestingly, in terms of similarity detection, while the
SHAttered PDFs are not 100% identical, they are 80%+ similar)

Mike