[PATCH V2] similar: compare between actual file contents for exact identity

Thu Mar 2 18:02:27 UTC 2017

# HG changeset patch
# User FUJIWARA Katsunori <foozy at lares.dti.ne.jp>
# Date 1488477426 -32400
#      Fri Mar 03 02:57:06 2017 +0900
# Node ID d7d47f54019fa900968245163e67ca6f02378995
# Parent  0bb3089fe73527c64f1afc40b86ecb8dfe7fd7aa
similar: compare between actual file contents for exact identity

Before this patch, similarity detection logic (for addremove and
automv) depends entirely on SHA-1 digesting. But this causes incorrect
rename detection, if:

  - removing file A and adding file B occur at same committing, and
  - SHA-1 hash values of file A and B are same

This may prevent security experts from managing sample files for
SHAttered issue in Mercurial repository, for example.

  https://security.googleblog.com/2017/02/announcing-first-sha1-collision.html
  https://shattered.it/

Hash collision itself isn't so serious for core repository
functionality of Mercurial, described by mpm as below, though.

  https://www.mercurial-scm.org/wiki/mpm/SHA1

This patch compares between actual file contents after hash comparison
for exact identity.

Even after this patch, SHA-1 is still used, because it is reasonable
enough to quickly detect existence of "(almost) same" file.

  - replacing SHA-1 causes decreasing performance, and
  - replacement of it has ambiguity, yet

Getting content of removed file (= rfctx.data()) at each exact
comparison should be cheap enough, even though getting content of
added one costs much.

  ======= ============== =====================
  file    fctx           data() reads from
  ======= ============== =====================
  removed filectx        in-memory revlog data
  added   workingfilectx storage
  ======= ============== =====================

diff --git a/mercurial/similar.py b/mercurial/similar.py
--- a/mercurial/similar.py
+++ b/mercurial/similar.py
@@ -35,9 +35,13 @@ def _findexactmatches(repo, added, remov
     for i, fctx in enumerate(added):
         repo.ui.progress(_('searching for exact renames'), i + len(removed),
                 total=numfiles, unit=_('files'))
-        h = hashlib.sha1(fctx.data()).digest()
+        adata = fctx.data()
+        h = hashlib.sha1(adata).digest()
         if h in hashes:
-            yield (hashes[h], fctx)
+            rfctx = hashes[h]
+            # compare between actual file contents for exact identity
+            if adata == rfctx.data():
+                yield (rfctx, fctx)
 
     # Done
     repo.ui.progress(_('searching for exact renames'), None)