[PATCH] mdiff: Compare content of binary files directly

Martin Geisler mg at daimi.au.dk
Fri Aug 8 19:20:07 CDT 2008


# HG changeset patch
# User Martin Geisler <mg at daimi.au.dk>
# Date 1218240622 -7200
# Node ID 14cc0aa138d3138b77900a0310c7cdd0d7093551
# Parent  08a88ccca36107c2f3ec572fb83d1b1acc140d72
mdiff: Compare content of binary files directly

A plain Python string comparison stops when the first mismatch is
found, whereas the call to md5 would need to compute the hash over the
entire string and only then do the comparison.

A simple test with the timeit module shows that comparing 50 MiB
strings which differ in the first byte is quite fast:

  % python -m timeit -s "x = 'x' + ('abcdefghij' * 5 * 2**20)" \
                     -s "y = 'y' + ('abcdefghij' * 5 * 2**20)" 'x == y'
  10000000 loops, best of 3: 0.187 usec per loop

It is actually almost as fast as comparing 50 byte strings:

  % python -m timeit -s "x = 'x' + ('abcdefghij' * 5)" \
                     -s "y = 'y' + ('abcdefghij' * 5)" 'x == y'
  1000000 loops, best of 3: 0.173 usec per loop

Using md5 takes longer for a short string:

  % python -m timeit -s 'import md5' \
           -s "x = 'x' + ('abcdefghij' * 5)" \
           -s "y = 'y' + ('abcdefghij' * 5)" \
           'md5.new(x).digest() == md5.new(y).digest()'
  100000 loops, best of 3: 3.38 usec per loop

and even longer for a long string (as expected):

  % python -m timeit -s 'import md5' \
           -s "x = 'x' + ('abcdefghij' * 5 * 2**20)" \
           -s "y = 'y' + ('abcdefghij' * 5 * 2**20)" \
           'md5.new(x).digest() == md5.new(y).digest()'
  10 loops, best of 3: 807 msec per loop

If the strings differ in the very last byte, then a normal Python
comparison is still faster than the md5 version:

  % python -m timeit -s "x = ('abcdefghij' * 5 * 2**20) + 'x'" \
                     -s "y = ('abcdefghij' * 5 * 2**20) + 'y'" 'x == y'
  10 loops, best of 3: 156 msec per loop

diff -r 08a88ccca361 -r 14cc0aa138d3 mercurial/mdiff.py
--- a/mercurial/mdiff.py	Sat Aug 09 01:56:23 2008 +0200
+++ b/mercurial/mdiff.py	Sat Aug 09 02:10:22 2008 +0200
@@ -78,10 +78,7 @@
     epoch = util.datestr((0, 0))
 
     if not opts.text and (util.binary(a) or util.binary(b)):
-        def h(v):
-            # md5 is used instead of sha1 because md5 is supposedly faster
-            return util.md5(v).digest()
-        if a and b and len(a) == len(b) and h(a) == h(b):
+        if a and b and len(a) == len(b) and a == b:
             return ""
         l = ['Binary file %s has changed\n' % fn1]
     elif not a:


More information about the Mercurial-devel mailing list