[PATCH] mdiff: Compare content of binary files directly
Martin Geisler
mg at daimi.au.dk
Fri Aug 8 19:20:07 CDT 2008
# HG changeset patch
# User Martin Geisler <mg at daimi.au.dk>
# Date 1218240622 -7200
# Node ID 14cc0aa138d3138b77900a0310c7cdd0d7093551
# Parent 08a88ccca36107c2f3ec572fb83d1b1acc140d72
mdiff: Compare content of binary files directly
A plain Python string comparison stops when the first mismatch is
found, whereas the call to md5 would need to compute the hash over the
entire string and only then do the comparison.
A simple test with the timeit module shows that comparing 50 MiB
strings which differ in the first byte is quite fast:
% python -m timeit -s "x = 'x' + ('abcdefghij' * 5 * 2**20)" \
-s "y = 'y' + ('abcdefghij' * 5 * 2**20)" 'x == y'
10000000 loops, best of 3: 0.187 usec per loop
It is actually almost as fast as comparing 50 byte strings:
% python -m timeit -s "x = 'x' + ('abcdefghij' * 5)" \
-s "y = 'y' + ('abcdefghij' * 5)" 'x == y'
1000000 loops, best of 3: 0.173 usec per loop
Using md5 takes longer for a short string:
% python -m timeit -s 'import md5' \
-s "x = 'x' + ('abcdefghij' * 5)" \
-s "y = 'y' + ('abcdefghij' * 5)" \
'md5.new(x).digest() == md5.new(y).digest()'
100000 loops, best of 3: 3.38 usec per loop
and even longer for a long string (as expected):
% python -m timeit -s 'import md5' \
-s "x = 'x' + ('abcdefghij' * 5 * 2**20)" \
-s "y = 'y' + ('abcdefghij' * 5 * 2**20)" \
'md5.new(x).digest() == md5.new(y).digest()'
10 loops, best of 3: 807 msec per loop
If the strings differ in the very last byte, then a normal Python
comparison is still faster than the md5 version:
% python -m timeit -s "x = ('abcdefghij' * 5 * 2**20) + 'x'" \
-s "y = ('abcdefghij' * 5 * 2**20) + 'y'" 'x == y'
10 loops, best of 3: 156 msec per loop
diff -r 08a88ccca361 -r 14cc0aa138d3 mercurial/mdiff.py
--- a/mercurial/mdiff.py Sat Aug 09 01:56:23 2008 +0200
+++ b/mercurial/mdiff.py Sat Aug 09 02:10:22 2008 +0200
@@ -78,10 +78,7 @@
epoch = util.datestr((0, 0))
if not opts.text and (util.binary(a) or util.binary(b)):
- def h(v):
- # md5 is used instead of sha1 because md5 is supposedly faster
- return util.md5(v).digest()
- if a and b and len(a) == len(b) and h(a) == h(b):
+ if a and b and len(a) == len(b) and a == b:
return ""
l = ['Binary file %s has changed\n' % fn1]
elif not a:
More information about the Mercurial-devel
mailing list