[PATCH] bdiff: replace hash algorithm

Gregory Szorc gregory.szorc at gmail.com
Tue Nov 8 11:33:52 EST 2016



> On Nov 8, 2016, at 08:21, Augie Fackler <raf at durin42.com> wrote:
> 
>> On Mon, Nov 07, 2016 at 12:19:47AM -0800, Gregory Szorc wrote:
>> 
>> 
>>> On Nov 6, 2016, at 19:06, Gregory Szorc <gregory.szorc at gmail.com> wrote:
>>> 
>>> # HG changeset patch
>>> # User Gregory Szorc <gregory.szorc at gmail.com>
>>> # Date 1478487117 28800
>>> #      Sun Nov 06 18:51:57 2016 -0800
>>> # Node ID bb7c6d6f4a10e80ff4bdf88919692f08497d2d66
>>> # Parent  1c7269484883804b6f960e87309169ef4ae85043
>>> bdiff: replace hash algorithm
>>> 
>>> This patch replaces lyhash with the hash algorithm used by diffutils.
>>> The algorithm has its origins in Git commit 2e9d1410, which is all the
>>> way back from 1992. The license header in the code at that revision
>>> in GPL v2.
>>> 
>>> I have not performed an extensive analysis of the distribution
>>> (and therefore buckets) of hash output. However, `hg perfbdiff`
>>> gives some clear wins. I'd like to think that if it is good enough
>>> for diffutils it is good enough for us?
>> 
>> Searching the Internets seems to reveal that xxHash is the state of the art for fast string hashing with great distribution. We'll have a copy of xxHash vendored as part of zstd and it should be relatively easy to plug in then.
>> 
>> Honestly, I'm not sure if we should take the quick win or hold out for xxHash in a few weeks (assuming my compression engine series and zstd vendoring moves forward...).
> 
> I'm going to take the quick win now (bird in the hand two in the bush
> etc), so this is queued.

OK. I have a hacked patch for xxHash and it appears roughly the same speed. But looking at assembly level profiling output, I think this is due to how the loop is compiled. I'll keep casually poking at this.

FWIW, we're still currently spending more time in bdiff_splitlines than the "real" algorithm part of diff. If only Big-O notation captured the reality of compilers, assembly, and CPU cache lines...

> 
>> 
>>> 
>>> From the mozilla-unified repository:
>>> 
>>> $ perfbdiff -m 3041e4d59df2
>>> ! wall 0.053271 comb 0.060000 user 0.060000 sys 0.000000 (best of 100)
>>> ! wall 0.035827 comb 0.040000 user 0.040000 sys 0.000000 (best of 100)
>>> 
>>> $ perfbdiff 0e9928989e9c --alldata --count 100
>>> ! wall 6.204277 comb 6.200000 user 6.200000 sys 0.000000 (best of 3)
>>> ! wall 4.309710 comb 4.300000 user 4.300000 sys 0.000000 (best of 3)
>>> 
>>> From the hg repo:
>>> 
>>> $ perfbdiff 35000 --alldata --count 1000
>>> ! wall 0.660358 comb 0.660000 user 0.660000 sys 0.000000 (best of 15)
>>> ! wall 0.534092 comb 0.530000 user 0.530000 sys 0.000000 (best of 19)
>>> 
>>> Looking at the generated assembly and statistical profiler output
>>> from the kernel level, I believe there is room to make this function
>>> even faster. Namely, we're still consuming data character by character
>>> instead of at the word level. This translates to more loop iterations
>>> and more instructions.
>>> 
>>> At this juncture though, the real performance killer is that we're
>>> hashing every line. We should get a significant speedup if we change
>>> the algorithm to find the longest prefix, longest suffix, treat those
>>> as single "lines" and then only do the line splitting and hashing on
>>> the parts that are different. That will require a lot of C code,
>>> however. I'm optimistic this approach could result in a ~2x speedup.
>>> 
>>> diff --git a/mercurial/bdiff.c b/mercurial/bdiff.c
>>> --- a/mercurial/bdiff.c
>>> +++ b/mercurial/bdiff.c
>>> @@ -17,6 +17,10 @@
>>> #include "bitmanipulation.h"
>>> #include "bdiff.h"
>>> 
>>> +/* Hash implementation from diffutils */
>>> +#define ROL(v, n) ((v) << (n) | (v) >> (sizeof(v) * CHAR_BIT - (n)))
>>> +#define HASH(h, c) ((c) + ROL(h ,7))
>>> +
>>> struct pos {
>>>   int pos, len;
>>> };
>>> @@ -44,8 +48,7 @@ int bdiff_splitlines(const char *a, ssiz
>>>   /* build the line array and calculate hashes */
>>>   hash = 0;
>>>   for (p = a; p < a + len; p++) {
>>> -        /* Leonid Yuriev's hash */
>>> -        hash = (hash * 1664525) + (unsigned char)*p + 1013904223;
>>> +        hash = HASH(hash, *p);
>>> 
>>>       if (*p == '\n' || p == plast) {
>>>           l->hash = hash;
>> _______________________________________________
>> Mercurial-devel mailing list
>> Mercurial-devel at mercurial-scm.org
>> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


More information about the Mercurial-devel mailing list