[PATCH] bdiff: replace hash algorithm
Gregory Szorc
gregory.szorc at gmail.com
Tue Nov 8 11:33:52 EST 2016
> On Nov 8, 2016, at 08:21, Augie Fackler <raf at durin42.com> wrote:
>
>> On Mon, Nov 07, 2016 at 12:19:47AM -0800, Gregory Szorc wrote:
>>
>>
>>> On Nov 6, 2016, at 19:06, Gregory Szorc <gregory.szorc at gmail.com> wrote:
>>>
>>> # HG changeset patch
>>> # User Gregory Szorc <gregory.szorc at gmail.com>
>>> # Date 1478487117 28800
>>> # Sun Nov 06 18:51:57 2016 -0800
>>> # Node ID bb7c6d6f4a10e80ff4bdf88919692f08497d2d66
>>> # Parent 1c7269484883804b6f960e87309169ef4ae85043
>>> bdiff: replace hash algorithm
>>>
>>> This patch replaces lyhash with the hash algorithm used by diffutils.
>>> The algorithm has its origins in Git commit 2e9d1410, which is all the
>>> way back from 1992. The license header in the code at that revision
>>> in GPL v2.
>>>
>>> I have not performed an extensive analysis of the distribution
>>> (and therefore buckets) of hash output. However, `hg perfbdiff`
>>> gives some clear wins. I'd like to think that if it is good enough
>>> for diffutils it is good enough for us?
>>
>> Searching the Internets seems to reveal that xxHash is the state of the art for fast string hashing with great distribution. We'll have a copy of xxHash vendored as part of zstd and it should be relatively easy to plug in then.
>>
>> Honestly, I'm not sure if we should take the quick win or hold out for xxHash in a few weeks (assuming my compression engine series and zstd vendoring moves forward...).
>
> I'm going to take the quick win now (bird in the hand two in the bush
> etc), so this is queued.
OK. I have a hacked patch for xxHash and it appears roughly the same speed. But looking at assembly level profiling output, I think this is due to how the loop is compiled. I'll keep casually poking at this.
FWIW, we're still currently spending more time in bdiff_splitlines than the "real" algorithm part of diff. If only Big-O notation captured the reality of compilers, assembly, and CPU cache lines...
>
>>
>>>
>>> From the mozilla-unified repository:
>>>
>>> $ perfbdiff -m 3041e4d59df2
>>> ! wall 0.053271 comb 0.060000 user 0.060000 sys 0.000000 (best of 100)
>>> ! wall 0.035827 comb 0.040000 user 0.040000 sys 0.000000 (best of 100)
>>>
>>> $ perfbdiff 0e9928989e9c --alldata --count 100
>>> ! wall 6.204277 comb 6.200000 user 6.200000 sys 0.000000 (best of 3)
>>> ! wall 4.309710 comb 4.300000 user 4.300000 sys 0.000000 (best of 3)
>>>
>>> From the hg repo:
>>>
>>> $ perfbdiff 35000 --alldata --count 1000
>>> ! wall 0.660358 comb 0.660000 user 0.660000 sys 0.000000 (best of 15)
>>> ! wall 0.534092 comb 0.530000 user 0.530000 sys 0.000000 (best of 19)
>>>
>>> Looking at the generated assembly and statistical profiler output
>>> from the kernel level, I believe there is room to make this function
>>> even faster. Namely, we're still consuming data character by character
>>> instead of at the word level. This translates to more loop iterations
>>> and more instructions.
>>>
>>> At this juncture though, the real performance killer is that we're
>>> hashing every line. We should get a significant speedup if we change
>>> the algorithm to find the longest prefix, longest suffix, treat those
>>> as single "lines" and then only do the line splitting and hashing on
>>> the parts that are different. That will require a lot of C code,
>>> however. I'm optimistic this approach could result in a ~2x speedup.
>>>
>>> diff --git a/mercurial/bdiff.c b/mercurial/bdiff.c
>>> --- a/mercurial/bdiff.c
>>> +++ b/mercurial/bdiff.c
>>> @@ -17,6 +17,10 @@
>>> #include "bitmanipulation.h"
>>> #include "bdiff.h"
>>>
>>> +/* Hash implementation from diffutils */
>>> +#define ROL(v, n) ((v) << (n) | (v) >> (sizeof(v) * CHAR_BIT - (n)))
>>> +#define HASH(h, c) ((c) + ROL(h ,7))
>>> +
>>> struct pos {
>>> int pos, len;
>>> };
>>> @@ -44,8 +48,7 @@ int bdiff_splitlines(const char *a, ssiz
>>> /* build the line array and calculate hashes */
>>> hash = 0;
>>> for (p = a; p < a + len; p++) {
>>> - /* Leonid Yuriev's hash */
>>> - hash = (hash * 1664525) + (unsigned char)*p + 1013904223;
>>> + hash = HASH(hash, *p);
>>>
>>> if (*p == '\n' || p == plast) {
>>> l->hash = hash;
>> _______________________________________________
>> Mercurial-devel mailing list
>> Mercurial-devel at mercurial-scm.org
>> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
More information about the Mercurial-devel
mailing list