[PATCH] diff: use a threshold on similarity index before using word-diff (issue5965)

Denis Laxalde denis at laxalde.org
Tue Aug 21 11:11:51 EDT 2018


Yuya Nishihara a écrit :
> On Tue, 21 Aug 2018 14:10:33 +0200, Denis Laxalde wrote:
>> # HG changeset patch
>> # User Denis Laxalde <denis.laxalde at logilab.fr>
>> # Date 1534853203 -7200
>> #      Tue Aug 21 14:06:43 2018 +0200
>> # Node ID c43df6ff42d26163d19e99e15a3cf3094020d822
>> # Parent  c62184c6299c09d2e8e7be340f9aee138229cb86
>> # Available At http://hg.logilab.org/users/dlaxalde/hg
>> #              hg pull http://hg.logilab.org/users/dlaxalde/hg -r c43df6ff42d2
>> # EXP-Topic issue5965
>> diff: use a threshold on similarity index before using word-diff (issue5965)
>>
>> The threshold is chosen quite arbitrarily with a value of 0.5. It does
>> not change the results of test-diff-color.t whereas higher values (e.g.
>> 0.6) would. Looking at what this produces on some changesets in recent
>> history (e.g. 037debbf869c or 7acec9408e1c), this significantly improves
>> diff readability.
>>
>> Similarity index is computed using difflib.SequenceMatcher's ratio()
>> method; this is documented as being "expensive", but other faster methods
>> (that compute an upper bound value) do not give good results.
>> Nevertheless, since we compute this ratio on each hunk which are usually
>> small, this might not be problematic in most cases. Also, as we'd
>> short-circuit computation of inline colors for those hunks that are not
>> similar enough, this "expensive" ratio computation might also be
>> compensated.
> 
> Can you test this against a large BLOB-ish diff (such as machine-generated
> 10k-line JSON, a binary in Intel HEX format, etc.)? Last time I faced that,
> the original difflib-based algorithm was painfully slow (~100s-ish to yield
> one hunk), which made me think the word-diff should never be turned on by
> default.

I've set up a test repo with some JSON at
https://bitbucket.org/dlax/hg-worddiff-tests. As far as I can tell,
there's no significant difference when diffing the last changeset; it's
even a bit faster with the similarity ratio patch. Is this what you had
in mind?


> Also, for me, it was a bad UX that the word-diff was disabled at an arbitrary
> threshold. I needed to guess reason why a hunk wasn't highlighted.

I don't get this. Before this patch, word-diff is also on side of the
hunk is empty IIUC. Do you mean we should make the threshold configurable?



More information about the Mercurial-devel mailing list