[PATCH] diff: use a threshold on similarity index before using word-diff (issue5965)

Thu Aug 23 08:48:17 EDT 2018

On Wed, 22 Aug 2018 21:35:31 +0900, Yuya Nishihara wrote:
> On Tue, 21 Aug 2018 17:11:51 +0200, Denis Laxalde wrote:
> > Yuya Nishihara a écrit :
> > > On Tue, 21 Aug 2018 14:10:33 +0200, Denis Laxalde wrote:
> > >> # HG changeset patch
> > >> # User Denis Laxalde <denis.laxalde at logilab.fr>
> > >> # Date 1534853203 -7200
> > >> #      Tue Aug 21 14:06:43 2018 +0200
> > >> # Node ID c43df6ff42d26163d19e99e15a3cf3094020d822
> > >> # Parent  c62184c6299c09d2e8e7be340f9aee138229cb86
> > >> # Available At http://hg.logilab.org/users/dlaxalde/hg
> > >> #              hg pull http://hg.logilab.org/users/dlaxalde/hg -r c43df6ff42d2
> > >> # EXP-Topic issue5965
> > >> diff: use a threshold on similarity index before using word-diff (issue5965)
> > >>
> > >> The threshold is chosen quite arbitrarily with a value of 0.5. It does
> > >> not change the results of test-diff-color.t whereas higher values (e.g.
> > >> 0.6) would. Looking at what this produces on some changesets in recent
> > >> history (e.g. 037debbf869c or 7acec9408e1c), this significantly improves
> > >> diff readability.
> > >>
> > >> Similarity index is computed using difflib.SequenceMatcher's ratio()
> > >> method; this is documented as being "expensive", but other faster methods
> > >> (that compute an upper bound value) do not give good results.
> > >> Nevertheless, since we compute this ratio on each hunk which are usually
> > >> small, this might not be problematic in most cases. Also, as we'd
> > >> short-circuit computation of inline colors for those hunks that are not
> > >> similar enough, this "expensive" ratio computation might also be
> > >> compensated.
> > > 
> > > Can you test this against a large BLOB-ish diff (such as machine-generated
> > > 10k-line JSON, a binary in Intel HEX format, etc.)? Last time I faced that,
> > > the original difflib-based algorithm was painfully slow (~100s-ish to yield
> > > one hunk), which made me think the word-diff should never be turned on by
> > > default.
> > 
> > I've set up a test repo with some JSON at
> > https://bitbucket.org/dlax/hg-worddiff-tests. As far as I can tell,
> > there's no significant difference when diffing the last changeset;
> 
> Thanks, but it looks cheaper to compute than the stuff I had at work. I'll
> try to collect some number if I get a chance.

  $ hg diff -c REV --color=always --config diff.word-diff=true --time > /dev/null
  (orig) 1.250sec
  (new)  1259.490sec

It's an ASCII-fied FPGA image (called tabular text file), containing ~320k
decimal numbers plus commas (so ~1000k words in our word-diff.) And there
are some large hunks as it is a diff of two similar BLOBs split into chunks
per N bytes.