[PATCH] diff: use a threshold on similarity index before using word-diff (issue5965)

Denis Laxalde denis at laxalde.org
Tue Aug 21 12:10:33 UTC 2018


# HG changeset patch
# User Denis Laxalde <denis.laxalde at logilab.fr>
# Date 1534853203 -7200
#      Tue Aug 21 14:06:43 2018 +0200
# Node ID c43df6ff42d26163d19e99e15a3cf3094020d822
# Parent  c62184c6299c09d2e8e7be340f9aee138229cb86
# Available At http://hg.logilab.org/users/dlaxalde/hg
#              hg pull http://hg.logilab.org/users/dlaxalde/hg -r c43df6ff42d2
# EXP-Topic issue5965
diff: use a threshold on similarity index before using word-diff (issue5965)

The threshold is chosen quite arbitrarily with a value of 0.5. It does
not change the results of test-diff-color.t whereas higher values (e.g.
0.6) would. Looking at what this produces on some changesets in recent
history (e.g. 037debbf869c or 7acec9408e1c), this significantly improves
diff readability.

Similarity index is computed using difflib.SequenceMatcher's ratio()
method; this is documented as being "expensive", but other faster methods
(that compute an upper bound value) do not give good results.
Nevertheless, since we compute this ratio on each hunk which are usually
small, this might not be problematic in most cases. Also, as we'd
short-circuit computation of inline colors for those hunks that are not
similar enough, this "expensive" ratio computation might also be
compensated.

diff --git a/mercurial/patch.py b/mercurial/patch.py
--- a/mercurial/patch.py
+++ b/mercurial/patch.py
@@ -11,6 +11,7 @@ from __future__ import absolute_import, 
 import collections
 import contextlib
 import copy
+import difflib
 import email
 import errno
 import hashlib
@@ -2445,6 +2446,12 @@ def diffsinglehunkinline(hunklines):
     # re-split the content into words
     al = wordsplitter.findall(a)
     bl = wordsplitter.findall(b)
+    # if similarity index between word lists is not high enough, fall back to
+    # diffsinglehunk since word-diff coloring will be useless
+    if difflib.SequenceMatcher(None, al, bl).ratio() < 0.5:
+        for t in diffsinglehunk(hunklines):
+            yield t
+        return
     # re-arrange the words to lines since the diff algorithm is line-based
     aln = [s if s == '\n' else s + '\n' for s in al]
     bln = [s if s == '\n' else s + '\n' for s in bl]
diff --git a/tests/test-diff-color.t b/tests/test-diff-color.t
--- a/tests/test-diff-color.t
+++ b/tests/test-diff-color.t
@@ -304,6 +304,8 @@ test inline color diff
   > three of those lines will
   > collapse onto one
   > (to see if it works)
+  > 
+  > this line will almost completely change
   > EOF
   $ hg add file1
   $ hg ci -m 'commit'
@@ -326,12 +328,14 @@ test inline color diff
   > 
   > three of those lines have
   > collapsed onto one
+  > 
+  > so many things have changed in this line
   > EOF
   $ hg diff --config diff.word-diff=False --color=debug
   [diff.diffline|diff --git a/file1 b/file1]
   [diff.file_a|--- a/file1]
   [diff.file_b|+++ b/file1]
-  [diff.hunk|@@ -1,16 +1,17 @@]
+  [diff.hunk|@@ -1,18 +1,19 @@]
   [diff.deleted|-this is the first line]
   [diff.deleted|-this is the second line]
   [diff.deleted|-    third line starts with space]
@@ -360,11 +364,14 @@ test inline color diff
   [diff.deleted|-(to see if it works)]
   [diff.inserted|+three of those lines have]
   [diff.inserted|+collapsed onto one]
+   
+  [diff.deleted|-this line will almost completely change]
+  [diff.inserted|+so many things have changed in this line]
   $ hg diff --config diff.word-diff=True --color=debug
   [diff.diffline|diff --git a/file1 b/file1]
   [diff.file_a|--- a/file1]
   [diff.file_b|+++ b/file1]
-  [diff.hunk|@@ -1,16 +1,17 @@]
+  [diff.hunk|@@ -1,18 +1,19 @@]
   [diff.deleted|-][diff.deleted.changed|this][diff.deleted.unchanged| is the first ][diff.deleted.changed|line]
   [diff.deleted|-][diff.deleted.unchanged|this is the second line]
   [diff.deleted|-][diff.deleted.changed|    ][diff.deleted.unchanged|third line starts with space]
@@ -393,6 +400,9 @@ test inline color diff
   [diff.deleted|-][diff.deleted.changed|(to see if it works)]
   [diff.inserted|+][diff.inserted.unchanged|three of those lines ][diff.inserted.changed|have]
   [diff.inserted|+][diff.inserted.changed|collapsed][diff.inserted.unchanged| onto one]
+   
+  [diff.deleted|-this line will almost completely change]
+  [diff.inserted|+so many things have changed in this line]
 
 multibyte character shouldn't be broken up in word diff:
 


More information about the Mercurial-devel mailing list