D2601: xdiff: reduce indent heuristic overhead

quark (Jun Wu) phabricator at mercurial-scm.org
Sat Mar 3 20:07:41 UTC 2018


quark created this revision.
Herald added a subscriber: mercurial-devel.
Herald added a reviewer: hg-reviewers.

REVISION SUMMARY
  Adds some threshold to avoid expensive cases, like:
  
    #!python
    open('a', 'w').write(" \n" * 1000000)
    open('b', 'w').write(" \n" * 1000001)
  
  The indent heuristic is O(N * 20) (N = 1000000) for the above case, and
  makes diff much slower.
  
  Before this patch (system git 2.14.2):
  
    git diff --no-indent-heuristic a b  0.21s user 0.03s system 100% cpu 0.239 total
    git diff --indent-heuristic a b     0.77s user 0.02s system 99% cpu 0.785 total
  
  After this patch (git 2fc74f41, with xdiffi.c patched):
  
    # with the changed xdiffi.c
    git diff --indent-heuristic a b      0.16s user 0.01s system 90% cpu 0.188 total
    git diff --no-indent-heuristic a b   0.18s user 0.01s system 99% cpu 0.192 total
  
  Now turning on indent-heuristic has no visible impact on performance.

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D2601

AFFECTED FILES
  mercurial/third-party/xdiff/xdiffi.c

CHANGE DETAILS

diff --git a/mercurial/third-party/xdiff/xdiffi.c b/mercurial/third-party/xdiff/xdiffi.c
--- a/mercurial/third-party/xdiff/xdiffi.c
+++ b/mercurial/third-party/xdiff/xdiffi.c
@@ -801,6 +801,14 @@
 	exit(1);
 }
 
+/*
+ * For indentation heuristic, skip searching for better slide position after
+ * checking MAX_BORING lines without finding an improvement. This defends the
+ * indentation heuristic logic against pathological cases. The value is not
+ * picked scientifically but should be good enough.
+ */
+#define MAX_BORING 100
+
 /*
  * Move back and forward change groups for a consistent and pretty diff output.
  * This also helps in finding joinable change groups and reducing the diff
@@ -897,19 +905,43 @@
 			long shift, best_shift = -1;
 			struct split_score best_score;
 
-			for (shift = earliest_end; shift <= g.end; shift++) {
+			/*
+			 * This is O(N * MAX_BLANKS) (N = shift-able lines).
+			 * Even with MAX_BLANKS bounded to a small value, a
+			 * large N could still make this loop take several
+			 * times longer than the main diff algorithm. The
+			 * "boring" value is to help cut down N to something
+			 * like (MAX_BORING + groupsize).
+			 *
+			 * Scan from bottom to top. So we can exit the loop
+			 * without compromising the assumption "for a same best
+			 * score, pick the bottommost shift".
+			 */
+			int boring = 0;
+			for (shift = g.end; shift >= earliest_end; shift--) {
 				struct split_measurement m;
 				struct split_score score = {0, 0};
+				int cmp;
 
 				measure_split(xdf, shift, &m);
 				score_add_split(&m, &score);
 				measure_split(xdf, shift - groupsize, &m);
 				score_add_split(&m, &score);
-				if (best_shift == -1 ||
-				    score_cmp(&score, &best_score) <= 0) {
+
+				if (best_shift == -1) {
+					cmp = -1;
+				} else {
+					cmp = score_cmp(&score, &best_score);
+				}
+				if (cmp < 0) {
+					boring = 0;
 					best_score.effective_indent = score.effective_indent;
 					best_score.penalty = score.penalty;
 					best_shift = shift;
+				} else {
+					boring += 1;
+					if (boring >= MAX_BORING)
+						break;
 				}
 			}
 



To: quark, #hg-reviewers
Cc: mercurial-devel


More information about the Mercurial-devel mailing list