[PATCH 1 of 1] bdiff.c: implemented block-delimiting to better deal with long "lines"

Friedrich Kastner-Masilko kastner-masilko at at.festo.com
Wed May 14 16:54:33 CDT 2014


# HG changeset patch
# User Friedrich Kastner-Masilko <kastner-masilko at at.festo.com>
# Date 1400101077 -7200
#      Wed May 14 22:57:57 2014 +0200
# Node ID 1b57d1650cd2d5aa8c6cc103c344ecbd8fbabc39
# Parent  1ae3cd6f836c3c96ee3e9a872c8e966750910c2d
bdiff.c: implemented block-delimiting to better deal with long "lines"

Recent XML-based file formats often resemble human-readable text without a single line-break. This mostly comes from serialization of binary data into the XML format without well-forming the content for viewing. Storing such files with the current revlog implementation results in ineffective storage due to the used bdiff line-based algorithm. Since bdiff creates chunks based on the line-break mark, the whole file content is considered as one chunk, thus creating a delta as big (or even bigger) as the file itself.

This patch is introducing block-limiting of lines. All lines encountered will be split into 4k blocks, thus giving the algorithm a chance to create smaller deltas, especially if the changes are at the end of the file. Especially for growing content where the header of the file is never changed, this patch increases the storage efficiency. However, with changes at the beginning of the file the block-limiting is not changing the results w.r.t. the original algorithm. The same is true for standard usage with text-files: because these usually contain lines shorter than 4k characters, the patch never kicks in.

diff -r 1ae3cd6f836c -r 1b57d1650cd2 mercurial/bdiff.c
--- a/mercurial/bdiff.c	Tue May 13 19:29:45 2014 -0500
+++ b/mercurial/bdiff.c	Wed May 14 22:57:57 2014 +0200
@@ -36,16 +36,17 @@
 static int splitlines(const char *a, Py_ssize_t len, struct line **lr)
 {
 	unsigned hash;
-	int i;
+	int i, j;
 	const char *p, *b = a;
 	const char * const plast = a + len - 1;
 	struct line *l;
 
 	/* count the lines */
 	i = 1; /* extra line for sentinel */
-	for (p = a; p < a + len; p++)
-		if (*p == '\n' || p == plast)
-			i++;
+	j = 0; /* block counter for fixed line delimiting */
+	for (p = a; p < a + len; p++, j++)
+		if (*p == '\n' || p == plast || j >= 4096)
+			i++, j = 0;
 
 	*lr = l = (struct line *)malloc(sizeof(struct line) * i);
 	if (!l)
@@ -53,11 +54,11 @@
 
 	/* build the line array and calculate hashes */
 	hash = 0;
-	for (p = a; p < a + len; p++) {
+	for (p = a, j = 0; p < a + len; p++, j++) {
 		/* Leonid Yuriev's hash */
 		hash = (hash * 1664525) + (unsigned char)*p + 1013904223;
 
-		if (*p == '\n' || p == plast) {
+		if (*p == '\n' || p == plast || j >= 4096) {
 			l->hash = hash;
 			hash = 0;
 			l->len = p - b + 1;
@@ -65,6 +66,7 @@
 			l->n = INT_MAX;
 			l++;
 			b = p + 1;
+			j = 0;
 		}
 	}
 


More information about the Mercurial-devel mailing list