[PATCH 1 of 1] bdiff.c: implemented block-delimiting to better deal with long "lines"

Wed May 14 17:25:08 CDT 2014

On 05/14/2014 02:54 PM, Friedrich Kastner-Masilko wrote:
> # HG changeset patch
> # User Friedrich Kastner-Masilko <kastner-masilko at at.festo.com>
> # Date 1400101077 -7200
> #      Wed May 14 22:57:57 2014 +0200
> # Node ID 1b57d1650cd2d5aa8c6cc103c344ecbd8fbabc39
> # Parent  1ae3cd6f836c3c96ee3e9a872c8e966750910c2d
> bdiff.c: implemented block-delimiting to better deal with long "lines"

http://mercurial.selenic.com/wiki/ContributingChanges

In particular, note the bit about "all lines less than 80 characters" in 
the patch description.

>
> Recent XML-based file formats often resemble human-readable text without a single line-break. This mostly comes from serialization of binary data into the XML format without well-forming the content for viewing. Storing such files with the current revlog implementation results in ineffective storage due to the used bdiff line-based algorithm. Since bdiff creates chunks based on the line-break mark, the whole file content is considered as one chunk, thus creating a delta as big (or even bigger) as the file itself.
>
> This patch is introducing block-limiting of lines. All lines encountered will be split into 4k blocks, thus giving the algorithm a chance to create smaller deltas, especially if the changes are at the end of the file. Especially for growing content where the header of the file is never changed, this patch increases the storage efficiency. However, with changes at the beginning of the file the block-limiting is not changing the results w.r.t. the original algorithm. The same is true for standard usage with text-files: because these usually contain lines shorter than 4k characters, the patch never kicks in.

On the face of it, I think this is a pretty decent idea.

It isn't clear from the patch description whether diffs produced by 
block-limiting are backwards-compatible. i.e. would a diff produced by 
this tweaked bdiff be applicable by an older Mercurial? It doesn't seem 
like you've changed the code that applies diffs, so I presume the answer 
is yes. Even so, it'll be good to have that explicitly stated in the 
description.

Can you add tests to make sure that this works and produces a split output?

There's a pure-Python implementation at 
http://selenic.com/hg/file/1ae3cd6f836c/mercurial/pure/bdiff.py -- could 
you make the same change to this as well?

>
> diff -r 1ae3cd6f836c -r 1b57d1650cd2 mercurial/bdiff.c
> --- a/mercurial/bdiff.c	Tue May 13 19:29:45 2014 -0500
> +++ b/mercurial/bdiff.c	Wed May 14 22:57:57 2014 +0200
> @@ -36,16 +36,17 @@
>   static int splitlines(const char *a, Py_ssize_t len, struct line **lr)
>   {
>   	unsigned hash;
> -	int i;
> +	int i, j;
>   	const char *p, *b = a;
>   	const char * const plast = a + len - 1;
>   	struct line *l;
>   
>   	/* count the lines */
>   	i = 1; /* extra line for sentinel */
> -	for (p = a; p < a + len; p++)
> -		if (*p == '\n' || p == plast)
> -			i++;
> +	j = 0; /* block counter for fixed line delimiting */
> +	for (p = a; p < a + len; p++, j++)
> +		if (*p == '\n' || p == plast || j >= 4096)
> +			i++, j = 0;
>   
>   	*lr = l = (struct line *)malloc(sizeof(struct line) * i);
>   	if (!l)
> @@ -53,11 +54,11 @@
>   
>   	/* build the line array and calculate hashes */
>   	hash = 0;
> -	for (p = a; p < a + len; p++) {
> +	for (p = a, j = 0; p < a + len; p++, j++) {
>   		/* Leonid Yuriev's hash */
>   		hash = (hash * 1664525) + (unsigned char)*p + 1013904223;
>   
> -		if (*p == '\n' || p == plast) {
> +		if (*p == '\n' || p == plast || j >= 4096) {
>   			l->hash = hash;
>   			hash = 0;
>   			l->len = p - b + 1;
> @@ -65,6 +66,7 @@
>   			l->n = INT_MAX;
>   			l++;
>   			b = p + 1;
> +			j = 0;
>   		}
>   	}
>   
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel at selenic.com
> http://selenic.com/mailman/listinfo/mercurial-devel