[PATCH 1 of 1] bdiff.c: implemented block-delimiting to better deal with long "lines"

Matt Mackall mpm at selenic.com
Tue May 27 16:36:45 CDT 2014


On Wed, 2014-05-14 at 23:54 +0200, Friedrich Kastner-Masilko wrote:
> # HG changeset patch
> # User Friedrich Kastner-Masilko <kastner-masilko at at.festo.com>
> # Date 1400101077 -7200
> #      Wed May 14 22:57:57 2014 +0200
> # Node ID 1b57d1650cd2d5aa8c6cc103c344ecbd8fbabc39
> # Parent  1ae3cd6f836c3c96ee3e9a872c8e966750910c2d
> bdiff.c: implemented block-delimiting to better deal with long "lines"
> 
> Recent XML-based file formats often resemble human-readable text
> without a single line-break. This mostly comes from serialization of
> binary data into the XML format without well-forming the content for
> viewing. Storing such files with the current revlog implementation
> results in ineffective storage due to the used bdiff line-based
> algorithm. Since bdiff creates chunks based on the line-break mark,
> the whole file content is considered as one chunk, thus creating a
> delta as big (or even bigger) as the file itself.
> 
> This patch is introducing block-limiting of lines. All lines
> encountered will be split into 4k blocks, thus giving the algorithm a
> chance to create smaller deltas, especially if the changes are at the
> end of the file. Especially for growing content where the header of
> the file is never changed, this patch increases the storage
> efficiency. However, with changes at the beginning of the file the
> block-limiting is not changing the results w.r.t. the original
> algorithm. The same is true for standard usage with text-files:
> because these usually contain lines shorter than 4k characters, the
> patch never kicks in.

So this looks fine as far as it goes, but I think we should try to go
one further: getting somewhat repeatable block boundaries even if we
insert in the middle.

For instance, we could have a hierarchy of block break rules:

- break on newline
- break on [?_%] if > 1k  # some other rare characters?
- break if > 4k

This might give the delta engine some opportunities to realign.

-- 
Mathematics is the supreme nostalgia of our time.




More information about the Mercurial-devel mailing list