Internal textdiff using bsdiff algo

Ralf Leibold Ralf.Leibold at nuance.com
Mon Jul 30 09:31:27 CDT 2007


Hi,

here is my third and last patch for now. 

I had some problems during import of a quite large CVS repository. The
problem was that there was a large XML-file (with more than 4 million
lines) and afterwards the author changed nearly every tenth line.
Mercurial died during the commit (after ~90 mins) with a stack overflow
in the recurse()-function of mercurial/bdiff.c. I read through the
mailing list and there are some complaints about commits with large
files and there was a suggestion to add a different diff algorithm in
those cases. Although it was suggested to use a "simpler" one I chose
"bsdiff" (see http://www.daemonology.net/bsdiff/) as this should also
handle binary files well. [There is some copyright notice in the new
file "mercurial/bsdiff.cpp" which has to be kept in. As far as I can see
there is no limitation in distributing this algorithm apart from that.]

I kept the algorithm but changed the post-processing to create a patch
that is backward compatible to the mercurial way of storing diffs. So
repositories using this algorithm can still be used with default
mercurial. The usage is enabled as an extension. Here is an example
.hgrc:
[extensions]
hgext.bsdiff=

[bsdiff]
switch_size = 10000000

This means that the bsdiff-algorithm is used for all files when the
input or output file is larger than 10000000 bytes. I limited the usage
of this algorithm to files only - so changes in manifest etc. are still
stored with your algorithm.

The bsdiff-algo originally bzip2-compresses the created patch and stores
it differently so the results are not optimized to "your" repository
patch format. But it worked fine for my use cases.

I hope this is of use
Ralf



As the patch ("hg export" versus version "5026:48ebd6a83994" of
http://selenic.com/repo/hg) is quite big I attached it as text-file. I
hope this is fine.

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: export.txt
Url: http://selenic.com/pipermail/mercurial-devel/attachments/20070730/8073390b/attachment-0001.txt 


More information about the Mercurial-devel mailing list