Regular repository corruption -- help needed.

Bryan O'Sullivan bos at serpentine.com
Wed Dec 19 18:49:09 CST 2012


On Wed, Dec 19, 2012 at 3:50 PM, Alexander Krauss <krauss at in.tum.de> wrote:

> I appreciate any help on how to get to the source of the problem.
>

This looks extremely likely to be an NFS related problem, but I don't have
quite enough information to 100% certainly call it yet.

But really, NFS implementations are famously buggy, Linux's has
historically been even more buggy, and you clearly already know that NFS is
likely to be the root of the problem.

In effect, you have been playing Russian roulette, and are now asking for
help diagnosing the hole in your head, just in case it might not have been
the loaded gun.

In your bad repo, the corruption in the changelog starts at byte 7,581,130.

Here's what we see from a hex dump of the 00changelog.d file.

7581120 52 1f a0 df f3 1f e0 90 20 fa 00 00 00 00 00 00
7581136 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        ...
7581584 00 00 00 00 00 00 00 00 00 00 00

As you can see, where there should be valid data, there is instead a big
span consisting of hundreds of bytes of zeroes. The file contains exactly
the correct number of bytes, they're just not the right bytes.

During a push, Mercurial uses a special redirection mechanism to add
entries to the changelog file. That mechanism is very simple: it writes
data to a temporary file, then reads the contents of that file and writes
them to the end of the real changelog after all of the changegroup has
streamed through.

The forensics that I see are 100% consistent with a situation where your
NFS client happily says "here's your file full of zeroes" when you read
back a file (not full of zeroes) that you wrote earlier. You are not
guaranteed write-to-read consistency even on a single node (whatever the
specs may say).

I have a few questions around your report.

What is your NFS server setup? What OS is the server running, and what
version of NFS and transport are you using?

You say you've seen this happen on a "plain local push", too - but was that
also on an NFS filesystem? (I bet you a pfennig it was.)

When are you going to switch to bitbucket? You know it's free, right?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://selenic.com/pipermail/mercurial/attachments/20121219/0434e84c/attachment.html>


More information about the Mercurial mailing list