Corrupted repositories on NFS

John Hein jhein at symmetricom.com
Sun Nov 28 17:56:46 CST 2010


Jesper Noehr wrote at 16:01 +1100 on Nov 26, 2010:
 > I've managed to corrupt a repository on an NFS mount by accessing it
 > from several clients at once. This shouldn't be possible, should it?

I've had troubles with NFS and a barrage of clones that happen each
night from various clients trying to do some nightly builds.

I've never had enough good data to provide a good bug report (or to
decidedly cast blame on hg vs. some network issue vs. NFS vs. a
particular OS' implementation of NFS, server or client).  Since
someone else if having trouble, I'll jump on the bandwagon and add my
two cents.  If it does turn out to be an hg "problem", apologies for
sitting on the report.  We worked around it by using ssh to the central
repository host.

In our case, it hasn't been repo corruption, but failed clones.
It does not always fail in the same place, which was one reason it
was hard to generate a good report.  But here's one example:

+ hg --traceback clone -U /base/hg/Release Release
requesting all changes
adding changesets
adding manifests
adding file changes
transaction abort!
rollback completed
Traceback (most recent call last):
  File "/usr/local/lib/python2.6/site-packages/mercurial/dispatch.py", line 58, in _runcatch
    return _dispatch(ui, args)
  File "/usr/local/lib/python2.6/site-packages/mercurial/dispatch.py", line 590, in _dispatch
    cmdpats, cmdoptions)
  File "/usr/local/lib/python2.6/site-packages/mercurial/dispatch.py", line 401, in runcommand
    ret = _runcommand(ui, options, cmd, d)
  File "/usr/local/lib/python2.6/site-packages/mercurial/dispatch.py", line 641, in _runcommand
    return checkargs()
  File "/usr/local/lib/python2.6/site-packages/mercurial/dispatch.py", line 595, in checkargs
    return cmdfunc()
  File "/usr/local/lib/python2.6/site-packages/mercurial/dispatch.py", line 588, in <lambda>
    d = lambda: util.checksignature(func)(ui, *args, **cmdoptions)
  File "/usr/local/lib/python2.6/site-packages/mercurial/util.py", line 427, in check
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.6/site-packages/mercurial/commands.py", line 736, in clone
    branch=opts.get('branch'))
  File "/usr/local/lib/python2.6/site-packages/mercurial/hg.py", line 337, in clone
    dest_repo.clone(src_repo, heads=revs, stream=stream)
  File "/usr/local/lib/python2.6/site-packages/mercurial/localrepo.py", line 1880, in clone
    return self.pull(remote, heads)
  File "/usr/local/lib/python2.6/site-packages/mercurial/localrepo.py", line 1289, in pull
    return self.addchangegroup(cg, 'pull', remote.url(), lock=lock)
  File "/usr/local/lib/python2.6/site-packages/mercurial/localrepo.py", line 1733, in addchangegroup
    if fl.addgroup(source, revmap, trp) is None:
  File "/usr/local/lib/python2.6/site-packages/mercurial/revlog.py", line 1336, in addgroup
    chunkdata = bundle.parsechunk()
  File "/usr/local/lib/python2.6/site-packages/mercurial/changegroup.py", line 174, in parsechunk
    data = self.read(l - 80)
  File "/usr/local/lib/python2.6/site-packages/mercurial/changegroup.py", line 141, in read
    return self._stream.read(l)
  File "/usr/local/lib/python2.6/site-packages/mercurial/util.py", line 976, in read
    for chunk in self.iter:
  File "/usr/local/lib/python2.6/site-packages/mercurial/util.py", line 954, in splitbig
    for chunk in chunks:
  File "/usr/local/lib/python2.6/site-packages/mercurial/localrepo.py", line 1615, in gengroup
    raise util.Abort(_("empty or missing revlog for %s") % fname)
Abort: empty or missing revlog for tbsg/extra-files/usr/tsc/hise/ats6000/ConfigScreen.xml
abort: empty or missing revlog for tbsg/extra-files/usr/tsc/hise/ats6000/ConfigScreen.xml


 > Mercurial version 1.7.1, python 2.6.

I've noticed it in various Mercurial versions since 1.3 at least.


 > I wrote this script: http://paste.pocoo.org/show/296128/
 > 
 > Please excuse the quality, it was hacked up quickly.
 > 
 > Anyway, after running this for a while, the repository becomes
 > corrupted. It leaves an abandoned transaction in the repository, and
 > you need to run "hg recover" to get it back into a working state.
 > 
 > I think I managed to trace down at least some of the reason why this happens.
 > 
 > In http://bitbucket.org/mirror/mercurial-crew/src/tip/mercurial/lock.py#cl-78,
 > it tries to make a lock, and if it fails due to the lock already being
 > there, it will call self.testlock(). self.testlock() can naturally
 > assume that the file exists (cause the OS just said so!), but over
 > NFS, it can happen that the file will no longer exist inside
 > 'testlock()'. We saw that happen, at least.
 > 
 > I modified http://bitbucket.org/mirror/mercurial-crew/src/tip/mercurial/util.py#cl-593
 > (util.readlock) to return a dummy-string in case os.readlink raised
 > errno.ENOENT, triggering mercurials error.LockHeld, which seems to
 > have fixed that race condition.
 > 
 > Secondly, unlinking on NFS is not atomic. The recommended way to go
 > about it is to 1. rename the file (which is atomic), and 2. unlink it.
 > Then you get the same guarantees you can get from a normal filesystem.
 > I've modified mercurial to rename, then unlink, in cases where it
 > deals with lockfiles. That fixes the other race.
 > 
 > I've run this on 2 clients, 8 threads on each, for about 4 hours now.
 > I haven't seen any corruptions. I do keep seeing a lot of these
 > though:
 > 
 > ERROR:root:working directory has unknown parent '8e16e3e8db02'!
 > Traceback (most recent call last):
 >   File "bombard.py", line 45, in bombard
 >     commands.commit(u, r, afn, message="lolol", user="lolol
 > <lol at farm.org>", addremove=True)
 >   File "/home/jnoehr/env/lib/python2.6/site-packages/mercurial/commands.py",
 > line 780, in commit
 >     node = cmdutil.commit(ui, repo, commitfunc, pats, opts)
 >   File "/home/jnoehr/env/lib/python2.6/site-packages/mercurial/cmdutil.py",
 > line 1333, in commit
 >     return commitfunc(ui, repo, message, match(repo, pats, opts), opts)
 >   File "/home/jnoehr/env/lib/python2.6/site-packages/mercurial/commands.py",
 > line 775, in commitfunc
 >     editor=e, extra=extra)
 >   File "/home/jnoehr/env/lib/python2.6/site-packages/mercurial/localrepo.py",
 > line 873, in commit
 >     merge = len(wctx.parents()) > 1
 >   File "/home/jnoehr/env/lib/python2.6/site-packages/mercurial/context.py",
 > line 120, in parents
 >     return self._parents
 >   File "/home/jnoehr/env/lib/python2.6/site-packages/mercurial/util.py",
 > line 174, in __get__
 >     result = self.func(obj)
 >   File "/home/jnoehr/env/lib/python2.6/site-packages/mercurial/context.py",
 > line 667, in _parents
 >     self._parents = [changectx(self._repo, x) for x in p]
 >   File "/home/jnoehr/env/lib/python2.6/site-packages/mercurial/context.py",
 > line 27, in __init__
 >     self._node = self._repo.lookup(changeid)
 >   File "/home/jnoehr/env/lib/python2.6/site-packages/mercurial/localrepo.py",
 > line 513, in lookup
 >     % short(key))
 > Abort: working directory has unknown parent '8e16e3e8db02'!
 > 
 > .... however, they don't seem to corrupt anything.
 > 
 > I'm chiming in here as I'm kind of in the dark whether this is an
 > actual bug in Mercurial, and whether my fix is actually "good."
 > 
 > Any comments appreciated.

I didn't see a patch showing your changes.  Could you send those


More information about the Mercurial-devel mailing list