Corrupted repositories on NFS

Jesper Noehr jesper at noehr.org
Thu Nov 25 23:01:52 CST 2010


Hi list,

I've managed to corrupt a repository on an NFS mount by accessing it
from several clients at once. This shouldn't be possible, should it?

Mercurial version 1.7.1, python 2.6.

I wrote this script: http://paste.pocoo.org/show/296128/

Please excuse the quality, it was hacked up quickly.

Anyway, after running this for a while, the repository becomes
corrupted. It leaves an abandoned transaction in the repository, and
you need to run "hg recover" to get it back into a working state.

I think I managed to trace down at least some of the reason why this happens.

In http://bitbucket.org/mirror/mercurial-crew/src/tip/mercurial/lock.py#cl-78,
it tries to make a lock, and if it fails due to the lock already being
there, it will call self.testlock(). self.testlock() can naturally
assume that the file exists (cause the OS just said so!), but over
NFS, it can happen that the file will no longer exist inside
'testlock()'. We saw that happen, at least.

I modified http://bitbucket.org/mirror/mercurial-crew/src/tip/mercurial/util.py#cl-593
(util.readlock) to return a dummy-string in case os.readlink raised
errno.ENOENT, triggering mercurials error.LockHeld, which seems to
have fixed that race condition.

Secondly, unlinking on NFS is not atomic. The recommended way to go
about it is to 1. rename the file (which is atomic), and 2. unlink it.
Then you get the same guarantees you can get from a normal filesystem.
I've modified mercurial to rename, then unlink, in cases where it
deals with lockfiles. That fixes the other race.

I've run this on 2 clients, 8 threads on each, for about 4 hours now.
I haven't seen any corruptions. I do keep seeing a lot of these
though:

ERROR:root:working directory has unknown parent '8e16e3e8db02'!
Traceback (most recent call last):
  File "bombard.py", line 45, in bombard
    commands.commit(u, r, afn, message="lolol", user="lolol
<lol at farm.org>", addremove=True)
  File "/home/jnoehr/env/lib/python2.6/site-packages/mercurial/commands.py",
line 780, in commit
    node = cmdutil.commit(ui, repo, commitfunc, pats, opts)
  File "/home/jnoehr/env/lib/python2.6/site-packages/mercurial/cmdutil.py",
line 1333, in commit
    return commitfunc(ui, repo, message, match(repo, pats, opts), opts)
  File "/home/jnoehr/env/lib/python2.6/site-packages/mercurial/commands.py",
line 775, in commitfunc
    editor=e, extra=extra)
  File "/home/jnoehr/env/lib/python2.6/site-packages/mercurial/localrepo.py",
line 873, in commit
    merge = len(wctx.parents()) > 1
  File "/home/jnoehr/env/lib/python2.6/site-packages/mercurial/context.py",
line 120, in parents
    return self._parents
  File "/home/jnoehr/env/lib/python2.6/site-packages/mercurial/util.py",
line 174, in __get__
    result = self.func(obj)
  File "/home/jnoehr/env/lib/python2.6/site-packages/mercurial/context.py",
line 667, in _parents
    self._parents = [changectx(self._repo, x) for x in p]
  File "/home/jnoehr/env/lib/python2.6/site-packages/mercurial/context.py",
line 27, in __init__
    self._node = self._repo.lookup(changeid)
  File "/home/jnoehr/env/lib/python2.6/site-packages/mercurial/localrepo.py",
line 513, in lookup
    % short(key))
Abort: working directory has unknown parent '8e16e3e8db02'!

... however, they don't seem to corrupt anything.

I'm chiming in here as I'm kind of in the dark whether this is an
actual bug in Mercurial, and whether my fix is actually "good."

Any comments appreciated.



Jesper


More information about the Mercurial-devel mailing list