D7199: lock: fix race in lock-breaking code

valentin.gatienbaron (Valentin Gatien-Baron) phabricator at mercurial-scm.org
Sat Nov 2 01:51:27 UTC 2019


valentin.gatienbaron created this revision.
Herald added a subscriber: mercurial-devel.
Herald added a reviewer: hg-reviewers.

REVISION SUMMARY
  With low frequency, I see hg pulls fail with output like:
  abort: no such file or directory: .hg/store/lock
  
  I think what happens is, in lock.py, in:
  
    def _testlock(self, locker):
        if not self._lockshouldbebroken(locker):
            return locker
    
        # if locker dead, break lock.  must do this with another lock
        # held, or can race and break valid lock.
  
  1. MARK try: with lock(self.vfs, self.f + b'.break', timeout=0): self.vfs.unlink(self.f) except error.LockError: return locker
  
  if a lock is breakable on disk, and two hg processes concurrently get
  to MARK, a possible interleaving is: process1 finishes executing the
  function and then process2 finishes executing the function. If that
  happens, process2 will either get ENOENT in self.vfs.unlink (resulting
  in the spurious failure above), or break a valid lock and potentially
  cause repository corruption.
  
  The fix is simple enough: make sure the lock is breakable _inside_ the
  critical section, because only then can we know that no other process
  can invalidate our knowledge on the lock on disk.
  
  I don't think there are tests for this. I've tested this manually
  with:
  
  diff --git a/mercurial/lock.py b/mercurial/lock.py
  
  - a/mercurial/lock.py
  
  +++ b/mercurial/lock.py
  @@ -351,6 +351,8 @@ class lock(object):
  
    if not self._lockshouldbebroken(locker):
        return locker
  
  +        import random
  +        time.sleep(1. + random.random())
  
  1. if locker dead, break lock.  must do this with another lock
  2. held, or can race and break valid lock. try:
  
  @@ -358,6 +360,7 @@ class lock(object):
  
            self.vfs.unlink(self.f)
    except error.LockError:
        return locker
  
  +        time.sleep(1)
  
    def testlock(self):
        """return id of locker if lock is valid, else None.
  
  and I see this change of behavior before/after this commit:
  
  $ $hg init repo
   $ cd repo
   $ ln -s $HOSTNAME/effffffc:987654321 .hg/wlock
   $ touch a
   $ $hg commit -Am_ & $hg commit -Am _; wait
  -abort: No such file or directory: '/tmp/repo/.hg/wlock'
   adding a
  +warning: ignoring unknown working parent 679a8959a8ca!
  +nothing changed

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D7199

AFFECTED FILES
  mercurial/lock.py

CHANGE DETAILS

diff --git a/mercurial/lock.py b/mercurial/lock.py
--- a/mercurial/lock.py
+++ b/mercurial/lock.py
@@ -355,6 +355,9 @@
         # held, or can race and break valid lock.
         try:
             with lock(self.vfs, self.f + b'.break', timeout=0):
+                locker = self._readlock()
+                if not self._lockshouldbebroken(locker):
+                    return locker
                 self.vfs.unlink(self.f)
         except error.LockError:
             return locker



To: valentin.gatienbaron, #hg-reviewers
Cc: mercurial-devel


More information about the Mercurial-devel mailing list