Bug 4353 - dirstate can be corrupted by exceptions
Summary: dirstate can be corrupted by exceptions
Status: RESOLVED FIXED
Alias: None
Product: Mercurial
Classification: Unclassified
Component: Mercurial (show other bugs)
Version: 3.1.1
Hardware: PC Mac OS
: normal bug
Assignee: Bugzilla
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-09-03 18:50 UTC by Durham Goode
Modified: 2015-01-22 15:04 UTC (History)
3 users (show)

See Also:
Python Version: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Durham Goode 2014-09-03 18:50 UTC
Our users have been experiencing increasingly frequent issues where the dirstate is no longer in sync with the manifest of the working copy.  The dirstate might contain files that the manifest doesn't, etc.

Currently, the dirstate is written by wlock.release(), which is called in a try/finally, so the dirstate is written even during an exception.  In code like the following, if you insert an exception it can result in a corrupt dirstate:

merge.py:update()

        if not partial:
            repo.setparents(fp1, fp2)
            raise Exception("THIS IS BAD")
            recordupdates(repo, actions, branchmerge)
            # update completed, clear state
            util.unlink(repo.join('updatestate'))

            if not branchmerge:
                repo.dirstate.setbranch(p2.branch())

        repo.dirstate.write()
    finally:
        wlock.release()

since the dirstate parents will be written to disk (as part of wlock.release()) but the dirstate entries won't have been changed.  For instance, hg debugdirstate will show files as 'n' which may not even exist anymore.


This is a particular terrible state for a user to get into because there is no real way for them to fix it without consulting a Mercurial expert.
Comment 1 Durham Goode 2014-09-03 18:51 UTC
My current idea on how to fix this is to only write the dirstate if there is no exception.  Perhaps by requiring calling 'wlock.commit()' or something similar before the finally.
Comment 2 Matt Mackall 2014-09-03 19:16 UTC
Dunno. The sequence here is:

 1) make a bunch of working copy changes
 2) make a bunch of dirstate changes
 3) write out dirstate

If we exit after 1 with or without writing, we still have problems. If we exit anywhere in this process, with or without writing out the dirstate, we're going to have a bad time to some degree. Even if we interleave 1/2/3 on a file by file basis (obvs slow), we're still going to be in a weird half-merged/updated state if interrupted. 

(And "rolling back" (1) is also more or less out of the question: there's obviously no such thing as a transaction on most filesystems and implementing our own is a special form of recursive, non-performant, ENOSPC hell.)

When you say "corrupt", what do you mean precisely? Unparsable, invalid, or "simply" not coherent with the working copy? Is the exception here KeyboardInterrupt or something else?

(This state should hopefully be showing up as an "interrupted update", yes?)
Comment 3 Durham Goode 2014-09-03 19:36 UTC
By 'corrupt' I mean, the dirstate contains, or is missing, files that don't exist in the dirstate's parent node's manifest.

1) If the dirstate contains files that don't exist in the manifest, this can result in files being permanently left on disk (which may break builds) since hg update will never delete them because the manifest merge never again calculates an action to delete the files.

2) If the dirstate is missing files from the manifest, it would result in those files showing up as '?' in hg status, despite them actually existing in the manifest.


Regarding your series of steps, the problem is not that the working copy contents is out of sync with the dirstate (as would happen if update actions are applied, then the dirstate fails to write).  That situation would be correctly shown in hg status. The problem is when the dirstate is internally inconsistent.  If it has a parent node that doesn't match it's contents, we get into a invalid state where hg status is flat out wrong.  Requiring a wlock.commit() style solution will mean the dirstate is always internally consistent.


Yes, the situation in the description leaves them in an 'interrupted update' state, but that doesn't appear to actually prevent them from running update again (which fixes the interrupted state, but doesn't fix the dirstate).
Comment 4 Matt Mackall 2014-09-03 20:08 UTC
Ok, so "simply" incoherent. The quotes are because this is definitely still a problem in my mind, just not something I'd technically classify as "corrupt".

I suppose I agree that failing to write the dirstate at all will look like just a big change that can be reverted/update -C'd and is thus significantly less bad.

We have lots of places where we do single changes to the dirstate (parents/add/remove/whatever) where the automatic write on dirty is fine, so I think we don't want to add a commit() to all those paths.

We also probably can't make the setparents last as having 'm' states when we're not in a merge will be weird. Instead, we might want to add some sort of start/end "transaction" for when we do a setparents AND (maybe a lot of) file changes.

Probably the easiest way to fix this is adding a dirstate._dontwriteyet var that's checked by .write(), incremented by .beginchange(), and decremented by .endchange(). Names subject to change.
Comment 5 Durham Goode 2014-09-04 14:21 UTC
I think dirstate.begin()/commit()/end() will be a little ugly.  It will end up looking like:

    try:
        wlock = repo.wlock()
        dirstate.begin()
        
        somefunc()

        dirstate.commit()
    finally:
        wlock.release()
        # dirstate.end has to be in the finally to decrement the nest counter.
        # It has to be after wlock.release so wlock.release doesn't perform it's write in the event of an exception.
        # This is a subtle ordering issue, which is likely to cause bugs.
        dirstate.end() 


    def wlock.release():
        # this already exists, and is needed for normal auto-commits
        self.dirstate.write()

    def dirstate.write():
        # Don't allow writes unless outside of a begin/end
        if self.dirstate._donotwrite == 0:
            write-to-disk

    def dirstate.commit():
        # Allow writes only in the top level begin/end
        if self.dirstate._donotwrite == 1:
            write-to-disk

I think tying it to the wlock would be cleaner and less bug prone.
Comment 6 Matt Mackall 2014-09-04 16:14 UTC
But that was not my proposal. Mine only has begin and end. So it looks like:

diff -r b2bcff93668e mercurial/dirstate.py
--- a/mercurial/dirstate.py	Tue Sep 02 12:11:36 2014 +0200
+++ b/mercurial/dirstate.py	Thu Sep 04 15:12:06 2014 -0500
@@ -44,6 +44,13 @@
         self._lastnormaltime = 0
         self._ui = ui
         self._filecache = {}
+        self._multistep = 0
+
+    def begin(self):
+        self._multistep += 1
+
+    def end(self):
+        self._multistep -= 1
 
     @propertycache
     def _map(self):
@@ -502,7 +509,7 @@
         self._dirty = True
 
     def write(self):
-        if not self._dirty:
+        if not self._dirty or self._multistep:
             return
 
         # enough 'delaywrite' prevents 'pack_dirstate' from dropping
diff -r b2bcff93668e mercurial/merge.py
--- a/mercurial/merge.py	Tue Sep 02 12:11:36 2014 +0200
+++ b/mercurial/merge.py	Thu Sep 04 15:12:06 2014 -0500
@@ -1134,6 +1134,7 @@
         stats = applyupdates(repo, actions, wc, p2, overwrite, labels=labels)
 
         if not partial:
+            repo.dirstate.begin()
             repo.setparents(fp1, fp2)
             recordupdates(repo, actions, branchmerge)
             # update completed, clear state
@@ -1141,6 +1142,7 @@
 
             if not branchmerge:
                 repo.dirstate.setbranch(p2.branch())
+            repo.dirstate.end()
     finally:
         wlock.release()
Comment 7 Durham Goode 2014-09-04 17:25 UTC
How does that handle exception recovery though?  We need finally's so the count isn't permanently screwed up when an exception is thrown, passes by a few begin/end's, and then is caught and handled.
Comment 8 Matt Mackall 2014-09-04 18:25 UTC
What is this "exception recovery" of which you speak? It sounds like black magic.

Realistically, if an exception is thrown, we can do one of two things:

- exit with an error
- discard the dirstate and reload from disk

..because the dirstate is in an indeterminate state and it's WAY more trouble to try to get it back in a well-defined state than to just reload it.

We might need to do the latter for the command server, I guess, perhaps something like this:

diff -r 1b66266170c6 mercurial/dirstate.py
--- a/mercurial/dirstate.py	Thu Sep 04 15:22:33 2014 -0500
+++ b/mercurial/dirstate.py	Thu Sep 04 17:19:52 2014 -0500
@@ -305,7 +305,7 @@
                 "_ignore"):
             if a in self.__dict__:
                 delattr(self, a)
-        self._lastnormaltime = 0
+        self._lastnormaltime = self._multistep = 0
         self._dirty = False
 
     def copy(self, source, dest):
@@ -509,7 +509,10 @@
         self._dirty = True
 
     def write(self):
-        if not self._dirty or self._multistep:
+        if self._multistep:
+            return False
+
+        if not self._dirty:
             return
 
         # enough 'delaywrite' prevents 'pack_dirstate' from dropping
diff -r 1b66266170c6 mercurial/localrepo.py
--- a/mercurial/localrepo.py	Thu Sep 04 15:22:33 2014 -0500
+++ b/mercurial/localrepo.py	Thu Sep 04 17:19:52 2014 -0500
@@ -1102,7 +1102,8 @@
             return l
 
         def unlock():
-            self.dirstate.write()
+            if self.dirstate.write() == False:
+                self.dirstate.invalidate()
             self._filecache['dirstate'].refresh()
 
         l = self._lock(self.vfs, "wlock", wait, unlock,
Comment 9 HG Bot 2014-09-11 13:15 UTC
Fixed by http://selenic.com/repo/hg/rev/12bc7f06fc41
Durham Goode <durham@fb.com>
dirstate: add begin/endparentchange to dirstate

It's possible for the dirstate to become incoherent (issue4353) if there is an
exception in the middle of the dirstate parent and entries being written (like
if the user ctrl+c's). This change adds begin/endparentchange which a future
patch will require to be set before changing the dirstate parent.  This will
allow us to prevent writing the dirstate in the event of an exception while
changing the parent.

(please test the fix)
Comment 10 HG Bot 2014-09-11 13:15 UTC
Fixed by http://selenic.com/repo/hg/rev/6f63c47cbb86
Durham Goode <durham@fb.com>
dirstate: wrap setparent calls with begin/endparentchange (issue4353)

This wraps all the locations of dirstate.setparent with the appropriate
begin/endparentchange calls. This will prevent exceptions during those calls
from causing incoherent dirstates (issue4353).

(please test the fix)
Comment 11 Matt Mackall 2015-01-22 15:04 UTC
Bulk testing -> fixed