[PATCH 1 of 9 V3] cmdutil: add the class to restore dirstate at unexpected failure easily

Pierre-Yves David pierre-yves.david at ens-lyon.org
Mon May 11 19:49:43 CDT 2015



On 05/11/2015 08:49 AM, FUJIWARA Katsunori wrote:
>
> At Fri, 08 May 2015 19:00:02 -0700,
> Pierre-Yves David wrote:
>>
>>
>>
>> On 05/06/2015 08:15 PM, FUJIWARA Katsunori wrote:
>>> # HG changeset patch
>>> # User FUJIWARA Katsunori <foozy at lares.dti.ne.jp>
>>> # Date 1430968030 -32400
>>> #      Thu May 07 12:07:10 2015 +0900
>>> # Node ID 449a46109f0c96bb8d82d4edf2c8855341558c2f
>>> # Parent  8174d27576a3fff28fb6f952a4c89d5c0b900214
>>> cmdutil: add the class to restore dirstate at unexpected failure easily
>>>
>>> Before this patch, after "dirstate.write()" execution, there is no way
>>> to restore dirstate to the original status before "dirstate.write()".
>>>
>>> In some code paths, "dirstate.invalidate()" is used as a kind of
>>> "restore .hg/dirstate to the original status". But it just avoids
>>> writing changes in memory out, and doesn't actually restore
>>> ".hg/dirstate" file. "dirstate.write()" prevents it from working as
>>> expected.
>>>
>>> To fix the issue that recent (in memory) dirstate isn't visible to
>>> external process (e.g. "precommit" hooks), "dirstate.write()" should
>>> be invoked before invocation of external process. But at the same
>>> time, ".hg/dirstate" should be restored to the status before
>>> "dirstate.write()" at unexpected failure in some cases.
>>>
>>> This patch adds the class "dirstateguard" to easily restore
>>> ".hg/dirstate" at unexpected failure. Typical usecase of it is:
>>>
>>>       # (1) build dirstate up
>>>       ....
>>>
>>>       # (2) write dirstate out, and backup ".hg/dirstate"
>>>       dsguard = dirstateguard(repo, 'scopename')
>>>       try:
>>>           # (3) execute somethig to do:
>>>           #     this may imply making some additional changes on dirstate
>>>           ....
>>>
>>>           # (4) unlink backup-ed dirstate file at the end of dsguard scope
>>>           dsguard.close()
>>>       finally:
>>>           # (5) if execution is aborted before "dsguard.close()",
>>>           #     ".hg/dirstate" is restored from the backup
>>>           dsguard.release()
>>
>> This "invocation" before external process, sounded really like the
>> "pending mechanism" used in transaction. Things are a bit different
>> since the dirstate also refers to content of the working copy, that have
>> been already flushed to disk. But as far as I understand, the dirstate
>> refers to parents that may be only contained in the the transaction.
>>
>> The real question here is: What should external writer see during the
>> process.
>>
>> If it make more sense for external writer to be able to have a peek at
>> the dirstate, we should write it.
>> If external writer cannot understand what is in the dirstate data, it
>> should not be written to disk until the transaction is completed, and
>> using a `.pending` mechanism instead.
>>
>>
>>> For this kind of issue, "extending transaction" approach (in
>>> https://titanpad.com/mercurial32-sprint) seems not to be suitable,
>>> because:
>>>
>>>     - transaction nesting occurs in some cases (e.g. "shelve => rebase"), and
>>>
>>>     - "dirstate" may be already modified since the beginning of OUTER
>>>       transaction scope, then
>>>
>>>     - dirstate should be backed up into the file other than
>>>       "dirstate.journal" at the beginning of INNER transaction scope, but
>>>
>>>     - such alternative backup files are useless for transaction itself,
>>>       and increases complication of its implementation
>>
>> Some part about the transaction confuses me.
>>
>>    - Yes, you are right, the dirstate changes may have a wider life spawn
>> than the transaction.
>>
>>    - Therefor doing backup there is not suitable (and is maybe never
>> useful) (your are still right there)
>>
>>    - However, for any changes started inside a transaction, we should
>> never write data in .hg/dirstate until the transaction is actually
>> committed (because it likely refers to unwritten data in the
>> transaction). This means that hooks/editors/etc would need to access a
>> .hg/distate.pending in that case.
>>
>>    - The transaction API allow for both "wait until transaction is closed
>> to flush that file" logic and "please write pending data for hooks"
>> logic. It looks like the dirstate handling should collaborate with the
>> transaction if one exist when dirstate manipulation are done. I'm not
>> sure how much it happen.
>
> Committing via "repo.commit()" may cause invoking editor process out
> of transaction scope, but flushing dirstate is still needed: for
> example, "hg commit -A" changes status of each unknown files before
> "repo.commit()".
>
> `.pending` mechanism of transaction itself can't be used directly in
> such cases.
>
> But on the other hand, editor process may be invoked in transaction
> scope, and (maybe) pending changelog and so on should become visible
> for external process: for example, "hg import -e" with multiple
> patches should cause such situation (BTW, rebasing multiple revisions
> creates a transaction for each revisions to be rebased).
>
> So, I'll try to follow the style of `.pending` mechanism of
> transaction for dirstate.

The way transaction works for that (roughly) is:

- write all files with a '.pending' suffix,
- set an environment variable HG_PENDING=<repo-root>
- Mercurial process during hooks detect the HG_PENDING vars and read the 
.pending file if they exists.

As you point out, there is case where the dirstate writing is not done 
during a transaction (eg: update). And in this case, we should -not- 
collaborate with the transaction and we should -not- use 
dirstate.pending file.

However, if we are in the middle of a transaction, the new dirstate will 
likely point to content of the transaction and we should rely on the 
transaction to generate a '.pending' file when needed (also meaning, we 
should teach dirstate to read them) and to flush data to disk when the 
the transaction is committed (see the file generator API, taking care of 
both).

(you can use `localrepo.currenttransaction()` to check if a transaction 
is open and if we jump onboard).

All that said, current dirstate implementation is not collaborating with 
transaction either. So your changes are an improvement. I'll take your 
series as is (with a couple of typo, and name fixed) and will hope for 
more follow up on this topic.

>> Another things that concerns me is "re-entrance", you code use a
>> distinct object for each scope, using its own backup and everything.
>> This seems simple enough, provide nesting and work probably well.
>> However your backup is based on the "name" provided in the code. This
>> mean that two nested calls with the same "name" will overwrite each
>> other. We should throw some unique identifier in there. `id(self)` would do.
>
> I don't have any reasons to insist using "name" for identifier. I'll
> revise around it. BTW, what about using both "name" and `id(self)` for
> trouble shooting/debugging ?

This sounds like a plan, I'll fix it in-flight.

>> This re-entrance business transition perfectly to my last question: It
>> seems like this superseed the beginparentchange/endparentchange system
>> previously in place. However, there is still multiple call to this
>> system. Could we, in theory replace them will with the new system? If
>> not, why?
>
> begin/end-parentchange system seems handy for small (a few lines)
> scope changes, because it doesn't need writing current dirstate out
> and backing it up.
>
> For example, current "heavy" dirstateguard doesn't seem suitable for
> "committablectx.markcommitted()".
>
>      def markcommitted(self, node):
>          self._repo.dirstate.beginparentchange()
>          for f in self.modified() + self.added():
>              self._repo.dirstate.normal(f)
>          for f in self.removed():
>              self._repo.dirstate.drop(f)
>          self._repo.dirstate.setparents(node)
>          self._repo.dirstate.endparentchange()
>
> For such lightweight use, we may have to add the flag to
> enable/disable "writing current dirstate out and backing it up" to
> dirstateguard: when backup is disabled, "dirstate.write()" may have to
> be blocked for safety.
>
> On the other hand, begin/end-parentchange system doesn't seem suitable
> for non-small scope or complicated changes, because it is difficult to
> ensure that "dirstate.write()" isn't implied in such case (like ones
> in mq, commands.import and cmdutil.tryimportone replaced by this
> series).
>
> begin/end-parentchange in this condition should be replaced by
> dirstateguard or so, as soon as possible.

I think we should have a single mechanism of people will use it wrong. 
We should also makes it mandatory (or at least issue devel-warning) to 
get people to use it.

Can we flag the old system for deletion and hunt down all it's current user?

>> Also, this old system was living on the dirstate object directly. would
>> it make sense to move the guardian business there?
>
> IMHO, yes, even though some more preparations may be needed to
> complete moving.

Could be done in a follow up.

>> Summary of questions/topics
>> ===========================
>>
>> - Do we need a pending mechanism ?
>> - It seems like it would make sense to attach the logic to the
>> transaction if one exists. Though?
>> - Possible "name" collision and necessity to make identifier unique.
>> - What's the fate of beginparentchange
>>
>> Beside the question aboves (and some typo I can fix in-flight), the
>> series looks like an improvement to me. I'll take it once the above
>> point are clarified.
>>
>>> +class dirstateguard(object):
>>> +    '''Restore dirstate at unexpected failure.
>>> +
>>> +    At the construction, this class does:
>>> +
>>> +    - write current ``repo.dirstate`` out, and
>>> +    - save ``.hg/dirstate`` into the backup file
>>> +
>>> +    This restores ``.hg/dirstate`` from backup file, if ``release()``
>>> +    is invoked before ``close()``.
>>> +
>>> +    This just removes the backup file at ``close()`` before ``release()``.
>>> +    '''
>>> +
>>> +    def __init__(self, repo, name):
>>> +        repo.dirstate.write()
>>> +        self.repo = repo
>>> +        self.name = 'dirstate.backup.%s' % name
>>
>> this "name" is actually a filename, would we rename that into 'filename'?
>> (could be a follow up)
>
> I'll revise so.

I'll apply that inflight

>
>>> +        repo.vfs.write(self.name, repo.vfs.tryread('dirstate'))
>>> +        self.active = True
>>> +        self.closed = False
>>
>> also: Could we have all the attributes marked private as private
>> (could be a follow up)
>
> I'll revise so.

I'll apply that inflight

-- 
Pierre-Yves David


More information about the Mercurial-devel mailing list