Proposal for detecting history rewriting on shared repos

Fri Feb 14 13:01:13 CST 2014

On 2/13/14, 10:37 AM, Pierre-Yves David wrote:
>
>
> On 02/12/2014 07:14 PM, Gregory Szorc wrote:
>> On 2/12/14, 4:24 PM, Pierre-Yves David wrote:
>>>
>>>
>>> On 02/12/2014 04:20 PM, Gregory Szorc wrote:
>>>> The share extension and workflow is very fragile. If rewriting
>>>> occurs on
>>>> the original repository, there's a good chance shared clones of that
>>>> repo will get corrupted. While there is a giant warning in the
>>>> output of
>>>> `hg help share` to warn you about this, Mercurial currently offers
>>>> little to no assistance to detect and recover from this.
>>>
>>> […]
>>>
>>>> Thoughts?
>>>
>>> The branch cache have logic to detect non-append only operation on the
>>> view. The same kind of logic should be applicable here.
>>>
>>> I know that the current cache key generated by branchcache have some
>>> weakness for some corner case. If you hit them, feel free to improve it.
>>
>> I didn't realize that code existed!
>>
>> It sounds like you are proposing storing a hash of some set of revlog
>> data (possibly the revs or nodes of the changelog) as the store ID. I
>> think this could work. You're essentially proposing a direct test vs an
>> indirect one. The indirect one, while faster, relies on code paths being
>> complete or else we miss updates.
>
> I also advertising for reusing//improvement of existing logic instead of
> creating a new mechanism for every usecase. I'm not certain it cover
> your use case but if worse having a look
>
>> The current branch cache code is computing a hash over all filtered
>> revs. I /think/ that because the branch cache doesn't care about
>> filtered revs that it can get away with computing just the revs and not
>> nodes?
>
> Not sure I understand. We have multiple branchcache for different level
> of filtering. So the cache key have to include information about what
> rev was excluded int he computation. This should not be relevant for
> your case.
>
>> For the share case, I /think/ we would need to hash nodes so history
>> rewriting that doesn't change rev count won't fall through a crack.
>
> The branches map cache is reasoning on nodes only (mostly) and is not
> affected by rev ordering (very same content, but different order of the
> node). Are the sahre extension affected by it? (I believe the question
> can be reduced to "does the working copy data store anything related to
> revs?")

That is the question this boils down to. If anything outside of 
.hg/store is caching revs instead of nodes, we can't hash revs and get 
reliable results. I don't know enough about these subsystems to answer 
that question.

Regardless, we should be able to make the store ID:

   <last tip rev> <last tip node> <hash>

On open, we compare <rev> and <node> from the shared source. If they 
differ, revlogs have been rewritten and we are corrupt. If they are the 
same, we need to recompute the hash to verify everything before is the 
same, as non-ancestor revlog entries before may have been reordered. Or, 
could we ignore non-ancestors? Would anything break if unrelated revlog 
entries were rewritten? Again, I'm not sure.