[PATCH] localrepo: introduce persistent caching of revset revision's branch names

Wed Oct 15 13:37:43 CDT 2014

On 10/15/2014 05:36 AM, Mads Kiilerich wrote:
> On 10/15/2014 01:47 PM, Pierre-Yves David wrote:
>>
>>
>> On 10/15/2014 04:40 AM, Mads Kiilerich wrote:
>>> On 10/15/2014 02:45 AM, Pierre-Yves David wrote:
>>>>
>>>>
>>>> On 10/14/2014 05:33 PM, Mads Kiilerich wrote:
>>>>> # HG changeset patch
>>>>> # User Mads Kiilerich <madski at unity3d.com>
>>>>> # Date 1413333190 -7200
>>>>> #      Wed Oct 15 02:33:10 2014 +0200
>>>>> # Node ID 85189122b49b31dddbcddb7c0925afa019dc4403
>>>>> # Parent  48c0b101a9de1fdbd638daa858da845cd05a6be7
>>>>> localrepo: introduce persistent caching of revset revision's branch
>>>>> names
>>>>>
>>>>> It is expensive to create a changectx and extract the branch name.
>>>>> That shows
>>>>> up when filtering on branches in revsets.
>>>>>
>>>>> To speed things up, cache the results on disk. To avoid using too
>>>>> much space,
>>>>> all branch names are only stored once and each revision references
>>>>> one of these
>>>>> names. To verify that the cache is valid, we also store the tip hash
>>>>> in the
>>>>> cache file.
>>>>
>>>> I cannot see what is your invalidation strategy here.
>>>
>>> ??? It is right there. Obviously, if the cache not is valid it is
>>> invalid. But I must assume you are looking for something else. What?
>>>
>>> Or to rephrase it: When writing the cache file, it writes the tip
>>> revision number (+1) and the hash of tip. When reading the file, if that
>>> revision still has the same hash (even if it no longer is tip), the
>>> cache is considered valid as far as it goes.
>>
>> Hum, standard blindness here, my bad… This cache strategy is
>> vulnerable to operation (or chain of operation) that change repository
>> content without altering the hash number of commit or its hash. We
>> should probably use something stronger.
>
> Yes.
>
> Something like a linearly chained hash of the hashes comes to mind. It
> would however be too slow to validate the whole chain when accessing the
> cache, so it would be vulnerable to the same problems unless it has good
> low level support for invalidation and recomputation. And then, why not
> just invalidate/truncate caches at that point.

I'm thinking about this for some time and I think that storing such 
chained chain of hash in the repo itself could do the trick. (probably 
for each main happen only file in the repo, changelog and obsstore). If 
we also had such data stored we could quickly check if the "combined 
hash" for revnum X match the "combined hash" stored in the hash.

We could extend this logic by storing revnum + "combined hash
  at various checkpoint in your branch cache file to allow quick partial 
truncating.

> Something vector clock or bloom filter could perhaps also be used. A
> simpler solution would probably be an ever-increasing counter that are
> incremented on commits/changegroup additions _and_ on strip (and amend).

Not super fan of the ever increasing counter because you haveto rely on 
all clients bumping it and therefor requires a new .hg/requires entry.

Another approach would be to use a hash of all heads somewhere. But it 
does allow forward update of the cache.

I think the "combined" hash approach is worth poking at.

>> I know that this is used for branchmap as well. So yes branchmap is
>> vulnerable too. But branchmap is less vulnerable because it does not
>> depends of the order of the revision in the repo (revnum) while you do
>> here. We should build something stronger here.
>
> The current branchmap validation scheme can just as easily be tricked if
> the two tipmost revs are unrelated, both are stripped, a new evil commit
> is made on some branch, and then the old tip is pulled back in.

Yes, the current branchmap cache is too fragile too (but a still a bit 
less fragile because it resist to reordering the exact same head)

> A stronger solution would be nice. When we get it we can apply the same
> solution everywhere. Until we get the "cannot break" solution, I think
> it is ok to extend the use of the existing imperfect but "doesn't break"
> solution.

I think the current solution is significantly weaker for you type of 
cash. I do not think it is wise to use it.

-- 
Pierre-Yves David