[PATCH] localrepo: introduce persistent caching of revset revision's branch names

Mads Kiilerich mads at kiilerich.com
Tue Oct 14 21:58:31 CDT 2014

On 10/15/2014 02:58 AM, Matt Mackall wrote:
> On Tue, 2014-10-14 at 17:51 -0700, Pierre-Yves David wrote:
>> On 10/14/2014 05:43 PM, Matt Mackall wrote:
>>> On Wed, 2014-10-15 at 02:33 +0200, Mads Kiilerich wrote:
>>>> # HG changeset patch
>>>> # User Mads Kiilerich <madski at unity3d.com>
>>>> # Date 1413333190 -7200
>>>> #      Wed Oct 15 02:33:10 2014 +0200
>>>> # Node ID 85189122b49b31dddbcddb7c0925afa019dc4403
>>>> # Parent  48c0b101a9de1fdbd638daa858da845cd05a6be7
>>>> localrepo: introduce persistent caching of revset revision's branch names
>>>> It is expensive to create a changectx and extract the branch name. That shows
>>>> up when filtering on branches in revsets.
>>>> To speed things up, cache the results on disk. To avoid using too much space,
>>>> all branch names are only stored once and each revision references one of these
>>>> names. To verify that the cache is valid, we also store the tip hash in the
>>>> cache file.
>>> If we're going to add such a cache, I think it needs to not need
>>> rebuilding across a strip.
>> I'm not sure I get you. Do you mean you want the cache to be permanent
>> (so using hash as key instead of rev?) Or do you want it to to be
>> properly invalidated in case of strip (some kind of cache key) or any of
>> the previous or something else.
> If it takes a minute to build the cache, and takes a minute to rebuild
> it after strip.. then it needs rethinking.

I think the root cause of the problem is that it takes a minute to 
retrieve this information. The best solution would be if the data could 
be stored in a way where there was no need for this. That would be a 
different trade-off in the basic design of Mercurial and is out of scope.

Another root cause of the problem is that strip works against the basic 
append-only design. I dislike the intrusiveness of phases and obsoletion 
markers but I guess something like that is better than strip. I think it 
is fair enough that a real garbage collection invalidates all caches.

This cache I propose works efficiently with the normal append-only mode 
of operation. The size of the cachefile is currently 4 times the number 
of revisions in the repo (plus each branch name once). (Btw: this array 
is a perfect use case for blosc. The compressed size could probably in 
most cases be significantly smaller than the number of revisions.) The 
current design also works efficiently when operating in rev-land.

The cache file _could_ also store the hash of each revision so it could 
reuse as much as possible after a strip. That would make it at least 5 
times bigger and less efficient. Would you prefer that? Or perhaps some 
"random" synchronization points could be a good trade-off?

A "better" solution could be to maintain it at a lower level, 
integrating it closely with the changelog index. That would however be 
too intrusive, in my opinion.

But mainly: This is just a cache, build on demand. There is no kind of 
lock-in for using it. It can always be ripped out and replaced with 
something else without any other cost that would put us in a worse 
position than if we didn't have it.

It is annoying if a strip requires a rebuild that takes a minute. It is 
however a much bigger problem that a simple query for all changesets on 
a branch takes a minute every time. This patch is about addressing the 
bigger problem. It can always be improved to / replaced by something 
that also solves the smaller problems.


More information about the Mercurial-devel mailing list