[PATCH] localrepo: introduce persistent caching of revset revision's branch names

Gregory Szorc gregory.szorc at gmail.com
Tue Oct 14 23:53:09 CDT 2014


On 10/14/14 7:58 PM, Mads Kiilerich wrote:
> On 10/15/2014 02:58 AM, Matt Mackall wrote:
>> On Tue, 2014-10-14 at 17:51 -0700, Pierre-Yves David wrote:
>>> On 10/14/2014 05:43 PM, Matt Mackall wrote:
>>>> On Wed, 2014-10-15 at 02:33 +0200, Mads Kiilerich wrote:
>>>>> # HG changeset patch
>>>>> # User Mads Kiilerich <madski at unity3d.com>
>>>>> # Date 1413333190 -7200
>>>>> #      Wed Oct 15 02:33:10 2014 +0200
>>>>> # Node ID 85189122b49b31dddbcddb7c0925afa019dc4403
>>>>> # Parent  48c0b101a9de1fdbd638daa858da845cd05a6be7
>>>>> localrepo: introduce persistent caching of revset revision's branch
>>>>> names
>>>>>
>>>>> It is expensive to create a changectx and extract the branch name.
>>>>> That shows
>>>>> up when filtering on branches in revsets.
>>>>>
>>>>> To speed things up, cache the results on disk. To avoid using too
>>>>> much space,
>>>>> all branch names are only stored once and each revision references
>>>>> one of these
>>>>> names. To verify that the cache is valid, we also store the tip
>>>>> hash in the
>>>>> cache file.
>>>> If we're going to add such a cache, I think it needs to not need
>>>> rebuilding across a strip.
>>> I'm not sure I get you. Do you mean you want the cache to be permanent
>>> (so using hash as key instead of rev?) Or do you want it to to be
>>> properly invalidated in case of strip (some kind of cache key) or any of
>>> the previous or something else.
>> If it takes a minute to build the cache, and takes a minute to rebuild
>> it after strip.. then it needs rethinking.
>
> I think the root cause of the problem is that it takes a minute to
> retrieve this information. The best solution would be if the data could
> be stored in a way where there was no need for this. That would be a
> different trade-off in the basic design of Mercurial and is out of scope.
>
> Another root cause of the problem is that strip works against the basic
> append-only design. I dislike the intrusiveness of phases and obsoletion
> markers but I guess something like that is better than strip. I think it
> is fair enough that a real garbage collection invalidates all caches.
>
> This cache I propose works efficiently with the normal append-only mode
> of operation. The size of the cachefile is currently 4 times the number
> of revisions in the repo (plus each branch name once). (Btw: this array
> is a perfect use case for blosc. The compressed size could probably in
> most cases be significantly smaller than the number of revisions.) The
> current design also works efficiently when operating in rev-land.
>
> The cache file _could_ also store the hash of each revision so it could
> reuse as much as possible after a strip. That would make it at least 5
> times bigger and less efficient. Would you prefer that? Or perhaps some
> "random" synchronization points could be a good trade-off?
>
> A "better" solution could be to maintain it at a lower level,
> integrating it closely with the changelog index. That would however be
> too intrusive, in my opinion.
>
> But mainly: This is just a cache, build on demand. There is no kind of
> lock-in for using it. It can always be ripped out and replaced with
> something else without any other cost that would put us in a worse
> position than if we didn't have it.
>
> It is annoying if a strip requires a rebuild that takes a minute. It is
> however a much bigger problem that a simple query for all changesets on
> a branch takes a minute every time. This patch is about addressing the
> bigger problem. It can always be improved to / replaced by something
> that also solves the smaller problems.

Append only is a nice ideal and just that: an ideal. Things like 
transaction rollbacks are effectively strips. And transaction rollbacks 
can happen when e.g. a server-side hook rejects a push. And if that hook 
(or a hook that ran before) accesses branch data and causes a cache 
update that would trigger invalidation on rollback, the next 
unsuspecting user triggers a fresh cache rebuild and experiences extreme 
latency (if the repo is moderately sized).

We see this at Mozilla with rejected pushes causing branchcache 
rebuilds. On our Try repo, this leads to CPU exhaustion from multiple 
clients all triggering cache generation simultaneously (because there is 
no lock on the cache). This is why we spent time at the Summit coming up 
with a better way to add non-revlog files into transactions.

Whatever you do, please avoid full cache (re)populations wherever possible.



More information about the Mercurial-devel mailing list