Bug 4531 - Populating rbc branch cache is extremely slow (>1 minute)
Summary: Populating rbc branch cache is extremely slow (>1 minute)
Status: RESOLVED FIXED
Alias: None
Product: Mercurial
Classification: Unclassified
Component: Mercurial (show other bugs)
Version: 3.3
Hardware: PC Mac OS
: normal bug
Assignee: kiilerix
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-02-05 20:09 UTC by Durham Goode
Modified: 2015-03-10 01:00 UTC (History)
5 users (show)

See Also:
Python Version: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Durham Goode 2015-02-05 20:09 UTC
The new rbc branch cache (http://selenic.com/hg/rev/cb99bacb9b4e) is extremely slow to populate on large repositories.  On the hg repo it takes 1.5 seconds, but on a repository with hundreds of thousand commits it takes over a minute.  Additionally, the cache is not being written to disk unless a write operation happens (like a commit).

This is a major regression and affects all newly cloned repositories when they do any operation that touches branch information.
Comment 1 kiilerix 2015-02-05 20:53 UTC
That is on a repo where for example (very simplified)
  hg id -r 'branch(default)'
would take the same amount of time?
Comment 2 Gregory Szorc 2015-02-05 20:58 UTC
Would adding rbc branch cache transfer to bundle2 solve your problems?
Comment 3 Durham Goode 2015-02-05 22:02 UTC
Yes, 'hg id -r branch(default)' takes a long time too without the cache.  Prior to the regression (introduced in 678f53865c68) that command took 0.34s.  Now it takes 58s.

Can we change the cache to actually just cache the requested data?  Instead of precomputing it for the entire repository?  Using bundle2 to copy it to clients just circumvents the problem that populating the cache doesn't scale.
Comment 4 kiilerix 2015-02-05 23:03 UTC
Sorry - that was a bad example that just is suitable for showing the worst case you describe. I was trying to assess the speed of your repo and compare the "over a minute" with other operations. But branch slowness is of course mainly an issue for us who use named branches and have queries where
  hg log -b stable -T. --time
is a realistic benchmark. Not so much of an issue if you only need the head of the default branch.

I agree that populating the repo initially has a notable cost. Repeated population because it is readonly is of course much worse and just bad. But are you sure that populating the cache doesn't scale, also for you? It scales linearly with repo size loke cloning does and is approximately as expensive as a log -b. That is still cheap compared to the other operations that only happens once. Or do your custom backend avoid cloning and have much faster branch name lookup than normal changelogs?
Comment 5 Durham Goode 2015-02-05 23:29 UTC
Any operation that is O(size of repo) won't scale unfortunately. An operation that needs to read data from inside each commit will have to parse 700MB (and growing) of data, most of which isn't even in the page cache.  Is the cache ordered?  Can we just append new entries to it as we read them? At the very least, we could have a config flag for disabling this feature.



Regarding clones and log -b:

Normal clones don't scale either (computing the changegroup is too slow), so we do streaming clones, and that only scales because we don't download filelogs (and the per commit overhead of the changelog+manifest is tiny).

log -b is also slow, but at least we can optimize that to start returning results as soon as it finds them, so the user doesn't have to wait.
Comment 6 Matt Mackall 2015-02-06 13:01 UTC
All our other caches try to write the cache whenever possible. Not doing that here is silly.
Comment 7 kiilerix 2015-02-06 13:44 UTC
I could not find a way to do it from revsets that not was silly.
Comment 8 Durham Goode 2015-02-11 12:46 UTC
I sent some work-in-progress patches to Mads and Pierre-Yves that make this cache lazy.  I'll send them to the mailing list once I'm sure I've addressed all the issues they discussed in the original series.
Comment 9 kiilerix 2015-02-21 14:41 UTC
First:
If using database terms, the changelog branch (and changeset) information is pretty much like a 'computed value'. It can currently not be retrieved efficiently. Some "real" databases can make indices with computed values, but I decided to name the revision branch cache a 'cache' because it lives in the same layer as the branch head cache and not always is fully updated. Ideally, the changelog format itself would allow efficient retrieval of meta info and/or it would maintain relevant indices for the most common queries.

(In reply to comment #5)
> Any operation that is O(size of repo) won't scale unfortunately. An
> operation that needs to read data from inside each commit will have to parse
> 700MB (and growing) of data, most of which isn't even in the page cache.  Is
> the cache ordered?  Can we just append new entries to it as we read them? At
> the very least, we could have a config flag for disabling this feature.

Well ... just streaming the changelog like you do is just as much O(size of repo) ... but might have a different constant factor that is different your setup. In the normal cases, these 700 MB have just been transferred and written - operations that seems significantly more expensive than reading them.

I will still argue that in all "real world" scenarios, populating the rbc after cloning
* is very cheap compared to the cloning itself. 
* is not more costly than common slow operations which it makes several times faster.

It might be relevant to transfer the rbc too when cloning a repo. I am not sure it is worth it, but that could be investigated.

In your special setup with a special storage format and very efficient cloning, you might not need the rbc but I think everybody else do. It seems like it would be very reasonable if your extension disabled this functionality that perhaps no longer is relevant with your extension. I think the rbc in most cases add so much value at a low worst case cost, that we don't want a configuration knob for this.

> Regarding clones and log -b:
> 
> Normal clones don't scale either (computing the changegroup is too slow), so
> we do streaming clones, and that only scales because we don't download
> filelogs (and the per commit overhead of the changelog+manifest is tiny).

"streaming clones" - like --uncompressed? (It would be nice to have that option renamed to --raw or --stream).
Comment 10 Durham Goode 2015-02-21 15:16 UTC
I'm confused. I sent patches to make it lazy so both of our needs should be met. Are you saying that's not sufficient?
Comment 11 Matt Mackall 2015-03-02 01:49 UTC
I'm not terribly happy with the patch on the list, but it's the only one I've got to go on here at release time. Please discuss patches on the list: not only do they get wider exposure, the discussion doesn't die when someone closes their browser tab.
Comment 12 HG Bot 2015-03-02 02:00 UTC
Fixed by http://selenic.com/repo/hg/rev/5b4ed033390b
Mads Kiilerich <madski@unity3d.com>
revisionbranchcache: fall back to slow path if starting readonly (issue4531)

Transitioning to Mercurial versions with revision branch cache could be slow as
long as all operations were readonly (revset queries) and the cache would be
populated but not written back.

Instead, fall back to using the consistently slow path when readonly and the
cache doesn't exist yet. That avoids the overhead of populating the cache
without writing it back.

If not readonly, it will still populate all missing entries initially. That
avoids repeated writing of the cache file with small updates, and it also makes
sure a fully populated cache available for the readonly operations.

(please test the fix)
Comment 13 Durham Goode 2015-03-02 02:03 UTC
The patch on the list doesn't really fix this bug, it just delays it until a write operation.

I sent some patches to Pierre-Yves and Mads directly to make it actually incremental. I got their feedback and will send an updated version to the list this week hopefully.
Comment 14 Bugzilla 2015-03-10 01:00 UTC
Bug was set to TESTING for 7 days, resolving