The new rbc branch cache (http://selenic.com/hg/rev/cb99bacb9b4e) is extremely slow to populate on large repositories. On the hg repo it takes 1.5 seconds, but on a repository with hundreds of thousand commits it takes over a minute. Additionally, the cache is not being written to disk unless a write operation happens (like a commit). This is a major regression and affects all newly cloned repositories when they do any operation that touches branch information.
That is on a repo where for example (very simplified) hg id -r 'branch(default)' would take the same amount of time?
Would adding rbc branch cache transfer to bundle2 solve your problems?
Yes, 'hg id -r branch(default)' takes a long time too without the cache. Prior to the regression (introduced in 678f53865c68) that command took 0.34s. Now it takes 58s. Can we change the cache to actually just cache the requested data? Instead of precomputing it for the entire repository? Using bundle2 to copy it to clients just circumvents the problem that populating the cache doesn't scale.
Sorry - that was a bad example that just is suitable for showing the worst case you describe. I was trying to assess the speed of your repo and compare the "over a minute" with other operations. But branch slowness is of course mainly an issue for us who use named branches and have queries where hg log -b stable -T. --time is a realistic benchmark. Not so much of an issue if you only need the head of the default branch. I agree that populating the repo initially has a notable cost. Repeated population because it is readonly is of course much worse and just bad. But are you sure that populating the cache doesn't scale, also for you? It scales linearly with repo size loke cloning does and is approximately as expensive as a log -b. That is still cheap compared to the other operations that only happens once. Or do your custom backend avoid cloning and have much faster branch name lookup than normal changelogs?
Any operation that is O(size of repo) won't scale unfortunately. An operation that needs to read data from inside each commit will have to parse 700MB (and growing) of data, most of which isn't even in the page cache. Is the cache ordered? Can we just append new entries to it as we read them? At the very least, we could have a config flag for disabling this feature. Regarding clones and log -b: Normal clones don't scale either (computing the changegroup is too slow), so we do streaming clones, and that only scales because we don't download filelogs (and the per commit overhead of the changelog+manifest is tiny). log -b is also slow, but at least we can optimize that to start returning results as soon as it finds them, so the user doesn't have to wait.
All our other caches try to write the cache whenever possible. Not doing that here is silly.
I could not find a way to do it from revsets that not was silly.
I sent some work-in-progress patches to Mads and Pierre-Yves that make this cache lazy. I'll send them to the mailing list once I'm sure I've addressed all the issues they discussed in the original series.
First: If using database terms, the changelog branch (and changeset) information is pretty much like a 'computed value'. It can currently not be retrieved efficiently. Some "real" databases can make indices with computed values, but I decided to name the revision branch cache a 'cache' because it lives in the same layer as the branch head cache and not always is fully updated. Ideally, the changelog format itself would allow efficient retrieval of meta info and/or it would maintain relevant indices for the most common queries. (In reply to comment #5) > Any operation that is O(size of repo) won't scale unfortunately. An > operation that needs to read data from inside each commit will have to parse > 700MB (and growing) of data, most of which isn't even in the page cache. Is > the cache ordered? Can we just append new entries to it as we read them? At > the very least, we could have a config flag for disabling this feature. Well ... just streaming the changelog like you do is just as much O(size of repo) ... but might have a different constant factor that is different your setup. In the normal cases, these 700 MB have just been transferred and written - operations that seems significantly more expensive than reading them. I will still argue that in all "real world" scenarios, populating the rbc after cloning * is very cheap compared to the cloning itself. * is not more costly than common slow operations which it makes several times faster. It might be relevant to transfer the rbc too when cloning a repo. I am not sure it is worth it, but that could be investigated. In your special setup with a special storage format and very efficient cloning, you might not need the rbc but I think everybody else do. It seems like it would be very reasonable if your extension disabled this functionality that perhaps no longer is relevant with your extension. I think the rbc in most cases add so much value at a low worst case cost, that we don't want a configuration knob for this. > Regarding clones and log -b: > > Normal clones don't scale either (computing the changegroup is too slow), so > we do streaming clones, and that only scales because we don't download > filelogs (and the per commit overhead of the changelog+manifest is tiny). "streaming clones" - like --uncompressed? (It would be nice to have that option renamed to --raw or --stream).
I'm confused. I sent patches to make it lazy so both of our needs should be met. Are you saying that's not sufficient?
I'm not terribly happy with the patch on the list, but it's the only one I've got to go on here at release time. Please discuss patches on the list: not only do they get wider exposure, the discussion doesn't die when someone closes their browser tab.
Fixed by http://selenic.com/repo/hg/rev/5b4ed033390b Mads Kiilerich <madski@unity3d.com> revisionbranchcache: fall back to slow path if starting readonly (issue4531) Transitioning to Mercurial versions with revision branch cache could be slow as long as all operations were readonly (revset queries) and the cache would be populated but not written back. Instead, fall back to using the consistently slow path when readonly and the cache doesn't exist yet. That avoids the overhead of populating the cache without writing it back. If not readonly, it will still populate all missing entries initially. That avoids repeated writing of the cache file with small updates, and it also makes sure a fully populated cache available for the readonly operations. (please test the fix)
The patch on the list doesn't really fix this bug, it just delays it until a write operation. I sent some patches to Pierre-Yves and Mads directly to make it actually incremental. I got their feedback and will send an updated version to the list this week hopefully.
Bug was set to TESTING for 7 days, resolving