RFC: bitmap storage for hidden
Pierre-Yves David
pierre-yves.david at ens-lyon.org
Tue Sep 6 13:06:52 EDT 2016
On 08/31/2016 10:53 PM, Durham Goode wrote:
> One of the performance costs that affects every command is the
> computehidden function (which computes what commits are hidden based on
> a combination of obsmarkers, bookmarks, workingcopy, phases, and tags).
> At Facebook this function alone can add 300ms to every command a user
> runs (depending on a variety of factors).
Do you have the details on where that time is going? There might be
small steps to take in existing code to speed thing up before
introducing a new concept.
For example, We could have a similar bitmap to know if a changeset is
affected by obsolescence markers, that would cut the whole obsstore
parsing out + hash checking out of the equation with minimum impact. The
bitmap would also be significant more stable (you never stop being
affected unless the obsstore is stripped) which would make it easier to
handle.
> I've begun work on a storage structure for this information, so we don't
> need to recompute it during every read command. Before I go through the
> work to polish it enough to send it upstream, I want to run it by people
> for comments.
>
> The idea is basically:
>
> - Have a file that acts as a bitmap, where a changelog rev number maps
> to a particular bit, and if that bit is set then the node is considered
> hidden. computehidden will return a class that uses this to answer
> filteredrevs.__contains__ calls.
>
> - When something is changed that would affect visibility (bookmark edit,
> phase change, obsmarker creation, dirstate movement, tag edit), that
> code is responsible for calling hidden.updatevisibilty(repo,
> affectednodes) which reports which nodes are touched by the edit (like
> the old bookmark node and the new bookmark node), and updatevisibility()
> does the minimal calculation of visibility changes and propagates those
> changes to the node ancestors as required.
Could we move that at a level lower and use this for the whole filter
infrastructure? There is a lot of logic around incremental update and
sub-setting here.
The idea of having some even system to detect change and triggers
various processing seems useful but we should probably also tight it up
at a lower level than just high level operation. There is multiple other
place that could use this kind of data, having some kind of "journal of
transaction event" seems something useful. Having it a the transaction
level would also help a bit with the daunting idea of having 3rd party
code in extension play nice with it.
> - The file would be stored in .hg/cache/hidden, and would only be 122kb
> per million commits, so we'd just use the normal copy + atomic rename
> strategy for transactionality (i.e. transaction.addfilegenerator)
We cannot just use transaction here, visibility is affected by operation
done outside a transaction (eg: working copy parent update).
> - In theory the file would always be up-to-date, but we can put some
> metadata at the front of the file to allow verifying the repo hasn't
> changed significantly out from under it (recording tip, working copy
> parents, obsmarker file timestamp, etc). If the repo has changed, or if
> the bitmap doesn't yet exist, it can be trivially calculated using
> hidden.updatevisibility(repo.revs('not public()')) in a manner similar
> to how it works today.
Given that old client can touch repository, we cannot guarantee it will
always be up to date unless we bump the repository format (something
that will eventually happen "one day"). In the mean time we need some
cache invalidation key. Invalidation key for the whole hiddenset wont be
trivial. That make another argument to start with caching at simpler
point in the pipe.
> Notable issues:
>
> - A number of places treat filteredrevs as a set, and do things like
> 'myset - repo.filteredrevs'. With a bitmap this doesn't work, so we
> need to translate it to '(r for r in myset if r not in
> repo.filteredrevs)'. Which is probably a better O() anyway since it
> won't be affected by the size of filteredrevs.
I'm pretty sure Python have optimized enough set operations than the
complexity of the existing code is fine. You proposal will probably be
significantly slower. We should look into having efficient operation
with set instead.
> - filteredrevs is currently a frozen set. Editing the bitmap will have
> to be done in a way where the edits only become visible when the
> 'frozen' set is invalidated. So we maintain the existing behavior.
Moving to mutable object for this is a bit scary. They might escape in
various places and are used in various keys. We should probably stick to
immutable object for now and see if having mutable one is worth is as a
separate step.
> Follow up ideas:
>
> - Once this ships as a cache, it might be possible to progress it to
> not-a-cache, such that we could define visibility in other terms. For
> instance, we could allow hiding/reviving commits without obsolescence
> markers in non-evolve repos. This is a bit more of a long term thought,
> but might be worth brainstorming about.
I'm a bit confused by this part. The current code for hidden computable
is already independent from obsolescence and extensible. Did you checked
out the current code?
Cheers,
--
Pierre-Yves David
More information about the Mercurial-devel
mailing list