RFC: bitmap storage for hidden

Pierre-Yves David pierre-yves.david at ens-lyon.org
Tue Sep 6 13:06:52 EDT 2016



On 08/31/2016 10:53 PM, Durham Goode wrote:
> One of the performance costs that affects every command is the
> computehidden function (which computes what commits are hidden based on
> a combination of obsmarkers, bookmarks, workingcopy, phases, and tags).
> At Facebook this function alone can add 300ms to every command a user
> runs (depending on a variety of factors).

Do you have the details on where that time is going? There might be 
small steps to take in existing code to speed thing up before 
introducing a new concept.

For example, We could have a similar bitmap to know if a changeset is 
affected by obsolescence markers, that would cut the whole obsstore 
parsing out + hash checking out of the equation with minimum impact. The 
bitmap would also be significant more stable (you never stop being 
affected unless the obsstore is stripped) which would make it easier to 
handle.

> I've begun work on a storage structure for this information, so we don't
> need to recompute it during every read command.  Before I go through the
> work to polish it enough to send it upstream, I want to run it by people
> for comments.
>
> The idea is basically:
>
> - Have a file that acts as a bitmap, where a changelog rev number maps
> to a particular bit, and if that bit is set then the node is considered
> hidden.  computehidden will return a class that uses this to answer
> filteredrevs.__contains__ calls.
>
> - When something is changed that would affect visibility (bookmark edit,
> phase change, obsmarker creation, dirstate movement, tag edit), that
> code is responsible for calling hidden.updatevisibilty(repo,
> affectednodes) which reports which nodes are touched by the edit (like
> the old bookmark node and the new bookmark node), and updatevisibility()
> does the minimal calculation of visibility changes and propagates those
> changes to the node ancestors as required.

Could we move that at a level lower and use this for the whole filter 
infrastructure? There is a lot of logic around incremental update and 
sub-setting here.

The idea of having some even system to detect change and triggers 
various processing seems useful but we should probably also tight it up 
at a lower level than just high level operation. There is multiple other 
place that could use this kind of data, having some kind of "journal of 
transaction event" seems something useful. Having it a the transaction 
level would also help a bit with the daunting idea of having 3rd party 
code in extension play nice with it.

> - The file would be stored in .hg/cache/hidden, and would only be 122kb
> per million commits, so we'd just use the normal copy + atomic rename
> strategy for transactionality (i.e. transaction.addfilegenerator)

We cannot just use transaction here, visibility is affected by operation 
done outside a transaction (eg: working copy parent update).

> - In theory the file would always be up-to-date, but we can put some
> metadata at the front of the file to allow verifying the repo hasn't
> changed significantly out from under it (recording tip, working copy
> parents, obsmarker file timestamp, etc). If the repo has changed, or if
> the bitmap doesn't yet exist, it can be trivially calculated using
> hidden.updatevisibility(repo.revs('not public()')) in a manner similar
> to how it works today.

Given that old client can touch repository, we cannot guarantee it will 
always be up to date unless we bump the repository format (something 
that will eventually happen "one day"). In the mean time we need some 
cache invalidation key. Invalidation key for the whole hiddenset wont be 
trivial. That make another argument to start with caching at simpler 
point in the pipe.

> Notable issues:
>
> - A number of places treat filteredrevs as a set, and do things like
> 'myset - repo.filteredrevs'.  With a bitmap this doesn't work, so we
> need to translate it to '(r for r in myset if r not in
> repo.filteredrevs)'.  Which is probably a better O() anyway since it
> won't be affected by the size of filteredrevs.

I'm pretty sure Python have optimized enough set operations than the 
complexity of the existing code is fine. You proposal will probably be 
significantly slower. We should look into having efficient operation 
with set instead.

> - filteredrevs is currently a frozen set.  Editing the bitmap will have
> to be done in a way where the edits only become visible when the
> 'frozen' set is invalidated.  So we maintain the existing behavior.

Moving to mutable object for this is a bit scary. They might escape in 
various places and are used in various keys. We should probably stick to 
immutable object for now and see if having mutable one is worth is as a 
separate step.

> Follow up ideas:
>
> - Once this ships as a cache, it might be possible to progress it to
> not-a-cache, such that we could define visibility in other terms.  For
> instance, we could allow hiding/reviving commits without obsolescence
> markers in non-evolve repos.  This is a bit more of a long term thought,
> but might be worth brainstorming about.

I'm a bit confused by this part. The current code for hidden computable 
is already independent from obsolescence and extensible. Did you checked 
out the current code?

Cheers,

-- 
Pierre-Yves David


More information about the Mercurial-devel mailing list