RFC: bitmap storage for hidden

Fri Sep 9 13:25:27 EDT 2016

On 09/06/2016 11:58 PM, Durham Goode wrote:
>
>
> On 9/6/16 10:06 AM, Pierre-Yves David wrote:
>>
>>
>> On 08/31/2016 10:53 PM, Durham Goode wrote:
>>> One of the performance costs that affects every command is the
>>> computehidden function (which computes what commits are hidden based on
>>> a combination of obsmarkers, bookmarks, workingcopy, phases, and tags).
>>> At Facebook this function alone can add 300ms to every command a user
>>> runs (depending on a variety of factors).
>>
>> Do you have the details on where that time is going? There might be
>> small steps to take in existing code to speed thing up before
>> introducing a new concept.
> From 269ms total:
> - 125ms is in hideablerevs (80ms doing 'not public()', 29ms doing 'node
> in obsstore.successors' checks (maybe that include obstore parsing?))
> - 100ms is building the changelog prefix tree as we do node-to-rev lookups
> Not sure where the rest is going
>>
>> For example, We could have a similar bitmap to know if a changeset is
>> affected by obsolescence markers, that would cut the whole obsstore
>> parsing out + hash checking out of the equation with minimum impact.
>> The bitmap would also be significant more stable (you never stop being
>> affected unless the obsstore is stripped) which would make it easier
>> to handle.
> I think the inhibit extension means we'd have a similar kind of cache
> invalidation problem, but I guess the inhibit set is usually small
> enough to be easy to compute a hash for.

Yeah, inhibit is not designed nor expected to be active for a large 
number of changeset. In the worse possible case it affect all locally 
active draft created by the user which is supposed to be bounded.

> I'd have to think about how we'd use a obsmarker only bitmap cache. I
> think one issue is that we want to avoid reading any nodes from far back
> in the changelog (because that requires building the node prefix index
> Determining if a commit is hidden requires knowing the phase, which
> requires reading nodes all the way back in time to the phase root before
> the first node affected by obsmarkers.  So I don't think an obsmarker
> bitmap would help there.  We could of course also have a public-or-not
> bitmap, or have the obsmarker bitmap take into account public.
 >
> But I think we'd still need almost the same level of cache invalidation
> (you wouldn't care about bookmarks and the working copy, but you would
> have to check if the obsstore change or if phase roots change, which are
> the expensive ones).  I'd rather aim to remove 100% of the read-time
> calculations, since ideally we want our commands as close to 0ms as
> possible (with chg).

Having an obsmarker bitmap would help not having to read/use nodes. But 
you are right that it need to be backed up by another 'public' bitmap to 
remove the need to use these nodes on the phase side.

However that still looks like a better first step. Each of these two 
bitmaps will be tight to a single data type so computing a cache key, 
invalidating them and incremental update should be much simpler. In 
addition they also have the interesting property that bit tend to also 
flip in one direction which make transaction related work simpler 
(unless rare event like stripping obsstore or forced phases moved). With 
the current state we will need some cache invalidation mechanism in all 
cases (see previous email)

Such low levels caches will benefit other bits of the code and provide a 
wider speed boost.

> >> I've begun work on a storage structure for this information, so we don't
>>> need to recompute it during every read command.  Before I go through the
>>> work to polish it enough to send it upstream, I want to run it by people
>>> for comments.
>>>
>>> The idea is basically:
>>>
>>> - Have a file that acts as a bitmap, where a changelog rev number maps
>>> to a particular bit, and if that bit is set then the node is considered
>>> hidden.  computehidden will return a class that uses this to answer
>>> filteredrevs.__contains__ calls.
>>>
>>> - When something is changed that would affect visibility (bookmark edit,
>>> phase change, obsmarker creation, dirstate movement, tag edit), that
>>> code is responsible for calling hidden.updatevisibilty(repo,
>>> affectednodes) which reports which nodes are touched by the edit (like
>>> the old bookmark node and the new bookmark node), and updatevisibility()
>>> does the minimal calculation of visibility changes and propagates those
>>> changes to the node ancestors as required.
>>
>> Could we move that at a level lower and use this for the whole filter
>> infrastructure? There is a lot of logic around incremental update and
>> sub-setting here.
 >
> Maybe eventually.  I find that the stack-of-filter logic is very hard to
> reason about and work with, so I'd rather start at the top with the most
> common situations, then if the pattern proves useful, change the lower
> filters to use it as well afterwards

There is really no rocket science here. The filters defines different 
subset of nodes. These subset have the interesting property to be 
subsets of each other (eg: public < visible). The smaller subset change 
less often and usually only grow forward (no node removed). As a result 
the caches (eg: branch cache) related to the smaller subset are more 
stable and can be used by the super set as a base to incrementally 
recompute a version valid for the subset without starting over from scratch.

The system is mostly used for branchmap cache but could probably be 
useful elsewhere. In addition some of the cache invalidation related to 
filter are very simple, improving them is a low hanging fruit.

A lot of the filter are defined from phases information, so having the 
bitmap related to public would help all along the way there.

Trying to improve code around visible without looking into the filter 
system seems like a concerning oversight.

>> The idea of having some even system to detect change and triggers
>> various processing seems useful but we should probably also tight it
>> up at a lower level than just high level operation. There is multiple
>> other place that could use this kind of data, having some kind of
>> "journal of transaction event" seems something useful. Having it a the
>> transaction level would also help a bit with the daunting idea of
>> having 3rd party code in extension play nice with it.
>>
>>> - The file would be stored in .hg/cache/hidden, and would only be 122kb
>>> per million commits, so we'd just use the normal copy + atomic rename
>>> strategy for transactionality (i.e. transaction.addfilegenerator)
>>
>> We cannot just use transaction here, visibility is affected by
>> operation done outside a transaction (eg: working copy parent update).
 >
> I thought all working copy updates were done in transactions now? We can
> have writes that circumvent the transaction blow away the cache entirely
> for now, if they're rare.

No, working copy parent change will collaborate with the transaction if 
there is one. But it does not requires a transaction.

>>> - In theory the file would always be up-to-date, but we can put some
>>> metadata at the front of the file to allow verifying the repo hasn't
>>> changed significantly out from under it (recording tip, working copy
>>> parents, obsmarker file timestamp, etc). If the repo has changed, or if
>>> the bitmap doesn't yet exist, it can be trivially calculated using
>>> hidden.updatevisibility(repo.revs('not public()')) in a manner similar
>>> to how it works today.
>>
>> Given that old client can touch repository, we cannot guarantee it
>> will always be up to date unless we bump the repository format
>> (something that will eventually happen "one day"). In the mean time we
>> need some cache invalidation key. Invalidation key for the whole
>> hiddenset wont be trivial. That make another argument to start with
>> caching at simpler point in the pipe.
>>
>>> Notable issues:
>>>
>>> - A number of places treat filteredrevs as a set, and do things like
>>> 'myset - repo.filteredrevs'.  With a bitmap this doesn't work, so we
>>> need to translate it to '(r for r in myset if r not in
>>> repo.filteredrevs)'.  Which is probably a better O() anyway since it
>>> won't be affected by the size of filteredrevs.
>>
>> I'm pretty sure Python have optimized enough set operations than the
>> complexity of the existing code is fine. You proposal will probably be
>> significantly slower. We should look into having efficient operation
>> with set instead.
 >
> The runtime of the existing set operations is fine, but the runtime of
> building those sets to begin with is not.  We can't really have sets
> whose creation is O(size of repo), liked filteredrevs is.  So I'd rather
> make these calculations minimal to the commits they care about, even if
> it increases the constant time factor in the equation.
>>
>>> - filteredrevs is currently a frozen set. Editing the bitmap will have
>>> to be done in a way where the edits only become visible when the
>>> 'frozen' set is invalidated.  So we maintain the existing behavior.
>>
>> Moving to mutable object for this is a bit scary. They might escape in
>> various places and are used in various keys. We should probably stick
>> to immutable object for now and see if having mutable one is worth is
>> as a separate step.
>>
>>> Follow up ideas:
>>>
>>> - Once this ships as a cache, it might be possible to progress it to
>>> not-a-cache, such that we could define visibility in other terms.  For
>>> instance, we could allow hiding/reviving commits without obsolescence
>>> markers in non-evolve repos.  This is a bit more of a long term thought,
>>> but might be worth brainstorming about.
>>
>> I'm a bit confused by this part. The current code for hidden
>> computable is already independent from obsolescence and extensible.
>> Did you checked out the current code?
 >
> I mean, the only thing today that can cause a commit to be hidden is the
> creation of an obsolescence marker.  If we had hidden-ness stored
> elsewhere, we might be able to hide nodes for more arbitrary reasons.

I'm still confused, the way we store hidden-ness and the way we compute 
it are two different things. The current API definitely allow for other 
way to marks things hidden. Sean even successfully implemented 
experiment for hg-git a couple of month ago. So I don't see why this is 
an argument in the current discussion.

Cheers

-- 
Pierre-Yves David