RFC: bitmap storage for hidden

Tue Sep 6 17:58:19 EDT 2016

On 9/6/16 10:06 AM, Pierre-Yves David wrote:
>
>
> On 08/31/2016 10:53 PM, Durham Goode wrote:
>> One of the performance costs that affects every command is the
>> computehidden function (which computes what commits are hidden based on
>> a combination of obsmarkers, bookmarks, workingcopy, phases, and tags).
>> At Facebook this function alone can add 300ms to every command a user
>> runs (depending on a variety of factors).
>
> Do you have the details on where that time is going? There might be 
> small steps to take in existing code to speed thing up before 
> introducing a new concept.
 From 269ms total:
- 125ms is in hideablerevs (80ms doing 'not public()', 29ms doing 'node 
in obsstore.successors' checks (maybe that include obstore parsing?))
- 100ms is building the changelog prefix tree as we do node-to-rev lookups
Not sure where the rest is going
>
> For example, We could have a similar bitmap to know if a changeset is 
> affected by obsolescence markers, that would cut the whole obsstore 
> parsing out + hash checking out of the equation with minimum impact. 
> The bitmap would also be significant more stable (you never stop being 
> affected unless the obsstore is stripped) which would make it easier 
> to handle.
I think the inhibit extension means we'd have a similar kind of cache 
invalidation problem, but I guess the inhibit set is usually small 
enough to be easy to compute a hash for.

I'd have to think about how we'd use a obsmarker only bitmap cache. I 
think one issue is that we want to avoid reading any nodes from far back 
in the changelog (because that requires building the node prefix index  
Determining if a commit is hidden requires knowing the phase, which 
requires reading nodes all the way back in time to the phase root before 
the first node affected by obsmarkers.  So I don't think an obsmarker 
bitmap would help there.  We could of course also have a public-or-not 
bitmap, or have the obsmarker bitmap take into account public.  Or we 
could have the visibility set be lazily calculated.  So if you want to 
know if commit 'N - 10' is hidden or not, you only have to read 10 commits.

But I think we'd still need almost the same level of cache invalidation 
(you wouldn't care about bookmarks and the working copy, but you would 
have to check if the obsstore change or if phase roots change, which are 
the expensive ones).  I'd rather aim to remove 100% of the read-time 
calculations, since ideally we want our commands as close to 0ms as 
possible (with chg).
>
>> I've begun work on a storage structure for this information, so we don't
>> need to recompute it during every read command.  Before I go through the
>> work to polish it enough to send it upstream, I want to run it by people
>> for comments.
>>
>> The idea is basically:
>>
>> - Have a file that acts as a bitmap, where a changelog rev number maps
>> to a particular bit, and if that bit is set then the node is considered
>> hidden.  computehidden will return a class that uses this to answer
>> filteredrevs.__contains__ calls.
>>
>> - When something is changed that would affect visibility (bookmark edit,
>> phase change, obsmarker creation, dirstate movement, tag edit), that
>> code is responsible for calling hidden.updatevisibilty(repo,
>> affectednodes) which reports which nodes are touched by the edit (like
>> the old bookmark node and the new bookmark node), and updatevisibility()
>> does the minimal calculation of visibility changes and propagates those
>> changes to the node ancestors as required.
>
> Could we move that at a level lower and use this for the whole filter 
> infrastructure? There is a lot of logic around incremental update and 
> sub-setting here.
Maybe eventually.  I find that the stack-of-filter logic is very hard to 
reason about and work with, so I'd rather start at the top with the most 
common situations, then if the pattern proves useful, change the lower 
filters to use it as well afterwards
>
> The idea of having some even system to detect change and triggers 
> various processing seems useful but we should probably also tight it 
> up at a lower level than just high level operation. There is multiple 
> other place that could use this kind of data, having some kind of 
> "journal of transaction event" seems something useful. Having it a the 
> transaction level would also help a bit with the daunting idea of 
> having 3rd party code in extension play nice with it.
>
>> - The file would be stored in .hg/cache/hidden, and would only be 122kb
>> per million commits, so we'd just use the normal copy + atomic rename
>> strategy for transactionality (i.e. transaction.addfilegenerator)
>
> We cannot just use transaction here, visibility is affected by 
> operation done outside a transaction (eg: working copy parent update).
I thought all working copy updates were done in transactions now? We can 
have writes that circumvent the transaction blow away the cache entirely 
for now, if they're rare.
>
>> - In theory the file would always be up-to-date, but we can put some
>> metadata at the front of the file to allow verifying the repo hasn't
>> changed significantly out from under it (recording tip, working copy
>> parents, obsmarker file timestamp, etc). If the repo has changed, or if
>> the bitmap doesn't yet exist, it can be trivially calculated using
>> hidden.updatevisibility(repo.revs('not public()')) in a manner similar
>> to how it works today.
>
> Given that old client can touch repository, we cannot guarantee it 
> will always be up to date unless we bump the repository format 
> (something that will eventually happen "one day"). In the mean time we 
> need some cache invalidation key. Invalidation key for the whole 
> hiddenset wont be trivial. That make another argument to start with 
> caching at simpler point in the pipe.
>
>> Notable issues:
>>
>> - A number of places treat filteredrevs as a set, and do things like
>> 'myset - repo.filteredrevs'.  With a bitmap this doesn't work, so we
>> need to translate it to '(r for r in myset if r not in
>> repo.filteredrevs)'.  Which is probably a better O() anyway since it
>> won't be affected by the size of filteredrevs.
>
> I'm pretty sure Python have optimized enough set operations than the 
> complexity of the existing code is fine. You proposal will probably be 
> significantly slower. We should look into having efficient operation 
> with set instead.
The runtime of the existing set operations is fine, but the runtime of 
building those sets to begin with is not.  We can't really have sets 
whose creation is O(size of repo), liked filteredrevs is.  So I'd rather 
make these calculations minimal to the commits they care about, even if 
it increases the constant time factor in the equation.
>
>> - filteredrevs is currently a frozen set. Editing the bitmap will have
>> to be done in a way where the edits only become visible when the
>> 'frozen' set is invalidated.  So we maintain the existing behavior.
>
> Moving to mutable object for this is a bit scary. They might escape in 
> various places and are used in various keys. We should probably stick 
> to immutable object for now and see if having mutable one is worth is 
> as a separate step.
>
>> Follow up ideas:
>>
>> - Once this ships as a cache, it might be possible to progress it to
>> not-a-cache, such that we could define visibility in other terms.  For
>> instance, we could allow hiding/reviving commits without obsolescence
>> markers in non-evolve repos.  This is a bit more of a long term thought,
>> but might be worth brainstorming about.
>
> I'm a bit confused by this part. The current code for hidden 
> computable is already independent from obsolescence and extensible. 
> Did you checked out the current code?
I mean, the only thing today that can cause a commit to be hidden is the 
creation of an obsolescence marker.  If we had hidden-ness stored 
elsewhere, we might be able to hide nodes for more arbitrary reasons.