RFC: bitmap storage for hidden

Durham Goode durham at fb.com
Fri Sep 2 12:41:30 EDT 2016



On 9/2/16 9:14 AM, Gregory Szorc wrote:
>
>> On Aug 31, 2016, at 13:53, Durham Goode <durham at fb.com> wrote:
>>
>> One of the performance costs that affects every command is the computehidden function (which computes what commits are hidden based on a combination of obsmarkers, bookmarks, workingcopy, phases, and tags).  At Facebook this function alone can add 300ms to every command a user runs (depending on a variety of factors).
>>
>> I've begun work on a storage structure for this information, so we don't need to recompute it during every read command.  Before I go through the work to polish it enough to send it upstream, I want to run it by people for comments.
>>
>> The idea is basically:
>>
>> - Have a file that acts as a bitmap, where a changelog rev number maps to a particular bit, and if that bit is set then the node is considered hidden.  computehidden will return a class that uses this to answer filteredrevs.__contains__ calls.
>>
>> - When something is changed that would affect visibility (bookmark edit, phase change, obsmarker creation, dirstate movement, tag edit), that code is responsible for calling hidden.updatevisibilty(repo, affectednodes) which reports which nodes are touched by the edit (like the old bookmark node and the new bookmark node), and updatevisibility() does the minimal calculation of visibility changes and propagates those changes to the node ancestors as required.
>>
>> - The file would be stored in .hg/cache/hidden, and would only be 122kb per million commits, so we'd just use the normal copy + atomic rename strategy for transactionality (i.e. transaction.addfilegenerator)
>>
>> - In theory the file would always be up-to-date, but we can put some metadata at the front of the file to allow verifying the repo hasn't changed significantly out from under it (recording tip, working copy parents, obsmarker file timestamp, etc). If the repo has changed, or if the bitmap doesn't yet exist, it can be trivially calculated using hidden.updatevisibility(repo.revs('not public()')) in a manner similar to how it works today.
>>
>>
>> Notable issues:
>>
>> - A number of places treat filteredrevs as a set, and do things like 'myset - repo.filteredrevs'.  With a bitmap this doesn't work, so we need to translate it to '(r for r in myset if r not in repo.filteredrevs)'.  Which is probably a better O() anyway since it won't be affected by the size of filteredrevs.
>>
>> - filteredrevs is currently a frozen set.  Editing the bitmap will have to be done in a way where the edits only become visible when the 'frozen' set is invalidated.  So we maintain the existing behavior.
>>
>>
>> Follow up ideas:
>>
>> - Once this ships as a cache, it might be possible to progress it to not-a-cache, such that we could define visibility in other terms.  For instance, we could allow hiding/reviving commits without obsolescence markers in non-evolve repos.  This is a bit more of a long term thought, but might be worth brainstorming about.
> Why not ship it as part of the store from the beginning (without the extra features)? Cache validation is hard. Tying the updates to store locks/transactions seems easier and guarantees no updates are needed at read time (which has historically caused issues on multi process HTTP servers).
The cache will be tied to locks and transactions already.  So it 
effectively won't be a cache.  The only reason for treating it like a 
cache and having some sort of validation is so that we can detect issues 
where the bitmap becomes out of sync and we can fix them before we make 
this really not-a-cache.


More information about the Mercurial-devel mailing list