Linkrev perf

Durham Goode durham at fb.com
Tue Sep 27 19:29:42 EDT 2016



On 9/27/16 2:15 PM, Jun Wu wrote:
> Excerpts from Durham Goode's message of 2016-09-27 13:31:22 -0700:
>> On 9/27/16 8:24 AM, Jun Wu wrote:
>>> I changed the subject to reflect the actual topic.
>>>
>>> The map or whatever structure to replace one single linkrev to a list of
>>> "introrev"s (or flags indicating it is not unique) is just a hack.
>> Sure it's imperfect, but it won't require any BC breakages, and it will
>> likely fix the problem for the majority of use cases (by letting us only
>> scan the changelog .i for potential ancestor linknodes and avoid
>> scanning the changelog .d).
> Suppose the map is {fctx: [linkrev]}.
>
> First, the map should be as small as possible, so it shouldn't contain fctx
> where its linkrev list contains only one revision - in that case, linkrev is
> accurate.
>
> Then the problem becomes how to make sure the map is up-to-date. It's not
> easy imo. For example, say Alice has a map storing {fctx: [linkrev]} for
> len([linkrev]) > 1, Bob also has such a map locally. When Bob pulls from
> Alice, it is possible that we need new entires that exist in neither
> Alice's, nor Bob's map - how do you find out those entries in an efficient
> way?
The map is just there to make the initial step of adjustlinkrev more 
likely to succeed.  When it scans the changelog.i, it knows of more 
nodes it can search for.  If it finds none of those, it reverts back to 
the old scan-changelog.d behavior and fills the cache with the result.  
commit, rebase, amend, etc can also be filling the cache as commits are 
added.
>
>>> The long-term solution would be something different, ideally each fctx has
>>> one unique linkrev, and any filelog is a subgraph of the changelog.
>>>
>>> It seems we have to add some noise when creating a new filenode to address
>>> the problem. But hash change is a big BC. So one crazier idea is to extend
>>> the hash - we still have 16 bytes unused for nodeid everywhere. So it would
>>> be like:
>>>
>>>     - filenode would have extra N bytes for each revision, purely noise
>>>     - manifest would be the complex part - we need to record those N bytes for
>>>       each file, while we don't want to have BC or hash change.
>>>       The solution would be like to have a separated manifest just storing
>>>       those noises for each file path. So the hash of the main manifest does
>>>       not change - thus compatible with old clients.
>>>       Manifest nodeid would be extended by N bytes as well.
>>>     - changelog nodeid would have a similar change - extra N bytes
>>>       representing the noise.
>> I don't fully understand.  You say the extra bytes don't affect the hash
>> of the main manifest.  If it's not affecting the hash, how is it fixing
>> the problem?  If it does affect the hash, where does it affect it.
> It won't affect the first 20-bytes of the manifest hash. For example, a
> manifest entry is like:
>    
>    path: 20-byte hash, 8-byte noise
>    ^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^^^
>     |                    |
>     |                    +- only affect the extra 8 bytes of a manifest hash
>     +- only affect first 20 bytes of a manifest hash
>
> Manifest hash: 20-byte 8-byte
>                 ^^^^^^^ ^^^^^^
>                  |       |
>                  |       +- only read by new clients
>                  +-- compatible with old clients
>
> Similar to changelog hash - first 20 bytes are for old clients, while the
> extra 8 bytes are for new clients.
I'd have to think about this some more.  It sounds a little scary, and 
I'm not sure how it would work with the wireprotocol (does the wire 
protocol also have 12 bytes of free space?)
>>>     - Include the order of p1, p2 in the hash
>> The order is already p1-then-p2. Why does it need recording?
> Currently we hash sorted([p1, p2]), so if you swap p1 and p2 directly in
> revlog, "hg verify" still passes. Since the order does affect some behavior,
> like "hg diff", it's better to "include the order in the hash" so if you
> swap, "hg verify" will complain.
Ug, I didn't realize they were sorted.


More information about the Mercurial-devel mailing list