Linkrev perf

Tue Sep 27 16:31:22 EDT 2016

On 9/27/16 8:24 AM, Jun Wu wrote:
> I changed the subject to reflect the actual topic.
>
> The map or whatever structure to replace one single linkrev to a list of
> "introrev"s (or flags indicating it is not unique) is just a hack.
Sure it's imperfect, but it won't require any BC breakages, and it will 
likely fix the problem for the majority of use cases (by letting us only 
scan the changelog .i for potential ancestor linknodes and avoid 
scanning the changelog .d).
>
> The long-term solution would be something different, ideally each fctx has
> one unique linkrev, and any filelog is a subgraph of the changelog.
>
> It seems we have to add some noise when creating a new filenode to address
> the problem. But hash change is a big BC. So one crazier idea is to extend
> the hash - we still have 16 bytes unused for nodeid everywhere. So it would
> be like:
>
>    - filenode would have extra N bytes for each revision, purely noise
>    - manifest would be the complex part - we need to record those N bytes for
>      each file, while we don't want to have BC or hash change.
>      The solution would be like to have a separated manifest just storing
>      those noises for each file path. So the hash of the main manifest does
>      not change - thus compatible with old clients.
>      Manifest nodeid would be extended by N bytes as well.
>    - changelog nodeid would have a similar change - extra N bytes
>      representing the noise.
I don't fully understand.  You say the extra bytes don't affect the hash 
of the main manifest.  If it's not affecting the hash, how is it fixing 
the problem?  If it does affect the hash, where does it affect it.
>
> That sounds ugly anyway. If we can make a big BC on the hash (just like
> treemanifest), I think it's may be the time to change more things, like:
I honestly don't think we'll be able to make a BC breaking hash change 
in upstream Mercurial.  Nor do I think the benefits outweigh the 
downsides for the community.
>
>    - Include the order of p1, p2 in the hash
The order is already p1-then-p2. Why does it need recording?
>    - Like git, hash file blobs without parentids, say it's blobid, so we can
>      reuse blobs for copies, reverts, and across repos.
>    - Generate some noise (like, uuid) for each changeset being created
>      (thanks Durham for this idea), in filelog, let nodeid = hash((blobid,
>      p1, p2, intro-changeset-uuid)). The uuid could be reused for the amend
>      case.
(this was Matt's idea.  I just regurgitated it)
>
> Notes:
>
>    - The noise is a dig at my own "reduce-noise" attempt. I was planning to
>      move amend_source etc. to obsstore. But after a second thought, I
>      believe we have to have some kind of noises for performance regarding on
>      single file traversal.
>    - We could provide a command to hash the tree if we have the blob layer,
>      for users who want to check if two commits have the exactly same file
>      contents.
>
> Excerpts from Pierre-Yves David's message of 2016-09-27 16:08:33 +0200:
>> On 09/24/2016 07:59 PM, Jun Wu wrote:
>>> By the way, I just encountered an _adjustlinkrev related perf issue when
>>> working on fastannotate internally. introrev (_adjustlinkrev) can take up
>>> seconds (and sadly it just returns ctx.linerev() most of the time). Seconds
>>> is unacceptable comparing with the annotate logic which only takes tens of
>>> milliseconds.
>> Did you check that that the ancestry context where properly inherited
>> from one check to another, this helps with the cost of adjustlinkrev.
>>
>>> I think a stupid way to make it better is to have a map of {fctxnode:
>>> [linknode]}. To make the map small, it only stores entries where
>>> len([linknode]) > 1. The challenge would be how to keep the map up-to-date.
>> A smaller step would be to have a fast way to know if a fnode is used
>> more than once. We can skip linkrev adjustment for anything with only
>> one entry.
>>
>> A "multiple entry flag" have this interesting similarity with other
>> proper with are considering bitmap for: "public", "obs". It is a
>> "forward only" property and we could reuse transaction safe logic for
>> all of them.
>>
>>
>> Building a very basic version of this (eg: no cache invalidation, simple
>> json on disk) would be a good way to get some number and see how much it
>> actually improves performance.