Linkrev perf

Jun Wu quark at fb.com
Tue Sep 27 17:15:22 EDT 2016


Excerpts from Durham Goode's message of 2016-09-27 13:31:22 -0700:
> On 9/27/16 8:24 AM, Jun Wu wrote:
> > I changed the subject to reflect the actual topic.
> >
> > The map or whatever structure to replace one single linkrev to a list of
> > "introrev"s (or flags indicating it is not unique) is just a hack.
> Sure it's imperfect, but it won't require any BC breakages, and it will 
> likely fix the problem for the majority of use cases (by letting us only 
> scan the changelog .i for potential ancestor linknodes and avoid 
> scanning the changelog .d).

Suppose the map is {fctx: [linkrev]}.

First, the map should be as small as possible, so it shouldn't contain fctx
where its linkrev list contains only one revision - in that case, linkrev is
accurate.

Then the problem becomes how to make sure the map is up-to-date. It's not
easy imo. For example, say Alice has a map storing {fctx: [linkrev]} for
len([linkrev]) > 1, Bob also has such a map locally. When Bob pulls from
Alice, it is possible that we need new entires that exist in neither
Alice's, nor Bob's map - how do you find out those entries in an efficient
way?

> > The long-term solution would be something different, ideally each fctx has
> > one unique linkrev, and any filelog is a subgraph of the changelog.
> >
> > It seems we have to add some noise when creating a new filenode to address
> > the problem. But hash change is a big BC. So one crazier idea is to extend
> > the hash - we still have 16 bytes unused for nodeid everywhere. So it would
> > be like:
> >
> >    - filenode would have extra N bytes for each revision, purely noise
> >    - manifest would be the complex part - we need to record those N bytes for
> >      each file, while we don't want to have BC or hash change.
> >      The solution would be like to have a separated manifest just storing
> >      those noises for each file path. So the hash of the main manifest does
> >      not change - thus compatible with old clients.
> >      Manifest nodeid would be extended by N bytes as well.
> >    - changelog nodeid would have a similar change - extra N bytes
> >      representing the noise.
> I don't fully understand.  You say the extra bytes don't affect the hash 
> of the main manifest.  If it's not affecting the hash, how is it fixing 
> the problem?  If it does affect the hash, where does it affect it.

It won't affect the first 20-bytes of the manifest hash. For example, a
manifest entry is like:
  
  path: 20-byte hash, 8-byte noise
  ^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^^^
   |                    |
   |                    +- only affect the extra 8 bytes of a manifest hash
   +- only affect first 20 bytes of a manifest hash

Manifest hash: 20-byte 8-byte
               ^^^^^^^ ^^^^^^
                |       |
                |       +- only read by new clients
                +-- compatible with old clients

Similar to changelog hash - first 20 bytes are for old clients, while the
extra 8 bytes are for new clients.

> > That sounds ugly anyway. If we can make a big BC on the hash (just like
> > treemanifest), I think it's may be the time to change more things, like:
> I honestly don't think we'll be able to make a BC breaking hash change 
> in upstream Mercurial.  Nor do I think the benefits outweigh the 
> downsides for the community.

Umm... I was just wondering because tree manifest is a hash BC. Since we are
in the progress of migrating to treemanifest, we still have chances to make
other hash changes along with the migration. Although it seems unlikely to
happen.

> >    - Include the order of p1, p2 in the hash
> The order is already p1-then-p2. Why does it need recording?

Currently we hash sorted([p1, p2]), so if you swap p1 and p2 directly in
revlog, "hg verify" still passes. Since the order does affect some behavior,
like "hg diff", it's better to "include the order in the hash" so if you
swap, "hg verify" will complain.

> >    - Like git, hash file blobs without parentids, say it's blobid, so we can
> >      reuse blobs for copies, reverts, and across repos.
> >    - Generate some noise (like, uuid) for each changeset being created
> >      (thanks Durham for this idea), in filelog, let nodeid = hash((blobid,
> >      p1, p2, intro-changeset-uuid)). The uuid could be reused for the amend
> >      case.
> (this was Matt's idea.  I just regurgitated it)


More information about the Mercurial-devel mailing list