Linkrev perf

Tue Sep 27 13:04:43 EDT 2016

On Tue, Sep 27, 2016 at 8:24 AM, Jun Wu <quark at fb.com> wrote:

> I changed the subject to reflect the actual topic.
>
> The map or whatever structure to replace one single linkrev to a list of
> "introrev"s (or flags indicating it is not unique) is just a hack.
>
> The long-term solution would be something different, ideally each fctx has
> one unique linkrev, and any filelog is a subgraph of the changelog.
>
> It seems we have to add some noise when creating a new filenode to address
> the problem. But hash change is a big BC. So one crazier idea is to extend
> the hash - we still have 16 bytes unused for nodeid everywhere. So it would
> be like:
>
>   - filenode would have extra N bytes for each revision, purely noise
>   - manifest would be the complex part - we need to record those N bytes
> for
>     each file, while we don't want to have BC or hash change.
>     The solution would be like to have a separated manifest just storing
>     those noises for each file path. So the hash of the main manifest does
>     not change - thus compatible with old clients.
>     Manifest nodeid would be extended by N bytes as well.
>   - changelog nodeid would have a similar change - extra N bytes
>     representing the noise.
>
> That sounds ugly anyway. If we can make a big BC on the hash (just like
> treemanifest), I think it's may be the time to change more things, like:
>
>   - Include the order of p1, p2 in the hash
>   - Like git, hash file blobs without parentids, say it's blobid, so we can
>     reuse blobs for copies, reverts, and across repos.
>   - Generate some noise (like, uuid) for each changeset being created
>     (thanks Durham for this idea), in filelog, let nodeid = hash((blobid,
>     p1, p2, intro-changeset-uuid)). The uuid could be reused for the amend
>     case.
>

As you noted, things get real ugly real quick. There are also implications
for changegroup transfer, which does hardcode 20 bytes for the hash.

I tend to think of linkrev as just another piece of revision-derived
metadata that could be "cached" better. When you view it through that lens,
you realize there are tons of other things we could also throw into a
generic "derived from revision" cache. This could even include parts of
changeset data, such as the author, branch, and files changed. Depending on
how the cache is implemented (say you used SQLite), this could result in
some significant performance wins for certain operations. e.g. an index of
file changes in changesets would make `hg log --removed` significantly
faster. If you moved the changelog to something that wasn't a revlog, it
also makes shallow and narrow clone much easier to implement since
out-of-order insertions would be allowed.

Thinking even bigger picture, there seem to be enough potential areas for
improvement across many areas to - and I don't say this lightly - consider
a new store format. Regardless of whether we decide to take that giant
leap, I think it is important to keep a list of "known problems" and
"things we would fix if we implemented things from scratch." Perhaps the
"internals" help pages could start documenting these things?

>
> Notes:
>
>   - The noise is a dig at my own "reduce-noise" attempt. I was planning to
>     move amend_source etc. to obsstore. But after a second thought, I
>     believe we have to have some kind of noises for performance regarding
> on
>     single file traversal.
>   - We could provide a command to hash the tree if we have the blob layer,
>     for users who want to check if two commits have the exactly same file
>     contents.
>
> Excerpts from Pierre-Yves David's message of 2016-09-27 16:08:33 +0200:
> > On 09/24/2016 07:59 PM, Jun Wu wrote:
> > > By the way, I just encountered an _adjustlinkrev related perf issue
> when
> > > working on fastannotate internally. introrev (_adjustlinkrev) can take
> up
> > > seconds (and sadly it just returns ctx.linerev() most of the time).
> Seconds
> > > is unacceptable comparing with the annotate logic which only takes
> tens of
> > > milliseconds.
> >
> > Did you check that that the ancestry context where properly inherited
> > from one check to another, this helps with the cost of adjustlinkrev.
> >
> > > I think a stupid way to make it better is to have a map of {fctxnode:
> > > [linknode]}. To make the map small, it only stores entries where
> > > len([linknode]) > 1. The challenge would be how to keep the map
> up-to-date.
> >
> > A smaller step would be to have a fast way to know if a fnode is used
> > more than once. We can skip linkrev adjustment for anything with only
> > one entry.
> >
> > A "multiple entry flag" have this interesting similarity with other
> > proper with are considering bitmap for: "public", "obs". It is a
> > "forward only" property and we could reuse transaction safe logic for
> > all of them.
> >
> >
> > Building a very basic version of this (eg: no cache invalidation, simple
> > json on disk) would be a good way to get some number and see how much it
> > actually improves performance.
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel at mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.mercurial-scm.org/pipermail/mercurial-devel/attachments/20160927/f70ef922/attachment.html>