[PATCH 3 of 3] rebase: use matcher to optimize manifestmerge

Mon Mar 20 04:14:04 EDT 2017

On Sun, 19 Mar 2017 12:00:58 -0700, Durham Goode wrote:
> # HG changeset patch
> # User Durham Goode <durham at fb.com>
> # Date 1489949694 25200
> #      Sun Mar 19 11:54:54 2017 -0700
> # Node ID 800c452bf1a44f9f817174c69443121f4ed4c3b8
> # Parent  d598e42fa629195ecf43f438b71603df9fb66d6d
> rebase: use matcher to optimize manifestmerge
> 
> The old merge code would call manifestmerge and calculate the complete diff
> between the source to the destination. In many cases, like rebase, the vast
> majority of differences between the source and destination are irrelevant
> because they are differences between the destination and the common ancestor
> only, and therefore don't affect the merge. Since most actions are 'keep', all
> the effort to compute them is wasted.
> 
> Instead, let's compute the difference between the source and the common ancestor
> and only perform the diff of those files against the merge destination. When
> using treemanifest, this lets us avoid loading almost the entire tree when
> rebasing from a very old ancestor. This speeds up rebase of an old stack of 27
> commits by 20x.

Looks generally good to me, but this needs more eyes.

> @@ -819,6 +819,27 @@ def manifestmerge(repo, wctx, p2, pa, br
>          if any(wctx.sub(s).dirty() for s in wctx.substate):
>              m1['.hgsubstate'] = modifiednodeid
>  
> +    # Don't use m2-vs-ma optimization if:
> +    # - ma is the same as m1 or m2, which we're just going to diff again later
> +    # - The matcher is set already, so we can't override it
> +    # - The caller specifically asks for a full diff, which is useful during bid
> +    #   merge.
> +    if (pa not in ([wctx, p2] + wctx.parents()) and
> +        matcher is None and not forcefulldiff):

Is this optimization better for normal merge where m2 might be far from m1?

> +        # Identify which files are relevant to the merge, so we can limit the
> +        # total m1-vs-m2 diff to just those files. This has significant
> +        # performance benefits in large repositories.
> +        relevantfiles = set(ma.diff(m2).keys())
> +
> +        # For copied and moved files, we need to add the source file too.
> +        for copykey, copyvalue in copy.iteritems():
> +            if copyvalue in relevantfiles:
> +                relevantfiles.add(copykey)
> +        for movedirkey in movewithdir.iterkeys():
> +            relevantfiles.add(movedirkey)
> +        matcher = matchmod.match(repo.root, '',
> +                                 ('path:%s' % p for p in relevantfiles))

Perhaps we can use scmutil.matchfiles(). patterns shouldn't be a generator
since it may be evaluated as a boolean.