[PATCH 1 of 3 v4] graft: support grafting across move/copy (issue4028)

Thu Aug 4 10:37:14 EDT 2016

On 08/04/2016 04:23 PM, Gábor STEFANIK wrote:
> -----Original Message-----
>> From: Pierre-Yves David [mailto:pierre-yves.david at ens-lyon.org]
>> Sent: Thursday, August 04, 2016 3:53 PM
>> To: Gábor STEFANIK <Gabor.STEFANIK at nng.com>; mercurial-
>> devel at mercurial-scm.org
>> Subject: Re: [PATCH 1 of 3 v4] graft: support grafting across move/copy
>> (issue4028)
>>
>>
>>
>> On 08/03/2016 10:23 PM, Gábor Stefanik wrote:
>>> # HG changeset patch
>>> # User Gábor Stefanik <gabor.stefanik at nng.com> # Date 1470255644 -7200
>>> #      Wed Aug 03 22:20:44 2016 +0200
>>> # Node ID aa1f05a3ecdc720ddfd62821784c30d1f1f698cf
>>> # Parent  73ff159923c1f05899c27238409ca398342d9ae0
>>> graft: support grafting across move/copy (issue4028)
>>
>> Thanks for make the effort to split your series. However, splitting the test
>> from the code change seems like a bad idea, so patch 1 and 2 should
>> probably goes back together.
>
> Will do.
>
>>
>> Splitting the update fix in patch 3 is great. But could you build a test case to
>> make sure we do not regress the behavior.
>
> There is an additional part 4 now, which updates some tests due to slightly different
> output from the update command (specifically, a "searching for copies" message),
> which also implicitly acts as a regression test.
>
>>
>> More preparatory changes can probably be extracted from patch 1 tool.
>
> The only one that could be extracted meaningfully is the _silent option.
> The other changes don't really make sense without each other.

lets extract silent then.

>> I'm also getting more familiar with the logic of this patch and this seems neat,
>> looking forward to have it done.
>>
>>> Graft performs a merge with a false common ancestor, which must be
>>> taken into account when tracking copies. Explicitly pass the real
>>> common ancestor in this case, and track copies between the real and false
>> common ancestors in reverse.
>>>
>>> With this change, when grafting a commit with a change to a file moved
>>> earlier on the graft's source branch, the change is merged as expected
>>> into the original
>>> (unmoved) file, rather than recreating it under its new name.
>>> It should also make it possible to eventually enable cross-branch
>>> updates with merge.
>>>
>>> v2: handle the case when target branch also has a rename
>>> v3: address review comments
>>> v4: move ancestry check to mergecopies, split tests to separate commit
>>>
>>> diff --git a/mercurial/copies.py b/mercurial/copies.py
>>> --- a/mercurial/copies.py
>>> +++ b/mercurial/copies.py
>>> @@ -8,6 +8,7 @@
>>>  from __future__ import absolute_import
>>>
>>>  import heapq
>>> +import operator
>>>
>>>  from . import (
>>>      node,
>>> @@ -231,7 +232,7 @@
>>>      return _chain(x, y, _backwardrenames(x, a),
>>>                    _forwardcopies(a, y, match=match))
>>>
>>> -def _computenonoverlap(repo, c1, c2, addedinm1, addedinm2):
>>> +def _computenonoverlap(repo, c1, c2, addedinm1, addedinm2,
>> silent=False):
>>>      """Computes, based on addedinm1 and addedinm2, the files exclusive
>> to c1
>>>      and c2. This is its own function so extensions can easily wrap this call
>>>      to see what files mergecopies is about to process.
>>> @@ -242,10 +243,10 @@
>>>      u1 = sorted(addedinm1 - addedinm2)
>>>      u2 = sorted(addedinm2 - addedinm1)
>>>
>>> -    if u1:
>>> +    if u1 and not silent:
>>>          repo.ui.debug("  unmatched files in local:\n   %s\n"
>>>                        % "\n   ".join(u1))
>>> -    if u2:
>>> +    if u2 and not silent:
>>>          repo.ui.debug("  unmatched files in other:\n   %s\n"
>>>                        % "\n   ".join(u2))
>>>      return u1, u2
>>
>> As pointed in a previous review, it would be nice to split this function change
>> in its own patches.
>
> Is it really necessary? Histedit doesn't support splitting patches, it would have to be done entirely by hand.

Actually, histedit does. You can use 'edit' and partially commit some of 
the change before using histedit --continue.

>>> @@ -321,6 +322,23 @@
>>>      if repo.ui.configbool('experimental', 'disablecopytrace'):
>>>          return {}, {}, {}, {}
>>>
>>> +    # In certain scenarios (e.g. graft, update or rename), ca can be
>> overridden
>>> +    # We still need to know a real common ancestor in this case
>>> +    wc = repo[None]
>>> +    graft = False
>>> +    cta = ca
>>> +    # ca.descendant(wc) and ca.descendant(ca) are False, work around that
>>> +    _c1 = c1.p1() if c1 == wc else c1
>>
>> To detect if a ctx is the working copy, you can use `c1.rev() is None`.
>> This seems simpler.
>
> Will do.
>
>>
>>> +    _c2 = c2.p1() if c2 == wc else c2
>>> +    if ca == wc: # the working copy can never be a common ancestor
>>> +        graft = True
>>
>> I'm a bit confused about the meaning of this case. Can you elaborate?
>
> The working copy can never be an ancestor of anything (except for itself
> if we treat "ancestor" and "descendant" inclusively). So, the only way for
> the working copy to be a common ancestor of c1 and c2 is if c1 == c2 == wc.
> But then we would bail out early due to c1 == c2, so if we got here with
> the working copy as ca, then it must be a fake common ancestor, so we need
> to use the graft logic.

So are we aware of any case where this happen? Or are we asserting this 
is not somethig we'll ever see. If so I would prever put a small assert 
and move on.

>>> +    if not (ca == _c1 or ca.descendant(_c1)):
>>> +        graft = True
>>> +    if not (ca == _c2 or ca.descendant(_c2)):
>>> +        graft = True
>>
>> We should probably turn these 'if' into 'elif' because once one is True
>> there is no need to seach any further. This would save use some
>> descendant call.
>
> Will do.
>
>>
>>> +    if graft:
>>> +        cta = _c1.ancestor(_c2)
>>
>> a graft is the only case where cta != ca right ? could we just compute
>> `_c1.ancestor(_c2)` and compare it to ca directly ? declaring a graft if
>> they do not match?
>
> There can be more than one MRCA of _c1 and _c2, which is exactly what happens during a bid merge.
> In that case, it's possible to have _c1.ancestor(_c2) return one MRCA, and have the other MRCA
> passed to us in ca. Also, ca could be a non-most-recent common ancestor, and we still wouldn't need
> to use the more expensive graft logic.

A yes, good point. Maybe add a small comment about that?

>> (I'm also tempted to says that upper layer probably already know if they
>> might be performing a graft of not and could pass that along for
>> performance reason. But let's stick to  your approach and see how it
>> perform)
>
> Upper layer knows when it's performing a graft proper, but not necessarily when it's performing a graft-like operation.
> As we have seen, graft is not the only way to trigger bug 4028 - it also occurs during update, rebase and unshelve.
>
> We can't trust our callers to know whether the merge/update they are trying to do is graft-like, unless we require
> all callers to perform the same tests that we have here, which would negate any performance gain.

Okay :)

>
>>
>>> +
>>>      limit = _findlimit(repo, c1.rev(), c2.rev())
>>>      if limit is None:
>>>          # no common ancestor, no copies
>>> @@ -330,28 +348,44 @@
>>>      m1 = c1.manifest()
>>>      m2 = c2.manifest()
>>>      ma = ca.manifest()
>>> +    mta = cta.manifest()
>>>
>>> +    # see checkcopies documentation below for these dicts
>>>      copy1, copy2, = {}, {}
>>> +    copyfrom, copyto = {}, {}
>>>      movewithdir1, movewithdir2 = {}, {}
>>>      fullcopy1, fullcopy2 = {}, {}
>>>      diverge = {}
>>>
>>>      # find interesting file sets from manifests
>>> -    addedinm1 = m1.filesnotin(ma)
>>> -    addedinm2 = m2.filesnotin(ma)
>>> -    u1, u2 = _computenonoverlap(repo, c1, c2, addedinm1, addedinm2)
>>> +    addedinm1 = m1.filesnotin(mta)
>>> +    addedinm2 = m2.filesnotin(mta)
>>> +    u1, u2 = _computenonoverlap(repo, c1, c2, addedinm1, addedinm2,
>> graft)
>>> +    if not graft:
>>> +        unmatched = u1 + u2
>>> +    else: # need to recompute this for directory move handling when
>> grafting
>>> +        unmatched = operator.add(*_computenonoverlap(repo, c1, c2,
>>> +                                 m1.filesnotin(ma), m2.filesnotin(ma), False))
>>> +
>>
>> So last time I pointed we should remove this operator.add. But you had a
>> comment about memory. can you elaborate?
>
> This way, computenonoverlap returns two lists, which operator.add can combine
> in-place. If we instead saved the results in two temporary lists, it would have to
> then duplicate them for the addition, so that a later write into the temporary list
> doesn't change the sum.
>
> That's assuming the Python interpreter supports move semantics, just like C++11.
> (I'm pretty sure it does.)

I'm pretty sure you are way too optimistic about the Python interpreter. 
I'm fairly certain it allocate two lists objects without doing anything 
smart.

>
>>
>>>      bothnew = sorted(addedinm1 & addedinm2)
>>>
>>>      for f in u1:
>>> -        checkcopies(c1, f, m1, m2, ca, limit, diverge, copy1, fullcopy1)
>>> +        checkcopies(c1, f, m1, m2, ca, cta, limit, diverge, copy1,
>>> +                    fullcopy1, copyfrom, copyto)
>>>
>>>      for f in u2:
>>> -        checkcopies(c2, f, m2, m1, ca, limit, diverge, copy2, fullcopy2)
>>> +        checkcopies(c2, f, m2, m1, ca, cta, limit, diverge, copy2,
>>> +                    fullcopy2, copyfrom, copyto)
>>>
>>>      copy = dict(copy1.items() + copy2.items())
>>>      movewithdir = dict(movewithdir1.items() + movewithdir2.items())
>>>      fullcopy = dict(fullcopy1.items() + fullcopy2.items())
>>>
>>> +    # combine partial copy paths discovered in the previous step
>>> +    for f in copyfrom:
>>> +        if f in copyto:
>>> +            copy[copyto[f]] = copyfrom[f]
>>> +
>>>      renamedelete = {}
>>>      renamedeleteset = set()
>>>      divergeset = set()
>>> @@ -369,10 +403,13 @@
>>>      if bothnew:
>>>          repo.ui.debug("  unmatched files new in both:\n   %s\n"
>>>                        % "\n   ".join(bothnew))
>>> -    bothdiverge, _copy, _fullcopy = {}, {}, {}
>>> +    bothdiverge = {}
>>> +    _copy, _fullcopy, _copyfrom, _copyto = {}, {}, {}, {} # dummy dicts
>>>      for f in bothnew:
>>> -        checkcopies(c1, f, m1, m2, ca, limit, bothdiverge, _copy, _fullcopy)
>>> -        checkcopies(c2, f, m2, m1, ca, limit, bothdiverge, _copy, _fullcopy)
>>> +        checkcopies(c1, f, m1, m2, ca, cta, limit, bothdiverge, _copy,
>>> +                    _fullcopy, _copyfrom, _copyto)
>>> +        checkcopies(c2, f, m2, m1, ca, cta, limit, bothdiverge, _copy,
>>> +                    _fullcopy, _copyfrom, _copyto)
>>>      for of, fl in bothdiverge.items():
>>>          if len(fl) == 2 and fl[0] == fl[1]:
>>>              copy[fl[0]] = of # not actually divergent, just matching renames
>>> @@ -438,7 +475,7 @@
>>>                        (d, dirmove[d]))
>>>
>>>      # check unaccounted nonoverlapping files against directory moves
>>> -    for f in u1 + u2:
>>> +    for f in unmatched:
>>>          if f not in fullcopy:
>>>              for d in dirmove:
>>>                  if f.startswith(d):
>>> @@ -452,7 +489,8 @@
>>>
>>>      return copy, movewithdir, diverge, renamedelete
>>>
>>> -def checkcopies(ctx, f, m1, m2, ca, limit, diverge, copy, fullcopy):
>>> +def checkcopies(ctx, f, m1, m2, ca, cta, limit, diverge, copy, fullcopy,
>>> +                copyfrom, copyto):
>>
>> Unrelated to your series, I'm noticing that checkcopies is only using
>> inside "mercurial/copies.py" we could probably name it _checkcopies in
>> the future to make this clearer.
>
> Not sure if it's not open for use in extensions.

We have no official API. no core extension uses it.

>
>>
>>>      """
>>>      check possible copies of f from m1 to m2
>>>
>>> @@ -460,14 +498,19 @@
>>>      f = the filename to check
>>>      m1 = the source manifest
>>>      m2 = the destination manifest
>>> -    ca = the changectx of the common ancestor
>>> +    ca = the changectx of the common ancestor, overridden on graft
>>
>> So, in the graft case, this is not a common ancestors, right? maybe we
>> could rename this to "base". (just a though, don't feel obligated to do
>> it, but if you want to this would make a great preparatory patch)
>
> I would prefer not to make the patch series more complex than it needs to be.

That's fine.

>
>>
>>> +    cta = topological common ancestor for graft-like scenarios
>>>      limit = the rev number to not search beyond
>>>      diverge = record all diverges in this dict
>>>      copy = record all non-divergent copies in this dict
>>>      fullcopy = record all copies in this dict
>>> +    copyfrom = source sides of partially known copy tracks
>>> +    copyto = destination sides of partially known copytracks
>>>      """
>>>
>>>      ma = ca.manifest()
>>> +    mta = cta.manifest()
>>> +    backwards = f in ma # graft common ancestor already contains the
>> rename
>>>      getfctx = _makegetfctx(ctx)
>>>
>>>      def _related(f1, f2, limit):
>>> @@ -513,20 +556,32 @@
>>>              continue
>>>          seen.add(of)
>>>
>>> -        fullcopy[f] = of # remember for dir rename detection
>>> +        # remember for dir rename detection
>>> +        if backwards:
>>> +            fullcopy[of] = f # grafting backwards through renames
>>> +        else:
>>> +            fullcopy[f] = of
>>>          if of not in m2:
>>>              continue # no match, keep looking
>>>          if m2[of] == ma.get(of):
>>>              break # no merge needed, quit early
>>>          c2 = getfctx(of, m2[of])
>>> -        cr = _related(oc, c2, ca.rev())
>>> +        cr = _related(oc, c2, cta.rev())
>>>          if cr and (of == f or of == c2.path()): # non-divergent
>>> -            copy[f] = of
>>> +            if backwards:
>>> +                copy[of] = f
>>> +            else:
>>> +                copy[f] = of
>>>              of = None
>>>              break
>>>
>>>      if of in ma:
>>>          diverge.setdefault(of, []).append(f)
>>> +    elif of in mta:
>>> +        if backwards:
>>> +            copyfrom[of] = f
>>> +        else:
>>> +            copyto[of] = f
>>
>> In the case where 'ca' is 'cta', this means we'll do the same (possibly
>> costly) membership test twice. As this is the common case, could we make
>> the code a bit smarter here?
>
> The alternative is "elif cta != ca and of in mta", which trades a membership test for a complex-object equality test, not sure which one is faster.

The object should easily compare in O(1) (with eventually a int 
comparison) the manifest membership test will get significant more 
costly as the manifest grow and complexify.

Cheers,

-- 
Pierre-Yves David