D6255: copies: calculate mergecopies() based on pathcopies()

martinvonz (Martin von Zweigbergk) phabricator at mercurial-scm.org
Tue Apr 16 16:31:54 EDT 2019


martinvonz added a comment.


  In https://phab.mercurial-scm.org/D6255#91015, @marmoute wrote:
  
  > I did a first path through it, the new code seems reasonable and easier 
  >  to read than the previous one. Some comments and questions below.
  
  
  The rest somehow didn't make it here, so I'll copy from the email (i.e. the below is from Pierre-Yves, not from me):
  
  I did a first path through it, the new code seems reasonable and easier 
  to read than the previous one. Some comments and questions below.
  
  On 4/16/19 7:19 PM, martinvonz (Martin von Zweigbergk) wrote:
  
  > martinvonz created this revision.
  >  Herald added subscribers: mercurial-devel, mjpieters.
  >  Herald added a reviewer: hg-reviewers.
  > 
  > REVISION SUMMARY
  >     When copies are stored in changesets, we need a changeset-centric
  >     version of mergecopies() just like we have a changeset-centric version
  >     of pathcopies(). I think the natural way of thinking about
  >     mergecopies() is in terms of pathcopies() from the base to each of the
  >     commits. So if we can rewrite mergecopies() based two such
  >     pathcopies() calls, we'll get the changeset-centric version for
  >     free. That's what this patch does.
  
  I like the approach, if I understand it correctly, we'll have less 
  duplicated code in the end.
  
  >     
  >     A nice bonus is that it ends up being a lot simpler. mergecopies() has
  >     accumulated a lot of technical debt over time. One good example is the
  >     code for dealing with grafts (the "partial/incomplete/dirty"
  >     stuff). Since pathcopies() already deals with backwards renames and
  >     ping-pong renames, we get that for free.
  >     
  >     I've run tests with hard-coded debug logging for "fullcopy" and while
  >     I haven't looked at every difference it produces, all the ones I have
  >     looked at seemed reasonable to me.
  
  How many difference did you had? can you share some example of them with 
  their explanation?
  
  on related topic, you seems to be fixing a case in 
  test-fastannotate-hg.t and test-annotate-hg.t that is probably worth 
  mentioning.
  
  >     One drawback of the rewritten code is that we may now make
  >     remotefilelog prefetch more files. We used to prefetch files that were
  >     unique to either side of the merge compared to the other. We now
  >     prefetch files that are unique to either sise of the merge compared to
  
  sise → side
  
  >     the base. This means that if you added the same file to each side, we
  >     would not prefetch it before, but we would now. Such cases are
  >     probably quite rare, but one likely scenario where they happen is when
  >     moving from a commit to its successor (or the other way around). The
  >     user will probably already have the files in the cache in such cases,
  >     so it's probably not a big deal.
  >     
  >     Some timings for calculating mergecopies between two revisions (all
  >     using the common ancestor as base):
  
  Which revisions did you pick?
  
  (for the record, the benchmark suite use 1daa622bbe42 and 76caed42cf7c)
  
  >     In the hg repo:
  >     4.8 4.8: 0.21s -> 0.21s
  >     4.0 4.8: 0.35s -> 0.63s
  >     
  >     In and old copy of the mozilla-unified repo:
  >     FIREFOX_BETA_60_BASE^ FIREFOX_BETA_60_BASE: 0.51s -> 0.60s
  >     FIREFOX_NIGHTLY_59_END FIREFOX_BETA_60_BASE: 2.1s -> 2.3s
  >     FIREFOX_BETA_59_END FIREFOX_BETA_60_BASE: 3.1s -> 3.3s
  >     FIREFOX_AURORA_50_BASE FIREFOX_BETA_60_BASE: 30s -> 35s
  >     
  >     So it's measurably slower in most cases. Note that merge copies are
  >     not calculated when updating with a clean working copy, which is
  >     probably the most common case. I therefore think the much simpler code
  >     is worth the slowdown.
  
  Do you know where the slowdown comes from ?
  
  > REPOSITORY
  >     https://phab.mercurial-scm.org/diffusion/HG/ Mercurial
  > 
  > REVISION DETAIL
  >     https://phab.mercurial-scm.org/D6255
  > 
  > AFFECTED FILES
  >     mercurial/copies.py
  >     tests/test-annotate.t
  >     tests/test-fastannotate-hg.t
  >     tests/test-graft.t
  >     tests/test-rename-merge2.t
  > 
  > CHANGE DETAILS
  > 
  > diff --git a/tests/test-rename-merge2.t b/tests/test-rename-merge2.t
  > 
  > - a/tests/test-rename-merge2.t +++ b/tests/test-rename-merge2.t @@ -433,6 +433,9 @@
  > 
  >      --------------
  >      test L:nc a b R:up b   W:       - 12 merge b no ancestor
  >      --------------
  >  +    all copies found (* = to merge, ! = divergent, % = renamed and deleted):
  >  +     src: 'a' -> dst: 'b'
  >  +    checking for directory renames
  >      resolving manifests
  >       branchmerge: True, force: False, partial: False
  >       ancestor: 924404dff337, local: 86a2aa42fc76+, remote: af30c7647fc7
  >  @@ -469,6 +472,9 @@
  >      --------------
  >      test L:up b   R:nm a b W:       - 13 merge b no ancestor
  >      --------------
  >  +    all copies found (* = to merge, ! = divergent, % = renamed and deleted):
  >  +     src: 'a' -> dst: 'b'
  >  +    checking for directory renames
  >      resolving manifests
  >       branchmerge: True, force: False, partial: False
  >       ancestor: 924404dff337, local: 59318016310c+, remote: bdb19105162a
  >  @@ -506,6 +512,9 @@
  >      --------------
  >      test L:nc a b R:up a b W:       - 14 merge b no ancestor
  >      --------------
  >  +    all copies found (* = to merge, ! = divergent, % = renamed and deleted):
  >  +     src: 'a' -> dst: 'b'
  >  +    checking for directory renames
  >      resolving manifests
  >       branchmerge: True, force: False, partial: False
  >       ancestor: 924404dff337, local: 86a2aa42fc76+, remote: 8dbce441892a
  >  @@ -543,6 +552,9 @@
  >      --------------
  >      test L:up b   R:nm a b W:       - 15 merge b no ancestor, remove a
  >      --------------
  >  +    all copies found (* = to merge, ! = divergent, % = renamed and deleted):
  >  +     src: 'a' -> dst: 'b'
  >  +    checking for directory renames
  >      resolving manifests
  >       branchmerge: True, force: False, partial: False
  >       ancestor: 924404dff337, local: 59318016310c+, remote: bdb19105162a
  >  @@ -580,6 +592,9 @@
  >      --------------
  >      test L:nc a b R:up a b W:       - 16 get a, merge b no ancestor
  >      --------------
  >  +    all copies found (* = to merge, ! = divergent, % = renamed and deleted):
  >  +     src: 'a' -> dst: 'b'
  >  +    checking for directory renames
  >      resolving manifests
  >       branchmerge: True, force: False, partial: False
  >       ancestor: 924404dff337, local: 86a2aa42fc76+, remote: 8dbce441892a
  >  @@ -617,6 +632,9 @@
  >      --------------
  >      test L:up a b R:nc a b W:       - 17 keep a, merge b no ancestor
  >      --------------
  >  +    all copies found (* = to merge, ! = divergent, % = renamed and deleted):
  >  +     src: 'a' -> dst: 'b'
  >  +    checking for directory renames
  >      resolving manifests
  >       branchmerge: True, force: False, partial: False
  >       ancestor: 924404dff337, local: 0b76e65c8289+, remote: 4ce40f5aca24
  >  @@ -653,6 +671,9 @@
  >      --------------
  >      test L:nm a b R:up a b W:       - 18 merge b no ancestor
  >      --------------
  >  +    all copies found (* = to merge, ! = divergent, % = renamed and deleted):
  >  +     src: 'a' -> dst: 'b'
  >  +    checking for directory renames
  >      resolving manifests
  >       branchmerge: True, force: False, partial: False
  >       ancestor: 924404dff337, local: 02963e448370+, remote: 8dbce441892a
  >  @@ -695,6 +716,9 @@
  >      --------------
  >      test L:up a b R:nm a b W:       - 19 merge b no ancestor, prompt remove a
  >      --------------
  >  +    all copies found (* = to merge, ! = divergent, % = renamed and deleted):
  >  +     src: 'a' -> dst: 'b'
  >  +    checking for directory renames
  >      resolving manifests
  >       branchmerge: True, force: False, partial: False
  >       ancestor: 924404dff337, local: 0b76e65c8289+, remote: bdb19105162a
  >  diff --git a/tests/test-graft.t b/tests/test-graft.t
  > 
  > - a/tests/test-graft.t +++ b/tests/test-graft.t @@ -75,6 +75,8 @@
  > 
  >    
  >      $ hg graft -r 2 --base 3
  >      grafting 2:5c095ad7e90f "2"
  >  +  note: possible conflict - c was deleted and renamed to:
  >  +   a
  >      note: graft of 2:5c095ad7e90f created no changes to commit
  >    
  >    Can't continue without starting:
  >  @@ -220,6 +222,9 @@
  >      committing changelog
  >      updating the branch cache
  >      grafting 5:97f8bfe72746 "5"
  >  +    all copies found (* = to merge, ! = divergent, % = renamed and deleted):
  >  +     src: 'c' -> dst: 'b'
  >  +    checking for directory renames
  >      resolving manifests
  >       branchmerge: True, force: True, partial: False
  >       ancestor: 4c60f11aa304, local: 6b9e5368ca4e+, remote: 97f8bfe72746
  >  @@ -233,6 +238,9 @@
  >      $ HGEDITOR=cat hg graft 4 3 --log --debug
  >      scanning for duplicate grafts
  >      grafting 4:9c233e8e184d "4"
  >  +    all copies found (* = to merge, ! = divergent, % = renamed and deleted):
  >  +     src: 'c' -> dst: 'b'
  >  +    checking for directory renames
  >      resolving manifests
  >       branchmerge: True, force: True, partial: False
  >       ancestor: 4c60f11aa304, local: 1905859650ec+, remote: 9c233e8e184d
  >  @@ -1129,7 +1137,6 @@
  >      grafting 2:f58c7e2b28fa "C0"
  >      merging f1e and f1b to f1e
  >      merging f2a and f2c to f2c
  >  -  merging f5b and f5a to f5a
  >    
  >    Test the cases A.1 (f4x) and A.7 (f3x).
  >    
  >  diff --git a/tests/test-fastannotate-hg.t b/tests/test-fastannotate-hg.t
  > 
  > - a/tests/test-fastannotate-hg.t +++ b/tests/test-fastannotate-hg.t @@ -273,37 +273,10 @@
  > 
  >      > EOF
  >      $ hg ci -mc -d '3 0'
  >      created new head
  >  -BROKEN: 'a' was copied to 'b' on both sides. We should not get a merge conflict here
  >      $ hg merge
  >      merging b
  >  -  warning: conflicts while merging b! (edit, then use 'hg resolve --mark')
  >  -  0 files updated, 0 files merged, 0 files removed, 1 files unresolved
  >  -  use 'hg resolve' to retry unresolved file merges or 'hg merge --abort' to abandon
  >  -  [1]
  >  -  $ cat b
  >  -  <<<<<<< working copy: b80e3e32f75a - test: c
  >  -  a
  >  -  z
  >  -  a
  >  -  ||||||| base
  >  -  =======
  >  -  a
  >  -  a
  >  -  a
  >  -  b4
  >  -  c
  >  -  b5
  >  -  >>>>>>> merge rev:    64afcdf8e29e - test: mergeb
  >  -  $ cat <<EOF > b
  >  -  > a
  >  -  > z
  >  -  > a
  >  -  > b4
  >  -  > c
  >  -  > b5
  >  -  > EOF
  >  -  $ hg resolve --mark -q
  >  -  $ rm b.orig
  >  +  0 files updated, 1 files merged, 0 files removed, 0 files unresolved
  >  +  (branch merge, don't forget to commit)
  >      $ echo d >> b
  >      $ hg ci -mmerge2 -d '4 0'
  >    
  >  diff --git a/tests/test-annotate.t b/tests/test-annotate.t
  > 
  > - a/tests/test-annotate.t +++ b/tests/test-annotate.t @@ -273,37 +273,10 @@
  > 
  >      > EOF
  >      $ hg ci -mc -d '3 0'
  >      created new head
  >  -BROKEN: 'a' was copied to 'b' on both sides. We should not get a merge conflict here
  >      $ hg merge
  >      merging b
  >  -  warning: conflicts while merging b! (edit, then use 'hg resolve --mark')
  >  -  0 files updated, 0 files merged, 0 files removed, 1 files unresolved
  >  -  use 'hg resolve' to retry unresolved file merges or 'hg merge --abort' to abandon
  >  -  [1]
  >  -  $ cat b
  >  -  <<<<<<< working copy: b80e3e32f75a - test: c
  >  -  a
  >  -  z
  >  -  a
  >  -  ||||||| base
  >  -  =======
  >  -  a
  >  -  a
  >  -  a
  >  -  b4
  >  -  c
  >  -  b5
  >  -  >>>>>>> merge rev:    64afcdf8e29e - test: mergeb
  >  -  $ cat <<EOF > b
  >  -  > a
  >  -  > z
  >  -  > a
  >  -  > b4
  >  -  > c
  >  -  > b5
  >  -  > EOF
  >  -  $ hg resolve --mark -q
  >  -  $ rm b.orig
  >  +  0 files updated, 1 files merged, 0 files removed, 0 files unresolved
  >  +  (branch merge, don't forget to commit)
  >      $ echo d >> b
  >      $ hg ci -mmerge2 -d '4 0'
  >    
  >  diff --git a/mercurial/copies.py b/mercurial/copies.py
  > 
  > - a/mercurial/copies.py +++ b/mercurial/copies.py @@ -373,57 +373,6 @@
  > 
  >    
  >        return u1, u2
  >    
  >  -def _makegetfctx(ctx):
  >  -    """return a 'getfctx' function suitable for _checkcopies usage
  > 
  > - -    We have to re-setup the function building 'filectx' for each -    '_checkcopies' to ensure the linkrev adjustment is properly setup for -    each. Linkrev adjustment is important to avoid bug in rename -    detection. Moreover, having a proper '_ancestrycontext' setup ensures -    the performance impact of this adjustment is kept limited. Without it, -    each file could do a full dag traversal making the time complexity of -    the operation explode (see issue4537). - -    This function exists here mostly to limit the impact on stable. Feel -    free to refactor on default. -    """ -    rev = ctx.rev() -    repo = ctx._repo -    ac = getattr(ctx, '_ancestrycontext', None) -    if ac is None: -        revs = [rev] -        if rev is None: -            revs = [p.rev() for p in ctx.parents()] -        ac = repo.changelog.ancestors(revs, inclusive=True) -        ctx._ancestrycontext = ac -    def makectx(f, n): -        if n in node.wdirfilenodeids:  # in a working context? -            if ctx.rev() is None: -                return ctx.filectx(f) -            return repo[None][f] -        fctx = repo.filectx(f, fileid=n) -        # setup only needed for filectx not create from a changectx -        fctx._ancestrycontext = ac -        fctx._descendantrev = rev -        return fctx -    return util.lrucachefunc(makectx) - -def _combinecopies(copyfrom, copyto, finalcopy, diverge, incompletediverge): -    """combine partial copy paths""" -    remainder = {} -    for f in copyfrom: -        if f in copyto: -            finalcopy[copyto[f]] = copyfrom[f] -            del copyto[f] -    for f in incompletediverge: -        assert f not in diverge -        ic = incompletediverge[f] -        if ic[0] in copyto: -            diverge[f] = [copyto[ic[0]], ic[1]] -        else: -            remainder[f] = ic -    return remainder -
  > 
  >    def mergecopies(repo, c1, c2, base):
  >        """
  >        Finds moves and copies between context c1 and c2 that are relevant for
  >  @@ -526,168 +475,93 @@
  >        This is pretty slow when a lot of changesets are involved but will track all
  >        the copies.
  >        """
  >  -    # In certain scenarios (e.g. graft, update or rebase), base can be
  >  -    # overridden We still need to know a real common ancestor in this case We
  >  -    # can't just compute _c1.ancestor(_c2) and compare it to ca, because there
  >  -    # can be multiple common ancestors, e.g. in case of bidmerge.  Because our
  >  -    # caller may not know if the revision passed in lieu of the CA is a genuine
  >  -    # common ancestor or not without explicitly checking it, it's better to
  >  -    # determine that here.
  >  -    #
  >  -    # base.isancestorof(wc) is False, work around that
  >  -    _c1 = c1.p1() if c1.rev() is None else c1
  >  -    _c2 = c2.p1() if c2.rev() is None else c2
  >  -    # an endpoint is "dirty" if it isn't a descendant of the merge base
  >  -    # if we have a dirty endpoint, we need to trigger graft logic, and also
  >  -    # keep track of which endpoint is dirty
  >  -    dirtyc1 = not base.isancestorof(_c1)
  >  -    dirtyc2 = not base.isancestorof(_c2)
  >  -    graft = dirtyc1 or dirtyc2
  >  -    tca = base
  >  -    if graft:
  >  -        tca = _c1.ancestor(_c2)
  > 
  > - -    limit = _findlimit(repo, c1, c2) -
  > 
  >        m1 = c1.manifest()
  >        m2 = c2.manifest()
  >        mb = base.manifest()
  >    
  >  -    # gather data from _checkcopies:
  >  -    # - diverge = record all diverges in this dict
  >  -    # - copy = record all non-divergent copies in this dict
  >  -    # - fullcopy = record all copies in this dict
  >  -    # - incomplete = record non-divergent partial copies here
  >  -    # - incompletediverge = record divergent partial copies here
  >  -    diverge = {} # divergence data is shared
  >  -    incompletediverge  = {}
  >  -    data1 = {'copy': {},
  >  -             'fullcopy': {},
  >  -             'incomplete': {},
  >  -             'diverge': diverge,
  >  -             'incompletediverge': incompletediverge,
  >  -            }
  >  -    data2 = {'copy': {},
  >  -             'fullcopy': {},
  >  -             'incomplete': {},
  >  -             'diverge': diverge,
  >  -             'incompletediverge': incompletediverge,
  >  -            }
  >  +    copies1 = pathcopies(base, c1)
  >  +    copies2 = pathcopies(base, c2)
  >  +
  >  +    inversecopies1 = {}
  >  +    inversecopies2 = {}
  >  +    for dst, src in copies1.items():
  >  +        inversecopies1.setdefault(src, []).append(dst)
  >  +    for dst, src in copies2.items():
  >  +        inversecopies2.setdefault(src, []).append(dst)
  >  +
  >  +    copy = {}
  >  +    diverge = {}
  >  +    renamedelete = {}
  >  +    allsources = set(inversecopies1) | set(inversecopies2)
  >  +    for src in allsources:
  >  +        dsts1 = inversecopies1.get(src)
  >  +        dsts2 = inversecopies2.get(src)
  >  +        if dsts1 and dsts2:
  >  +            # copied/renamed on both sides
  >  +            if src not in m1 and src not in m2:
  >  +                # renamed on both sides
  >  +                dsts1 = set(dsts1)
  >  +                dsts2 = set(dsts2)
  >  +                # If there's some overlap in the rename destinations, we
  >  +                # consider it not divergent. For example, if side 1 copies 'a'
  >  +                # to 'b' and 'c' and deletes 'a', and side 2 copies 'a' to 'c'
  >  +                # and 'd' and deletes 'a'.
  
  Formatting could help this comment a bit. something like
  
  side 1: a -> b -> c
  side 2: a   ->    c -> d
  
  > +                if dsts1 & dsts2:
  >  +                    for dst in (dsts1 & dsts2):
  >  +                        copy[dst] = src
  >  +                else:
  >  +                    diverge[src] = sorted(dsts1 | dsts2)
  >  +            elif src in m1 and src in m2:
  >  +                # copied on both sides
  >  +                dsts1 = set(dsts1)
  >  +                dsts2 = set(dsts2)
  >  +                for dst in (dsts1 & dsts2):
  >  +                    copy[dst] = src
  >  +            # TODO: Handle cases where it was renamed on one side and copied
  >  +            # on the other side
  
  Is this TODO a regression or some ported limitation.
  
  > +        elif dsts1:
  >  +            # copied/renamed only on side 1
  >  +            if src not in m2:
  >  +                # deleted on side 2
  >  +                if src not in m1:
  >  +                    # renamed on side 1, deleted on side 2
  >  +                    renamedelete[src] = dsts1
  >  +            elif m2[src] != mb[src]:
  >  +                # modified on side 2
  >  +                for dst in dsts1:
  >  +                    if dst not in m2:
  >  +                        # dst not added on side 2 (handle as regular
  >  +                        # "both created" case in manifestmerge in that case)
  
  Can you elaborate a bit on what this case means (and how we deal with 
  it) especially, what happens if dst is indeed in m2 ?
  
  > +                        copy[dst] = src
  >  +        elif dsts2:
  >  +            # copied/renamed only on side 2
  >  +            if src not in m1:
  >  +                # deleted on side 1
  >  +                if src not in m2:
  >  +                    # renamed on side 2, deleted on side 1
  >  +                    renamedelete[src] = dsts2
  >  +            elif m1[src] != mb[src]:
  >  +                # modified on side 1
  >  +                for dst in dsts2:
  >  +                    if dst not in m1:
  >  +                        # dst not added on side 1 (handle as regular
  >  +                        # "both created" case in manifestmerge in that case)
  >  +                        copy[dst] = src
  
  Since this seems the very same code as the previous clause, I wonder if 
  we could factor it out. This would help to prevent subtle bug when we 
  update it. (the answer might be "no because python is slow).
  
  > +    renamedeleteset = set()
  >  +    divergeset = set()
  >  +    for src, dsts in diverge.items():
  >  +        divergeset.update(dsts)
  >  +    for src, dsts in renamedelete.items():
  >  +        renamedeleteset.update(dsts)
  
  small nits: Is there any reason not to use diverge.values and 
  renamedelete.values here ?

REPOSITORY
  rHG Mercurial

REVISION DETAIL
  https://phab.mercurial-scm.org/D6255

To: martinvonz, #hg-reviewers
Cc: marmoute, mjpieters, mercurial-devel


More information about the Mercurial-devel mailing list