cset 4ebc8693ce72 - convert: add filename filtering and renaming support

Alexis S. L. Carvalho alexis at cecm.usp.br
Fri Aug 17 18:33:57 CDT 2007


(Long, long message ahead, sorry.  But at least it has lots of pictures.)

> changeset:   5016:4ebc8693ce72
> user:        Bryan O'Sullivan <bos at serpentine.com>
> date:        Thu Jul 26 13:34:36 2007 -0700
> files:       hgext/convert/__init__.py hgext/convert/hg.py
> description:
> convert: add filename filtering and renaming support

I've finally managed to take a deeper look at this changeset.

I don't really like it.

For starters it doesn't have any docs or tests for --filemap, but
that should be easy enough to fix.


> diff -r cb100605a516 -r 4ebc8693ce72 hgext/convert/hg.py
> --- a/hgext/convert/hg.py     Thu Jul 26 13:34:36 2007 -0700
> +++ b/hgext/convert/hg.py     Thu Jul 26 13:34:36 2007 -0700
> @@ -59,7 +59,10 @@ class mercurial_sink(converter_sink):
>              pass
>  
>      def putcommit(self, files, parents, commit):
> -        seen = {}
> +        if not files:
> +            return hex(self.repo.changelog.tip())
> +
> +        seen = {hex(nullid): 1}
>          pl = []
>          for p in parents:
>              if p not in seen:

This is broken for regular conversions.  I've pushed a fix with a
detailed changelog -
http://hg.intevation.org/mercurial/crew/rev/33015dac5df5


And the big problem:  it doesn't handle merges correctly - it doesn't
even notice that merges are special.

Take the following revision graph (time flows from the left to the
right; the names next to each revision are the files added/modified by
that revision):

       ---- B ------
      /    foo      \
     /               E
    /               / \
  A --------- C ----   \
 foo\        bar        \
 bar \                   F ------- G
 baz  \                 /         foo
       -------- D ------          bar
               baz                baz


Now, if we want to split a repo that contains only foo, what revisions
should we include?  We certainly want at least A, B and G since they
change foo directly.  OTOH, C and D are completely uninteresting, and so
the merges at E and F are also uninteresting since they don't touch foo
and they have nothing to merge in the restricted graph.

After some hand waiving we arrive at the following graph (notice that G
has a single parent: B):

  foo:  A ---- B ---- G

In the same way:

  bar:  A ---- C ---- G

  baz:  A ---- D ---- G

Things start to get more interesting when we want more than one file.

Say we want foo and bar.  Then we definitely want A, B, C and G since
they modify those files directly.  D is still uninteresting and so the
merge at F is also uninteresting.  But the merge at E suddenly became
interesting, even though it doesn't directly touch foo and bar.

"Proof" by contradiction of my last assertion: if we don't want E, then
what should the resulting graph that includes A, B, C and G look like?

- we could make G a merge:

  A ---- B ---- G
    \         /
     --- C ---

  But this has 2 problems: G wasn't a merge in the original graph and we
  don't have any tree that includes both foo and bar after their
  changes in B and C, resp.

- since we can't make G a merge, we could add a dummy merge:

  A ---- B ---- X ---- G
    \         /
     --- C ---

  But then it'd be better to just take E instead of coming up with some
  dummy commit message (and notice that if, instead of having 2 files,
  you had N files, you could need up to N-1 dummy merges)

And after some additional hand waiving, we arrive at

  foo bar: A ---- B ---- E ---- G
             \         /
              --- C ---

In the same way:

  foo baz: A ---- B ---- F ---- G
             \         /
              --- D ---

  bar baz: A ---- C ---- F ---- G
             \         /
              --- D ---

And if we want all 3 files, we get the same original graph (this may not
always be the case - e.g. if the original graph included some empty
revisions...)

                     ---- B ------
                    /             \
                   /               E
                  /               / \
  foo bar baz:  A --------- C ----   \
                  \                   \
                   \                   F ------- G
                    \                 /
                     -------- D ------

Now, what are we getting with current hg convert?  Depends on the source
repo type.

(When I say a merge is "strange", I mean one of its parents is an
ancestor of the other parent, even though this wasn't the case in the
original repo)

For an hg repo:

  foo: A ---- B ---- G                      (correct)

  bar: A ---- C ---- E ---- G               (E shouldn't be there and is
         \         /                         a strange merge)
          ---------

  baz: A ---- D ---- F ---- G               (F shouldn't be there and is
         \         /                         a strange merge)
          ---------

  foo bar: A ---- B ---- E ---- G           (correct)
             \         /
              --- C ---

  foo baz: A ---- B ---- F ---- G           (correct)
             \         /
              --- D ---

  bar baz: A ---- C ---- E ---- F ---- G    (E shouldn't be there and is
           | \         /      /              a strange merge)
           |  ---------      /
            \               /
             ------ D ------

For a git repo:

  foo: A ---- B ---- E ---- F ---- G          \
       | \         /      /                    |
       |  ---------      /                     |
        \               /                      |
         ---------------                       |
                                               |  E and F shouldn't
  foo: A ---- C ---- E ---- F ---- G            > be there; they are
       | \         /      /                    |  always strange
       |  ---------      /                     |
        \               /                      |
         ---------------                       |
                                               |
  baz: A ---- D ---- F ----- G                 |
         \         /                           |
          ---------                           /

  foo bar: A ---- B ---- E ---- F ---- G      F shouldn't be there;
           | \         /      /               F is strange
           |  --- C ---      /
            \               /
             ---------------

  foo baz: A ---- B ---- E ---- F ---- G      \
           | \         /      /                |
           |  ---------      /                 |
            \               /                  |
             ---- D --------                   |
                                                > E shouldn't be there;
  bar baz: A ---- C ---- E ---- F ---- G       |  E is strange
           | \         /      /                |
           |  ---------      /                 |
            \               /                  |
             ---- D --------                  /


Both split the repo with all three files correctly; I haven't tried
anything with CVS and Subversion.

If you look carefully, you can notice that the main problems are merge
revisions that are being included, even though we don't want them.  This
actually points to a deeper problem:

Convert expects that getchanges will give it enough data to reconstruct
a revision based on its parents.  To get per-file history (mostly)
correct, the git backend returns all files that have changed compared to
any parent, while (I think) the hg backend gets away with comparing the
merge to its first parent (but I haven't looked enough at it yet).

OTOH, to decide if we're interested in a merge revision while doing file
filtering, we want the list of files that changed compared to *all*
parents.

We also need some data about the ancestry of the parents of a merge
revision, even if these parents had already been converted by a previous
run.  But this is not saved in the revision map file, and regular
conversions don't need these data.

So, I think some way to split a repo can be very useful and I think the
code in the convert extension should be used there.  But I really think
the convert _command_ is the wrong place for it.

Alexis


More information about the Mercurial-devel mailing list