How the Convert extension works

This page documents the implementation of the ConvertExtension as of f29b674cc221 (in main or crew).

If you are a user, you should read the ConvertExtension page instead. If you still intend to read this page, you should read ConvertExtension first, so that you have a full understanding of what the implementation is implementing.

This article is a draft.

Inputs

Every VCS that the ConvertExtension supports as input is represented by a subclass of the abstract class common.converter_source. The repository to convert is represented by an instance of a source. For example, when converting from Subversion, the input repository is represented by an instance of convert.subversion.svn_source, which is a subclass of converter_source.

The converter_source class takes three arguments:

Additionally, it has a property named encoding, which is 'utf-8' by default.

If a converter_source subclass finds no valid repository at the given path, it raises the exception common.NoRepo. The subclass may also raise NoRepo if some library that it depends on isn't available.

The Subversion source (convert.subversion.svn_source) uses “url”, not “path”, as the name of its second argument, because Subversion works with URLs rather than paths. However, it does support local file paths; it uses the geturl function in the same module to silently upgrade a path to a URL. Additionally, the url argument is not optional.

Abstract Methods

before and after

Subclasses can implement these to perform any necessary preparation and clean-up.

getheads

Returns the revision identifiers that exist in the source repository. For most Subversion repositories, the heads are the revision numbers of the latest revision on the trunk and the latest revision on every branch.

getchanges

Takes a revision identifier as its only argument, and returns either a tuple of collections identifying all the files affected by that commit, or common.SKIPREV.

The former collection is a sorted list of tuples. The former object in each tuple is the filename; the latter object is the identifier of that file as of that revision.

Many sources will put the same identifier into every tuple—specifically, the identifier for the requested revision. The Subversion and Mercurial sources both do this. However, some VCSs (such as Git) assign a different identifier to every file-revision intersection; the Git source returns these identifiers.

The latter collection is a mapping of filenames to other filenames. This collection expresses copies: the key filename is one from the list of tuples, and is the destination of the copy, whereas the value filename is the file that was copied (the source of the copy).

If a file was not copied in the commit in question, it is not included in the mapping. This means that the mapping can be, and usually is, empty.

Instead of the pair of collections, the getchanges method can return common.SKIPREV, if the source wants to indicate that the requested commit should not be converted.

gettags

Returns a dictionary in which each key is a tag name and each value is a revision identifier for the commit that was tagged.

Outputs

Output classes are called sinks. Every VCS that the ConvertExtension supports as output is represented by a subclass of the abstract class common.converter_sink.

The converter_sink class takes two arguments:

Unlike the converter_source class, the path argument here is not optional.

Additionally, it has a property named created_files, which is a list that holds the paths to files that the sink has created, so that the sink can unlink them if the conversion fails. The abstract class does not handle this clean-up; it leaves it to the subclasses.

Abstract methods

before and after

Subclasses can implement these to perform any necessary preparation and clean-up.

getheads

Returns the revision identifiers that already exist in the destination repository.

authorfile

A subclass can implement this to return the path to a file within the repository where an authormap file may be found.

revmapfile

Every subclass must implement this to return the path to a file within the repository where the converter should store its revision map (described below).

setbranch

Sets the branch that future commits will be on (like the hg branch command). Takes two arguments: the branch name, and a list of parent branches.

Each parent branch in the list is described by a tuple. The former element is the revision identifier in the destination repository for the parent commit; the latter element is the name of the parent branch.

putcommit

This is the method that enters a commit into the destination repository. Every subclass must implement it.

It takes five arguments:

According to the docstring for converter_sink.putcommit, the source object in the fifth argument is only guaranteed to have getfile and getmode methods. In practice, it is always a converter_source, so it will implement all of that class's required methods (although you shouldn't need any others).

The convert command

The command is implemented in the convert.convcmd sub-module. Only the most basic requirements for a Mercurial extension command are in convert.__init__; the convert function there tail-calls convert.convcmd.convert.

The convert function calls two subroutines in the same module, convertsink and convertsource, to obtain sink and source instances for the destination and source repositories. These functions iterate the mappings of VCS names to sink/source classes, trying each class in turn on the specified destination and source repositories.

The final step in the function is to create an instance of convcmd.converter, which is the class that actually performs the conversion.

Anatomy of convcmd.converter

The class takes five arguments, all required:

Additionally, it has five properties:

The main method of the class is convert, which the top-level convcmd.convert function calls to do the work.

The revision map

The revision map associates each commit in the source repository with a commit in the destination repository. This is the converter's record of which commits it has already copied. If the user runs the converter again, it reads the revision map back in, and uses it to resume the conversion rather than start it over from the beginning.

A source revision identifier's matching value is usually a destination revision identifier, but may instead be common.SKIPREV. This indicates not that the commit has already been converted, but that the converter should skip it.

What the preceding paragraphs boil down to is that the converter will not copy a commit if its source revision identifier is in the revision map at all, on the assumption that either a previous run copied it or the source didn't want the converter to copy it.

A revision map is an instance of common.mapfile, a subclass of dict that reads its pairs in from a file, and updates that file whenever another object adds, changes, or removes a pair. The converter uses a file inside the destination repository, whose pathname it obtains from the sink's revmapfile method.

In the file format, common.SKIPREV is represented by the word “SKIP” in all uppercase letters. The ConvertExtension implements this by defining common.SKIPREV to that string.

The splice map

The splice map enables the user to revise history, giving a commit one or two different parents from the parent(s) it has in the source repository. By adding lines to the splice map, the user can splice one series of commits in between two other commits, remove commits from the history (by connecting their antecedent and descendant directly together), or forge a merge (by adding a second parent to a commit).

The file format of the splice map is simple: each splice is a line, with two or three revision identifiers separated by spaces. The first one is from the source repository, and names the commit whose parents are to be edited. The second and optional third are from either the source or destination repository, and name the commits that will be the new parents.

Like the revision map, the splice map is an instance of common.mapfile. Unlike the revision map, the converter does not change the contents of the splice map.

Commit ordering

Order is significant, as revision identifiers in Mercurial are dependent on the order of the commits. (Mercurial defines a revision identifier as the hash of a number of pieces of data from the commit, one of which is the revision number of the commit's parent.)

By default, the ConvertExtension copies commits in topological order, aka ancestral order. As you might guess from the latter name, this means only that a commit is guaranteed to come before a commit that depends on it.

With the --datesort option, the ConvertExtension instead copies commits in the order in which they were originally committed in the source repository. As long as humans are not capable of time travel and the repository itself has not been tampered with, this chronological sort is also a valid topological sort.

Both sorts are performed by the converter.toposort method.

The conversion process

Conversion truly begins in the converter.convert method, although most of the real work is still done in other methods (not to mention other classes).

First, the converter must determine the commits to copy. It starts by getting the list of heads from the source repository (using converter_source.getheads); then, it uses converter.walktree to find all the ancestors of those heads.

walktree returns an object that maps each commit to a list of its parents. All the commits in this mapping are those that have not yet been copied to the destination repository; when it encounters a commit that is in the converter's revision map, it skips that commit without putting it into the mapping.

The convert method calls the toposort method with this mapping to put them in order (see the section describing commit ordering). toposort takes the mapping, iterates the keys (which are revision identifiers from the source repository), builds a new list containing them in the sorted order, and returns that list.

Now it's time to begin copying commits. For every commit identified in the list, the convert method calls converter.copy with that revision identifier (from the source repository).

The copy method starts out by calling self.source.getchanges, passing the revision identifier. It checks for two unusual cases:

If neither of those cases is true, then getchanges returned the usual pair of collections (described above), and the copy method proceeds.

Next, it assembles a list of parent branches, then calls self.dest.setbranch with the branch name and that list. (See the description of that method, which covers what the list contains.)

The copy method then looks up the revision identifier from the source repository in the splice map. If the look-up succeeds, it looks up the parents named by the splice map in the revision map; otherwise, it uses the parent revision identifiers from the list of parent branches (which are already revision identifiers from the destination repository).

Now, finally, the copy method calls the sink's putcommit method. It passes the list of files, the mapping of copies, the list of parent revision identifiers from the destination repository, the commit object, and the source object, and receives a revision identifier from the destination repository for the newly-entered commit.

The last things that the copy method does before returning are to tell the source it converted the commit (by calling self.source.converted with both revision identifiers), and to enter the revision identifiers into the revision map.

We arrive back in the convert method, at the end of the loop.

Having finished copying commits, the convert method updates the tags in the destination repository. It asks the source for its dictionary of tags, then filters each tagged commit's revision identifier through the revision map (skipping any commit that is not in the map or is mapped to SKIPREV).

It then passes this list of destination revision identifiers to the sink's puttags method, which creates all those tags in the destination repository in a single commit and returns the identifier for that commit. The convert method then maps the source revision identifier for the last converted commit to the destination revision identifier for the tags-update commit, “so we don't end up with extra tag heads”.

The very last step is to write out the author map into a file in the destination repository (whose pathname the converter gets from the sink's authorfile method). This file includes any author mapping that the user specified in a custom author-map file when running the convert command.

After this, the converter performs clean-up. It calls the after methods of the sink, then the source, and closes its revision-map file.

The conversion is done.


CategoryInternals

ConvertExtensionImplementation (last edited 2013-01-22 00:11:54 by rcl)