This is an English version of a German article that was published in the magazine OBJEKTspektrum by SIGS-DATACOM in January 2011. The authors are:
MartinGeisler: A core developer of Mercurial. Lives in Zurich and work for aragost Trifork ag as a Mercurial consultant helping others with Mercurial.
- Aaron Digulla: A Mercurial user and works as a Senior Java Developer at Avanon in Zurich.
- Klaus Bucka-Lassen: A Mercurial user and managing director of aragost Trifork ag. He lives in Switzerland where he is mainly works as a technology consultant, project leader, architecht and software developer.
Fast, scalable revision control
Abstract: Distributed revision control is now widely used in the industry and by many open source projects. This article takes a closer look at Mercurial, one of the fastest and most user friendly distributed revision control systems, and explains how it works and why you would want to use one.
Just like Subversion solved many of the problems with CVS, distributed revision control systems like Mercurial and Git solve many of the issues with Subversion. They are faster, more flexible, much better at handling branches and merges, and encourage a better style of development.
The main issue with Subversion and similar centralized revision control tools is that they fail to help users when they need it most. Instead of being an ally that provides a safety net for your experiments, centralized tools discourage experiments and many developers come to see them more as a burden than a help. A new wave of distributed version control systems changes these things drastically, and this article will show you how.
The core feature of a distributed revision control system (DVCS) is the distribution of history — instead of having just a single repository stored on a central server, you now have many redundant repositories. In fact, each developer has his own, private repository.
By moving the full project history out to the developers own machine, a lot of operations suddenly become very fast. When things are fast, people tend to use them more. This means that developers will want to make more and better commits with Mercurial. When a commit takes 0.1 seconds instead of 3 seconds (or more), then you will begin making more logical commits. Small commits that each contain one coherent change are easier to review. This will be elaborated below.
Mercurial is an open source and cross platform system used by large organizations such as Mozilla Foundation (Firefox, Thunderbird, ...) and Sun (OpenSolaris, Java, OpenOffice...) as well as many other companies and open source projects.
There are a number of important concepts that must be understood to work efficiently with Mercurial.
Working with a Single Repository
The first is the notion of your working copy. This is where all the files in your project live and this is where you edit them. When you make a commit, the changes made in the working copy are saved as a new changeset. Changesets are also often called revisions, just like in Subversion. The changesets make up the project history and live in the repository, which is always stored in the .hg/ directory at the root of the working copy. See the figure below for the relationship between local and remote operations:
The working copy reflect a given checked out revision, the so-called working copy parent revision. When you ask Mercurial to show the file statuses, your files in the working copy are compared to how they looked in the parent revision. You can ask Mercurial to update the working copy to reflect another revision. When you commit, the changeset you create is always a child revision that descends from the current working copy parent revision. This parent information is stored in each changeset so that it knows where it came from.
Working with Several Repositories
Repositories can be cloned and this makes a full, independent copy of all the history. This is how a developer gets access to the source code initially: he clones it from some well-known location such as a central server in a company, or a homepage for an open source project. Having made his own private clone, the developer can explore the history and make new changesets, all without talking to the server he got the clone from. When new changesets are made in the server's repository, he will pull them into his own clone. The symmetric operation is push: he uses that to push his new changesets back to the server. The two commands simply compares the two repositories and transfers missing changesets.
When changesets are copied from one repository to another, the parent information stored in each changeset is critical: it is used to insert the changesets at their right position in the history. The result is a graph of changesets as shown below.
When changesets pulled and pushed between repositories, small branches appear in the history. They represent parallel lines of development, e.g., if Alice created a new changeset A and Bob created a new changeset B, and if both A and B descends from changeset X, then after pulling B into her clone, Alice will see two branches starting at X. The changesets A and B have no children and such changesets are called heads. Two heads represent divergent lines of development and should normally be merged back together. Small branches are created all the time. Compared to older tools such as Subversion, Mercurial has a much more robust merge engine that works correctly in face of renamed files and similar corner cases.
The command line interface for Mercurial is called hg, named after the element mercury. Just like the svn program for Subversion, the hg program has a number of sub-commands. The most important commands are listed below. Notice how they match the commands from Subversion closely — Mercurial has been designed to make transition from Subversion as painless as possible.
Most important commands in Mercurial
Commands operating on two repositories:
hg clone: create a new repository on your local machine by copying another repository.
hg pull: retrieve new changesets from anther repository.
hg push: send new changesets to another repository.
Commands operating on local repository only:
hg status: show modified/added/deleted/unknown files in the working copy.
hg diff: show modifications made to files in the working copy.
hg commit: save working copy changes as a new changeset.
hg update: update working copy to a given revision (often used after pull).
hg log: show information about changesets
hg merge: combine two lines of development.
hg init: create a new empty repository on your local machine.
The need for keeping track of source code has existed ever since the first source code was written. There are therefore a long tradition of making revision control systems. The first generation of revision control systems worked on a single file basis only. The classic RCS program (which is simply short for "Revision Control System") is an example of such a system and dominated the scene in the 1980s.
The second generation of revision control systems came with CVS, short for Concurrent Versions System. CVS and its proprietary cousins handle multiple files and support network operations. CVS enable several developers to work concurrently on the files and then merge their outstanding changes when committing them to a central server. CVS became extremely widespread in the 1990s. In 2000, Subversion (SVN) was released as a "better CVS" and has since then replaced many CVS installations.
The two first generations relied on a single central server to store the history. A third generation of revision control systems is starting to remove this central server and instead offer a fully distributed workflow instead. These distributed version control systems started to appear in the beginning of this millennium, but the development really took hold in 2005 with the release of Mercurial and Git.
Git and Mercurial both has the same historic background. In 2005, the source code of the Linux Kernel was kept in a proprietary system called BitKeeper. The kernel developers received a free license to BitKeeper under the condition that they would not themselves work on competing products. However, this free license was revoked in the beginning of 2005, when a kernel developer made a third-party tool that would talk to the BitKeeper servers.
This unleashed an enormous activity in the Linux Kernel development community where a small army of very skillful programmers suddenly found themselves without a revision control system. Hackers as they were, they sat down and made their own. Linus Torvalds made a system he called Git and another kernel developer, Matt Mackall, called his system Mercurial, abbreviated "hg" after the chemical symbol for mercury. Today, the Linux kernel is kept in Linus' system with a Mercurial mirror.
The Problems with Traditional Tools
Traditional version control systems such as Subversion use a client-server architectures with a central server to store the revisions of a project. This has two negative consequences:
- The clients must contact the server over the network for almost all operations. Subversion support diff without talking to the server, but that is it. It only supports a very limited form of diffs: you can compare a file with the working copy base revision, but not with any other revision.Contacting a server over a network connection takes much longer than accessing data on a local disk and this slows down development. The natural reaction from developers is to use the revision control tool less. Instead of making small, consistent commits they make larger commits in order to avoid talking to the server. Large commits that mix different changes are much more difficult to review and this hurts the code quality.
- With a centralized system, every commit is published immediately for all to see. It is not uncommon for people to wait with a commit because they are afraid of breaking the build. This means that people instead sit on their uncommitted changes for days or weeks without committing. So instead of encouraging developers to track their progress step by step (including false starts), a centralized tool makes people not want to use it.Branches are supposed to help solve this problem: each developer gets his own branch where he can make all the mistakes he want without disturbing the rest of the team. However, branches are only interesting if you can merge them again! Subversion has very fragile merge support: try merging a branch with a renamed file that has also been modified on trunk. Depending on the version, Subversion will either say there is a conflict or it will silently(!) throw away the change. This bug has been there for years.The fact is that merging is a hard problem and tools like Subversion do not support it very well. The result is that people either avoid branches and sit on their code for a long time, or that they use branches but are burned when they try to merge them again.
Benefits of Distributed Version Control
In Mercurial, the history is stored redundantly on several servers. Each developer has local access to the full history and this makes it very fast to work with. When switching to Mercurial, people naturally tend to make much more fine-grained commits simply because it's easy and fast.
Also, when developers commit changes to their own local repositories, the repositories begin to diverge. The histories are brought back in sync with a merge — so by necessity, Mercurial has excellent support for merges. Branches are created and merged back together on a daily basis and become a natural part of the workflow.
Branches are an enormously useful concept and with a distributed system, developers become used to working with them. They can therefore use them efficiently on all scales: for a small bugfix or for tracking major lines of development.
Mercurial gives a better workflow for the individual developer or within a single team. But being distributed also makes it possible to improve workflows that involve several teams. A great example of this is off-shoring. With Subversion all teams but one must suffer from extra-long network round trips since there is only one server.
Mercurial allows you to put a hub at each site. The developers use this to get quick, local access to each others changes. The hubs are then synchronized periodically. You simply don't get this kind of flexibility with a centralized system.
Changes flow from one repository to another and it is straight-forward to impose conditions on when a change can move to another repository. This can be used for mandatory code reviews: developers push their changes to a holding-repository where they sit until a reviewer approves or rejects them. This reviewer could be a team-leader or it could be an automated continuous integration system. When a change passes review, it is pushed further along to a central repository. The other developers will then see it and they can then integrate it into their own repository.
The preceding sections looked at Mercurial from a very high-level view — this section will dive right into the nitty-gritty details of how Mercurial is implemented. Mercurial tracks three things:
- Changes to file content (file edits)
- Changes to the file tree (added and removed files)
- Information about who made the above changes
The history of each file under revision control is kept in a filelog stored under .hg/store/data/. For files with a large history, there are actually two files: a file.i which holds an index and a file.d which holds the revision data itself. While this may look innocent in itself, it turns out that filenames are a somewhat messy subject.
Mercurial repositories are often put on memory sticks and carried around to different computers with different operating systems. Different operating systems have different opinions on what constitutes a valid filename: Linux filesystems generally allow any byte except "\0" and "/", Windows filesystems are more restrictive and disallows characters such as ":" and "?" but also normal looking filenames such as "aux" and "nul" (they are DOS device files and still reserved today). Mercurial needs to ensure that files stored under the .hg folder are readable on any filesystem. The filenames are therefore encoded to a safe subset of ASCII.
While this mapping ensures that a repository can be copied around between different machines, it does not ensure that one can always make a successful checkout. In other words, there is nothing that prevents a Linux user from adding a file called aux. The history for this file will be stored in.hg/store/data/au~78.d, which is a legal filename on a Windows filesystem. So a Windows user can pull the changeset and browse the history using TortoiseHg, say, but he cannot use hg update to checkout the file. There is an extension for Mercurial called caseguard which can be enabled to catch such problems. It was originally designed to prevent the more simple problem of a Linux user adding files that differ only in the case of the filename.
The file tree is tracked in a file called the manifest. The manifest looks similar to the output of ls -R except that there is a hash-value stored for each file. The manifest changes as files are added and removed from the tree, and the hash values change when the files are modified. To reconstruct the project as it stood at a certain revision, one simply has to reconstruct the right revision of the manifest and then use this to lookup the right revisions in the referenced filelogs.
Finally, the who-did-what part is tracked in file called the changelog. When a new commit is made, metadata such as username, date, and the commit message is recorded in the changelog file. The changelog also points to a particular revision of the manifest in order to tie the change to a particular file tree.
The Revision Log
The basic building block in Mercurial is the revlog, which is short for "revision log". This is an append-only data structure that is used to implement both filelogs, the manifest and the changelog itself. A revlog stores a series of revisions and does so in a simple, yet efficient way. When a new revision is added — such as a new version of a file in case of a filelog — the delta (difference) between the last revision is computed and compressed. The compressed delta is appended to the revlog. However, simply appending deltas like this would make it increasingly expensive to reconstruct new revisions of a given file since one would have to read and process all the differences since the very first version. To prevent this, the revlog automatically stores a compressed snapshot of the file if it determines that the chain of deltas has grown too large. This ensures that the time needed to reconstruct any revision of the file does not grow when the number of revisions grow.
Furthermore, the revlog is append-only: when a new revision is added, the compressed delta or the compressed snapshot is simply appended to the file. The primary benefit of this is that any revision of a file can be reconstructed using a single seek to find the beginning of the delta chain (or snapshot) and a single read operation to read the needed deltas (or snapshot). By avoiding seeks, the data can be read very efficiently by a normal hard drive. A secondary benefit is that appending to files is inherently more safe than rewriting them. If the computer crashes while Mercurial is appending to one of its files, then it is easy to recover to a known good state: simply truncate each file to its original length (which was stored at the beginning of the transaction). The user can then re-do the commit or pull operation.
Mercurial is a multiuser system where several users may push changesets to a central server at the same time while other users are pulling changesets. Mercurial must ensure that this is done in an orderly fashion without people stepping on each other's toes.
As described above, the final file data is found by first visiting the changelog. This points to a manifest, which in turn points to the filelogs. When new changesets are added to a repository, the data is written in the opposite order: first the filelogs are extended, then the manifest is written and finally the changelog is updated. This can be done concurrently with readers since they wont notice or care about the new data until they see it referenced from the changelog. Reading can thus be done without taking any locks at all. This is rather important since read operations are by far the most common operation and also because the reader quite often cannot expect to be able to take a lock due to filesystem permissions (you should be able to clone a repository with nothing but read-access to the files in the repository). Multiple writers need to lock the repository to serialize their access to the files.
Each changeset in Mercurial is given a unique identifier. With a centralized system it is easy to use sequential integers as identifiers: the clients simply ask the server for the next number in the sequence. With a distributed system where the clients must be able to work in complete isolation, something more clever is needed. First of all, it is clear that a simple integer sequence won't be enough when identifiers must be globally unique — if Alice calls her changeset number 37, then Bob won't know about it when he is working in a train without an Internet connection.
One possibility would be to assign a completely random identifier to each changeset. Given a sufficiently large space of identifiers, the risk of a collision can be made arbitrarily small. For example, one could make the risk smaller than the chance of winning the lottery — not just winning a single time, but winning again and again, every day of your life!
There is a better alternative, though: using a cryptographic hash function. A cryptographically secure hash function H takes an input string of arbitrary length and maps it to an output of fixed length, say, 160 bit. To be secure, a hash function must be hard to invert, that is, given a hash value h, it must be extremely difficult to find an m, such that H(m) = h. It should also be hard to find two different messages that hash to the same value. Hash functions play a key role in modern cryptographic systems.
Armed with a cryptographically secure hash function (Mercurial uses the industry standard SHA-1 function and we plan to switch to a more secure hash function when SHA-1 is broken in the future), it is now possible to construct changeset identifiers (changeset hashes) in a completely systematic way and still avoid the problem of Alice and Bob assigning the same identifier to different changesets. The key trick is to use the content of the change itself together with the changeset hashes of the parent changesets as input to the hash function.
Concretely, each Mercurial changeset is identified by a 40 character hexadecimal hash value — though only 12 characters is shown in the interface by default since this is normally more than enough to differentiate changesets. The hash value is computed from these components:
- The manifest hash
- The username
- The current date
- The commit message
- Hash value of first parent changeset
- Hash value of second parent
The manifest hash is computed in a similar fashion by giving the filelog hashes as inputs to SHA-1. The filelog hashes themselves originate from the actual file content. This design ensures that a byte in a file affects the filelog hash, which in turn affects the manifest hash, which in turn affects the changeset hash. Put differently: because it is infeasible to find collisions for the hash function, the changeset hash fixes the manifest hash. This means that if you are given the changeset hash from a trusted source, then you can download everything else from an untrusted server and afterwards verify the integrity of the data. You know that it is infeasible to find two manifest hash values that gives the same changeset hash, so when you have retrieved one manifest hash from the untrusted server, then you can be sure it's the correct value when it gives the correct changeset hash. Similarly for the filelogs: if they hash to the correct manifest hash, then they must be correct with overwhelming probability.
Because the changeset hash include the hash values of the parent changesets, the entire history of the repository is included in the hash value! Given a trustworthy changeset hash, it can be verified that the parent hash values are trustworthy. The integrity of the entire history graph can be verified by walking backwards from the tip-most changeset to the root in this recursive fashion.
Using a cryptographic hash function to verify the history like this was pioneered by the Monotone distributed revision control system and has since then been used in both Mercurial and Git.
There is a growing community around Mercurial that continuously add new features and fix bugs. We will focus on two features below which will hopefully soon be included in Mercurial.
A Mercurial repository can grow large over time as changesets are added and as more and more files are put under version control. While the technological advances in harddisk sizes and bandwidth will be able to keep up with the expansion to some extend, it would often be nice if one could work on only a subset of the files or a subset of the history. The first feature is called "narrow clones" since they only clone a narrow subtree of the full project tree, and the second feature is called "shallow clones" since they only clone a shallow part of the full changeset graph.
Mercurial participates in Google's Summer of Code program. The program was started in 2005. It is a great initiative that aims to motivating university students to join an open source project of their choice and be paid to do so. The students receive $5,000 upon succesful completion of the project. When studying Computer Science, joining an open source project is a great way to gain real world experience in software development. However, there are still many students that miss out on this opportunity and the Summer of Code program aims to change this. Over 2,500 students have already successfully completed the program.
Three students will spend their summer implementing new features for Mercurial in this program. In particular, a student will work on implementing shallow clones. Shallow clones are often requested by organizations that have converted a very old CVS repository to Mercurial — say with 10 or 20 years of history. For their day to day development, they don't need access to the ancient versions of the code, and so a shallow clone would be a nice way to save on the bandwidth and disk space needed.
The way history is stored by Mercurial lends itself well to shallow clones. The history for each file is stored sequentially in a single revlog, and by collapsing the first N revisions of a revlog into a single snapshot, one can effectively exclude the first N revisions. This snapshot will naturally have the wrong SHA-1 hash value, so Mercurial will have to be taught to ignore this for shallow clones.
The main difficulty in implementing shallow clones is in the handling of merges. If the changeset graph is simply cut off at a certain depth below the tip, then you may lose important information needed for correct merges. This could be information about renamed files: if old.txt is renamed to new.txt in a changeset that is not part of the shallow clone, then Mercurial will see the two files as independent. Merging will therefore give a different result in a shallow clone compared to a full clone.
This is of course not acceptable, so the current proposal for shallow clones requires that there must be a single root for a shallow clone. This means that if you want to make a shallow clone that involves changesets on two different branches, then you must extend the shallow clone back to the first common ancestor changeset between the two braches. This not a very limiting requirement since most projects have regular points in time where all branches are merged into a single changeset, e.g., when a release is made.
When a shallow clone contains a single root changeset only, then merges will be identical to merges in full clones, which is what we want for consistency and to make the mental model simple for the users. When the summer is over, then it will be known if the student succeeded with the project.
Dealing with Big Files
One area where distributed version control systems have a disadvantage compared with centralized systems is in handling of very large files. Large files tend to be compressed already (think of images or movies), which means that they cannot be compressed any further. To make things worse, even a small change to a binary file can lead to a very different file on disk, so the deltas between each revision will also tend to be large.
When the entire history is stored on each client, a big file will keep taking up space everywhere and initial clones will take longer because they need to transport more data. In a centralized system, you can simply delete a big file again if you added it by mistake and it will then only be the server that has to keep a copy of the file. After the file has been deleted, a checkout will again be fast since the clients only retrieve the last revision from the server.
Greg Ward, a long time Python developer, has written an extension for Mercurial called bfiles. It combines the strengths of both centralized and distributed version control by storing the big files in a central location, while letting you work with all the smaller files like normal. When you retrieve a clone of your Mercurial repository, you only retrieve the small files plus a number of "stand-in files". These stand-in files are small files that reference the big files by the SHA-1 hash value of the big file. Using the stand-in files, Mercurial can now retrieve the needed big files. This saves you from downloading the entire history of each big file — you only get what you need. All the revisions of the big files are still stored on the server, but since storage space is cheap, it is no problem to add more disks to a central server. This design gives you the best of both worlds: you get the full history for your smaller files so that you can still search through them with hg grep and you can still compare them with hg diff, and you save disk space on the clients and bandwidth when making a clone.
The extension currently requires the use of a number of special commands such as hg bfupdate after a hg update to actually update the big files. When complete, the extension will allow full integration in the normal Mercurial workflow in order to make the use as seamless as possible.
Developers are smart people — if you give them a slow or tedious tool to work with, their natural reaction is to use it as little as possible. This means making bigger commits in order to not have the tool talk too often to the server. Mercurial solves this problem in the most elementary way possible: by distributing the server to all the clients. Making a commit is now a fast operation and with the history at hand, developers can begin using the history to their benefit when tracking down bugs. Being smart people, developers will see that it now pays off to make more fine-grained commits, each with a small logical unit of work.
Since Mercurial is a distributed tool, it must be good at merging branches. Developers will be merging daily and they will thus also become very familiar with it. When dealing with longer lived branches for new features, it is also important to merge often. Mercurial remembers what you have already merged and you can therefore do repeated merges of the main branch into your feature branch. This is the key to making merging manageable: do it often in order to keep the merge conflicts that you must deal with small. Mercurial is not magic, but it helps you as much as possible by keeping track of things and by ensuring that the conflicts you get are genuine.
There is a very friendly community around Mercurial. Everybody is welcome and encouraged to join the mailinglist or to come online in the #mercurial, our IRC channel at irc.freenode.net for a direct chat with the developers and other users.
The author would like to thank Greg Ward and Benoit Boissinot for their helpful comments.