Cloning via hardlinking/overlays (was Re: Clonning only from version and diff before pulls)

Stephen Darnell sdarnell at esmertec.com
Thu Aug 18 07:56:15 CDT 2005


> Andrew Thompson wrote:
> > Quoting Daniel Santa Cruz <byteshack at gmail.com>:
> > 
> > One of the features of Hg is cloning by hardlinking. If your
upstream is 
> > on the
> > same filesystem as your working copy, you only use space for your
actual
> > current working copy, and the .hg/ files that have been changed by
local
> > commits.

Kevin Smith <yarcs at qualitycode.com> replied: 
> *And* if your operating system and file system support hardlinking.
> For MS Windows developers, that's rarely true.

I'm not sure that assertion is necessarily true.  I would say that
NTFS is becoming, if not already the predominant filesystem - however
FAT32 has extended the life of FAT for main disks.

Half the problem for Python is that the win32 extensions do not appear
to
be supported on the vanilla python.org python (ActiveState is OK).
I would have thought that the core python could include support for
os.link and a proper link count without too much difficulty.

I have some (probably slightly rotted now) changes that add hardlink
support
for windows (handles failure and vanilla python etc.) but there didn't
seem to be much interest, and Matt would like some more exposure before
accepting the change. Anyone?

> The whole paradigm of 
> mercurial (and other repo-inside-working-tree SCM tools) is that
clones 
> are cheap. When they are not, the tool becomes less effective. This
has 
> become a significant concern to me, because my MS Windows-using 
> teammates will resist tools that make them second-class citizens.

When benchmarking the Windows version of hardlinking, I found that on
smallish repos the time is dominated by the scanning the filesystem,
and creating the dir entries etc. - I'd expect Unixes to be better.
However, the space saving is obviously the big win.

Another solution that has a similar effect is to overlay rather than
link.  That is, if B is an overlay-clone of A, when looking in B for
a file X, it tries B/X first, then A/X.  I'm not sure if this would
make most file operations that much slower (working dir operations
would be unaffected).  But the time to create a clone would be
VERY quick.

This would work across all filesystems, local and remote, but has
one big drawback - you can't discard repo's with a simple rm -rf
if they contain data needed by an overlay-clone.
Hardlinking has the benefit that all clones are equal, so sticking
one through a shredder does no affect the other.  It seems that
most people seem to have a couple of main repo's and a collection
of related work repo's so this may not be too awkward if used
selectively.

Another drawback of the overlay idea is that it imposes more
requirements to be careful in the code, and if done correctly it
should only complicate a few places (if hidden by repo abstraction).

A minor benefit is that the size overheads of clones would be
smaller still.

On a related note, if the working copy and the history were
separated (rather than in a .hg subdir) it would be less likely
that you would accidentally delete repo meta data, and potentially
allow more data sharing.  But I can certainly see the convienience
of a self contained discardable directory.

Regards,
 Stephen



More information about the Mercurial mailing list