User metadata support

Sun Nov 19 13:54:51 CST 2006

Hi all,

I'm rather new to Mercurial, but I have used a lot of different SCMs before.

Mercurial looks in many aspects like the "right thing" to me.

I used to check out Monotone some time ago, but it had just too many
shortcomings to be useful - most of which Mercurial managed to avoid.

Actually, there is only one single big thing left (except for symlink
support) which would make everyone happy: User metadata support.

With user metadata, I mean something like the "properties" of Subversion.

In essence, it's just a version-controlled key/value list associated
with each file.

You might think nobody actually needs such a thing?

Let's illustrate a few cases where such metadata would be highly useful:

* Additional permission bits. Currently, Mercurial supports the
executable bit right out of the box. Fine. But what if more permission
bits should be associated with a file, such as the sticky bit. Or
creating a file as read-only. Or a special POSIX ACL. If hooks for
checkin/checkout had access to file metadata, the hooks could set the 
appropriate bits on checkout as required, and without a need to 
integrate such features into the Mercurial core.

* Actually, even the executable bit needed not to be supported directly
my the core Mercurial any more: Hooks could take over that job too, 
provided they have access to metadata items such as to a property named
"hg:executable".

* Line-ending conversion. While I agree that line-ending conversions
should normally be performed based on heuristics because users tend to 
forget about setting special properties, there are exceptions. What if a 
file with extension .txt is a texture in some project subdirectory 
rather than a text file like in the rest of the project? If the 
autodetection heuristics for binary files fails, we'll be screwed as 
soon as line-ending conversion will be attempted on that file. Using a 
property such as "hg:eol-style" set to "binary" would let a hook script 
override autodetection in such cases.

* Character set conversion. What if a single directory contains text
files in different character set encodings? Just think of text files on
a Windows machine which shall also be edited on a UTF-8 Linux 
workstation: On the Windows side, most files will be using the "ANSI" 
character set (in fact WINDOWS-1252 because of that EURO-Symbol), but 
some files intended to be used by the Console are instead represented 
using the "OEM" character set (IBM CP 437 or CP 850). Using the same 
conversion for all text files cannot work in this case. And they all 
share the same filename extension. It is necessary to override the 
conversions on a per-file basis. Metadata properties would allow the 
hook to also take care of this.

* Stream metadata. Machines like the Apple Macintosh can use different
streams in a file, the so-called "data fork" and "resource fork". Think
of it as a kind of sub-file. Same for NTFS which also supports streams.
Each stream of a file has the potential to contain data a user would 
like to be subject to version control. The "normal file contents" are 
just the contents of the default stream of each file. But the best is 
yet to come:

* Symlinks could be implemented using hooks having access to stream
metadata! In this case, a symlink would be treated as a stream with a
reserved name of a file which has no default stream (and thus no normal 
file data) at all. That means, when checking out, there will no file be 
created. But the checkout hooks (which will be run for all streams, not 
just for file data contents) can check the stream type and create a symlink.

* Any kind of additional information to be attached to files/streams,
such as MIME types etc. The hooks can make use of this information if
required.

* Directory attributes! Directory can also also have metadata streams,
storing version-controlled metadata about the directories, such as
"hg:ignore".

* Tracking even empty directories and renaming or moving of directories. 
If we assign an (empty) dummy stream such as "hg:dir" to each directory, 
we can deduce the existence of a directory from the mere existence of 
that stream. Which means we won't need things like dummy-".keep" any more.

You see, metadata support would in fact be exceptionally useful for
everyone, and would even make implementation of some things easier.

For instance, you could forget about the executable bit or symlinks in
the core Mercurial project, and delegate such problems to the hook scripts.

Of course, metadata support should be implemented in a way that requires
the least changes to the existing implementation.

First, what is actually needed:

* Not files are versioned, but streams are. It's pretty much the same
from the perspective of a revlog, but we have to add an additional
level below the leaf level (as it is now).

For instance, instead of having a

./hg/data/somedir/somefile.d

we then could have a

./hg/data/somedir/somefile/hg_data.d

which means: This represents a stream with name "hg:data" of the version
controlled "object somedir/somefile". We say "object" here rather than
"file", because that object could be a symlink as well - depending on 
which stream properties it has.

In this case, it is a file, because it has a "hg:data" property (which 
contains the actual file contents).

But it could have additional properties as well:

./hg/data/somedir/somefile/hg_executable.d

could indicate the fact that file "somedir/somefile" has an additional 
property "hg:executable", which means its executable bit should be set 
on checkout.

"hg:executable" is also the example for a "switch"-style property: It's 
mere existence indicates something; the actual revlog contents will 
typically an empty file because all that matters is whether this 
property is there or not.

You can also see: Streams and Properties are pretty much the same in 
this model - from the viewpoint of the revlog they are just more files 
to be version-controlled.

It's the only *interpretation* as streams/properties which makes them 
special.

Another example: In order to save a symlink instead of a file using the
same name as above, we could use a property-revlog like this:

./hg/data/somedir/somefile2/hg_symlink.d

which represents a stream with name "hg:symlink", and the contents of
the current revision of that revlog contain the symlink target.

And now how to store a "hg:ignore" property for directory
"somedir/somesubdir":

./hg/data/somedir/somedir/somesubdir/hg_ignore.d

So, the most important thing to be changed for implementing that feature 
is to add an additional subdirectory level at the leaves of the 
version-controlled directory tree that is omitted when checking out a 
revision, but available to the hooks.

In the output of "hg manifest" the streams could be displayed using the 
-v option, while the "hg:data" stream should be suppressed from output
in the normal case (because it is the default).

For instance,

$ hg manifest

hexstuff... somedir/somefile
hexstuff... somedir/somefile2 [hg:symlink]
hexstuff... somedir/somesubdir [hg:ignore]

$ hg manifest -v

hexstuff... somedir/somefile [hg:data]
hexstuff... somedir/somefile2 [hg:symlink]
hexstuff... somedir/somesubdir [hg:ignore]

Of course, streams could also be displayed in any different way as well;
it's just an example.

The manifest internal format needed to be updated as well:

$ hg debugdata .hg/00manifest.d 0

somedir/somefile/hg_data<hexstuff>
somedir/somefile2/hg_symlink<hexstuff>
somedir/somesubdir/hg_ignore<hexstuff>

So, actually it does NOT need to be changed, but rather includes the
*uninterpreted* contents of the .gh/data directory, including the leaf 
revlogs which always represent streams.

To be more precise: *All* the revlogs now represent streams! Because the 
actual files or directories or symlinks will be represented by 
subdirectories now which contain the stream revlogs. And whether such a 
directory will be interpreted as the name of a version-controlled file, 
directory, symlink, fifo, device file or something different depends 
solely on which stream revlogs exist in that directory.

And to make the best of properties, there should be a means of
*inheriting* them from the parent directory, possibly overriding them in
nested subdirectories. But that's worth its own thread I think.

Regarding stream names, it might be wise to enforce a naming policy to 
avoid name clashes with user-defined properties.

A suggestion of such a policy:

* All property/stream names optionally start with a namespace prefix, 
followed by a colon, and then an identifier. For instance, in "hg:data", 
"hg" is the name of the namespace, and "data" is the namespace-relative 
name of the stream.

* Namespace "hg" is reserved for Mercurial's "well-known" or specially 
interpreted streams. For instance, while "hg:executable" could be a 
user-defined property as well which is only of interest for user-defined 
hooks, "hg:data" is clearly of essential interest for the internal 
checkout and checkin functions of Mercurial.

* Namespace "urn" is reserved for property names which conform to the 
URN syntax, e. g. globally unique and *persistent* identifiers. 
(Persistency is also the big difference between an URL and an URN. URLs 
cannot truly be considered to be persistent: Domains come into existence 
and go away all the time.) For instance, there is a "urn:uuid" scheme 
which allows to create URNs based on UUIDs for those who like this. But 
numerous other schemes exist as well.

* Namespace "rdn" is reserved for the "reversed domain name" identifiers 
which are so popular in JAVA (or Monotone). This specifies properties 
such as "rdn:com.sun.java/bigproject/specialstream". However, as stated 
in the previous paragraph, DNS names might be not be the best choice to 
guarantee uniqueness of a name - at least not over time. URNs, in 
contrast, will do (if an appropriate URN scheme is chosen, such as 
"urn:uuid").

* All other names with or without a namespace prefix are free to be used 
by users in any way they like.

However, there is a problem here: The "urn" and "rdn" namespaces allow 
to include slashes, colons and other characters better to be avoded in 
filenames. Especially under Windows.

So I suggest a simple name mapping strategy:

* We add a pseudo-namespace "b32" which encodes whatever follows it in 
BASE-32 encoding.

* That pseudo-namespace will only be recognized at the beginning of a 
stream name and will encode whichever follows it in BASE-32.

* colon characters are mapped into underscores.

Here are some examples of stream names and the mapped revlog file names 
which will represent them:

"hg:data" -> "hg_data2.d"
"hg:symlink" -> "hg_symlink.d"
"plain" -> "plain.d"
"usernamespace:whatever" -> "usernamespace_whatever.d"
"funny_name" -> "b32_<base32stuff>.d"
"urn:uuid:11223344-5566-3353-aabbccddeeff" -> "b32_<base32stuff>.d"

In those examples, <base32stuff> is a placeholder for the BASE-32 
encoding of the string on the left side.

Why BASE-32 instead of BASE-64 one might ask?

Because BASE-32 does not use both upper- and lower case characters in 
the encoding it generates, which eliminates problems which filesystems 
that do not preserve letter case in file or directory names. (See the 
RFC about BASE-32 encoding for more details.)

Anyway, all the above is a mere suggestion to show how streams could be 
implemented; I'll be happy to keep it open to discussion.

But I would really be happy to see properties supported by Mercurial 
some day, which will also be the day I convert my SVK repositories into 
Mercurial!

Currently I cannot use Mercurial because I have lots of symlinks under 
version control in SVK.

I am using SVK because it is still the best distributed SCM I have 
encountered so far: It has (nearly) all the features of Subversion, but 
adds fully disconnected (off-line) operation.

SVK has also some disadvantages. The biggest disadvantage of SVK is its 
lack of concise documentation and it's largely intransparent operation. 
It's very obscure. *And* written in Perl. ;-)

Mercurial clearly excels here: All basic data structures (i. e. the 
revlog) are well defined and the interconnection between the components 
of the data structures (revlog, nodeid, manifest, etc) are nicely 
explained. This is how it should be.

In SVK I do not even fully understand the options and operation modes of 
its 3 merge commands... especially in disconnected or mirrored operation.

However, it works.

Somehow.

But I would really prefer Mercurial - if it only could support support 
properties like symlinks and character conversion attributes.

Greetings from Vienna,
Guenther