Note:

This page is primarily intended for developers of Mercurial.

'fncache2' repo format

This is a proposed evolution of the fncache format that addresses some of its shortcomings and complexity.

1. Some background: how fncache works, and what's tricky

The fncache format uses two encoding mechanisms to safely and portably handle metadata names.

Basic encoding is quite easy to implement efficiently, but hashed encoding is complex and difficult to understand.

It's not immediately easy to see from the Python implementation of the hashed encoder, but a major difficulty with the scheme is that it *must* be implemented using multiple passes, extra code paths, and copies, due to its data dependencies and differences from basic encoding.

  1. The SHA-1 hash must be computed over the dir-encoded name, which means that we have to dir-encode a path before we do anything else.
  2. We need a separate code path specially for dir-encoding.
  3. The lower case encoding scheme is different than the one used for basic encoding, so it too must have a separate implementation.
  4. We then aux-encode the lower-encoded text.
  5. Once that's done, there's an additional very complicated copy-and-fixup step (named "hashmangle" in the patch that implements this algorithm in C). This may truncate and tweak every path component, and it then has to do lots of further surgery to glue together all the parts together with the SHA-1 hash and the original suffix.

We have five passes over the name here: direncode, hash, lowerencode, auxencode, mangle. When implemented in C with an eye on efficiency, the final "mangle" step is particularly tricky.

2. A somewhat simpler hashing scheme

We retain the existing basic encoding scheme. For longer names:

  1. Compute the hash (presumably still SHA-1) of the original pathname, probably without its extension (otherwise ".i" and ".d" files will hash differently, and will be less likely to be laid out contiguously on disk.)
  2. Basic-encode the original pathname, using the encoding code we already have. Either stop or truncate at 200 bytes.
  3. Truncate each path component and fix up any dangling spaces or dots that arise.
  4. Append the hash to the end of the result of step 3.
  5. Tack the original extension (".i" or ".d") back onto the end.

This gives us three passes: hash, basic encode, fix up; and then 42 bytes of extra work to append the hash and extension.

3. An even simpler hashing scheme

_maxstorepathlen = 120
_winres3 = ('aux', 'con', 'prn', 'nul'
_winres4 = ('com', 'lpt')

_encchar = ("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
            "~!~#$%&'()~+,-~~0123456789~;~=~~"
            "@abcdefghijklmnopqrstuvwxyz[~]^_"
            "`abcdefghijklmnopqrstuvwxyz{~}~~"
            "~abcdefghijklmnopqrstuvwxyz{~}~~"
            "~!~#$%&'()~+,-~~0123456789~;~=~~"
            "@abcdefghijklmnopqrstuvwxyz[~]^_"
            "`abcdefghijklmnopqrstuvwxyz{~}~~")

def _foldencode(f): # preserves size
    f = ''.join([_encchar[ord(c)] for c in f])
    l = len(f)
    if l == 3 and f in _winres3:
        f = f[:2] + '~'
    if (l == 4 and f[3] <= '9' and f[3] >= '0'
               and f[:3] in _winres4):
        f = f[:3] + '~'
    return f

def _cutdirs(path):
    spaceleft = _maxstorepathlen - 45
    parts = []
    for s in path.split('/'):
        if len(s) > 8:
            s = s[:8]
        if parts:
            spaceleft -= 1 # for '/'
        spaceleft -= len(s)
        if spaceleft < 0:
            break
        parts.append(s)
    return '/'.join(map(_foldencode, parts))

def _hybridencode2(f):
    ef = _pathencode(f)
    if ef is None:
        f = f[5:] # cut off "data/"
        prefix, suffix = f[:-2], f[-2:]
        digest = _sha(prefix).hexdigest()
        ef = 'dh/' + _cutdirs(prefix) + digest + suffix
    return ef

The cutdirs function in C

static const Py_ssize_t maxstorepathlen = 120;

static const char encchar[256] =
        "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
        "~!~#$%&'()~+,-~~0123456789~;~=~~"
        "@abcdefghijklmnopqrstuvwxyz[~]^_"
        "`abcdefghijklmnopqrstuvwxyz{~}~~"
        "~abcdefghijklmnopqrstuvwxyz{~}~~"
        "~!~#$%&'()~+,-~~0123456789~;~=~~"
        "@abcdefghijklmnopqrstuvwxyz[~]^_"
        "`abcdefghijklmnopqrstuvwxyz{~}~~";

/* this encoding folds */
static inline char encodechar(char c)
{
        return encchar[0xff & c];
}

static Py_ssize_t _cutdirs(char *dest, Py_ssize_t destlen, size_t destsize,
                           const char *src, Py_ssize_t len)
{
        Py_ssize_t i = 0, spaceleft = maxstorepathlen - 45 + 1;
        char seg[8];
        int seglen = 0;
        uint32_t cmp;

        while (i < len && spaceleft > 0) {
                if (src[i] == '/' || src[i] == '\0') {
                        if (seglen != 0) {
                                if (seglen == 3) {
                                        cmp = seg[0] << 16 | seg[1] << 8 | seg[2];
                                        if (   cmp == 0x617578 /* aux */
                                            || cmp == 0x636f6e /* con */
                                            || cmp == 0x70726e /* prn */
                                            || cmp == 0x6e756c /* nul */)
                                                seg[2] = '~';
                                }
                                else if (seglen == 4 && seg[3] <= '9'
                                                     && seg[3] >= '0') {
                                        cmp = seg[0] << 16 | seg[1] << 8 | seg[2];
                                        if (   cmp == 0x636f6d /* com0..9 */
                                            || cmp == 0x6c7074 /* lpt0..9 */)
                                                seg[3] = '~';
                                }
                                memcopy(dest, &destlen, destsize, &seg, seglen);
                                seglen = 0;
                        }
                        charcopy(dest, &destlen, destsize, src[i++]);
                        spaceleft--;
                }
                else if (seglen == sizeof(seg)) {
                        i++;
                }
                else {
                        seg[seglen++] = encodechar(src[i++]);
                        spaceleft--;
                }
        }

        return destlen;
}

Patches: http://selenic.com/pipermail/mercurial-devel/2012-September/044718.html


CategoryDeveloper

fncache2RepoFormat (last edited 2012-10-04 14:22:08 by abuehl)