[PATCH] introduce filenamelog repository layout

Fri Jul 11 16:36:31 CDT 2008

On 11.07.2008 20:21, Matt Mackall wrote:
> On Fri, 2008-07-11 at 19:10 +0200, Adrian Buehlmann wrote:
>> # HG changeset patch
>> # User Adrian Buehlmann <adrian at cadifra.com>
>> # Date 1215795701 -7200
>> # Node ID 4c44bdd7f45f62a21feaab6e41a44dd8e8ec9151
>> # Parent  2134d6c09432e4e3dbee18d93ec9242a332f7cdc
>> introduce filenamelog repository layout
>>
>> * adds a new entry 'filenamelog' to .hg/requires for new repos
>> * writes new file .hg/store/filenamelog
> 
> What's the format?
> 

Very simple. Please read the code of the new filenamelog.py.

Entries are \n separated. Each entry consists of two \0 separated
paths. The first one being basically the name used by filelog
(see filelog.encodedir), encoded by filenamelog.fnlogencode to mask
things like zero bytes, the second one being basically the first one
encoded by util.fnlogencode, which is the path of the filelog
stored on disk.

The second one could be omitted as it can be calculated from
the first one. I haven't done so for two reasons:

* humans can look into filenamelog and see the encoded name
* streamclone is faster, as it doesn't need to call
  util.fnlogencode again

Drawback is, that needs a bit disk space. But of course that
would be easy to change if you want.

See filenamelog.packentry in the code for the details.

Most relevant part for this question is:

+HEADER_PREFIX = 'Mercurial filenamelog'
+FORMAT_DEFAULT_VERSION = '1'
+FORMAT_DEFAULT_FLAGS = '0'    # unused
+
+def header():
+    parts = (HEADER_PREFIX, FORMAT_DEFAULT_VERSION,
+             FORMAT_DEFAULT_FLAGS, '\n')
+    return '\0'.join(parts)
+
+def checkheader(parts, logname):
+    if parts[0] != HEADER_PREFIX:
+        abort(_('invalid header'), logname)
+    if parts[1] != FORMAT_DEFAULT_VERSION:
+        abort(_('unsupported format'), logname)
+
+def fnlogencode(s):
+    '''escape \0 and \n using \x01'''
+    return (s.replace('\x01', '\x011')
+             .replace('\n',   '\x01n')
+             .replace('\x00', '\x010'))
+
+def fnlogdecode(s):
+    return (s.replace('\x010', '\x00')
+             .replace('\x01n', '\n'  )
+             .replace('\x011', '\x01'))
+
+def packentry(path, encodedpath):
+    return '\0'.join( (fnlogencode(path), encodedpath, '\n') )

>> * hash-encodes filenames with long paths (issue839)
> 
> What's the format?
>
>> * encodes Windows reserved filenames (issue793)
> 
> What's the format?

This can be seen in util.fnlogencode.

+_windows_reserved_filenames = '''con prn aux nul
+    com1 com2 com3 com4 com5 com6 com7 com8 com9
+    lpt1 lpt2 lpt3 lpt4 lpt5 lpt6 lpt7 lpt8 lpt9'''.split()
+def auxencode(path):
+    res = []
+    for n in path.split('/'):
+        if n:
+            base = n.split('.')[0]
+            if base and (base in _windows_reserved_filenames):
+                # encode third letter ('aux' -> 'au~78')'''
+                ec = "~%02x" % ord(n[2])
+                n = n[0:2] + ec + n[3:]
+        res.append(n)
+    return '/'.join(res)
+
+MAX_PATH_LEN_IN_HGSTORE = 120
+def fnlogencode(path):
+    if not path.startswith('data/'):
+        return path
+    ndpath = path[len('data/'):]
+    aep = auxencode(encodefilename(ndpath))
+    if len(aep) < MAX_PATH_LEN_IN_HGSTORE:
+        res = 'df/' + aep
+    else:
+        dirs = aep.split('/')
+        n = len(dirs)
+        hdir = ''
+        if n > 1:
+            shortdirs = [p[:8] for p in dirs[:min(n-1,8)]]
+            hdir = '/'.join(shortdirs) + '/'
+        root, ext = os.path.splitext(aep)
+        res = 'dh/' + hdir + sha.new(path).hexdigest() + ext
+    return res

(encodefilename is the current encoding function)

For the hashed names, I have taken your earlier idea you presented
on this list and adapted it a bit.

Taking the first eight chars of the first eight directory
levels and then using the sha hash of the full path.

Examples can be seen in tests/test-filenamelog.

For example, if filelog wants to store its *.i file as:

data/FIRST/SECOND/THIRD/FOURTH/FIFTH/SIXTH/SEVENTH/EIGHTH/NINETH/TENTH/ELEVENTH/LOREM.TXT.i

it is written to:

dh/_f_i_r_s/_s_e_c_o/_t_h_i_r/_f_o_u_r/_f_i_f_t/_s_i_x_t/_s_e_v_e/_e_i_g_h/213bfeabe713cd5571ac605bbc0cf5de4e682b43.i

because the other encoding would result in a path longer than MAX_PATH_LEN_IN_HGSTORE,
which I've defined as 120 (your fixed limit requirement of the parren encoding
switching idea - the hybrid scheme).

>> * aborts on Windows if repo store path length exceeds limit
>>
>> The filename encoding used is no longer reversible.
> 
> Both encodings? Why? Non-reversible encoding are encoding that can have
> collisions. So unless your encoding is cryptographically strong (in
> other words, makes collisions extremely unlikely), that's a problem.

No. There are no collisions. And for the hashed names, I've used
sha from your's and Jesse's proposal.

The filename encoding for the files in .hg/store/df is actually
basically the same as the current encoding, but with the addition of
what's done in util.auxencode (encode third letter: 'aux' -> 'au~78')

For example:

data/aux.bla/bla.aux/prn/PRN/lpt/com3/nul/coma/foo.NUL/normal.c.i

is encoded as:

df/au~78.bla/bla.aux/pr~6e/_p_r_n/lpt/co~6d3/nu~6c/coma/foo._n_u_l/normal.c.i

I just haven't provided a reverse encoding function for the non-hashed names,
because it is unneeded (the only part needing a reverse encoding would be
streamclone, and that needs the filenamelog anyway due to the hashed names,
so I can write all files to filenamelog and get away with the directory
walk entierely).

But the names in .hg/store/df are still reversible. I've written the
reverse encoding function in an earlier patch, but haven't included it
here because it is unneeded.

The names in .hg/store/dh are not reversible, due to the hashes.

>> Filelogs with full (unhashed) filenames are stored into '.hg/store/df',
> 
> Why change this name?

I needed to separate the namespace of the non-hashed filenames
from the hashed filenames.

A possible directory structure would have been

.hg/store/data/unhashed/...
.hg/store/data/hashed/...

Instead of wasting that much path length or inventing yet anther masking
scheme for the hashed paths in order to store them into .hg/store/data, I
took the shorter and simpler radical change:

.hg/store/df/...
.hg/store/dh/...

to separate both worlds

Another solution would have been

.hg/store/data/...
.hg/store/hdata/...

(or something similar). This has the misleading effect that people/tools
still expecting the old information in .hg/store/data would have been
surprised by not finding the hashed files in there.

So we can take another name for .hg/store/data too, because it
would not contain what's currently in .hg/store/data anyway.
It's something different anyway. If it's different, it can have
a different name.

>> This change depends on the fact that hg strip truncates filelog
>> files to zero length instead of deleting them. If strip should ever
>> start deleting empty filelogs, there will be duplicate entries
>> in the filenamelog if filelogs are recreated.