Possibly changing the path encoding format

Adrian Buehlmann adrian at cadifra.com
Fri Sep 21 13:16:12 CDT 2012

On 2012-09-21 01:41, Bryan O'Sullivan wrote:
> On Wed, Sep 12, 2012 at 1:11 PM, Bryan O'Sullivan <bos at serpentine.com
> <mailto:bos at serpentine.com>> wrote:
>     I'd be happy to do the work to implement this, but this is one of
>     those rare cases where it's less work to describe the desired
>     behaviour in English than in code.
> I finally had time to write up a slightly modified version of this proposal:
> http://mercurial.selenic.com/wiki/fncache2RepoFormat
> I'll probably write a Python implementation tomorrow and see how it fares.

Bryan wrote on the wiki:
> == A somewhat simpler hashing scheme ==
> We retain the existing basic encoding scheme. For longer names:
>  1. Compute the hash (presumably still SHA-1) of the original pathname, probably without its extension (otherwise ".i" and ".d" files will hash differently, and will be less likely to be laid out contiguously on disk.)
>  1. Basic-encode the original pathname, using the encoding code we already have. Either stop or truncate at 200 bytes.

Encoding-wise, basic-encoding is - in theory - overkill for hashed paths.

For example, "the encoding code we already have" *includes* direncoding
- but that's unneeded.

A directory named foo.i can't possibly collide anyway with a file under
dh/, because the filename always has at least a length of 40, and the
shorted directories will be way shorter than that (or at least we can
easily make sure they are always shorter than 40).

Also, the direncoding doesn't survive directory-truncation, as the ".hg"
ending of an "xxx.i.hg" may be truncated away, thus producing the
unwanted ".i" ending again.

What's more, as I've already posted, there are possibly simpler ways to
encode directories named aux & friends.

The X -> _x encoding included in basic-encode is unneeded as well and
wastes path length bandwith in the shorted dirs.

>  1. Truncate each path component and fix up any dangling spaces or dots that arise.
>  1. Append the hash to the end of the result of step 3.
>  1. Tack the original extension (".i" or ".d") back onto the end.
> This gives us three passes: hash, basic encode, fix up; and then 42 bytes of extra work to append the hash and extension.

More information about the Mercurial-devel mailing list