Possibly changing the path encoding format

Sat Sep 15 14:27:51 CDT 2012

On 2012-09-12 20:49, Bryan O'Sullivan wrote:
> Compare this to the hashed encoding scheme. It's not immediately easy to
> see from the Python code, but a major difficulty with the hashed encoder
> is that it *must* be implemented using multiple passes, extra code
> paths, and copies, due to the data dependencies and variations over
> basic encoding.
> 
>    1. The SHA-1 hash must be computed over the dir-encoded name, which
>       means that we have to dir-encode a path before we do anything else.
>    2. We need a separate code path specially for dir-encoding.

I just had another look at the history of encodedir (which is the main
reason why I write this email).

It might be interesting to point out that it once (at the time the
fncache format was released in 2008) resided inside filelog.py, with
store.fncachestore knowing nothing about that encoding step.

Benoit then in May 2009 moved it into store.py with

  http://selenic.com/repo/hg/rev/810387f59696#l3.43

which then made the encodedir() call explicitly appear in the
hybridencode() function:

  http://selenic.com/repo/hg/file/810387f59696/mercurial/store.py#l126

It's correct that way, but now perhaps looks a bit strange. Why would
anyone want to do the hashing with the direncoded path?

>    3. The lower case encoding scheme is different than the one used for
>       basic encoding, so it too must have a separate implementation.
>    4. We then aux-encode the lower-encoded text.

As already mentioned, I did the lowercasing in order not to waste
precious remaining path space with stuff from HELLO -> _h_e_l_l_o.

As the fncache format does the hashing on the direncoded but otherwise
original path (which preserves case), it can be safely combined with
lowercasing. The sha-1 hash is distinctive enough. It doesn't matter if
files from a directory named "HELLO" and from another directory named
"hello" will land in the same dir under dh/. They will get different
hashes, as those directories have different names.

The aux-encoding was needed for obvious reasons.

>    5. Once that's done, there's an additional very complicated
>       copy-and-fixup step (named "hashmangle" in my patches). This may
>       truncate and tweak every path component, and it then has to do
>       lots of further surgery to glue together all the parts together
>       with the SHA-1 hash and the original suffix.

The hashmangle step indeed really hurts with regards to complexity. The
complexity is in the Python code already.

When I came up with the surgery done there in 2008, I'd never would have
thought that anyone would ever want to implement that C.

At least it took 4 years until that happened :-)

> We have five passes over the name here: direncode, hash, lowerencode,
> auxencode, mangle. It's conceivable that we could combine the last three
> passes into one using a suitably clever state machine, but that looks
> like a nightmarish prospect to me :-)