Possibly changing the path encoding format

Adrian Buehlmann adrian at cadifra.com
Wed Sep 12 14:28:47 CDT 2012

On 2012-09-12 20:49, Bryan O'Sullivan wrote:
> On Wed, Sep 12, 2012 at 7:35 AM, Adrian Buehlmann <adrian at cadifra.com
> <mailto:adrian at cadifra.com>> wrote:
>     I think this should be considerably simpler to translate into C code
>     than _auxencode.
> I'm not quite opposed to the idea of replacing the current hybrid
> encoding scheme with something new, especially given bug 3621. But let
> me outline some thoughts below.

Great. Thank you.

> Given my recent experience, I think that probably the best way to
> determine that an encoding scheme can be both efficient and
> comprehensible is to implement it in C.
> Perhaps accidentally, the current basic encoding scheme actually fulfils
> both criteria of efficiency and comprehensibility.
>     * It can be implemented with a simple state machine.
>     * We can identify a path that does /not/ need escaping in a single
>       pass, and in such cases we can allocate no extra memory and do no
>       copying.
>     * Even when a path must be escaped, we can do so with just a single
>       allocation and one more pass.
>     * In the no-escape case, no byte in a path needs to be inspected
>       more than once.
> (And conveniently enough, the current auxencode can be implemented as a
> variant of basic encoding by just driving the same function with a
> couple of different lookup tables. As a result, it costs just a few
> extra lines to implement. That's a big saving in complexity.)
> Compare this to the hashed encoding scheme. It's not immediately easy to
> see from the Python code, but a major difficulty with the hashed encoder
> is that it *must* be implemented using multiple passes, extra code
> paths, and copies, due to the data dependencies and variations over
> basic encoding.
>    1. The SHA-1 hash must be computed over the dir-encoded name, which
>       means that we have to dir-encode a path before we do anything else.
>    2. We need a separate code path specially for dir-encoding.
>    3. The lower case encoding scheme is different than the one used for
>       basic encoding, so it too must have a separate implementation.
>    4. We then aux-encode the lower-encoded text.
>    5. Once that's done, there's an additional very complicated
>       copy-and-fixup step (named "hashmangle" in my patches). This may
>       truncate and tweak every path component, and it then has to do
>       lots of further surgery to glue together all the parts together
>       with the SHA-1 hash and the original suffix.
> We have five passes over the name here: direncode, hash, lowerencode,
> auxencode, mangle. It's conceivable that we could combine the last three
> passes into one using a suitably clever state machine, but that looks
> like a nightmarish prospect to me :-)

I'm totally open to drop the current hashed encoder and replace it with
something better.

> So really, given the above, it shouldn't surprise you that I think that
> auxencode, considered in isolation, is not the problem at all. (Based on
> brief inspection, I also don't think that the proposed change would
> simplify the auxencode state machine by more than a small amount.)

I agree that my auxencode2 possibly wouldn't change much compared to
current auxencode. But you could drop the logic around where the periods

And yes, auxencode is not that much of a problem, but already that one
has needless complexities. It's been four years since we did that and I
now more clearly see what's unneeded. :-)

> Before I continue, I think that we could fix bug 3621 with just an hour
> or two of work on both the C and Python sides: just don't append the
> original suffix (this is where the unbounded length comes from), and
> call the new scheme fncache2. Boom! Done.

Sure. But it needs a new repo format for that. And that's what got me
started thinking.

> If you want to explore a simpler hashed encoding scheme, here's what I
> think it should be:
>    1. Compute the hash (presumably still SHA-1) of the original pathname.
>    2. Basic-encode the original pathname, using the scheme we already
>       have. Either stop or truncate at 210 bytes.
>    3. Truncate each path component and fix up any dangling spaces or
>       dots that arise.
>    4. Append the hash to the end of the result of step 3.
> This is quite simple. It avoids the need for separately dir-encoding and
> lower-encoding a path, along with much of the complexity of my
> hashmangle function. It has better data dependencies, too. Conceptually,
> it looks like it needs 3 passes over the pathname (hash, encode,
> truncate), which is already an improvement, but it could be implemented
> in just 2 if someone really cared (at a cost of another state machine
> and more code, and probably negligible benefit in performance).

Sounds good to me. I was actually hoping you would make a proposal in

More information about the Mercurial-devel mailing list