Possibly changing the path encoding format

Adrian Buehlmann adrian at cadifra.com
Wed Sep 12 14:28:47 CDT 2012


On 2012-09-12 20:49, Bryan O'Sullivan wrote:
> On Wed, Sep 12, 2012 at 7:35 AM, Adrian Buehlmann <adrian at cadifra.com
> <mailto:adrian at cadifra.com>> wrote:
> 
>     I think this should be considerably simpler to translate into C code
>     than _auxencode.
> 
> 
> I'm not quite opposed to the idea of replacing the current hybrid
> encoding scheme with something new, especially given bug 3621. But let
> me outline some thoughts below.

Great. Thank you.

> Given my recent experience, I think that probably the best way to
> determine that an encoding scheme can be both efficient and
> comprehensible is to implement it in C.
> 
> Perhaps accidentally, the current basic encoding scheme actually fulfils
> both criteria of efficiency and comprehensibility.
> 
>     * It can be implemented with a simple state machine.
>     * We can identify a path that does /not/ need escaping in a single
>       pass, and in such cases we can allocate no extra memory and do no
>       copying.
>     * Even when a path must be escaped, we can do so with just a single
>       allocation and one more pass.
>     * In the no-escape case, no byte in a path needs to be inspected
>       more than once.
> 
> (And conveniently enough, the current auxencode can be implemented as a
> variant of basic encoding by just driving the same function with a
> couple of different lookup tables. As a result, it costs just a few
> extra lines to implement. That's a big saving in complexity.)
> 
> Compare this to the hashed encoding scheme. It's not immediately easy to
> see from the Python code, but a major difficulty with the hashed encoder
> is that it *must* be implemented using multiple passes, extra code
> paths, and copies, due to the data dependencies and variations over
> basic encoding.
> 
>    1. The SHA-1 hash must be computed over the dir-encoded name, which
>       means that we have to dir-encode a path before we do anything else.
>    2. We need a separate code path specially for dir-encoding.
>    3. The lower case encoding scheme is different than the one used for
>       basic encoding, so it too must have a separate implementation.
>    4. We then aux-encode the lower-encoded text.
>    5. Once that's done, there's an additional very complicated
>       copy-and-fixup step (named "hashmangle" in my patches). This may
>       truncate and tweak every path component, and it then has to do
>       lots of further surgery to glue together all the parts together
>       with the SHA-1 hash and the original suffix.
> 
> We have five passes over the name here: direncode, hash, lowerencode,
> auxencode, mangle. It's conceivable that we could combine the last three
> passes into one using a suitably clever state machine, but that looks
> like a nightmarish prospect to me :-)

I'm totally open to drop the current hashed encoder and replace it with
something better.

> So really, given the above, it shouldn't surprise you that I think that
> auxencode, considered in isolation, is not the problem at all. (Based on
> brief inspection, I also don't think that the proposed change would
> simplify the auxencode state machine by more than a small amount.)

I agree that my auxencode2 possibly wouldn't change much compared to
current auxencode. But you could drop the logic around where the periods
are.

And yes, auxencode is not that much of a problem, but already that one
has needless complexities. It's been four years since we did that and I
now more clearly see what's unneeded. :-)

> Before I continue, I think that we could fix bug 3621 with just an hour
> or two of work on both the C and Python sides: just don't append the
> original suffix (this is where the unbounded length comes from), and
> call the new scheme fncache2. Boom! Done.

Sure. But it needs a new repo format for that. And that's what got me
started thinking.

> If you want to explore a simpler hashed encoding scheme, here's what I
> think it should be:
> 
>    1. Compute the hash (presumably still SHA-1) of the original pathname.
>    2. Basic-encode the original pathname, using the scheme we already
>       have. Either stop or truncate at 210 bytes.
>    3. Truncate each path component and fix up any dangling spaces or
>       dots that arise.
>    4. Append the hash to the end of the result of step 3.
> 
> This is quite simple. It avoids the need for separately dir-encoding and
> lower-encoding a path, along with much of the complexity of my
> hashmangle function. It has better data dependencies, too. Conceptually,
> it looks like it needs 3 passes over the pathname (hash, encode,
> truncate), which is already an improvement, but it could be implemented
> in just 2 if someone really cared (at a cost of another state machine
> and more code, and probably negligible benefit in performance).

Sounds good to me. I was actually hoping you would make a proposal in
Python.



More information about the Mercurial-devel mailing list