Possibly changing the path encoding format

Bryan O'Sullivan bos at serpentine.com
Wed Sep 12 13:49:42 CDT 2012


On Wed, Sep 12, 2012 at 7:35 AM, Adrian Buehlmann <adrian at cadifra.com>wrote:

> I think this should be considerably simpler to translate into C code
> than _auxencode.
>

I'm not quite opposed to the idea of replacing the current hybrid encoding
scheme with something new, especially given bug 3621. But let me outline
some thoughts below.

Given my recent experience, I think that probably the best way to determine
that an encoding scheme can be both efficient and comprehensible is to
implement it in C.

Perhaps accidentally, the current basic encoding scheme actually fulfils
both criteria of efficiency and comprehensibility.

   - It can be implemented with a simple state machine.
   - We can identify a path that does *not* need escaping in a single pass,
   and in such cases we can allocate no extra memory and do no copying.
   - Even when a path must be escaped, we can do so with just a single
   allocation and one more pass.
   - In the no-escape case, no byte in a path needs to be inspected more
   than once.

(And conveniently enough, the current auxencode can be implemented as a
variant of basic encoding by just driving the same function with a couple
of different lookup tables. As a result, it costs just a few extra lines to
implement. That's a big saving in complexity.)

Compare this to the hashed encoding scheme. It's not immediately easy to
see from the Python code, but a major difficulty with the hashed encoder is
that it *must* be implemented using multiple passes, extra code paths, and
copies, due to the data dependencies and variations over basic encoding.

   1. The SHA-1 hash must be computed over the dir-encoded name, which
   means that we have to dir-encode a path before we do anything else.
   2. We need a separate code path specially for dir-encoding.
   3. The lower case encoding scheme is different than the one used for
   basic encoding, so it too must have a separate implementation.
   4. We then aux-encode the lower-encoded text.
   5. Once that's done, there's an additional very complicated
   copy-and-fixup step (named "hashmangle" in my patches). This may truncate
   and tweak every path component, and it then has to do lots of further
   surgery to glue together all the parts together with the SHA-1 hash and the
   original suffix.

We have five passes over the name here: direncode, hash, lowerencode,
auxencode, mangle. It's conceivable that we could combine the last three
passes into one using a suitably clever state machine, but that looks like
a nightmarish prospect to me :-)

So really, given the above, it shouldn't surprise you that I think that
auxencode, considered in isolation, is not the problem at all. (Based on
brief inspection, I also don't think that the proposed change would
simplify the auxencode state machine by more than a small amount.)

Before I continue, I think that we could fix bug 3621 with just an hour or
two of work on both the C and Python sides: just don't append the original
suffix (this is where the unbounded length comes from), and call the new
scheme fncache2. Boom! Done.

If you want to explore a simpler hashed encoding scheme, here's what I
think it should be:

   1. Compute the hash (presumably still SHA-1) of the original pathname.
   2. Basic-encode the original pathname, using the scheme we already have.
   Either stop or truncate at 210 bytes.
   3. Truncate each path component and fix up any dangling spaces or dots
   that arise.
   4. Append the hash to the end of the result of step 3.

This is quite simple. It avoids the need for separately dir-encoding and
lower-encoding a path, along with much of the complexity of my hashmangle
function. It has better data dependencies, too. Conceptually, it looks like
it needs 3 passes over the pathname (hash, encode, truncate), which is
already an improvement, but it could be implemented in just 2 if someone
really cared (at a cost of another state machine and more code, and
probably negligible benefit in performance).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://selenic.com/pipermail/mercurial-devel/attachments/20120912/7013a986/attachment.html>


More information about the Mercurial-devel mailing list