[PATCH alternative-version] pathencode: for long paths, strip first 5 chars, not first dir

Adrian Buehlmann adrian at cadifra.com
Fri May 8 17:22:25 CDT 2015


On 2015-05-08 23:31, Martin von Zweigbergk wrote:
> 
> 
> On Fri, May 8, 2015 at 2:13 PM Adrian Buehlmann <adrian at cadifra.com
> <mailto:adrian at cadifra.com>> wrote:
> 
>     On 2015-05-08 21:59, Martin von Zweigbergk wrote:
>     > # HG changeset patch
>     > # User Martin von Zweigbergk <martinvonz at google.com
>     <mailto:martinvonz at google.com>>
>     > # Date 1430953094 25200
>     > #      Wed May 06 15:58:14 2015 -0700
>     > # Node ID baa1fb9f49617b0d554c6c47af9838922baf9a35
>     > # Parent  8179af513aebf96c4902ba3e5e3cf710d49501e4
>     > pathencode: for long paths, strip first 5 chars, not first dir
>     >
>     > When encoding long paths, the pure Python code strips the first
>     > directory from the path, while the native code currently strips the
>     > first 5 characters. This discrepancy has not been a problem so far,
>     > since we have not stored anything in directories other than
>     > data/. However, we will soon be storing submanifest revlogs in
>     > metadata/, so the discrepancy will have to go [1]. Since file
>     > collisions are avoided by the hashing alone (which is done on the full
>     > unencoded path), it doesn't really matter whether we drop the first
>     > dir, the first 5 characters, or special-case non-data/. To avoid
>     > touching the C code, let's always strip the first 5 characters.
>     >
>     >  [1] Or maybe elsewhere, but the discrepancy is ugly either way.
>     >
>     > diff -r 8179af513aeb -r baa1fb9f4961 mercurial/store.py
>     > --- a/mercurial/store.py      Thu May 07 16:43:58 2015 -0700
>     > +++ b/mercurial/store.py      Wed May 06 15:58:14 2015 -0700
>     > @@ -187,7 +187,7 @@
>     >
>     >  def _hashencode(path, dotencode):
>     >      digest = _sha(path).hexdigest()
>     > -    le = lowerencode(path).split('/')[1:]
>     > +    le = lowerencode(path[5:]).split('/')
>     >      parts = _auxencode(le, dotencode)
>     >      basename = parts[-1]
>     >      _root, ext = os.path.splitext(basename)
>     > diff -r 8179af513aeb -r baa1fb9f4961 tests/test-hybridencode.py
>     > --- a/tests/test-hybridencode.py      Thu May 07 16:43:58 2015 -0700
>     > +++ b/tests/test-hybridencode.py      Wed May 06 15:58:14 2015 -0700
>     > @@ -460,3 +460,9 @@
>     >            'VWXYZ-1234567890-xxxxxxxxx-xxxxxxxxx-xxxxxxxx-xxxx'
>     >            'xxxxx-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwww'
>     >            'wwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww.i')
>     > +
>     > +print "paths outside data/ can be encoded"
>     > +show('metadata/dir/00manifest.i')
>     > +show('metadata/12345678/12345678/12345678/12345678/12345678/'
>     > +          '12345678/12345678/12345678/12345678/12345678/12345678/'
>     > +          '12345678/12345678/00manifest.i')
>     > diff -r 8179af513aeb -r baa1fb9f4961 tests/test-hybridencode.py.out
>     > --- a/tests/test-hybridencode.py.out  Thu May 07 16:43:58 2015 -0700
>     > +++ b/tests/test-hybridencode.py.out  Wed May 06 15:58:14 2015 -0700
>     > @@ -491,3 +491,10 @@
>     >  A =
>     'data/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345/-xxxxxxxxx-xxxxxxxxx-xxxxxxxxx-123456789-12.3456789-12345-ABCDEFGHIJKLMNOPRSTUVWXYZ-abcdefghjiklmnopqrstuvwxyz-ABCDEFGHIJKLMNOPRSTUVWXYZ-1234567890-xxxxxxxxx-xxxxxxxxx-xxxxxxxx-xxxxxxxxx-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww.i'
>     >  B =
>     'dh/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345/-xxxxx93352aa50377751d9e5ebdf52da1e6e69a6887a6.i'
>     >
>     > +paths outside data/ can be encoded
>     > +A = 'metadata/dir/00manifest.i'
>     > +B = 'metadata/dir/00manifest.i'
>     > +
>     > +A =
>     'metadata/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/00manifest.i'
>     > +B =
>     'dh/ata/12345678/12345678/12345678/12345678/12345678/12345678/12345678/00manife0a4da1f89aa2aa9eb0896eb451288419049781b4.i'
>     > +
> 
>     The current layout (using the current, 'fncache', repo format) for
>     regular revlogs under .hg/store/ is:
> 
>       data/      for non-path-hashed revlogs
>       dh/        for path-hashed-revlogs
> 
>     You - as I understand it - want to separately store submanifests under a
>     new directory named 'metadata', under 'store' (leaving aside for now
>     that I don't like the name 'metadata', because of its length).
> 
>     Don't you then also need a separate directory for paths of submanifests
>     that trigger the path-hash-encoding?
> 
>     If you store path-hashed submanifest revlogs under 'dh', together with
>     non-submanifest revlogs, won't that lead to potential path collisions
>     inside dh?
> 
> 
> I think not. That's what I tried to explain with the "collisions are
> avoided by the hashing alone" bit.
>  

I do believe you are wrong. The name "metadata" (or parts of it)
together with 00manifest.i (with the .i encoded) may appear as a part of
the users files in the working tree.

Hashing won't solve that collision.

> 
> 
>     The trailing 'ata' of 'metadata' doesn't look like a correct
>     discriminator to me, as that string may derive from the working tree as
>     well.
> 


More information about the Mercurial-devel mailing list