[PATCH alternative-version] pathencode: for long paths, strip first 5 chars, not first dir
Adrian Buehlmann
adrian at cadifra.com
Fri May 8 17:22:25 CDT 2015
On 2015-05-08 23:31, Martin von Zweigbergk wrote:
>
>
> On Fri, May 8, 2015 at 2:13 PM Adrian Buehlmann <adrian at cadifra.com
> <mailto:adrian at cadifra.com>> wrote:
>
> On 2015-05-08 21:59, Martin von Zweigbergk wrote:
> > # HG changeset patch
> > # User Martin von Zweigbergk <martinvonz at google.com
> <mailto:martinvonz at google.com>>
> > # Date 1430953094 25200
> > # Wed May 06 15:58:14 2015 -0700
> > # Node ID baa1fb9f49617b0d554c6c47af9838922baf9a35
> > # Parent 8179af513aebf96c4902ba3e5e3cf710d49501e4
> > pathencode: for long paths, strip first 5 chars, not first dir
> >
> > When encoding long paths, the pure Python code strips the first
> > directory from the path, while the native code currently strips the
> > first 5 characters. This discrepancy has not been a problem so far,
> > since we have not stored anything in directories other than
> > data/. However, we will soon be storing submanifest revlogs in
> > metadata/, so the discrepancy will have to go [1]. Since file
> > collisions are avoided by the hashing alone (which is done on the full
> > unencoded path), it doesn't really matter whether we drop the first
> > dir, the first 5 characters, or special-case non-data/. To avoid
> > touching the C code, let's always strip the first 5 characters.
> >
> > [1] Or maybe elsewhere, but the discrepancy is ugly either way.
> >
> > diff -r 8179af513aeb -r baa1fb9f4961 mercurial/store.py
> > --- a/mercurial/store.py Thu May 07 16:43:58 2015 -0700
> > +++ b/mercurial/store.py Wed May 06 15:58:14 2015 -0700
> > @@ -187,7 +187,7 @@
> >
> > def _hashencode(path, dotencode):
> > digest = _sha(path).hexdigest()
> > - le = lowerencode(path).split('/')[1:]
> > + le = lowerencode(path[5:]).split('/')
> > parts = _auxencode(le, dotencode)
> > basename = parts[-1]
> > _root, ext = os.path.splitext(basename)
> > diff -r 8179af513aeb -r baa1fb9f4961 tests/test-hybridencode.py
> > --- a/tests/test-hybridencode.py Thu May 07 16:43:58 2015 -0700
> > +++ b/tests/test-hybridencode.py Wed May 06 15:58:14 2015 -0700
> > @@ -460,3 +460,9 @@
> > 'VWXYZ-1234567890-xxxxxxxxx-xxxxxxxxx-xxxxxxxx-xxxx'
> > 'xxxxx-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwww'
> > 'wwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww.i')
> > +
> > +print "paths outside data/ can be encoded"
> > +show('metadata/dir/00manifest.i')
> > +show('metadata/12345678/12345678/12345678/12345678/12345678/'
> > + '12345678/12345678/12345678/12345678/12345678/12345678/'
> > + '12345678/12345678/00manifest.i')
> > diff -r 8179af513aeb -r baa1fb9f4961 tests/test-hybridencode.py.out
> > --- a/tests/test-hybridencode.py.out Thu May 07 16:43:58 2015 -0700
> > +++ b/tests/test-hybridencode.py.out Wed May 06 15:58:14 2015 -0700
> > @@ -491,3 +491,10 @@
> > A =
> 'data/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345/-xxxxxxxxx-xxxxxxxxx-xxxxxxxxx-123456789-12.3456789-12345-ABCDEFGHIJKLMNOPRSTUVWXYZ-abcdefghjiklmnopqrstuvwxyz-ABCDEFGHIJKLMNOPRSTUVWXYZ-1234567890-xxxxxxxxx-xxxxxxxxx-xxxxxxxx-xxxxxxxxx-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww.i'
> > B =
> 'dh/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345/-xxxxx93352aa50377751d9e5ebdf52da1e6e69a6887a6.i'
> >
> > +paths outside data/ can be encoded
> > +A = 'metadata/dir/00manifest.i'
> > +B = 'metadata/dir/00manifest.i'
> > +
> > +A =
> 'metadata/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/00manifest.i'
> > +B =
> 'dh/ata/12345678/12345678/12345678/12345678/12345678/12345678/12345678/00manife0a4da1f89aa2aa9eb0896eb451288419049781b4.i'
> > +
>
> The current layout (using the current, 'fncache', repo format) for
> regular revlogs under .hg/store/ is:
>
> data/ for non-path-hashed revlogs
> dh/ for path-hashed-revlogs
>
> You - as I understand it - want to separately store submanifests under a
> new directory named 'metadata', under 'store' (leaving aside for now
> that I don't like the name 'metadata', because of its length).
>
> Don't you then also need a separate directory for paths of submanifests
> that trigger the path-hash-encoding?
>
> If you store path-hashed submanifest revlogs under 'dh', together with
> non-submanifest revlogs, won't that lead to potential path collisions
> inside dh?
>
>
> I think not. That's what I tried to explain with the "collisions are
> avoided by the hashing alone" bit.
>
I do believe you are wrong. The name "metadata" (or parts of it)
together with 00manifest.i (with the .i encoded) may appear as a part of
the users files in the working tree.
Hashing won't solve that collision.
>
>
> The trailing 'ata' of 'metadata' doesn't look like a correct
> discriminator to me, as that string may derive from the working tree as
> well.
>
More information about the Mercurial-devel
mailing list