[PATCH alternative-version] pathencode: for long paths, strip first 5 chars, not first dir

Martin von Zweigbergk martinvonz at google.com
Fri May 8 16:31:57 CDT 2015


On Fri, May 8, 2015 at 2:13 PM Adrian Buehlmann <adrian at cadifra.com> wrote:

> On 2015-05-08 21:59, Martin von Zweigbergk wrote:
> > # HG changeset patch
> > # User Martin von Zweigbergk <martinvonz at google.com>
> > # Date 1430953094 25200
> > #      Wed May 06 15:58:14 2015 -0700
> > # Node ID baa1fb9f49617b0d554c6c47af9838922baf9a35
> > # Parent  8179af513aebf96c4902ba3e5e3cf710d49501e4
> > pathencode: for long paths, strip first 5 chars, not first dir
> >
> > When encoding long paths, the pure Python code strips the first
> > directory from the path, while the native code currently strips the
> > first 5 characters. This discrepancy has not been a problem so far,
> > since we have not stored anything in directories other than
> > data/. However, we will soon be storing submanifest revlogs in
> > metadata/, so the discrepancy will have to go [1]. Since file
> > collisions are avoided by the hashing alone (which is done on the full
> > unencoded path), it doesn't really matter whether we drop the first
> > dir, the first 5 characters, or special-case non-data/. To avoid
> > touching the C code, let's always strip the first 5 characters.
> >
> >  [1] Or maybe elsewhere, but the discrepancy is ugly either way.
> >
> > diff -r 8179af513aeb -r baa1fb9f4961 mercurial/store.py
> > --- a/mercurial/store.py      Thu May 07 16:43:58 2015 -0700
> > +++ b/mercurial/store.py      Wed May 06 15:58:14 2015 -0700
> > @@ -187,7 +187,7 @@
> >
> >  def _hashencode(path, dotencode):
> >      digest = _sha(path).hexdigest()
> > -    le = lowerencode(path).split('/')[1:]
> > +    le = lowerencode(path[5:]).split('/')
> >      parts = _auxencode(le, dotencode)
> >      basename = parts[-1]
> >      _root, ext = os.path.splitext(basename)
> > diff -r 8179af513aeb -r baa1fb9f4961 tests/test-hybridencode.py
> > --- a/tests/test-hybridencode.py      Thu May 07 16:43:58 2015 -0700
> > +++ b/tests/test-hybridencode.py      Wed May 06 15:58:14 2015 -0700
> > @@ -460,3 +460,9 @@
> >            'VWXYZ-1234567890-xxxxxxxxx-xxxxxxxxx-xxxxxxxx-xxxx'
> >            'xxxxx-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwww'
> >            'wwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww.i')
> > +
> > +print "paths outside data/ can be encoded"
> > +show('metadata/dir/00manifest.i')
> > +show('metadata/12345678/12345678/12345678/12345678/12345678/'
> > +          '12345678/12345678/12345678/12345678/12345678/12345678/'
> > +          '12345678/12345678/00manifest.i')
> > diff -r 8179af513aeb -r baa1fb9f4961 tests/test-hybridencode.py.out
> > --- a/tests/test-hybridencode.py.out  Thu May 07 16:43:58 2015 -0700
> > +++ b/tests/test-hybridencode.py.out  Wed May 06 15:58:14 2015 -0700
> > @@ -491,3 +491,10 @@
> >  A =
> 'data/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345/-xxxxxxxxx-xxxxxxxxx-xxxxxxxxx-123456789-12.3456789-12345-ABCDEFGHIJKLMNOPRSTUVWXYZ-abcdefghjiklmnopqrstuvwxyz-ABCDEFGHIJKLMNOPRSTUVWXYZ-1234567890-xxxxxxxxx-xxxxxxxxx-xxxxxxxx-xxxxxxxxx-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww.i'
> >  B =
> 'dh/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345/-xxxxx93352aa50377751d9e5ebdf52da1e6e69a6887a6.i'
> >
> > +paths outside data/ can be encoded
> > +A = 'metadata/dir/00manifest.i'
> > +B = 'metadata/dir/00manifest.i'
> > +
> > +A =
> 'metadata/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/00manifest.i'
> > +B =
> 'dh/ata/12345678/12345678/12345678/12345678/12345678/12345678/12345678/00manife0a4da1f89aa2aa9eb0896eb451288419049781b4.i'
> > +
>
> The current layout (using the current, 'fncache', repo format) for
> regular revlogs under .hg/store/ is:
>
>   data/      for non-path-hashed revlogs
>   dh/        for path-hashed-revlogs
>
> You - as I understand it - want to separately store submanifests under a
> new directory named 'metadata', under 'store' (leaving aside for now
> that I don't like the name 'metadata', because of its length).
>
> Don't you then also need a separate directory for paths of submanifests
> that trigger the path-hash-encoding?
>
> If you store path-hashed submanifest revlogs under 'dh', together with
> non-submanifest revlogs, won't that lead to potential path collisions
> inside dh?


I think not. That's what I tried to explain with the "collisions are
avoided by the hashing alone" bit.


>


> The trailing 'ata' of 'metadata' doesn't look like a correct
> discriminator to me, as that string may derive from the working tree as
> well.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://selenic.com/pipermail/mercurial-devel/attachments/20150508/13e38292/attachment.html>


More information about the Mercurial-devel mailing list