Windows long path experimenting report

Matt Mackall mpm at selenic.com
Fri Jun 20 13:56:17 CDT 2008


On Fri, 2008-06-20 at 14:50 +0100, Paul Moore wrote:
> On 20/06/2008, Peter Arrenbrecht <peter.arrenbrecht at gmail.com> wrote:
> > If `name` is a path, then this will not sort like the original
> > structure very well.
> 
> I was thinking of encoding each path element in turn. Actually, I just
> assumed that was what Mercurial currently did, and so didn't state it
> explicitly. Sorry.
> 
> > If it's just a component name, then we'll bump against the 260 (or so)
> > max path length limit again soon (260 / 42 ~= 6 = max folder nesting
> > depth). So at least if the component name is shorter than 42 chars, we
> > should drop the encoding and use the plain name. And we shall have to
> > disambiguate hashed and non-hashed names, so hashed names should maybe
> > always contain a special char and any plain name that contains said
> > char also gets hashed automatically, or something.
> 
> Fair enough - as I said, it wasn't a well thought out proposal. How
> about encoding elements as now, plus an initial underscore (to handle
> reserved names). If the resulting element name is over 42 characters,
> then take the first 10 plus a MD5 hash. Or just convert any over 32 to
> a hash - that breaks sorting completely, but only for rare cases.
> You'd need a flag character to say "hashed" - maybe use an 'x' rather
> than an underscore as prefix.

Again, it's not at all about the length of individual path components,
it's the overall path length.

Projects that have a balanced distribution of files generally won't run
into this problem. If we have filenames up to 20 characters, and 10
files or directories per directory, a balanced project can grow up to 13
levels deep, or 10^13 files. Expand that to 40 character filenames, and
we can still have 10^6 files.

But there are unbalanced projects today that have 10^5 - 10^6 files that
are running into the limit. They're pushing it off for now by turning
off all the escaping.

We could have an optional layout where everything used hashing. But
remember, hashing has an order of magnitude or so performance impact.
If large projects (the ones that are the most performance sensitive) are
forced to use hashing, they won't be happy. 

I've considered a split scheme where any pathname over some length gets
hashed, but again, it will mostly impact those large projects.

What's perhaps needed is a hybrid scheme. Something that takes:

A/Very/long/Filename/with/Many/components/that/just/keeps/going/and/going.txt

..and turns it into something like:

%00avelofiwimacoth/667f8f2c8402b51b51bae6987d2bf524cd4bfc85.i

(taking the first letter or two from the first 8 directory components,
prepending an encoded null so that no files in the regular encoding will
collide)

In other words, something that's at least partially locality-preserving.
Files in the same directory will stay in the same directory, though
possibly intermixed with a smallish number of other files (hopefully a
small fraction of the total project size). Directories near each other
in the working directory may even end up near each other in the repo in
this layout too.

And we'd only use this when names encoded in the regular way went over
some threshold.

-- 
Mathematics is the supreme nostalgia of our time.



More information about the Mercurial-devel mailing list