Windows long path experimenting report

Peter Arrenbrecht peter.arrenbrecht at gmail.com
Fri Jun 20 16:22:48 CDT 2008


On Fri, Jun 20, 2008 at 8:56 PM, Matt Mackall <mpm at selenic.com> wrote:
>
> On Fri, 2008-06-20 at 14:50 +0100, Paul Moore wrote:
>> On 20/06/2008, Peter Arrenbrecht <peter.arrenbrecht at gmail.com> wrote:
>> > If `name` is a path, then this will not sort like the original
>> > structure very well.
>>
>> I was thinking of encoding each path element in turn. Actually, I just
>> assumed that was what Mercurial currently did, and so didn't state it
>> explicitly. Sorry.
>>
>> > If it's just a component name, then we'll bump against the 260 (or so)
>> > max path length limit again soon (260 / 42 ~= 6 = max folder nesting
>> > depth). So at least if the component name is shorter than 42 chars, we
>> > should drop the encoding and use the plain name. And we shall have to
>> > disambiguate hashed and non-hashed names, so hashed names should maybe
>> > always contain a special char and any plain name that contains said
>> > char also gets hashed automatically, or something.
>>
>> Fair enough - as I said, it wasn't a well thought out proposal. How
>> about encoding elements as now, plus an initial underscore (to handle
>> reserved names). If the resulting element name is over 42 characters,
>> then take the first 10 plus a MD5 hash. Or just convert any over 32 to
>> a hash - that breaks sorting completely, but only for rare cases.
>> You'd need a flag character to say "hashed" - maybe use an 'x' rather
>> than an underscore as prefix.
>
> Again, it's not at all about the length of individual path components,
> it's the overall path length.
>
> Projects that have a balanced distribution of files generally won't run
> into this problem. If we have filenames up to 20 characters, and 10
> files or directories per directory, a balanced project can grow up to 13
> levels deep, or 10^13 files. Expand that to 40 character filenames, and
> we can still have 10^6 files.
>
> But there are unbalanced projects today that have 10^5 - 10^6 files that
> are running into the limit. They're pushing it off for now by turning
> off all the escaping.
>
> We could have an optional layout where everything used hashing. But
> remember, hashing has an order of magnitude or so performance impact.
> If large projects (the ones that are the most performance sensitive) are
> forced to use hashing, they won't be happy.
>
> I've considered a split scheme where any pathname over some length gets
> hashed, but again, it will mostly impact those large projects.
>
> What's perhaps needed is a hybrid scheme. Something that takes:
>
> A/Very/long/Filename/with/Many/components/that/just/keeps/going/and/going.txt
>
> ..and turns it into something like:
>
> %00avelofiwimacoth/667f8f2c8402b51b51bae6987d2bf524cd4bfc85.i
>
> (taking the first letter or two from the first 8 directory components,
> prepending an encoded null so that no files in the regular encoding will
> collide)
>
> In other words, something that's at least partially locality-preserving.
> Files in the same directory will stay in the same directory, though
> possibly intermixed with a smallish number of other files (hopefully a
> small fraction of the total project size). Directories near each other
> in the working directory may even end up near each other in the repo in
> this layout too.
>
> And we'd only use this when names encoded in the regular way went over
> some threshold.

Yes, that's what I've been thinking about too (see other messages in
this thread). I like your idea about trying to still group files from
one dir again into one dir. This should help on filesystems where tons
of files in a single dir is bad (not that I expect to see repos with
tons of files needing to be hashed, but still).

A key question is how to set the threshold. Will it be fixed across
all repos, or configurable per repo? If fixed, what is it? If not, we
expose more complexity to users, but allow more fine-tuning. And if
the latter, will the default be fixed across all filesystems? Such
that it should reasonably work for Windows (like 260/2 or something)?
Or will the Unix filesystems default to no maxlength, and Windows ones
default to 260/2?

I currently lean towards making it configurable per repo and having
ext3 et al. use no limit and NTFS/VFAT etc. something like 260/2. This
on the assumption that sharing repos via the filesystem like this is
not a very typical scenario. But I might be wrong. Another approach I
find attractive is to leave it configurable, but set the default to
260/2 for all filesystems. Might lead to less suprises for most users,
at maybe a slight performance cost for some users on Unix. But they
would have the option to get rid of this cost by, for example, setting
a default arg for clone in hgrc.

One aspect this scheme does not address is filesystems where
individual path components have seriously limited length. I think this
is another argument for making such aspects of the repo layout
configurable, because then we could simply introduce another config
option that starts hashing as soon as a particular path component
exceeds the max size, and the idea of repos not being per-se binary
compatible without some tweaking would already be established.

-parren

ps. Please refer to my other messages in this thread for the rationale
behind 260/2 = MAX_PATH/2.


More information about the Mercurial-devel mailing list