Windows long path experimenting report

Adrian Buehlmann adrian at cadifra.com
Thu Jun 26 05:22:20 CDT 2008


On 20.06.2008 23:49, Matt Mackall wrote:
> On Fri, 2008-06-20 at 23:22 +0200, Peter Arrenbrecht wrote:
>> On Fri, Jun 20, 2008 at 8:56 PM, Matt Mackall <mpm at selenic.com> wrote:
>>> On Fri, 2008-06-20 at 14:50 +0100, Paul Moore wrote:
>>>> On 20/06/2008, Peter Arrenbrecht <peter.arrenbrecht at gmail.com> wrote:
>>>>> If `name` is a path, then this will not sort like the original
>>>>> structure very well.
>>>> I was thinking of encoding each path element in turn. Actually, I just
>>>> assumed that was what Mercurial currently did, and so didn't state it
>>>> explicitly. Sorry.
>>>>
>>>>> If it's just a component name, then we'll bump against the 260 (or so)
>>>>> max path length limit again soon (260 / 42 ~= 6 = max folder nesting
>>>>> depth). So at least if the component name is shorter than 42 chars, we
>>>>> should drop the encoding and use the plain name. And we shall have to
>>>>> disambiguate hashed and non-hashed names, so hashed names should maybe
>>>>> always contain a special char and any plain name that contains said
>>>>> char also gets hashed automatically, or something.
>>>> Fair enough - as I said, it wasn't a well thought out proposal. How
>>>> about encoding elements as now, plus an initial underscore (to handle
>>>> reserved names). If the resulting element name is over 42 characters,
>>>> then take the first 10 plus a MD5 hash. Or just convert any over 32 to
>>>> a hash - that breaks sorting completely, but only for rare cases.
>>>> You'd need a flag character to say "hashed" - maybe use an 'x' rather
>>>> than an underscore as prefix.
>>> Again, it's not at all about the length of individual path components,
>>> it's the overall path length.
>>>
>>> Projects that have a balanced distribution of files generally won't run
>>> into this problem. If we have filenames up to 20 characters, and 10
>>> files or directories per directory, a balanced project can grow up to 13
>>> levels deep, or 10^13 files. Expand that to 40 character filenames, and
>>> we can still have 10^6 files.
>>>
>>> But there are unbalanced projects today that have 10^5 - 10^6 files that
>>> are running into the limit. They're pushing it off for now by turning
>>> off all the escaping.
>>>
>>> We could have an optional layout where everything used hashing. But
>>> remember, hashing has an order of magnitude or so performance impact.
>>> If large projects (the ones that are the most performance sensitive) are
>>> forced to use hashing, they won't be happy.
>>>
>>> I've considered a split scheme where any pathname over some length gets
>>> hashed, but again, it will mostly impact those large projects.
>>>
>>> What's perhaps needed is a hybrid scheme. Something that takes:
>>>
>>> A/Very/long/Filename/with/Many/components/that/just/keeps/going/and/going.txt
>>>
>>> ..and turns it into something like:
>>>
>>> %00avelofiwimacoth/667f8f2c8402b51b51bae6987d2bf524cd4bfc85.i
>>>
>>> (taking the first letter or two from the first 8 directory components,
>>> prepending an encoded null so that no files in the regular encoding will
>>> collide)
>>>
>>> In other words, something that's at least partially locality-preserving.
>>> Files in the same directory will stay in the same directory, though
>>> possibly intermixed with a smallish number of other files (hopefully a
>>> small fraction of the total project size). Directories near each other
>>> in the working directory may even end up near each other in the repo in
>>> this layout too.
>>>
>>> And we'd only use this when names encoded in the regular way went over
>>> some threshold.
>> Yes, that's what I've been thinking about too (see other messages in
>> this thread). I like your idea about trying to still group files from
>> one dir again into one dir. This should help on filesystems where tons
>> of files in a single dir is bad (not that I expect to see repos with
>> tons of files needing to be hashed, but still).
>>
>> A key question is how to set the threshold. Will it be fixed across
>> all repos, or configurable per repo?
> 
> Fixed. In practice, we should aim to have a single layout that can be
> shared by all platforms. Because people do in fact regularly share repos
> across platforms.
> 
>>  If fixed, what is it? If not, we
>> expose more complexity to users, but allow more fine-tuning. And if
>> the latter, will the default be fixed across all filesystems? Such
>> that it should reasonably work for Windows (like 260/2 or something)?
> 
> 130 is probably ok. I'd rather make it more like 200 though.

I think 150 might be a reasonable compromise.

root paths for the store\data on Windows like the 82 chars of

C:\Documents and Settings\adi\My Documents\hg-repos\tortoisehg-crew\.hg\store\data

are admittedly a bit silly when using Mercurial (better use an extra drive
letter for the repos-dir) but probably still quite common.

82 + 200 would already exceed the 260 magic barrier of Windows

110 for the root path and then 150 available until 260 is hit looks
reasonable.

>> One aspect this scheme does not address is filesystems where
>> individual path components have seriously limited length.
> 
> Like? Even VFAT allows long names up to 255 bytes. I suppose HFS has a
> 31 character limit, but there's not much excuse for using that.


More information about the Mercurial-devel mailing list