Windows long path experimenting report

Peter Arrenbrecht peter.arrenbrecht at gmail.com
Sat Jun 21 01:55:19 CDT 2008


On 20.06.2008, at 23:49, Matt Mackall <mpm at selenic.com> wrote:

>
> On Fri, 2008-06-20 at 23:22 +0200, Peter Arrenbrecht wrote:
>> On Fri, Jun 20, 2008 at 8:56 PM, Matt Mackall <mpm at selenic.com>  
>> wrote:
>>>
>>> On Fri, 2008-06-20 at 14:50 +0100, Paul Moore wrote:
>>>> On 20/06/2008, Peter Arrenbrecht <peter.arrenbrecht at gmail.com>  
>>>> wrote:
>>>>> If `name` is a path, then this will not sort like the original
>>>>> structure very well.
>>>>
>>>> I was thinking of encoding each path element in turn. Actually, I  
>>>> just
>>>> assumed that was what Mercurial currently did, and so didn't  
>>>> state it
>>>> explicitly. Sorry.
>>>>
>>>>> If it's just a component name, then we'll bump against the 260  
>>>>> (or so)
>>>>> max path length limit again soon (260 / 42 ~= 6 = max folder  
>>>>> nesting
>>>>> depth). So at least if the component name is shorter than 42  
>>>>> chars, we
>>>>> should drop the encoding and use the plain name. And we shall  
>>>>> have to
>>>>> disambiguate hashed and non-hashed names, so hashed names should  
>>>>> maybe
>>>>> always contain a special char and any plain name that contains  
>>>>> said
>>>>> char also gets hashed automatically, or something.
>>>>
>>>> Fair enough - as I said, it wasn't a well thought out proposal. How
>>>> about encoding elements as now, plus an initial underscore (to  
>>>> handle
>>>> reserved names). If the resulting element name is over 42  
>>>> characters,
>>>> then take the first 10 plus a MD5 hash. Or just convert any over  
>>>> 32 to
>>>> a hash - that breaks sorting completely, but only for rare cases.
>>>> You'd need a flag character to say "hashed" - maybe use an 'x'  
>>>> rather
>>>> than an underscore as prefix.
>>>
>>> Again, it's not at all about the length of individual path  
>>> components,
>>> it's the overall path length.
>>>
>>> Projects that have a balanced distribution of files generally  
>>> won't run
>>> into this problem. If we have filenames up to 20 characters, and 10
>>> files or directories per directory, a balanced project can grow up  
>>> to 13
>>> levels deep, or 10^13 files. Expand that to 40 character  
>>> filenames, and
>>> we can still have 10^6 files.
>>>
>>> But there are unbalanced projects today that have 10^5 - 10^6  
>>> files that
>>> are running into the limit. They're pushing it off for now by  
>>> turning
>>> off all the escaping.
>>>
>>> We could have an optional layout where everything used hashing. But
>>> remember, hashing has an order of magnitude or so performance  
>>> impact.
>>> If large projects (the ones that are the most performance  
>>> sensitive) are
>>> forced to use hashing, they won't be happy.
>>>
>>> I've considered a split scheme where any pathname over some length  
>>> gets
>>> hashed, but again, it will mostly impact those large projects.
>>>
>>> What's perhaps needed is a hybrid scheme. Something that takes:
>>>
>>> A/Very/long/Filename/with/Many/components/that/just/keeps/going/ 
>>> and/going.txt
>>>
>>> ..and turns it into something like:
>>>
>>> %00avelofiwimacoth/667f8f2c8402b51b51bae6987d2bf524cd4bfc85.i
>>>
>>> (taking the first letter or two from the first 8 directory  
>>> components,
>>> prepending an encoded null so that no files in the regular  
>>> encoding will
>>> collide)
>>>
>>> In other words, something that's at least partially locality- 
>>> preserving.
>>> Files in the same directory will stay in the same directory, though
>>> possibly intermixed with a smallish number of other files  
>>> (hopefully a
>>> small fraction of the total project size). Directories near each  
>>> other
>>> in the working directory may even end up near each other in the  
>>> repo in
>>> this layout too.
>>>
>>> And we'd only use this when names encoded in the regular way went  
>>> over
>>> some threshold.
>>
>> Yes, that's what I've been thinking about too (see other messages in
>> this thread). I like your idea about trying to still group files from
>> one dir again into one dir. This should help on filesystems where  
>> tons
>> of files in a single dir is bad (not that I expect to see repos with
>> tons of files needing to be hashed, but still).
>>
>> A key question is how to set the threshold. Will it be fixed across
>> all repos, or configurable per repo?
>
> Fixed. In practice, we should aim to have a single layout that can be
> shared by all platforms. Because people do in fact regularly share  
> repos
> across platforms.
>

Agreed. However, a suitable default across all filesystems would  
achieve basically the same while still leaving people in the know the  
option to tune. At the cost of a new obscure option, admittedly.

>> If fixed, what is it? If not, we
>> expose more complexity to users, but allow more fine-tuning. And if
>> the latter, will the default be fixed across all filesystems? Such
>> that it should reasonably work for Windows (like 260/2 or something)?
>
> 130 is probably ok. I'd rather make it more like 200 though.
>

Hmm. Can you estimate how many files would end up hashed in the big  
repos you mentioned?

>> One aspect this scheme does not address is filesystems where
>> individual path components have seriously limited length.
>
> Like? Even VFAT allows long names up to 255 bytes. I suppose HFS has a
> 31 character limit, but there's not much excuse for using that.
>
> -- 
> Mathematics is the supreme nostalgia of our time.
>


More information about the Mercurial-devel mailing list