Windows people: please help check idea for a new Mercurial repository layout

Peter Arrenbrecht peter.arrenbrecht at gmail.com
Mon Jun 16 06:58:35 CDT 2008


On Mon, Jun 16, 2008 at 9:46 AM, Adrian Buehlmann <adrian at cadifra.com> wrote:
> On 16.06.2008 02:59, Matt Mackall wrote:
>> On Sun, 2008-06-15 at 10:43 -0700, James Walker wrote:
>>> Matt Mackall wrote:
>>>
>>>> It's mostly a problem with path length, actually. Filenames can be 255
>>>> characters, but the total path length is limited to 260. Or something
>>>> like that. And if someone makes a repository with deeper and deeper
>>>> paths over time, most of the directory hierarchy will exist when we hit
>>>> the limit.
>>> Isn't this only quantitatively different from other OSes, not
>>> qualitatively different?  On the Mac, file names can be 255 Unicode
>>> characters, but MAX_PATH is 1024 (which I think means UTF-8 characters).
>>
>> Yes, and it's actually MAX_PATH that's the problem. People are creating
>> very deep **pathnames** that exceed Windows' pitiful limit of 260
>> characters.  Mercurial currently does everything with absolute paths,
>> adds its own .hg/store/data/ in, and escapes all the interesting
>> characters, making things worse here. So people may end up having an
>> effective MAX_PATH of something like 120, which is a pretty long name,
>> but not completely ridiculous.
>>
>> If you compare that to a Mac, people can easily create repo pathnames >
>> 512 bytes (past "stupidly long" and into "absurdly long"), and I'll have
>> absolutely no sympathy for anyone who runs into the 1k limit there.
>>
>> We know that NTFS[1] can actually handle paths that are 32K with \\?\
>> and probably will allow you to reach files with absolute paths > 260 by
>> chdir() + open() without \\?\.
>
> Extremely unlikely (to the part after the last "and"). Not because of NTFS but
> because of the higher software layers (Python library).
>
> And you will have to use the ...W functions of win32file for *all* disk
> access inside .hg, feeding every path as an absolute path with '\' path
> separators only (must include drive letter) in a Unicode string object
> prepended by "\\?\".
>
> This is doable, but certainly does not qualify as a "quick hack".
>
> So, PyWin32 will be an obligatory dependency (not really an issue,
> just to mention it).
>
>> 255 should be a comfortable limit for individual **filenames**. The
>> worst case is something like "日本国" which goes from 6 UTF-16 bytes or
>> 9 UTF-8 bytes to "~e6~97~a5~e6~9c~ac~e5~9b~bd.i" (29 bytes), an
>> expansion factor that limits such filenames to ~28 characters. As that's
>> enough for a haiku or two[2], I don't think that's a serious problem.
>
> I have to admit that it is a shame that the full power of NTFS
> is crippled behind such a lousy explorer.exe on Windows.
>
> However, I would appreciate if we could do that reserved name encoding despite
> long path [1] being able to write reserved names in theory.
>
> This would at least enable to solve that silly viral reserved name trap problem
> mixed platform projects today are facing with Mercurial.

This may be a rather unconventional idea, but I'll air it
nevertheless. Maybe it's going to trigger other ideas and eventually
lead somewhere.

How about simply skipping aux.i/.d et al. when encountering them on
Windows? This would make all accesses to the index/data of such files
fail. As I see it, this would have the following consequences:

a) You cannot clone such a repo over the wire. Good.
b) You can clone such a repo locally when the clone operation is just
linking/copying .hg/store based on a walk of .hg/store. Good.
c) You cannot update to a revision in which aux is alive. Acceptable
(see below)?
d) You can update to a revision in which aux is no longer present
(renamed, dropped). Good.
e) You can push csets from this repo as long as aux is not involved. Good.
f) You can pull from such a repo csets where aux is not involved. Good.
g) Verify would fail. Good.
h) The repo does not constitute a complete backup of the original
repo. Acceptable?

This would, of course, only help for mixed-environment repos where the
master is always hosted on Unix, and where either incompatibilities
have been eliminated in the working copy by renames, or else (c) is
relaxed so that hg will allow updating to a revision with missing
elements and simply skip them in the working copy as well.

The latter approach would be problematic if people attempted to merge
csets containing changes to such missing elements on Windows. However,
if we only change update so it accepts missing elements, but not merge
to accept missing merged elements, then this would automatically be
forbidden.

The same kind of thing might be applicable to case folding collisions
or even long paths as well.

As I said, just an idea.
-parren



More information about the Mercurial mailing list