Windows long path experimenting report

Fri Jun 20 09:51:13 CDT 2008

On Fri, Jun 20, 2008 at 3:50 PM, Paul Moore <p.f.moore at gmail.com> wrote:
> On 20/06/2008, Peter Arrenbrecht <peter.arrenbrecht at gmail.com> wrote:
>> If `name` is a path, then this will not sort like the original
>> structure very well.
>
> I was thinking of encoding each path element in turn. Actually, I just
> assumed that was what Mercurial currently did, and so didn't state it
> explicitly. Sorry.
>
>> If it's just a component name, then we'll bump against the 260 (or so)
>> max path length limit again soon (260 / 42 ~= 6 = max folder nesting
>> depth). So at least if the component name is shorter than 42 chars, we
>> should drop the encoding and use the plain name. And we shall have to
>> disambiguate hashed and non-hashed names, so hashed names should maybe
>> always contain a special char and any plain name that contains said
>> char also gets hashed automatically, or something.
>
> Fair enough - as I said, it wasn't a well thought out proposal. How
> about encoding elements as now, plus an initial underscore (to handle
> reserved names). If the resulting element name is over 42 characters,
> then take the first 10 plus a MD5 hash. Or just convert any over 32 to
> a hash - that breaks sorting completely, but only for rare cases.
> You'd need a flag character to say "hashed" - maybe use an 'x' rather
> than an underscore as prefix.
>
>> But still, it wouldn't really solve the problem I think.
>
> If we assume the 260-char limit is a hard limit (i.e., the //?/
> approach isn't going anywhere) then there will always be potential
> issues. But no-one I know hits the limit in practice, with general
> filenames - and that's in a Windows environment where directories
> called "Monthly Summary Reports 2008"  and files called "Report To The
> Directors For June 2008.doc" are common!!!
>
> At the risk of repeating the "nobody needs more than 640K" mistake, I
> suspect the average repository won't be affected by this. And the
> number where the working dir is OK, but the repo isn't, will be even
> fewer. (If we can fix the repo encoding so it doesn't do pathological
> things like doubling the length of ALL_CAPS names).
>
> We can't avoid the fact that MAX_PATH differs between platforms. All
> we should be aiming for is to make it so that it's not noticeably
> worse for Mercurial repositories than for simple directories (in my
> view).

Good point. And I do like your proposal quite well. However, I see
some (likely theoretical) drawbacks:

	* Could degrade performance on Unix unnecessarily. Consider a dir
with tons of files starting with a longish common prefix.
	* Does not address short names that Hg's encoding expands well. A
path like A/A/.../A/A could still fit into the limit, but under your
encoding it might not. So a violation of the goal you stated above.
	* Could incur many hashing operations (don't know if they are so
expensive to be relevant here).
	* Requires splitting paths (again, don't know if this is relevant).

In the hope of further stimulating the discussion, I have another
proposal, even though it is not fully satisfactory to me at this
point. It follows Adrian's and Patrick's idea that maybe we should
accept the need for a different format and, thus, manual intervention
when filesystem-level interoperability is desired[1]. This could also
be applied to your proposal, meaning it would be optional to turn on
the encoding you propose.

Proposal:

We use Adrian's ubar encoding (basically _ + present encoding for each
path component) for sufficiently short paths, and otherwise fall back
on a hashing scheme. Of course, we need a quick way to define
"sufficiently short" that, for a given repo, is fixed across
platforms. So we just configure the repo with a fixed max path length
and store this in, for instance, .hg/maxpathlen. Then:

def encode(path):
	ubar = ubarencode(path)
	if len(ubar) > repo.maxstorepathlen:
		return hashencode(ubar) # or hashencode(path)
	return ubar

def hashencode(path):
	# note there's no _ before hashed, so this cannot collide with ubar
encoded names
	return 'hashed/' + path[:10] + md5.md5(path).hexdigest()

Windows would, by default, apply a limit of 260/2 (a heuristic - see
below). Linux would apply none, but in any case we should add an
option to `hg clone` that specifies it. Might also add variants where
we specify `hg clone --sharable-by win,unix` which computes an
appropriate max path length.

Problems:

It seems the max path length is really a limit on the total length. So
when configuring the max repo store path length, one would really have
to take into account the length of the path to the repo itself. And
that would limit the locations it could then be copied to. Horrid.
This is why I currently propose a limit of 260/2 for Windows' default
as a compromise.

Consequences:

Most repos remain trivially interchangeable between Windows and Unix,
with no speed penalty on either platform.

Some (those with long paths) will have to explicitly ensure
sharability at the cost of a likely minor speed penalty on Unix. To
ensure it, a simple clone of a repo with suitable max path length
suffices.

We will have to hash only rarely (again, don't know if the expense of
hashing is really an issue).

[1]  We are, after all, discussing a repo format that is directly
sharable between Unix and Windows (and hopefully other platforms). I
don't think this is such a typical case. It arises when

	* sharing repos via shared filesystems (network shares, USB drives,
mounts, etc.),
	* migrating repos via plain file copy operations.

It is not a necessity when collaborating via the - I'm assuming -
prevalent push/pull and bundle exchange models.

-parren