[PATCH 0 of 9 RFC] manage filename normalization policy per repository

Matt Mackall mpm at selenic.com
Fri May 25 16:06:28 CDT 2012


On Sat, 2012-05-26 at 00:00 +0900, FUJIWARA Katsunori wrote:
> this patch series allows users to manage filename normalization policy
> per repository
> 
> this is just for the base of discussion, and tested a little: clone,
> bundle/unbundle, archive, diff, export/import.... simply.

What happens if:

a) a Mac user adds a file NFD(X)
b) that same user mentions that file in another file Y as NFD(X)
c) a Linux or Windows[1] -tool- tries to locate the file listed in Y but
Mercurial has helpfully transformed it to NFC(X) on check-out

Answer: neither Linux nor Windows will treat NFC(X) and NFD(X) as the
same file. And we won't renormalize the _contents_ of file Y, so
renormalizing the filename _introduces_ a mismatch. So.. it breaks. And
breaks here means "mysteriously stops compiling", "mysteriously gives
404s", "mysteriously crashes our mission-critical infrastructure".

Compare that with "user gets extremely annoyed by filenames he can read
and click on but can't type".[2]

This is another manifestation of the makefile problem: filenames
referred to inside other files MUST agree with what TOOLS see on the
filesystem for the tools to work.

Fundamentally, we can't force a Mac user to make Y reference NFC(X)
rather than NFD(X). Nor can we even detect it! So we can't prevent them
from making a non-portable commit. I'm afraid the best we can do is warn
Mac users that they're adding NFD files.

However, in the current scheme, a non-Mac user can always rename NFD(X)
to NFC(X) and fixup Y without introducing a commit that doesn't build.

Yes, NFD is a massively stupid annoyance to users. But your
renormalizing technique will break more than it fixes for any project
that contains non-ASCII inter-file references. And because we're an SCM
(and not a CMS), that's what we care about.

[1] assuming we get a UTF-8 mode working on Windows
[2] which is actually a generic Unicode problem, because in addition to
NFD, Unicode has tons of homoglyphs. duplicate characters, and the vast
majority of characters aren't even typable on any given keyboard.
-- 
Mathematics is the supreme nostalgia of our time.




More information about the Mercurial-devel mailing list