[PATCH 0 of 9 RFC] manage filename normalization policy per repository

Matt Mackall mpm at selenic.com
Wed Jun 6 22:45:28 CDT 2012


On Tue, 2012-06-05 at 18:17 +0900, FUJIWARA Katsunori wrote:
> At Mon, 04 Jun 2012 09:19:09 -0500,
> Matt Mackall wrote:
> > 
> > On Mon, 2012-06-04 at 19:07 +0900, FUJIWARA Katsunori wrote:
> 
> > > I assume that two UTF-8 changesets below:
> > > 
> > >   A. a changeset where every filename in it uses only ASCII chars
> > > 
> > >   B. a changeset where some filename in it uses non ASCII, but UTF-8
> > >      valid characters
> > > 
> > > To children of (B), I don't want to add any file of which name uses
> > > chars in encoding other than UTF-8, but may want to do so to children
> > > of (A): it is normal usecase of adding new files using non-ASCII chars
> > > in their names with current Mercurial.
> > > 
> > > If the parent of working directory is (B), tools can assume that the
> > > filename encoding in user mind is UTF-8: tools like TortoiseHg, which
> > > aware of dirstate structure and invoke Mercurial API directly in own
> > > process, can detect it and pass filename strings in UTF-8 encoding.
> > 
> > I actually intend for A and B to operate the same: all new files are
> > UTF-8/ASCII. Thus, you transparently upgrade to the new mode when adding
> > non-ASCII files without having to think about it.
> 
> I cleaned up my current understanding:
> 
>   - when working directory is updated by the UTF-8 changeset, "hg"
>     should use Unicode file API

Yes.

>       - other related tools like TortoiseHg should use UTF-8 as
>         filename encoding in this case

Sort of.

If <third-party-tool> wants to match filenames input/output by Mercurial
with filenames on the disk for a UTF-8 changeset, it will be best to
provide the names in UTF-8. But see below.

>       - so, "old tools", which always use system code page as filename
>         encoding, are not recommended for repositories having UTF-8
>         changesets already

We should probably try to avoid breaking them.

The best way to do that is probably to extend our case-folding logic to
cover this when in UTF-8 mode. So, we would treat a Latin1 command-line
argument 'á' as the same as UTF-8 'á' when the ANSI codepage is set to
Latin1, just like we treat 'A' the same as 'a'.

(There may be instances where this is ambiguous, I think those cases
will be rare-to-nonexistent in practice. For instance, in Latin1, all
UTF-8 continuation bytes (0x80-0xbf) are invalid or symbols, so you're
unlikely to get a UTF-8 filename that's meaningful in Latin1 or
vice-versa. Similarly with Shift-JIS.)

>       - in the other hand, "new hg" and tools are not recommended (at
>         least on Windows) for repositories having both "UTF-8
>         changeset (A)" and legacy changeset

No. Again, remember that I actually expect that Windows folks will move
to the new scheme by checking out a legacy changeset, changing the
filenames, and checking it back in. So I fully expect folks to have both
sorts of changesets.

>         because they may cause merging between UTF-8 changeset (B) as
>         children of (A) and legacy changeset unexpectedly
> 
>           - merged changeset can't be handled on Windows correctly

I disagree: I have every confidence we'll find a way to deal with it
sensibly. It's just unspecified how that will work today. Figuring out
how that works in detail is not a priority to me right now.

I normally don't say things like this but, please: stop thinking ahead
to merge. We're having enough trouble communicating about the initial
work without discussing the more complicated merge question.

>   - when working directory is updated by the legacy changeset, "hg"
>     should use byte file API
> 
>       - other tools should use system code page as filename encoding
>         because (*1)

Other tools should, as always, pass the bytes they get from the ANSI C
APIs without trying to interpret them wherever possible.

>       - this allows "old tools" to work with well, because this is as
>         same as current co-operations

Yes.

-- 
Mathematics is the supreme nostalgia of our time.




More information about the Mercurial-devel mailing list