[PATCH 0 of 9 RFC] manage filename normalization policy per repository

FUJIWARA Katsunori foozy at lares.dti.ne.jp
Thu Jun 7 01:28:31 CDT 2012


At Wed, 06 Jun 2012 22:45:28 -0500,
Matt Mackall wrote:
> 
> On Tue, 2012-06-05 at 18:17 +0900, FUJIWARA Katsunori wrote:
> > At Mon, 04 Jun 2012 09:19:09 -0500,
> > Matt Mackall wrote:
> > > 
> > > On Mon, 2012-06-04 at 19:07 +0900, FUJIWARA Katsunori wrote:
> > 
> > > > I assume that two UTF-8 changesets below:
> > > > 
> > > >   A. a changeset where every filename in it uses only ASCII chars
> > > > 
> > > >   B. a changeset where some filename in it uses non ASCII, but UTF-8
> > > >      valid characters
> > > > 
> > > > To children of (B), I don't want to add any file of which name uses
> > > > chars in encoding other than UTF-8, but may want to do so to children
> > > > of (A): it is normal usecase of adding new files using non-ASCII chars
> > > > in their names with current Mercurial.
> > > > 
> > > > If the parent of working directory is (B), tools can assume that the
> > > > filename encoding in user mind is UTF-8: tools like TortoiseHg, which
> > > > aware of dirstate structure and invoke Mercurial API directly in own
> > > > process, can detect it and pass filename strings in UTF-8 encoding.
> > > 
> > > I actually intend for A and B to operate the same: all new files are
> > > UTF-8/ASCII. Thus, you transparently upgrade to the new mode when adding
> > > non-ASCII files without having to think about it.
> > 
> > I cleaned up my current understanding:
> > 
> >   - when working directory is updated by the UTF-8 changeset, "hg"
> >     should use Unicode file API
> 
> Yes.
> 
> >       - other related tools like TortoiseHg should use UTF-8 as
> >         filename encoding in this case
> 
> Sort of.
> 
> If <third-party-tool> wants to match filenames input/output by Mercurial
> with filenames on the disk for a UTF-8 changeset, it will be best to
> provide the names in UTF-8. But see below.
> 
> >       - so, "old tools", which always use system code page as filename
> >         encoding, are not recommended for repositories having UTF-8
> >         changesets already
> 
> We should probably try to avoid breaking them.
> 
> The best way to do that is probably to extend our case-folding logic to
> cover this when in UTF-8 mode. So, we would treat a Latin1 command-line
> argument 'á' as the same as UTF-8 'á' when the ANSI codepage is set to
> Latin1, just like we treat 'A' the same as 'a'.
> 
> (There may be instances where this is ambiguous, I think those cases
> will be rare-to-nonexistent in practice. For instance, in Latin1, all
> UTF-8 continuation bytes (0x80-0xbf) are invalid or symbols, so you're
> unlikely to get a UTF-8 filename that's meaningful in Latin1 or
> vice-versa. Similarly with Shift-JIS.)

I agree with guessing UTF-8 byte sequence corresponded to original one
by system code page, at invocation from command line.

But at invocation from GUIs like TortoiseHg, which invokes Mercurial
internal API directly, encoding filenames into UTF-8 is responsibility
of them.

Fortunately, TortoiseHg is always released with the latest Mercurial,
so we can ignore this case for it. But I'm not sure for the other
tools.

OK, I'll contact to staffs of such tools for asking about how they
co-operate Mercurial, and re-post about this if some of them have
problems.


> >       - in the other hand, "new hg" and tools are not recommended (at
> >         least on Windows) for repositories having both "UTF-8
> >         changeset (A)" and legacy changeset
> 
> No. Again, remember that I actually expect that Windows folks will move
> to the new scheme by checking out a legacy changeset, changing the
> filenames, and checking it back in. So I fully expect folks to have both
> sorts of changesets.

I just worry about the case that some of team members should (or want
to ?) use current installed "old hg" in some reasons: restricted by
the policy of their organizations, for example.

# even though "installer not needing admin rights" may solve example
# case, if they don't fear being fired for policy violation :-)

In this case, without any explicit renaming, adding files named with
non-ascii chars by "new hg" will bring trouble into co-operation
between "new hg" users and "old hg" users, will not it ?

Otherwise, I agree with expecting that Windows folks will move to the
new scheme.

> >         because they may cause merging between UTF-8 changeset (B) as
> >         children of (A) and legacy changeset unexpectedly
> > 
> >           - merged changeset can't be handled on Windows correctly
> 
> I disagree: I have every confidence we'll find a way to deal with it
> sensibly. It's just unspecified how that will work today. Figuring out
> how that works in detail is not a priority to me right now.
> 
> I normally don't say things like this but, please: stop thinking ahead
> to merge. We're having enough trouble communicating about the initial
> work without discussing the more complicated merge question.

OK, at first, I'll start to work without discussing about merges any
more.

> >   - when working directory is updated by the legacy changeset, "hg"
> >     should use byte file API
> > 
> >       - other tools should use system code page as filename encoding
> >         because (*1)
> 
> Other tools should, as always, pass the bytes they get from the ANSI C
> APIs without trying to interpret them wherever possible.

I assumed TortoiseHg, which uses Unicode as internal filename
representation, as a typical one of other tools, so, I used "use
system code page".

Yes, as you described, tools using ANSI C APIs should use gotten byte
sequence without trying to interpret.

> >       - this allows "old tools" to work with well, because this is as
> >         same as current co-operations
> 
> Yes.
> 
> -- 
> Mathematics is the supreme nostalgia of our time.
> 
> 
> 

----------------------------------------------------------------------
[FUJIWARA Katsunori]                             foozy at lares.dti.ne.jp


More information about the Mercurial-devel mailing list