[PATCH 0 of 9 RFC] manage filename normalization policy per repository

Mon Jun 4 09:19:09 CDT 2012

On Mon, 2012-06-04 at 19:07 +0900, FUJIWARA Katsunori wrote:
> At Sun, 03 Jun 2012 17:52:30 -0500,
> Matt Mackall wrote:
> > 
> > On Sat, 2012-06-02 at 23:36 +0900, FUJIWARA Katsunori wrote:
> 
> > > BTW, in transition period, repositories using different encodings for
> > > filenames may exist in same host: cp932 and utf-8, for example.
> > 
> > Huh? Please go read that page again, because I don't think you
> > understood it:
> > 
> > http://mercurial.selenic.com/wiki/WindowsUTF8Plan#Definitions
> > http://mercurial.selenic.com/wiki/WindowsUTF8Plan#Upgrading_to_UTF-8
> > 
> > I fully expect SINGLE repos to have different encodings in different
> > changesets. This is in fact what will allows us to upgrade them. There
> > will be no notion of "repository encoding".
> 
> Sorry, I used term "repository encoding" as:
> 
>     if there are only legacy changesets in the repository, and users
>     assume that filenames are encoded only one encoding in their mind,
>     such encoding can be recognized as "repository encoding"
> 
> OK, I'll use just "UTF-8 changeset" and "legacy changeset".
> 
> 
> I assume that two UTF-8 changesets below:
> 
>   A. a changeset where every filename in it uses only ASCII chars
> 
>   B. a changeset where some filename in it uses non ASCII, but UTF-8
>      valid characters
> 
> To children of (B), I don't want to add any file of which name uses
> chars in encoding other than UTF-8, but may want to do so to children
> of (A): it is normal usecase of adding new files using non-ASCII chars
> in their names with current Mercurial.
> 
> If the parent of working directory is (B), tools can assume that the
> filename encoding in user mind is UTF-8: tools like TortoiseHg, which
> aware of dirstate structure and invoke Mercurial API directly in own
> process, can detect it and pass filename strings in UTF-8 encoding.

I actually intend for A and B to operate the same: all new files are
UTF-8/ASCII. Thus, you transparently upgrade to the new mode when adding
non-ASCII files without having to think about it.

> In the other hand, if the parent is (A), tools can't know what
> encoding user want to use for filenames: user may have to use encoding
> other than UTF-8 because of repository management rule in the project,
> for example.

> Here, I want to confirm that:
> 
>     in the latter case (= children of (A)), HGENCODING env should be
>     referred to decide filename encoding.

No. HGENCODING has never and will never have any relation to filenames.
Filename are either recognizable as UTF-8/ASCII (for the purposes of
making Windows happy) or bytes in an unspecified legacy encoding that we
don't know or care about just like file contents (everywhere else).

> Next. According to "WindowsUTF8Plan" wiki page:
> 
>     "Merge between UTF-8 and non-UTF-8 commits" could create
>     problems. We probably don't want to make merge aware of this
>     issue.
> 
> This is true for "UTF-8 changeset (B)" above and legacy one, but not
> for (A) and legacy one, isn't it ?

Yes and no. The result of either may be a legacy changeset or a UTF-8
changeset, depending on rename history.

My advice: don't think about it yet.

-- 
Mathematics is the supreme nostalgia of our time.