[PATCH STABLE V2] i18n: fix case folding problem with problematic encodings

Thu Dec 1 10:55:11 CST 2011

On Thu, 2011-12-01 at 18:03 +0900, FUJIWARA Katsunori wrote:
> At Wed, 30 Nov 2011 12:35:55 -0600,
> Matt Mackall wrote:
> 
> > > Please confirm my understanding.
> > > 
> > > "use upper()" seems to consist of below actions.
> > > 
> > >   1. use "upper()" (or NEW "encoding.upper()") for "posix.normcase()"
> > 
> > No. The only place "NTFS" and "POSIX" are related is in Microsoft's
> > dreams. Different filesystems -must- have different folding rules.
> > Please examine the link I gave you earlier:
> > 
> > http://www.selenic.com/hg/file/ad686c818e1c/mercurial/posix.py#l174
> > 
> > HFS+ does:
> > 
> > - LOWER case 
> > - Unicode NFD normalization
> > - percent-escaping
> > 
> > ..so we should mirror those rules on Mac when we detect case-folding.
> > But they'd be wrong elsewhere.
> > 
> > (Note that this means that util.normcase is actually charged with
> > handling other forms of folding/mapping as well.)
> > 
> > On Linux, where we can have any one of HFS+, NTFS, VFAT, ISO9660, etc.
> > connected either natively, or via one of dozens of network filesystems,
> > we're going to have a really hard time figuring out the underlying
> > case-folding rules for a given path. Also note that the character set
> > used to mount a non-native filesystem may disagree with the user's
> > locale. For instance, NTFS can be mounted in a mode where filenames are
> > represented as UTF-8, but a given user uses Latin1, or vice-versa. The
> > conservative thing to do here is str.lower(). This will be good enough
> > for something like 99% of users: 90% don't use non-native filesystems,
> > and 90% of the rest won't encounter case-collisions of non-ASCII
> > characters. 
> > 
> > Why does the lower vs upper thing matter at all? It mostly doesn't, but
> > there are few cases where the upper/lower mapping is not 1:1, like
> > Turkish iİıI and Georgian (which has three alphabets, only one of which
> > has "lowercase"). But as long as we have to have filesystem-specific
> > folding, we ought to try to match the filesystem insofar as Python's
> > Unicode database allows us to easily.
> > 
> > >   2. switch from "lower()" (or "encoding.lower()") for filename case
> > >      folding to "util.normcase()"
> > > 
> > >      # this is for readabilty/maintenancability
> > > 
> > >   3. upper case of fixed strings which are compared against normcase-d
> > >      string (or introduce case-folding-compare function ?)
> > > 
> > > But "os.path.normcase()" of Windows native Python lowers specified
> > > strings, so compare with upper-ed string seems to cause unexpected
> > > failure.
> > 
> > We should probably just ban os.path.normcase() from the Mercurial
> > codebase.
> 
> Thank you for detailed explanation !
> 
> 
> As I understand it:
> 
>     "util.normcase()" should abstract case folding policy, so
>     normcase-ed result should not be expected to be either lower or
>     upper.
> 
> 
> Then, I categorize lower/upper-ing points in current implementation.
> 
>   A. compare between filenames (directly or in-directly)
> 
>      "util.normcase()" should be applied on them.

util.normcase should only be applied after we've determined that we're
on a case-insensitive filesystem. We've done a pretty good job of
restricting its usage to dirstate.py, which carefully caches all the
relevant bits with _foldmap and normalize.

You should spend a while understanding dirstate.normalize. It's not
enough to be case-insensitive, we also have to be case-preserving.

-- 
Mathematics is the supreme nostalgia of our time.