[PATCH STABLE V2] i18n: fix case folding problem with problematic encodings

FUJIWARA Katsunori foozy at lares.dti.ne.jp
Thu Dec 1 03:03:09 CST 2011


At Wed, 30 Nov 2011 12:35:55 -0600,
Matt Mackall wrote:

> > Please confirm my understanding.
> > 
> > "use upper()" seems to consist of below actions.
> > 
> >   1. use "upper()" (or NEW "encoding.upper()") for "posix.normcase()"
> 
> No. The only place "NTFS" and "POSIX" are related is in Microsoft's
> dreams. Different filesystems -must- have different folding rules.
> Please examine the link I gave you earlier:
> 
> http://www.selenic.com/hg/file/ad686c818e1c/mercurial/posix.py#l174
> 
> HFS+ does:
> 
> - LOWER case 
> - Unicode NFD normalization
> - percent-escaping
> 
> ..so we should mirror those rules on Mac when we detect case-folding.
> But they'd be wrong elsewhere.
> 
> (Note that this means that util.normcase is actually charged with
> handling other forms of folding/mapping as well.)
> 
> On Linux, where we can have any one of HFS+, NTFS, VFAT, ISO9660, etc.
> connected either natively, or via one of dozens of network filesystems,
> we're going to have a really hard time figuring out the underlying
> case-folding rules for a given path. Also note that the character set
> used to mount a non-native filesystem may disagree with the user's
> locale. For instance, NTFS can be mounted in a mode where filenames are
> represented as UTF-8, but a given user uses Latin1, or vice-versa. The
> conservative thing to do here is str.lower(). This will be good enough
> for something like 99% of users: 90% don't use non-native filesystems,
> and 90% of the rest won't encounter case-collisions of non-ASCII
> characters. 
> 
> Why does the lower vs upper thing matter at all? It mostly doesn't, but
> there are few cases where the upper/lower mapping is not 1:1, like
> Turkish iİıI and Georgian (which has three alphabets, only one of which
> has "lowercase"). But as long as we have to have filesystem-specific
> folding, we ought to try to match the filesystem insofar as Python's
> Unicode database allows us to easily.
> 
> >   2. switch from "lower()" (or "encoding.lower()") for filename case
> >      folding to "util.normcase()"
> > 
> >      # this is for readabilty/maintenancability
> > 
> >   3. upper case of fixed strings which are compared against normcase-d
> >      string (or introduce case-folding-compare function ?)
> > 
> > But "os.path.normcase()" of Windows native Python lowers specified
> > strings, so compare with upper-ed string seems to cause unexpected
> > failure.
> 
> We should probably just ban os.path.normcase() from the Mercurial
> codebase.

Thank you for detailed explanation !


As I understand it:

    "util.normcase()" should abstract case folding policy, so
    normcase-ed result should not be expected to be either lower or
    upper.


Then, I categorize lower/upper-ing points in current implementation.

  A. compare between filenames (directly or in-directly)

     "util.normcase()" should be applied on them.

     there is only one special case: "util.checkcase()" uses
     lower()/upper() to check filesystem case-sensitive-ness.

     this function is called only with pure ASCII filename (at least
     in current implementation), so can be excluded from current
     discussion.
    

  B. compare between filename and other
     (e.g.: keyword searching)

     "util.normcase()" should be applied on them, because:

       - there is no hint to know whether "util.normcase()" fold
         character cases to lower or upper

       - fixed lower/upper-ing may cause inconsistency with filename
         representation in filesystem

     this also causes applying "util.normcase()" on description of
     changeset, and so on, for efficiency


  C. compare between part of filename and fixed string
     (e.g.: suffix check, reserved name check)

     these are limited to compare against pure ASCII string (at least
     in current implementation), so can be excluded from current
     discussion.


  D. compare between others

     these can be excluded from current discussion.


So, we should do for case folding:

  1. apply "util.normcase()" for (A)/(B) categories

  2. use upper-ing instead of "os.path.normcase()" on Windows
     (* for lower/upper one-way-mapping problem on NTFS)

  3. use upper-ing as "posix.normcase()" except on Mac OS
     (* corner case for cygwin on Windows NTFS)


Please confirm my understanding !

----------------------------------------------------------------------
[FUJIWARA Katsunori]                             foozy at lares.dti.ne.jp


More information about the Mercurial-devel mailing list