[PATCH STABLE V2] i18n: fix case folding problem with problematic encodings

Matt Mackall mpm at selenic.com
Fri Dec 2 12:25:20 CST 2011


On Sat, 2011-12-03 at 01:39 +0900, FUJIWARA Katsunori wrote:
> At Thu, 01 Dec 2011 10:55:11 -0600,
> Matt Mackall wrote:
> 
> > > As I understand it:
> > > 
> > >     "util.normcase()" should abstract case folding policy, so
> > >     normcase-ed result should not be expected to be either lower or
> > >     upper.
> > > 
> > > 
> > > Then, I categorize lower/upper-ing points in current implementation.
> > > 
> > >   A. compare between filenames (directly or in-directly)
> > > 
> > >      "util.normcase()" should be applied on them.
> > 
> > util.normcase should only be applied after we've determined that we're
> > on a case-insensitive filesystem. We've done a pretty good job of
> > restricting its usage to dirstate.py, which carefully caches all the
> > relevant bits with _foldmap and normalize.
> > 
> > You should spend a while understanding dirstate.normalize. It's not
> > enough to be case-insensitive, we also have to be case-preserving.
> 
> As I understand "dirstate._foldmap" functionality:
> 
>   - dirstate stores up "case-preserved" names
>     (* "logical name")
> 
> 
>   - "util.normcase()" should emulate case folding policy in target
>     case-insensitive filesystem
> 
>     case folding policy depends on filesystem implementation, so we
>     should not expect either lower-ed or upper-ed.
> 
> 
>   - "dirstate._foldmap" maps from "util.normcase()"-ed (= case folded)
>     names to "case-preserved" ones
> 
>       - if it is already tracked, this mapping gives original
>         case-preserved name (= "logical name")
> 
>       - otherwise, "case-preserved" name is given from filesystem
>         layer: case information is preserved in many case-insensitive
>         filesystem
> 
>         then, given name may be stored into dirstate, and become
>         "logical name"
> 
> 
> Almost all filename compare should be applied on "logical name" (and
> any of lower/upper can be available), so "util.normcase()" can be
> limited in "dirstate.normalize" family, as you described in above
> reply.
> 
> Do I understand correctly ?
> 
> 
> Then, I picked up some filename comparing points where I have less
> confidence in my inference whether "util.normcase()" is needed or not.
> 
> # especially (3) and (4) !!
> 
>   1. encoding.lower() in merge._checkcollision():

This should probably use util.normcase. If we check out a changeset
containing files that contain NFC and NFD variants of 'á' on a Mac, then
that ought to break. But encoding.lower() won't notice the problem.

>   2. encoding.lower() in scmutil.casecollisionauditor:
> 
>      this checks collision between "logical name"s, so
>      "util.normcase()" is not needed

Indeed. Because we don't know the case-folding scheme of all possible
target systems, we should use something generic. util.normcase of course
is local to our current platform. The ideal answer is something that
actually checks (if maccollision() or windowscollision()) but I don't
think it's really worth the trouble.

>   3. os.path.normcase() in scmutil.pathauditor.__call__():

This one is tricky, as it has slightly mixed goals. First and foremost,
it's trying to ensure that paths are safe on the local system. For that,
no case normalization is needed or wanted. os.lstat() will work just
fine. Note that this actually will do the wrong thing on a Mac using
case-sensitive HFSX, because os.path.normcase always folds on Macs.

But the auditor is also trying to assure that you can't check in a set
of files that would be unsafe if checked out on another system. I'm not
really sure if that's needed, but it'd take some very careful checking
to be sure.

>   4. os.path.normcase() in util.fspath():
> 
>      these "os.path.normcase()"-ed strings are used to invoke
>      "os.lstat()" or "os.listdir()".

Note that os.path.normcase() is a useless no-op on Linux: it never
changes case. That's right for native filesystems, but wrong for NTFS,
VFAT, and HFS+. So we should be using util.normcase (which always
lowers) iff we're on a case-insensitive filesystem. fspath is only used
in dirstate._normalize so that's easy enough.

>      so, just lower/upper-ed name may cause on some case-insensitive
>      filesystems, but os.path.normcase() is not suitable, too.
> 
>      "util.fspath()" is used only on case-insensitive filesystem, so
>      "util.normcase()" may be reasonable. but "scmutil.pathauditor" is
>      used also on case-sensitive filesystem.
> 
> 
> ----------------------------------------------------------------------
> [FUJIWARA Katsunori]                             foozy at lares.dti.ne.jp


-- 
Mathematics is the supreme nostalgia of our time.




More information about the Mercurial-devel mailing list