[PATCH 2 of 2 STABLE] i18n: fix case folding problem in some problematic encodings

Fri Nov 25 11:21:13 CST 2011

On 11/25/2011 04:20 PM, FUJIWARA Katsunori wrote:
>>> --- a/mercurial/encoding.py	Thu Nov 24 14:59:26 2011 +0900
>>> +++ b/mercurial/encoding.py	Thu Nov 24 15:05:37 2011 +0900
>>> @@ -171,3 +171,5 @@
>>>            return lu.encode(encoding)
>>>        except UnicodeError:
>>>            return s.lower() # we don't know how to fold this except in ASCII
>>> +    except LookupError, k:
>>> +        raise error.Abort("%s, please check your locale settings" % k)
>>
>> We have hint=... for hints.
>
> This re-raising code is as same as others in encoding, and these have
> no hint. Would it need hint ? and what hint message is appropriate, if
> so ?

"please check your locale settings" is good hint.

Fixing other aborts could be a separate patch. It is not important, but 
adding more old style aborts is not pretty.

>> But wouldn't the windows normcase have the same need for using
>> encoding.lower as the posix one?
>
> Users using windows native HG with problematic encoding would enable
> win32mbcs extension: win32mbcs wraps "windows.normcase()" to convert
> string from byte sequence to unicode. So, "windows.normcase()" can
> apply lower() on it safely.

So all that is needed is that core Mercurial should use some normcase 
function everywhere and win32mbcs should monkeypatch it?

> In other hand, HG on CYGWIN runs as POSIX one,

As far as I know Mercurial on cygwin only works by coincidence. 
Mercurial has many assumptions that cygwin and posix on windows 
violates. I think that before we start reasoning about cygwin corner 
cases we should figure out to which extend Mercurial on cygwin is 
supported - and whether it should be as windows or as posix.

I would recommend using msys instead and I don't think we should try to 
support cygwin at all.

(Cygwin might however behave a lot like for example when FAT or SMB with 
the same encodings is used on linux.)

> so byte sequnces in
> problematic encoding are passed to posix.normcase(). So, I think it
> should use "encoding.lower()" instead of "str.lower()".
>
>
>> Perhaps normcase should move to util and be a wrapper around a platform
>> specific helper function (that just should normalize slashes).
>
> IMHO, Matt seems to work around that by 6eff984d8e76, b2fd4746414a and
> so on, and to use "normcase()" only for case insensitive filesystem.

Yes, util.normcase has an unfortunate name. I think Matt proposed 
calling it 'foldcase' instead.

(AFAICS it is tied to the file system (and perhaps by definition and our 
approximation to HGENCODING) but not related to the platform at all. It 
seems to me like it really should be something else and live somewhere 
else.)

>>> diff -r 8b3cfad9c073 -r ccae15e928bf mercurial/scmutil.py
>>> --- a/mercurial/scmutil.py	Thu Nov 24 14:59:26 2011 +0900
>>> +++ b/mercurial/scmutil.py	Thu Nov 24 15:05:37 2011 +0900
>>> @@ -86,8 +86,11 @@
>>>            # AIX ignores "/" at end of path, others raise EISDIR.
>>>            if util.endswithsep(path):
>>>                raise util.Abort(_("path ends in directory separator: %s") % path)
>>> -        normpath = os.path.normcase(path)
>>> +        normpath = util.normcase(path)
>>>            parts = util.splitpath(normpath)
>>> +        # below 'lower()'s are for comparison only with ASCII strings,
>>> +        # so encoding.lower() is not needed
>>> +        # even if encoding of 'path' is problematic one
>>
>> Wouldn't it be better to do the "obviously correct" thing instead of
>> adding comments describing a clever optimization?
>
> I chose commenting, because HG seems to focus on efficiency. But I do
> not have any objection to using "encoding.lower()" instead of
> "str.lower()"

Yes, Mercurial focus on performance where it matters - and also focus on 
readable code instead of a comments.

However, in this case you changed os.path.normcase (which is a noop on 
posix, also on case-insensitive file systems) to util.normcase (which 
always on all platforms and all filesystems will lowercase), so there is 
no need for further lowering.

>>> +    # encoding.fromlocal() on 'name' is not needed because:
>>> +    #
>>> +    #   - Windows ('\\' as path separator) + problematic encoding
>>> +    #     should enable win32mbcs which wraps this function for
>>> +    #     unicode-nize of 'name'
>>> +    #
>>> +    #   - other posix environment (e.g.: cygwin) uses '/' as path
>>> +    #     separator, so re.findall() does not break byte sequence,
>>> +    #     even though it uses problematic encoding
>>
>> FWIW: I don't understand that comment.
>
> I wanted to explain about safeness about "re.findall()", which may
> break byte sequence with '\\' in problematic encoding, at this line,
> for future maintenance.

In my opinion the whole point with the win32mbcs extension is that core 
Mercurial never has to consider these "problematic" encodings.

/Mads