[PATCH 0 of 5] Fix to handle MBCS filename correctly

Sun Jan 6 12:26:11 UTC 2008

These patches are for handling MBCS filenames correctly for windows
environment using problematic encoding (like shift_jis).

In this text, describing the problem, expressing the implementation,
and giving proposal of holding encoding (if acceptable).

By the way, I think this implementation is not so good solution
because it is impelemented by hooking (decorating) os.path.*
functions, but I take this way after test implementation which
replacing all the os.path.* functions. Those path operations are use
in many places and would be added in future without considering for
this issue.

Are these patches acceptable?

Problem:

Windows uses '\' (0x5c) character as path separator.  Some encoding
(like shift_jis, big5, etc.) uses this character in MBCS second byte.
OTOH, hg hold filename as encoded string (without any conversion, as
its policy) and do the path string operations like xxx.split(os.sep),
os.path.basename(path) against it.  As result, hg gets invalid path
strings for the path names having 0x5c and cause file operation error.
It means hg cannot manage those files at all.
I think this is not critical for most of the projects,
but actually damaging for some projects (if they use hg).

Solution:

There are two step;
 1) (first 4 patches)
    Remove/alternate codes using os.sep to use existing/new functions.
    For example:
       s.replace('\\', '/') => util.normpath(s)  ... use existing function
       s.split(os.sep) => util.splitpath(s)      ... use new function
       s.endswith(os.sep) => util.endswithsep(s) ... use new function
       do not use rfindall(os.sep)               ... change code

 2) (last patch)
    Introduce a wrapper (decoration) function to call original
    function with decoded unicode argument(s), and returns value with
    encoding.  And do the wrapping for some functions they hits the
    issue.  I inspected ntpath.py and listed which function hits the
    issue, and wrap them and new functions added in patch above.

This wrapping solution is activated on Windows only.  And unicode
conversion is done for the case of using problematic encoding (by
judging util._encoding is in _problematic_encoding).  And also checks
given argument string is actually encoded that encoding before do the
conversion, otherwise simply do the original. The encoding check is
done by simply re-encoded string is same to original. 
I think this mechanism can keep original behaviour as possible.

Currently, Shift_JIS and BIG5 (and variant/alias names of those) are
listed in '_problematic_encodings'.

As limitation, this mechanism affects only when util._encoding is
problematic encoding. It means this patch does not provide perfect
resolution of the issue for the case of network wide cloned
repository. This limitation is related to not having encoding
information of filename in repository.

Proposal:

What about holding encoding information of filename in repository?

In current hg, filename is treat as just a byte sequence depended to
the committed environment.  So once the repository cloned, remote user
cannot know the encoding.  Cloned user may success extracting files
and compiling them, but he would be bothered on commiting non-ascii
filenames because he should know the original encoding, and as if he
know it, he should change his environment to fit for the original
encoding before commit.

As Matt said before, automatic filename conversion might not good, but
I think it is required to know the repository filename encoding at
least.

In point of view for this patch (0x5c issue), if we know encoding of
filename in repository, the limitation described above might be
resolved.

And also it may usefull for the extension like 'converter' who want to
know the encoding information in foreign repository.
(i.e. in the case of hg -> svn conversion)