[PATCH 0 of 9 RFC] manage filename normalization policy per repository

FUJIWARA Katsunori foozy at lares.dti.ne.jp
Sun Jun 3 06:35:45 CDT 2012


At Sat, 02 Jun 2012 23:36:15 +0900,
FUJIWARA Katsunori wrote:
> 
> At Fri, 01 Jun 2012 13:56:17 -0500,
> Matt Mackall wrote:
> > 
> > On Fri, 2012-06-01 at 18:20 +0900, FUJIWARA Katsunori wrote:
> > > > So.. can we focus on Windows UTF-8 support first?
> > > 
> > > We, some developers in Japan, start for "Windows UTF-8 support" !
> > 
> > Ok, in case you didn't see it, I wrote an outline of my idea here:
> > 
> > http://mercurial.selenic.com/wiki/WindowsUTF8Plan
> 
> Thank you for creating page !
> 
> In addition to what you described in WindowsUTF8Plan, I think some
> process interaction parts should be fixed:
> 
>   - receiving arguments in Unicode:
> 
>     when all managed filenames are encoded in UTF-8, some characters
>     can't be re-encoded into system code page (e.g.: NFD characters on
>     cp932 environment).
> 
>     command prompt can tab-complete and pass them to invoked commands,
>     but invokees should use "GetCommandLineW" API to receive them
>     without any data loss: fixutf8 extension does so.
> 
>         https://bitbucket.org/stefanrusek/hg-fixutf8/src/baf283ab9f92/win32helper.py#cl-103

According to a Redmine committer, CRuby also uses not Unicode API but
ANSI API to invoke processes yet, so Redmine uses the additional
extension to receive arguments encoded in other than system code page
as URL encoded string.

# https://bitbucket.org/redmine/redmine-trunk/src/b7d23f87921e/lib/redmine/scm/adapters/mercurial/redminehelper.py

Other plugins or wrappers which invoke "hg" command may have same
problem: frameworks or interpreters prevent them from using Unicode
process invocation API.

Even in such cases, to pass arguments encoded in other than system
code page successfully and easily, we should add the global option to
specify that arguments are escaped in some form, shouldn't we ?

  - add global option "--escaped-args", or such name

  - use "string-escape" (which is cheaper than URL encoding, isn't it ?)

  - steps to process arguments are:

      1. check "--encoding" by _earlygetopt() at first in _dispatch()

      2. fix encoding by "--encoding", HGENCODING or preferred encoding

      3. check "--escaped-args" by _earlygetopt()

         3.1 if specified, replace "req.args" by
             "decode('string-escape')"-ed ones,

         3.2 otherwise (and on Windows):

             3.2.1 get arguments in Unicode by "GetCommandLineW()"

             3.2.2 replace "req.args" by "encode(encoding.encoding)"-ed them

             3.2.3 abort when encoding fails: some chars are not valid
                   in specified encoding

                   or use "encode(encoding.encoding, 'replace')" for
                   backward compatibility ?

                   IMHO, many of users on Windows seem not to aware
                   about encoding details, so aborting is better to
                   decrease issue reporting like "why my operations
                   for xxx<chars out of encoding>xxx fail ?".

      4. get "--cwd", "--repository" and so on from processed "req.args"

           .....

      5. "--encoding" and "--escaped-args" are checked like as
         "--config"/"--cwd": "may not be abbreviated!" aborting

         should "may not be abbreviated!" for "--encoding" be checked
         only on Windows, for backward compatibility ?


>   - passing arguments in Unicode:
> 
>     in some cases, hg invokes external commands with filenames.
> 
>     current implementation uses "subprocess" python library to invoke
>     external commands, but it can't pass Unicode strings to invokees.
> 
>         http://bugs.python.org/issue1759845
> 
>     to pass characters, which are valid in UTF-8 but not in system
>     code page, we should use "CreateProcessW" explicitly on Windows.
> 
>     for example, according to checking around 'util.system()'
>     invocations:
> 
>       - external diff:
>           passes non-ascii filenames, if diff target is only one file.
> 
>       - external merge:
>           passes one non-ascii filename, because the file in working
>           directory is used as one of merge files.
> 
> # of course, use byte API, if any of target strings are not valid in UTF-8

In invocation side, what about introducing intermediate argument
conversion command (call it "argdecode" below) instead of using Unicode
process invocation API explicitly ?

  - on Wndows, if there are any chars not valid in system code page in
    specified arguments (or alwyas ?):

    1. encode command line (including path to command) with UTF-8 and
       "string-escape"

       what should we do, if command line can't be encoded into UTF-8?
       this means that there are some chars not valid in both system
       code page and UTF-8: is aborting reasonable ?

    2. invoke "argdecode" with encoded command line

    3. in "argdecode", decode arguments and invoke target command by
       Unicode process creation API

  - otherwise, use ANSI process invocation API (= as same as current
    implementation)

This can also be used by invokers of "hg" command, instead of
"--escaped-args" option described before, if "argdecode" is bundled
with Mercurial official distribution.


BTW, configurations for merge-tools (extdiff, too ?) should have the
new property to indicate "utf-8 arguments acceptable" capability, like
"binary", shouldn't it ?


> BTW, in transition period, repositories using different encodings for
> filenames may exist in same host: cp932 and utf-8, for example.
> 
> In such cases, both HGENCODING env and system code page can't be used
> to specify encoding for filenames of each repositories.
> 
> So, GUI tools like TortoiseHg managing multiple repositories want to
> know encoding of them in some way. Are there any ideas to solve it
> without breaking backward compatibility ?

After mail posting, I thought of that:

    Having "encoding" property for each repositories in GUI or such
    management side solves this problem.

    All the time, management of encoding for each repositories is
    responsibility of "users of hg". At this point of view, it is
    reasonable that GUIs/IDE plugins, which is "users of hg" in wide
    sense, manage them.

    Even if hg provides the way to detect the encoding of each
    repositories, GUIs/plugins should be modified to use it.

    So, having "encoding" property is not so more difficult than that.


BTW, if there are multiple repositories served by "hg serve" or
hgweb.cgi, and using encodings different each other, sharing single
"encoding.encoding" should cause the problem.

I don't know well about Process/Thread model in implementations of
each HTTPDs. If processings for each HTTP requests are fully isolated
always, please tell me so !

Are there any ways to solve this problem ?: using "thread local data"
to manage encoding ?

# according to a TortoiseHg committer, current TortoiseHg
# implementation also has same problem, because it invokes Mercurial
# API in same Python process

----------------------------------------------------------------------
[FUJIWARA Katsunori]                             foozy at lares.dti.ne.jp


More information about the Mercurial-devel mailing list