[PATCH 1 of 8 RFC] vfs: replace invocation of file APIs of os module by ones via vfs

Adrian Buehlmann adrian at cadifra.com
Sun Jun 17 04:48:52 CDT 2012


On 2012-06-17 10:27, FUJIWARA Katsunori wrote:
> 
> At Sat, 16 Jun 2012 23:34:47 +0200,
> Adrian Buehlmann wrote:
> 
>> Some further, perhaps stupid and wild ideas:
>>
>> For the openers (e.g. scmutil.opener) I think we might have to put a unicode
>> string into base (see scmutil.py):
>>
>> 199:    def __init__(self, base, audit=True):
>> 200:        self.base = base
>>
>> For the store openers, the path parameter on __call__
>>
>> 218:    def __call__(self, path, mode="r", text=False, atomictemp=False):
>>
>> would then be plain ASCII strings, as the filenames in the store are
>> all encoded already, using ASCII characters only.
>>
>> Then the join function
>>
>> 293:    def join(self, path):
>> 293:        return os.path.join(self.base, path)
>>
>> needs to return a unicode string, which is formed by using the "base"
>> unicode string and joining it with the ASCII path.
>>
>> join() is used in __call__() to form the final, complete path f
>>
>> 224:        f = self.join(path)
>>
>> which needs to be a unicode string as well (on Windows, of course).
>>
>> We then need a unicode version of util.posixfile
>>
>> 261:        fp = util.posixfile(f, mode)
>>
>> Which takes the unicode filename f.
>>
>> So we would then also need a unicode version of posixfile for Windows in
>> osutil.c, line 410.
>>
>> The store openers need to be unicode-aware because of the base.
>>
>> base is somewhere under the repo root. Which in turn can have funny characters
>> (e.g. Japanese).
>>
>> I think this has to be done unconditionally, if we want to support repo
>> roots with funny paths.
>>
>> Likewise, the base of wopeners need to be unicode strings as well for
>> the same reasons.
>>
>> But there, we ideally most likely want to have the path parameter on
>> __call__ in UTF-8, or some other encoding (e.g. latin1 or whatever?),
>> depending on some other conditions (the switching as per Matt's ideas).
> 
> For example, I can create files named as below via Python Unicode file
> API even on Japanese Windows using cp932 as system code page:
> 
>   - u'\u00c0'
>   - u'\u30cf\u309a' (NFD-ed u'\u30d1', which is valid in cp932)
> 
> But I can't access them via Python ANSI file API, because such Unicode
> characters has no corresponding characters in cp932.
> 
> # "os.listdir('.')" returns mangled names for them
> 
> So, I think that there are two kinds of "funny" paths:
> 
>   (A) using only chars valid in system code page
>   (B) using also chars not valid in system code page
> 
> If repo root path is (A):
> 
>   - root path (A) can be encoded to valid byte sequence in system code
>     page, and
> 
>   - encoded "root path (A)" and the path in workdir in any encoding
>     can be joined as byte sequence
> 
> So, we can access target files also by ANSI file API correctly: we can
> switch ANS/Unicode file API, according to some conditions suggested by
> Matt.
> 
> In the other hand, if repo root path is (B):
> 
>   - root path (B) can't be encoded to valid byte sequence in system
>     code page, so we should use Unicode file API to access files under
>     such directory, but
> 
>   - "root path (B)" and "legacy" path (not encoded in UTF8) can't be
>     joined as Unicode without any information about encoding of
>     "legacy" path
> 
> So, we can't access target "legacy" files in this case !
> 
> 
> The paths to subrepos in workdir or manually renaming by users may
> also cause this problem, even if we restrict repo root paths to (A) at
> creation by clone or init.
> 
> 
> It seems to be also problem that "valid in system code page" is not
> portable concept between each environments: valid paths on Japanese
> Windows may be not so on other ones.
> 
> 
> But sorry, I know about Windows native API only little, so please
> teach me if there are some good API to resolve this problem !

Sorry. I'm at loss here. I don't understand what you are trying to achieve.

I start to get problems trying to understand why you want to try
accessing the files and directories with, for example "os.listdir('.')".
That triggers python to use the ...A Windows API functions (for "A" for
ANSI).

I think if you want to be able to deal with all kinds of "funny"
characters in paths on Windows, then my impression is that there is
probably no way around using the ...W Windows API functions ("W" is
probably meant for "wide", but in the MSDN docs they label it as "Unicode").

So I think you need to use "os.listdir(u'.')" (note the "u"). Using the
unicode u'.' triggers python to use the ...W Windows API functions.

And please follow what Matt is saying. Not my ideas. I'm just
commenting, throwing (possibly stupid) thoughts in here. The design
decisions are made by Matt.


More information about the Mercurial-devel mailing list