[PATCH 1 of 8 RFC] vfs: replace invocation of file APIs of os module by ones via vfs

Sat Jun 16 12:19:10 CDT 2012

On 2012-06-16 16:24, FUJIWARA Katsunori wrote:
> 
> At Sat, 16 Jun 2012 11:07:50 +0200,
> Adrian Buehlmann wrote:
>>
>> On 2012-06-15 20:00, FUJIWARA Katsunori wrote:
>>>
>>> At Fri, 15 Jun 2012 10:31:45 -0500,
>>> Matt Mackall wrote:
>>>>
>>>> On Fri, 2012-06-15 at 23:45 +0900, FUJIWARA Katsunori wrote:
>>>>> # HG changeset patch
>>>>> # User FUJIWARA Katsunori <foozy at lares.dti.ne.jp>
>>>>> # Date 1339768793 -32400
>>>>> # Node ID a14b63be9a04e7fac445fea69bbaf840ca3f4063
>>>>> # Parent  622aa57a90b1d1f09b3204458b087de12ce2de82
>>>>> vfs: replace invocation of file APIs of os module by ones via vfs
>>>>
>>>> You seem to have missed the importance of step 1:
>>>>
>>>> "Rename opener to vfs"
>>>> http://mercurial.selenic.com/wiki/WindowsUTF8Plan#Steps
>>>>
>>>> The whole point of this exercise is to have one (or just a few) central
>>>> objects we route all our file operations through that are attached to
>>>> repository objects so that repository objects can easily switch their
>>>> modes as needed. Conveniently, we have something very much like that
>>>> already: it's called an opener. It's even beginning to grow some of
>>>> these sorts of methods.
>>>>
>>>> In particular, we'll want to take one of these objects, wopener, and
>>>> switch it to UTF-8 mode.. while leaving the other two in native mode.
>>>> Which means we need to be calling methods at a high enough level that we
>>>> know which part of the repository we're operating on...
>>>
>>> Sorry, I mis-understand, because I also think about "create a
>>> filesystem abstraction object in util.py" in "Abstracting filesystem
>>> API for UTF-8 support on Windows" which you posted to devel-ml.
>>>
>>> # http://www.selenic.com/pipermail/mercurial-devel/2011-December/036385.html
>>
>> Have you guys put some thought into how to deal with repo root paths
>> that contain "wide" characters already?
>>
>> Something along the lines:
>>
>>   C:\Users\AdrianBühlmann\repos\myrepo
>>
>> but with a username using Japanese characters, or something like that?
>>
>> For example, TortoiseHg users may be interested in exploring repos at
>> such funny locations and TortoiseHg will then surely try to create
>> mercurial repo objects for such repos. So we may already have such
>> problems with the repo root (mercurial.localrepo.localrepository.root).
>>
>> Don't we then also have to use the wide Windows API functions for the
>> store side (mercurial/store.py) when accessing repos at such funny
>> locations?
>>
>> Also interesting seem to be other paths, like config files or paths to
>> merge tools. Interesting paths can also originate from registry keys
>> (see 133a7922a900).
>>
>> The working dir side of things (wopener) certainly needs special
>> treatment (Windows "wide character" paths <-> UTF-8 paths internally in
>> Mercurial). But is that enough?
> 
> I also worry about repo root paths.
> 
> Please let me use notations below to explain my understanding:
> 
>   - UTF8(A): utf-8 safe, and including only chars acceptable also for
>              system code page
> 
>   - UTF8(U): utf-8 safe, but including chars unacceptable for system
>              code page
> 
>   - legacy: non utf-8 safe byte sequence
> 
> 
> Combination of components of the path to "target" (managed files in
> workdir or data files in store) are:
> 
>   A. <UTF8(A)>/<UTF8(A)>/target
> 
>      in this case, both ANSI API and Unicode API can access target,
>      because "<UTF8(A)>" is accessible for both.
> 
>   B. any combination of <UTF8(A)> and <UTF8(U)>
>      - <UTF8(A)>/<UTF8(U)>/target
>      - <UTF8(U)>/<UTF8(A)>/target
>      - <UTF8(U)>/<UTF8(U)>/target
> 
>      in this case, only Unicode API can access target, because ANSI
>      API can't handle "<UTF8(U)>" part.
> 
>   C. <UTF8(A)>/<legacy>/target
> 
>      in this case, only ANSI API can access target, because Unicode
>      API can't handle "<legacy>" part
> 
>   D. <UTF8(U)>/<legacy>/target
> 
>      in this case, any API can't access target, because:
> 
>        - ANSI API can't handle "<UTF8(U)>" part
>        - Unicode API can't handle "<legacy>" part
> 
> For performance of store side opener and switch-ability of workdir
> side opener, I think that repo root paths should be (A), but I don't
> know whether such restriction is reasonable or not.
> 
> In addition to it, if repo root paths should be restricted so, subrepo
> location should be also restricted. But "system code page" can't be
> determined on other than Windows platform, and it is different also on
> each Windows environments.
> 
> What should we do ?

I don't understand what you are saying there, but I already said
elsewhere that I largely lack an understanding about all those issues,
so perhaps, this was it already from my side. I can't help you there.

The best I can possibly do is asking stupid questions. And if they
become too stupid, I'll shut up.

I played a bit with my AdrianBühlmann user I made on my Windows 7 box.

  >>> os.getcwdu()
  u'C:\\Users'
  >>> os.listdir(u'.')
  [u'adi', u'adi-p', u'All Users', u'B\xfchlmannAdrian', u'Default',
u'Default User', u'desktop.ini', u'Public']
  >>> os.listdir(u'B\xfchlmannAdrian/repos')
  [u'myrepo']

So my dumb impression is I might be able to access all those paths if I
trigger Python to use the wide API functions and then deal with those
"unicode" strings (whatever that is exactly).

My understanding is, that mercurial wants to know the part of the path
that's under root, then strip off root (I think that's exactly the
domain of that scmutil.canonpath function).

But my dumb uninformed impression is this whole path juggling needs to
be done with those unicode strings. So, for exmaple scmutil.canonpath
would have to operate on unicode strings. And perhaps a higher layer
would then convert the relative path to UTF-8 so the higher levels of
mercurial can be shielded from those unicode paths. But I don't see how
anything else can work but using the wide APIs for file system accesses,
which includes the store, as the root may be a "unicode" path already.