Questions regarding WindowsUTF8 plan

Tue Jun 10 15:45:16 CDT 2014

On Mon, 2014-06-09 at 23:40 +0000, Chinmay Joshi wrote:
> Hello FUJIWARA Katsunori,
> 
> I am currently working on WindowsUTF8 plan under GSoC. As I learnt you 
> contributed on WindowsUTF8 plan. I really thank you for your feedback on my 
> patches until now. I understand you have the greater idea of this plan and 
> I had a few queries for you, for which I expect some help from you (or 
> anyone else). Any help will be highly appreciated.
> 
> The one question is regarding u16vfs class for Windows. Some discussion has 
> taken place on #mercurial IRC channel. This class is supposed to be derived 
> from vfs and should use "wide APIs internally" and give UTF-8 results in 
> case of UTF-8 changeset. What I understand from this is using Pythons APIs
>  with unicode objects
> which use windows wide APIs to to give UTF-8 results. One another solution 
> raised was using windows specific win32 APIs. This would need a lot work to 
> match python's current implementation of filesystem functions used in vfs 
> class.

Use Python's APIs whenever possible. Please don't call it u16vfs
(because it should never be passed a UTF-16 string). I probably wrote
that on the wiki, but a better name would be utf8vfs. Methods in utf8vfs
should generally follow a model like this:

def listdir(self, path):
    # take a utf-8 encoded byte string, convert it to a unicode() object
    upath = path.decode("utf-8")

    # pass the unicode object to a Python API, which will check the
    # class of the argument and internally use Windows 'wide string'
    # methods to do filesystem operations return unicode() objects
    # in its result
    uresult = os.listdir(ufilename)

    # this function gives back a list of unicode() filenames
    # convert the results back to bytestrings in UTF-8
    result = [u.encode('utf-8') for u in uresult]
    return result

Crucially, Mercurial code outside this vfs class should _never_ see a
UTF-16 encoded bytestring OR a unicode() object, nor should it be doing
any of its own encode/decode.

> One more issue was raised in today's meet up which is about not passing 
> Unicode objects to any Mercurial APIs (
> http://mercurial.selenic.com/wiki/EncodingStrategy#Unicode_strings).
> 
> As per discussion with mpm on irc, a concern is that people will want to 
> convert their existing non-ASCII repositories to UTF-8. This will not work 
> if previous commits remain unchanged.

You're not reading WindowsUTF8Plan carefully enough, see section 5.1.
Conversion will not be converting all of history, it will be converting
the branch head(s) by renaming the non-UTF-8 files and making a new
commit. When you check out the new commit, Mercurial will say "ah, all
these files are all UTF-8, switch to using utf8vfs". But if you switch
to an old commit, it'll keep working the way it currently does. 

-- 
Mathematics is the supreme nostalgia of our time.