Note:

This page is primarily intended for developers of Mercurial.

Windows UTF-8 Plan

A plan to make Mercurial on Windows interoperate with UTF-8 elsewhere.

1. Overview

According to EncodingStrategy, Mercurial generally avoids managing encoding of filenames. This generally works well on Linux and Mac, where UTF-8 is now a well-supported default, but less well on Windows.

To maximize interoperability while maximally preserving backwards-compatibility, we should recognize manifests that are in UTF-8 and switch to a Unicode filesystem mode on Windows. This is referred to as the "hybrid strategy" in EncodingStrategy. All internal filename handling is done in UTF-8 and converted to/from UTF-16 at a VFS abstraction layer.

2. Definitions

3. Steps

4. Python interface

Most of Python's APIs accept Unicode objects and will use Windows' wide APIs (aka UTF-16) to give Unicode results. The biggest exception here is os.getcwd() which takes no args and needs to be replaced by os.getcwdu().

5. Issues

5.1. Upgrading to UTF-8

Repositories that have all-ASCII filenames will work without change.

Repositories with legacy filenames can be converted by renaming files to UTF-8 and committing. This may require a Linux machine or clever utility.

5.2. Console will still be legacy

The console is still restricted to a legacy charset and Mercurial will continue to avoid transcoding when dealing with the console. Thus, UTF-8 names will be output as UTF-8 byte strings and result in mojibake unless cp65001 is used. This is identical to the current situation when working with UTF-8 changesets, except the filenames will be readable on disk.

Applications like TortoiseHg will be able to deal with this issue.

5.3. Merge between UTF-8 and non-UTF-8 commits

This could create problems. We probably don't want to make merge aware of this issue.

6. Status of progress

6.1. Step of current working

Now "Add methods for all basic filesystem operations to vfs object" and "Update all users to use vfs methods" steps are in progress incrementally: see last of Matt's mention, too.

"Replace usage of non-basic methods" step should be done, before each "Update all users to use vfs methods" works, if needed.

6.2. Status of each files

filename

file API

os.path.join

mercurial/bookmarks.py

(./)

(./)

mercurial/bundlerepo.py

<!>

(./)

mercurial/changegroup.py

<!>

(./)

mercurial/changelog.py

(./)

(./)

mercurial/context.py

(./)

(./) *1

mercurial/hg.py

{X}

{X}

mercurial/localrepo.py

(./)

{X} *2

mercurial/lock.py

(./)

(./)

mercurial/patch.py

{X}

{X}

mercurial/repair.py

(./)

(./)

mercurial/statichttprepo.py

(./)

(./)

mercurial/store.py

(./)

(./)

mercurial/transaction.py

{X} *3

(./)

Some other files are also changed for WindowsUTF8Plan, but just partially (e.g. hgext/shelve.py, mercurial/commands.py and so on)

6.3. Current API of vfs

LEGACY function

vfs function

note

builtin open()/file()

open()

"vfs.open(name)" should be used in newly added code instead of "vfs(name)"

util.posixfile()

open()

os.chmod()

chmod()

os.path.exists()

exists()

this shouldn't be used for files in working directory

util.fstat()

fstat()

this shouldn't be used for files in working directory, because this implies os.stat()

os.path.isdir()

isdir()

lstat() should be used for multiple examinations

os.path.isfile()

isfile()

lstat() should be used for multiple examinations

os.path.islink()

islink()

lstat() should be used for multiple examinations

os.path.lexists()

lexists()

os.lstat()

lstat()

util.makedir()

makedir()

this can take "notindexed" argument

util.makedirs()

makedirs()

this can create directory recursively

util.makelock()

makelock()

os.mkdir()

mkdir()

tempfile.mkstemp()

mkstemp()

osutil.listdir()

readdir()

API for os.listdir() is not yet provided

util.readlock()

readlock()

util.rename()

rename()

os.readlink()

readlink()

util.setflags()

setflags()

os.stat()

stat()

lstat() should be used for files in working directory

os.symlink()

symlink()

util.unlink()

unlink()

util.unlinkpath()

unlinkpath()

os.utime()

utime()

7. See also


CategoryNewFeatures

WindowsUTF8Plan (last edited 2014-06-22 17:19:45 by ChinmayJoshi)