Initial support of Unicode filenames

Victor Stinner victor.stinner at haypocalc.com
Wed Nov 2 19:29:24 CDT 2011


Le samedi 29 octobre 2011 00:58:46, vous avez écrit :
> I'm afraid I've already vetoed about a dozen variants of this suggestion
> over the years. For starters, it is not backward-compatible with
> existing Windows users.

It looks like the main concern is to not fail on undecodable filenames on UNIX. 
Python 3 has the PEP 383 to solve this issue: the surrogateescape 
errorhandler. Properties of this error handler:

 * os.fsdecode(fn) does never fail (for any filesystem encoding)
 * os.fsencode(os.fsdecode(fn)) == fn
 * os.fsdecode(fn) may contain surrogate characters (which are not printable)

(os.fsdecode decodes a bytes filename from the filesystem encoding, os.fsencode 
encodes a Unicode filename to the filesystsem encoding, there are new functions 
from Python 3.2)

http://www.python.org/dev/peps/pep-0383/

The codecs module API allows to register our own error handler. Attached 
script is a proof-of-concept to demonstrate that it is possible to implement 
it in Python 2. ASCII and UTF-8 encoders of Python 2 has some limitations, and 
so these encodings need special cases when encoding a filename.

 * Filenames from the filesystem and the command line would be decoded from the 
filesystem encoding using this error handler
 * Filenames would be stored as UTF-8 using this error handler
 * Filenames would be encoded to the filesystem encoding using this error 
handler when accessing the filesystem

The error handler can be implemented in C for speed (and it is already 
implemented in C in Python 3).

--

If we store filenames are UTF-8, you would be able to share a repository on a 
USB key between two Windows setup using different ANSI code pages (e.g. cp1252 
and cp932). You would also be able to use the full Unicode range on Windows, 
not only a small subset (the ANSI code page). For example, cp1252 contains 256 
code points vs 1.114.111 for Unicode 6.0).

Well, I don't think that I need to list all advantages of manipulate filenames 
as Unicode.

Victor
-------------- next part --------------
A non-text attachment was scrubbed...
Name: surrogateescape.py
Type: text/x-python
Size: 3360 bytes
Desc: not available
URL: <http://selenic.com/pipermail/mercurial-devel/attachments/20111103/c504fa69/attachment.py>


More information about the Mercurial-devel mailing list