Initial support of Unicode filenames
Victor Stinner
victor.stinner at haypocalc.com
Wed Nov 2 19:29:24 CDT 2011
Le samedi 29 octobre 2011 00:58:46, vous avez écrit :
> I'm afraid I've already vetoed about a dozen variants of this suggestion
> over the years. For starters, it is not backward-compatible with
> existing Windows users.
It looks like the main concern is to not fail on undecodable filenames on UNIX.
Python 3 has the PEP 383 to solve this issue: the surrogateescape
errorhandler. Properties of this error handler:
* os.fsdecode(fn) does never fail (for any filesystem encoding)
* os.fsencode(os.fsdecode(fn)) == fn
* os.fsdecode(fn) may contain surrogate characters (which are not printable)
(os.fsdecode decodes a bytes filename from the filesystem encoding, os.fsencode
encodes a Unicode filename to the filesystsem encoding, there are new functions
from Python 3.2)
http://www.python.org/dev/peps/pep-0383/
The codecs module API allows to register our own error handler. Attached
script is a proof-of-concept to demonstrate that it is possible to implement
it in Python 2. ASCII and UTF-8 encoders of Python 2 has some limitations, and
so these encodings need special cases when encoding a filename.
* Filenames from the filesystem and the command line would be decoded from the
filesystem encoding using this error handler
* Filenames would be stored as UTF-8 using this error handler
* Filenames would be encoded to the filesystem encoding using this error
handler when accessing the filesystem
The error handler can be implemented in C for speed (and it is already
implemented in C in Python 3).
--
If we store filenames are UTF-8, you would be able to share a repository on a
USB key between two Windows setup using different ANSI code pages (e.g. cp1252
and cp932). You would also be able to use the full Unicode range on Windows,
not only a small subset (the ANSI code page). For example, cp1252 contains 256
code points vs 1.114.111 for Unicode 6.0).
Well, I don't think that I need to list all advantages of manipulate filenames
as Unicode.
Victor
-------------- next part --------------
A non-text attachment was scrubbed...
Name: surrogateescape.py
Type: text/x-python
Size: 3360 bytes
Desc: not available
URL: <http://selenic.com/pipermail/mercurial-devel/attachments/20111103/c504fa69/attachment.py>
More information about the Mercurial-devel
mailing list