Add a Unicode mode, but keep the bytes mode
Victor Stinner
victor.stinner at haypocalc.com
Fri Nov 4 07:47:14 CDT 2011
Hi,
Summary: because we cannot solve all issues with a single data type (bytes or
Unicode), I propose to offer two exclusive modes: bytes and Unicode. People who
need a full Unicode support can chose the new Unicode mode, whereas existing
repositories will continue to work as before. This email lists all limitation
of each data type (repository "kind").
--
Thanks to the recent discussions, I have now a better idea of the issues
related to "Unicode filenames" (store filenames as UTF-8 and use a Unicode type
in Python). All issues listed above concern non-ASCII filenames. If you only
use ASCII filenames (which is the most common case), you don't have to be
worried by these issues :-)
There are two main use cases:
A) Portable project used on any platform (Windows, Linux and Mac OS X) shared
by a lot of people, tools compatible with Unicode filenames
B) Project specific to a platform shared by a small group, typical only on
UNIX, "legacy" tools (incompatible with Unicode filenames)
Unicode have to be used for (A), and bytes have to be used for (B). So I
propose to add a new "Unicode" mode to Mercurial.
--
It will be possible to convert a repository between the two modes under the
following conditions:
* Unicode->bytes requires an encoding able to encode all filenames. E.g. you
cannot convert to Latin1 if a filename contains a japanese character.
* bytes->Unicode requires an encoding able to decode all filenames. E.g. If
filenames were created on a latin1 system, you cannot convert the repository
from UTF-8 (you will get Unicode decode errors).
If it is the same computer used to create and convert the repository, it will
work on both cases (the locale encoding will be used).
You will have to use the same mode than all people of your project. You cannot
use bytes whereas others use Unicode. The mode has to be chosen when you
create a new repository, or the repository has to be converted only once when
everybody agrees (and after some tests).
There is no reason to convert from bytes to Unicode if you don't manipulate
non-ASCII filenames. You may want to move to Unicode if you have mojibake
issues (e.g. if you need to support the full Unicode range on Windows).
The default kind will be bytes until enough third-party tools are compatible
with Unicode (e.g. make).
--
Each repository mode has limitations:
* Unicode: you cannot checkout a repository on UNIX if your locale is unable
to encode all filenames. E.g. if your locale encoding is ASCII on Linux, you
cannot clone a repository containing non-ASCII filenames.
* bytes: you don't have access to the full Unicode range on Windows
* bytes: mojibake issues (filenames not displayed correctly) depending on your
locale
--
Summary of all issues related to filenames.
"Makefile": if a file contains a filename stored as bytes, you cannot "transcode"
filenames between two computers.
=> continue to use the bytes mode (until you solve these issues?)
Mojibake: filenames are currently stored as bytes without the encoding
information, if a filename was created on Windows with the cp1252 ANSI code
page or on Linux with latin1 code page, filenames are not displayed correctly
on Windows or Linux using a different code page/locale encoding (e.g. UTF-8 on
Linux).
=> convert your repository to Unicode
If filenames are stored as Unicode and your locale encoding cannot encode them,
you cannot checkout the repository.
=> continue to use the bytes mode, change your locale encoding or rename files
Mac OS X normalizes filenames to a variant of the decomposed form (NFD) when
the filesystem is HFS+.
=> Unicode filenames will be normalized
--
Now some technical details.
In the Python souce code, it is not a good idea to have two versions of each
function, one to process bytes filename, one to process Unicode filenames. I
suggest to always use the Unicode type because:
- we can store any bytes in Unicode using the ASCII encoding and the
surrogateescape error handler (PEP 383)
- you don't to store the encoding of a Unicode string, because the charset is
known (it's the Universal Character Set of Unicode)
- in Python 3, it's more pratical to manipulate Unicode than bytes
We might use bytes by encoding Unicode to UTF-8, but it would be more error-
prone because you have to be very careful to not concatenate two byte strings
of different encodings.
So non-ASCII characters will be stored in memory as surrogates in U+DC80-
U+DCFF, whereas ACSII characters will be stored as Unicode characters (U+0000-
U+007F range). On the disk, the filenames will be stored as bytes.
A global flag (maybe something like "unicode" in .hg/requires?) would indicate
if we use bytes or Unicode.
(Unicode mode) Filenames will be normalized to NFC when a directory content is
listed or when you pass a filename on the command line. So a checkout will pass
filenames normalized to NFC to the kernel. On Linux, Windows and Mac OS X, the
keyboard creates precomposed keys (use NFC), so it's better to use NFC. On Mac
OS X, the kernel normalize the filenames to its variant of NFD.
Victor
More information about the Mercurial-devel
mailing list