Add a Unicode mode, but keep the bytes mode

Fri Nov 4 07:47:14 CDT 2011

Hi,

Summary: because we cannot solve all issues with a single data type (bytes or 
Unicode), I propose to offer two exclusive modes: bytes and Unicode. People who 
need a full Unicode support can chose the new Unicode mode, whereas existing 
repositories will continue to work as before. This email lists all limitation 
of each data type (repository "kind").

--

Thanks to the recent discussions, I have now a better idea of the issues 
related to "Unicode filenames" (store filenames as UTF-8 and use a Unicode type 
in Python). All issues listed above concern non-ASCII filenames. If you only 
use ASCII filenames (which is the most common case), you don't have to be 
worried by these issues :-)

There are two main use cases:

 A) Portable project used on any platform (Windows, Linux and Mac OS X) shared 
by a lot of people, tools compatible with Unicode filenames
 B) Project specific to a platform shared by a small group, typical only on 
UNIX, "legacy" tools (incompatible with Unicode filenames)

Unicode have to be used for (A), and bytes have to be used for (B). So I 
propose to add a new "Unicode" mode to Mercurial.

--

It will be possible to convert a repository between the two modes under the 
following conditions:

 * Unicode->bytes requires an encoding able to encode all filenames. E.g. you 
cannot convert to Latin1 if a filename contains a japanese character.

 * bytes->Unicode requires an encoding able to decode all filenames. E.g. If 
filenames were created on a latin1 system, you cannot convert the repository 
from UTF-8 (you will get Unicode decode errors).

If it is the same computer used to create and convert the repository, it will 
work on both cases (the locale encoding will be used).

You will have to use the same mode than all people of your project. You cannot 
use bytes whereas others use Unicode. The mode has to be chosen when you 
create a new repository, or the repository has to be converted only once when 
everybody agrees (and after some tests).

There is no reason to convert from bytes to Unicode if you don't manipulate 
non-ASCII filenames. You may want to move to Unicode if you have mojibake 
issues (e.g. if you need to support the full Unicode range on Windows).

The default kind will be bytes until enough third-party tools are compatible 
with Unicode (e.g. make).

--

Each repository mode has limitations:

 * Unicode: you cannot checkout a repository on UNIX if your locale is unable 
to encode all filenames. E.g. if your locale encoding is ASCII on Linux, you 
cannot clone a repository containing non-ASCII filenames.

 * bytes: you don't have access to the full Unicode range on Windows

 * bytes: mojibake issues (filenames not displayed correctly) depending on your 
locale

--

Summary of all issues related to filenames.

"Makefile": if a file contains a filename stored as bytes, you cannot "transcode" 
filenames between two computers.
=> continue to use the bytes mode (until you solve these issues?)

Mojibake: filenames are currently stored as bytes without the encoding 
information, if a filename was created on Windows with the cp1252 ANSI code 
page or on Linux with latin1 code page, filenames are not displayed correctly 
on Windows or Linux using a different code page/locale encoding (e.g. UTF-8 on 
Linux).
=> convert your repository to Unicode

If filenames are stored as Unicode and your locale encoding cannot encode them, 
you cannot checkout the repository.
=> continue to use the bytes mode, change your locale encoding or rename files

Mac OS X normalizes filenames to a variant of the decomposed form (NFD) when 
the filesystem is HFS+.
=> Unicode filenames will be normalized

--

Now some technical details.

In the Python souce code, it is not a good idea to have two versions of each 
function, one to process bytes filename, one to process Unicode filenames. I 
suggest to always use the Unicode type because:

 - we can store any bytes in Unicode using the ASCII encoding and the 
surrogateescape error handler (PEP 383)
 - you don't to store the encoding of a Unicode string, because the charset is 
known (it's the Universal Character Set of Unicode)
 - in Python 3, it's more pratical to manipulate Unicode than bytes

We might use bytes by encoding Unicode to UTF-8, but it would be more error-
prone because you have to be very careful to not concatenate two byte strings 
of different encodings.

So non-ASCII characters will be stored in memory as surrogates in U+DC80-
U+DCFF, whereas ACSII characters will be stored as Unicode characters (U+0000-
U+007F range). On the disk, the filenames will be stored as bytes.

A global flag (maybe something like "unicode" in .hg/requires?) would indicate 
if we use bytes or Unicode.

(Unicode mode) Filenames will be normalized to NFC when a directory content is 
listed or when you pass a filename on the command line. So a checkout will pass 
filenames normalized to NFC to the kernel. On Linux, Windows and Mac OS X, the 
keyboard creates precomposed keys (use NFC), so it's better to use NFC. On Mac 
OS X, the kernel normalize the filenames to its variant of NFD.

Victor