Add a Unicode mode, but keep the bytes mode

Victor Stinner victor.stinner at haypocalc.com
Sat Nov 5 11:10:22 CDT 2011


Le samedi 5 novembre 2011 11:57:23, Laurens Holst a écrit :
> Op 5-11-2011 0:48, Victor Stinner schreef:
> > Le vendredi 4 novembre 2011 18:20:28, Andrey a écrit :
> >> Great work.
> >> 
> >> On Friday, November 4, 2011 1:47:14 PM UTC+1, Victor Stinner wrote:
> >>> The default kind will be bytes until enough third-party tools are
> >>> compatible
> >>> with Unicode (e.g. make).
> 
> In other words: never.

Windows supports Unicode since Windows 95 (and non-BMP characters since 
Windows 2000), but many Windows programs still use the ANSI (bytes) API (e.g. 
Mercurial ;-)).

On Mac OS X, the kernel process filenames UTF-8, and most program uses 
indirectly UTF-8 and so are Unicode compliant.

On UNIX, it does really depend on the locale encoding. There are still some 
old systems using an encoding different than UTF-8, but all new systems use 
UTF-8, and so, as Mac OS X, are Unicode compliant. But well, even if the 
system uses UTF-8 encoding, you may get mojibake if the encoding of an USB key 
is not correctly detected, or if you unpack an old archive (e.g. TAR archive 
stores filenames are bytes, if you created your archive on a latin1 system, you 
must have a latin1 locale encoding).

So it *is* possible to have a fully Unicode compliant system today... if your 
system is well configured, if you are careful, and if don't have to handle old 
content. There are many conditions, but it is possible ;-) And slowly it 
becomes more and more easy to have such system.

> I think a better point to switch the default would be when systems with
> non-Unicode encoding are a thing of the past.

As any new features, it is better to wait for user feedback to improve the 
feature and maybe fix bugs, before using it by default.

It would be too fast to use directly by default because users will continue to 
use old Mercurial versions for some time (as some people are still using 
Python 2.4 even if Python 2.7 and 3.2 are released) and the new Unicode mode 
is not fully backward compatible.

> Or to switch it right now,
> and have a little more sensible fallback behaviour on non-Unicode
> systems than ‘you can’t update at all’.

The corner case is not "hg pull -u" but "hg push" (old repository => new 
repository):

create on computer A (new Mercurial)

 * create a new Unicode repository
 * add content with non-ASCII filenames

work on computer B (old Mercurial)

 * clone the repository
 * add a new file with a non-ASCII filename
 * hg ci
 * hg push

After thinking twice, "hg push" is only a problem if you added new files with 
names not decodable from UTF-8. It "works" if your locale encoding is UTF-8 or 
if the filename is pure ASCII. So

Mac OS X and most Linux distro uses a UTF-8 locale encoding, but not Windows. 
So on Windows, with an old Mercurial, you will be limited to ASCII if you add 
new files.

> Wouldn’t the behaviour for old client versions be identical to when the
> repository were created on an UTF-8 system? That is, check out fine on
> an UTF-8 system, and get the usual garbling of non-ASCII characters on a
> Latin-1 system.

Yes.

> Another possibility might be to add two configuration options, one
> describes the repository encoding and one the target encoding. Without
> these set, Mercurial is encoding-agnostic (current behaviour), when you
> set the repository encoding it automatically recodes filenames to your
> local system’s encoding, or to the target encoding (if set). I think
> this is similar to what the eol extension does.

It is not so different than my "Unicode mode", and so it has the same 
contraints and limitations, except that it has an important advantage: it 
helps to have a smoother transition (backward compatibility) if you work in an 
homogeneous environment (e.g. only Windows with cp1252 ANSI code page). Python 
embeds most common encodings (e.g. most Windows code pages), it can work.

Being able to use latin1 (instead of UTF-8) would also help the corner case 
because all byte strings are decodable from latin1.

It avoids also to really convert the content of a repository: if the "new" 
encoding is already able to decode all filenames, you don't have to transcode 
filenames, and hashes are unchanged.

I like your idea :-)

Victor


More information about the Mercurial-devel mailing list