Unicode Windows API, Was: Concerns about using Python's ctypes library on Windows

Sat Jul 30 14:03:07 CDT 2011

Adrian Buehlmann wrote, On 07/29/2011 09:48 PM:
> On 2011-07-29 18:37, Andrei Polushin wrote:
>> 1. What if Mercurial will to switch to Unicode Windows APIs?
> This won't happen. Ask Matt why. (But it wouldn't be a problem anyway)

I'm not so sure that it won't happen. There is a problem with cross 
platform unicode and it should be fixed.

Some things obviously won't change. Mercurial will remain backward 
compatible and will keep storing and working with encoded filenames 
exactly like they are represented by traditional unix systems: as a 
sequence of bytes without \0 and with / only used as separator between 
path elements. That works just fine on and between systems that uses the 
same encoding, and with the UTF-8 encoding unicode file names is no 
problem. All major platforms - except Windows - now uses UTF-8 and it is 
the de facto standard encoding in Mercurial. (Most experienced 
developers are however smart enough to restrict themselves to (a subset 
of) 7-bit ASCII.)

Non-Windows platforms generally works fine and are not going to change, 
so some kind of Windows-specific solution/hack is needed. It could be 
argued that it is unfair to define the problem in such a way that the 
burden is put on Windows, but that is how it is, and that is how we have 
to look at it if we want to be constructive and improve the situation.

This is where I think Mercurial on Windows could and should grow 
_optional_ support for using unicode Windows APIs. The problem on 
Windows is that Mercurial doesn't use UTF-8 APIs - both because Windows 
uses UTF-16 instead of UTF-8 for its unicode APIs, and because Mercurial 
uses the 8-bit API. Instead we could use the UTF-16 API and convert 
to/from UTF-8 at the API level.

I think we should acknowledge, support and utilize that local and 
"unknown" encodings all are converging towards UTF-8 for most users. It 
happens automatically on other platforms, but some extra attention and 
hacks are required on Windows.

Note that:

* It will be a bit tricky to introduce unicode on Windows in such a way 
that existing repos keeps working the old quirky way while new repos 
uses such a more cross platform UTF-8 approach.

* Console input/output of unicode on Windows seems to be a lost game no 
matter what we do. Users will have to accept some garbage and use 
wildcards (or file://...%xx...) to specify unicode filenames. 
Re-encoding might not be an option for a general solution to console 
output, but keeping everything in UTF-8 until the low-level write 
function where it can be re-encoded (with loss?) to whatever encoding 
the user wants seems to be one of the least broken solutions.

* References to filenames from file content (such as make files or build 
systems) can in principle break if files on Windows no longer are 
created with unreadable UTF-8 in their names. Many build systems are 
however unicode aware, so while the build file might be encoded in UTF-8 
the build system on Windows will reference the file using the UTF-16 
encoded name.

* The fixutf8 extension already implement this UTF-8 approach and seems 
to have some happy users. It could perhaps be promoted to a standard 
extension or it could be used as inspiration for a new and better and 
more complete and maintainable implementation.

* Case-folding and unicode normalization could be considered a separate 
(and almost solved) problem.

* This discussion is mostly about file system access in the working 
directory. All (?) files in .hg have names in plain ASCII and do thus 
not have encoding issues but do instead have other requirements for file 
system semantics.

/Mads