Improving file name encoding support on Windows

Manuel Jacob me at
Tue Aug 6 22:14:54 EDT 2019

Recently I sent a mail to the Mercurial (user mailing list) to ask how 
to have filename encoding interoperability between Linux and Windows.  
It seems like this is currently not easily achievable.

I’ve seen the Windows UTF-8 plan in the Wiki, but it doesn’t seem to be 
implemented, so I decided to revisit the topic and discuss it on this 
mailing list.

Disclaimer: I’m not an expert on Windows, and certainly not an expert on 
Windows filesystem APIs.

First, let me share my interpretation of how file name encodings are 
handled on different levels.

Unix-derived systems use bytes as the native type for file names.

Windows uses Unicode (this mostly means "UTF-16") as the native type for 
file names.  Windows provides a subset of the filesystem API accepting 
bytes (for some weird reason it’s called "ANSI APIs"), but in my 
understanding it’s provided mostly for backward compatibility and its 
use is discouraged.

Since recently it’s possible to set the code page to UTF-8 on Windows 
10, but that setting is off by default, marked as "Beta" and apparently 
causes bugs. [1]  Therefore, for this mail, I’ll ignore that feature.

On Windows, Python 2 and Python 3 up to Python 3.5 called the bytes API 
iff a Python standard library filesystem function was called with bytes, 
but it was deprecated in Python 3.3. [2]  Python 3.6 and later always 
call the Unicode filesystem API, using UTF-8 to convert between bytes 
and Unicode. [3]

Mercurial stores file names as bytes.  Because on Linux, the filesystem 
API is bytes-native, that strategy works fine.  On Windows, passing 
bytes to the filesystem API has some problems:

- Only the characters in the currently active code page can be used.  I 
think that the current strategy limits the set of usable characters for 
the majority of users, but I didn’t check.
- Filename encoding interoperability between Windows and Unix-derived 
systems is reduced because most users on Unix-derived systems have UTF-8 
set as the system locale, but most Windows users don’t have UTF-8 set as 
the code page.
- Python 3.6 changed the behavior when passing bytes to the filesystem 
functions in the standard library. [3]

Instead, I think that only Unicode strings should be passed to the 
standard library filesystem functions.  This means that the bytes stored 
by Mercurial need to be decoded using some codec, and file names 
returned from the standard library filesystem functions need to be 
encoded using some codec.

How should that encoding be chosen?  I see a few possibilities:

- Always use UTF-8.  This is what will happen on Python 3.6 or later if 
we do nothing.  It has the problem that existing non-ASCII file names 
will break on Windows.
- Use UTF-8 if all / some(?) file names in the manifest are in UTF-8 
(this includes ASCII).  This is known as "hybrid strategy" or "Windows 
UTF-8 Plan" in the Wiki.  I’m not sure whether this can be detected 
- Use encoding defined in a global setting.  People who care about 
interoperability with other systems can set it to whatever the other 
systems are using.
- Store encoding in repository (either per manifest or per file).  I 
think that this doesn’t solve the backward compatibility problem.

For my specific use case, always using UTF-8 would be enough.  My 
suggestion would be to add a global setting.  At first, the default 
would be 'mbcs' (which means: use the active code page) with the 
'replace' error handler.  This would be equivalent to the current 
behavior on Python up to version 3.5, but with the advantage that the 
behavior is consistent on all Python versions (even Python 3.6 and 
later).  It would give core code and in-tree and out-of-tree extensions 
time to migrate to the new behavior.  People interested in 
interoperability could set this setting to 'utf-8'.  If someone invests 
the time to implement the hybrid mode, the default could be changed to 
that at some later point.


More information about the Mercurial-devel mailing list