Improving file name encoding support on Windows
me at manueljacob.de
Tue Aug 6 22:14:54 EDT 2019
Recently I sent a mail to the Mercurial (user mailing list) to ask how
to have filename encoding interoperability between Linux and Windows.
It seems like this is currently not easily achievable.
I’ve seen the Windows UTF-8 plan in the Wiki, but it doesn’t seem to be
implemented, so I decided to revisit the topic and discuss it on this
Disclaimer: I’m not an expert on Windows, and certainly not an expert on
Windows filesystem APIs.
First, let me share my interpretation of how file name encodings are
handled on different levels.
Unix-derived systems use bytes as the native type for file names.
Windows uses Unicode (this mostly means "UTF-16") as the native type for
file names. Windows provides a subset of the filesystem API accepting
bytes (for some weird reason it’s called "ANSI APIs"), but in my
understanding it’s provided mostly for backward compatibility and its
use is discouraged.
Since recently it’s possible to set the code page to UTF-8 on Windows
10, but that setting is off by default, marked as "Beta" and apparently
causes bugs.  Therefore, for this mail, I’ll ignore that feature.
On Windows, Python 2 and Python 3 up to Python 3.5 called the bytes API
iff a Python standard library filesystem function was called with bytes,
but it was deprecated in Python 3.3.  Python 3.6 and later always
call the Unicode filesystem API, using UTF-8 to convert between bytes
and Unicode. 
Mercurial stores file names as bytes. Because on Linux, the filesystem
API is bytes-native, that strategy works fine. On Windows, passing
bytes to the filesystem API has some problems:
- Only the characters in the currently active code page can be used. I
think that the current strategy limits the set of usable characters for
the majority of users, but I didn’t check.
- Filename encoding interoperability between Windows and Unix-derived
systems is reduced because most users on Unix-derived systems have UTF-8
set as the system locale, but most Windows users don’t have UTF-8 set as
the code page.
- Python 3.6 changed the behavior when passing bytes to the filesystem
functions in the standard library. 
Instead, I think that only Unicode strings should be passed to the
standard library filesystem functions. This means that the bytes stored
by Mercurial need to be decoded using some codec, and file names
returned from the standard library filesystem functions need to be
encoded using some codec.
How should that encoding be chosen? I see a few possibilities:
- Always use UTF-8. This is what will happen on Python 3.6 or later if
we do nothing. It has the problem that existing non-ASCII file names
will break on Windows.
- Use UTF-8 if all / some(?) file names in the manifest are in UTF-8
(this includes ASCII). This is known as "hybrid strategy" or "Windows
UTF-8 Plan" in the Wiki. I’m not sure whether this can be detected
- Use encoding defined in a global setting. People who care about
interoperability with other systems can set it to whatever the other
systems are using.
- Store encoding in repository (either per manifest or per file). I
think that this doesn’t solve the backward compatibility problem.
For my specific use case, always using UTF-8 would be enough. My
suggestion would be to add a global setting. At first, the default
would be 'mbcs' (which means: use the active code page) with the
'replace' error handler. This would be equivalent to the current
behavior on Python up to version 3.5, but with the advantage that the
behavior is consistent on all Python versions (even Python 3.6 and
later). It would give core code and in-tree and out-of-tree extensions
time to migrate to the new behavior. People interested in
interoperability could set this setting to 'utf-8'. If someone invests
the time to implement the hybrid mode, the default could be changed to
that at some later point.
More information about the Mercurial-devel