How encoding works in the Mercurial codebase.
This page is intended for developers.
There are three types of string used in Mercurial:
- byte string in unknown encoding (tracked data)
- byte string in local encoding (messages, user input)
- byte string in UTF-8 encoding (repository metadata)
This page sorts out which type of string can be expected where on disk and in the code and what functions manipulate them.
2. Unknown byte strings
The following are explicitly treated as binary data in an unknown encoding:
- file contents
- file names
These items should be treated as binary data and preserved losslessly wherever possible. Generally speaking, it is impossible to reliably and uniquely identify file type and encoding, thus Mercurial does not attempt to distinguish 'binary' files from 'text' files when storing them and instead aims to always preserve them exactly.
Similarly, for historical reasons, non-ASCII filenames are not necessarily portable from Unix to Windows, and Mercurial does not attempt to 'solve' this problem with transcoding either.
In general, do not attempt to transcode such data to Unicode and back in Mercurial, it will result in data loss.
3. UTF-8 strings
UTF-8 strings are used to store most repository metadata. Unlike repository contents, repository metadata is 'owned and managed' by Mercurial and can be made to conform to its rules. In particular, this includes:
- commit messages stored in the changelog
- user names
The following files are stored in UTF-8:
Most of these are converted to and from local strings in the relevant I/O functions, so that internally the above items are always represented in the local encoding. This restricts UTF-8-aware code to the smallest footprint possible so that the bulk of the code does not need to keep track of what encoding a string is in.
The primary exception to this rule is branch names, which must be preserved as UTF-8 between being read from the dirstate and written to the changelog to avoid transcoding lossage.
4. Local strings
Strings not mentioned above are generally assumed to be in the local charset encoding. This includes:
- command line arguments
configuration files like .hgrc
- prompt input
- commit message
All user input in the form of command line arguments, configuration files, etc. are assumed to be in the local encoding.
4.1. Internal messages
All internal messages are written in ASCII, which is assumed to be a subset of the local encoding. Where localized string data is available, these strings are translated to the local encoding via gettext.
5. Mixing output
Mercurial frequently mixes output of all three varieties. For instance, the output of 'hg log -p' will contain internal strings in local encoding to mark fields, UTF-8 metadata, and file contents in an unknown encoding. These are managed as follows:
- UTF-8 data is converted to local encoding at the earliest opportunity, generally at read time
- internal ASCII strings are translated to local encoding via gettext() or passed unmodified
- data in unknown encoding (file contents and filenames) are treated as already being in the local encoding for I/O purposes
- resulting strings are combined with typical string formatting and I/O operations
- raw binary output is used with no additional transcoding
Thus, the vast bulk of string operations in Mercurial are done as if they were operating on local strings.
As an example, attempts to view a patch containing UTF-8 characters on a non-UTF-8 terminal may not be entirely human-readable, but the generated patch will be correct in the sense that a standard patch tool will be able to apply it and get the right UTF-8 characters in the result. Similarly, 'hg cat' of a binary file will output an exact copy of the binary file, regardless of the current encoding.
The encoding module defines the following functions:
fromlocal(): converts a string from the local encoding to UTF-8 for storage with validation
tolocal(): converts string stored as UTF-8 to the local encoding replacing unknown glyphs
colwidth(): calculate the width of a local string in terminal columns
Also, encoding.encoding specifies Mercurial's idea of what the current encoding is.
7. Unicode strings
Python Unicode objects are only used in the implementation of the above functions and are carefully avoided elsewhere. Do not pass Unicode objects to any Mercurial APIs. Due to Python's misguided automatic Unicode to byte string conversion, Unicode objects are likely to work in testing, but break as soon as they encounter a non-ASCII character.
8. Historical note
Early versions of Mercurial made no effort to transcode metadata, so the tolocal() function has some fallbacks to allow guessing the encoding of strings that don't appear to be Unicode.