Encoding Strategy

How encoding works in the Mercurial codebase.

/!\ This page is intended for developers.

1. Overview

There are three types of string used in Mercurial:

This page sorts out which type of string can be expected where on disk and in the code and what functions manipulate them.

2. Platform issues

2.1. Linux and Unix

2.2. Mac OS X

2.3. Windows

2.4. Web and XML

2.5. Mercurial assumptions

3. Problems and constraints

3.1. The encoding tracking problem

There are a number of problems with detecting and tracking encoding of files:

Therefore, we do not attempt to convert file contents to match user locales and instead preserve files intact.

3.2. The "makefile" problem

On Unix, if a file such as a script or makefile refers to another filename, it must do so using an encoding that matches the filename encoding. For instance, if a filename is encoded in Latin1, a makefile must also encode that filename in Latin1. Otherwise, a compiler will fail to find the referenced file.

Therefore, we cannot change filename encoding to match the locale of different users, as common tools will fail.

4. Unknown byte strings

The following are explicitly treated as binary data in an unknown encoding:

These items should be treated as binary data and preserved losslessly wherever possible. Generally speaking, it is impossible to reliably and uniquely identify file type and encoding, thus Mercurial does not attempt to distinguish 'binary' files from 'text' files when storing them and instead aims to always preserve them exactly.

Similarly, for historical reasons, non-ASCII filenames are not necessarily portable from Unix to Windows, and Mercurial does not attempt to 'solve' this problem with transcoding either.

In general, do not attempt to transcode such data to Unicode and back in Mercurial, it will result in data loss.

5. UTF-8 strings

UTF-8 strings are used to store most repository metadata. Unlike repository contents, repository metadata is 'owned and managed' by Mercurial and can be made to conform to its rules. In particular, this includes:

The following files are stored in UTF-8:

These are converted to and from local strings in the relevant I/O functions, so that internally the above items are always represented in the local encoding. This restricts UTF-8-aware code to the smallest footprint possible so that the bulk of the code does not need to keep track of what encoding a string is in.

6. Local strings

Strings not mentioned above are generally assumed to be in the local charset encoding. This includes:

All user input in the form of command line arguments, configuration files, etc. are assumed to be in the local encoding.

6.1. Internal messages

All internal messages are written in ASCII, which is assumed to be a subset of the local encoding. Where localized string data is available, these strings are translated to the local encoding via gettext.

7. Mixing output

Mercurial frequently mixes output of all three varieties. For instance, the output of 'hg log -p' will contain internal strings in local encoding to mark fields, UTF-8 metadata, and file contents in an unknown encoding. These are managed as follows:

Thus, the vast bulk of string operations in Mercurial are done as if they were operating on local strings.

As an example, attempts to view a patch containing UTF-8 characters on a non-UTF-8 terminal may not be entirely human-readable, but the generated patch will be correct in the sense that a standard patch tool will be able to apply it and get the right UTF-8 characters in the result. Similarly, 'hg cat' of a binary file will output an exact copy of the binary file, regardless of the current encoding.

8. Functions

The encoding module defines the following functions:

Also, encoding.encoding specifies Mercurial's idea of what the current encoding is.

9. Round-trip conversion

Some data, such as branch names, are stored locally as UTF-8, read in for processing, then stored in the repository history as UTF-8 again.

This presents difficulties, as we either need to make sure the dozens of places that handle branch names do so in UTF-8 or we need to avoid conversion loss when converting from the local encoding back to UTF-8. In Mercurial post-1.7, this is facilitated by the encoding.localstr class returned by tolocal which caches the original UTF-8 version of a string alongside its local encoding. The fromlocal function can retrieve this string if it's available, which allows lossless round-trip conversion.

/!\ String operations (eg strip()) on localstr objects will lose the cached UTF-8 data.

10. Unicode strings

Python Unicode objects are only used in the implementation of the above functions and are carefully avoided elsewhere. Do not pass Unicode objects to any Mercurial APIs. Due to Python's misguided automatic Unicode to byte string conversion, Unicode objects are likely to work in testing, but break as soon as they encounter a non-ASCII character.

11. Filename strategy compatibility matrices

This section discusses different strategies of filename storage and their failure modes. The rows indicate filename and contents stored in a repo (Latin1 means "some filenames with Latin1 characters, with file contents also encoded in Latin1) while the columns indicate client operating system and configuration (read Windows Latin1 as codepage 1252, we ignore the differences here for simplicity).

11.1. Key

11.2. Mercurial <= 2.0 strategy:

Current versions of Mercurial read and write filenames "as-is" with no attempt to adapt to local encoding or use wide character interfaces.

Unix ASCII

Unix Latin1

Unix UTF-8

Mac UTF-8

Windows Latin1

Windows ShiftJIS

Windows UTF-8*

ASCII

(./)

(./)

(./)

(./)

(./)

(./)

(./)

Latin1

R

(./)

R

R

(./)

RC?

RC?

ShiftJIS

R

R

R

R

RC?

(./)

RC?

UTF-8

R

R

(./)

(./)

RC?

RC?

(./)

UTF-8/16

RB?

RB?

RB?

RB?

RC?

RC?

(./)

Various

R

R

R

R

R

R

R

11.3. "Transcode everything to/from Unicode and use Windows Unicode API" strategy:

Some other SCMs (SVN, Bazaar) attempt to trancode all filenames to/from Unicode internally. As file contents are not transcoded, files committed in with Latin1 contents are checked out in Latin1 contents.

Unix ASCII

Unix Latin1

Unix UTF-8

Mac UTF-8

Windows Latin1

Windows ShiftJIS

Windows UTF-8*

ASCII

(./)

(./)

(./)

(./)

(./)

(./)

(./)

Latin1

{X}

(./)

B?

B?

(./)

C?

C?

ShiftJIS

{X}

{X}

B?

B?

C?

(./)

C?

UTF-8

{X}

<!> B?

(./)

(./)

(./)

(./)

(./)

UTF-8/16

{X}

<!> B?

B?

B?

(./)

(./)

(./)

Various

X-(

X-(

X-(

X-(

X-(

X-(

X-(

11.4. Future hybrid strategy:

A proposed future version of Mercurial would use Windows Unicode APIs whenever UTF-8 filenames were stored in a repo:

Unix ASCII

Unix Latin1

Unix UTF-8

Mac UTF-8

Windows Latin1

Windows ShiftJIS

Windows UTF-8*

ASCII

(./)

(./)

(./)

(./)

(./)

(./)

(./)

Latin1

R

(./)

R

R

(./)

RC?

RC?

ShiftJIS

R

R

R

R

RC?

(./)

RC?

UTF-8

R

R

(./)

(./)

(./)

(./)

(./)

UTF-8/16

RB?

RB?

RB?

RB?

(./)

(./)

(./)

Various

R

R

R

R

R

R

R

11.5. Observations

12. Historical note

Early versions of Mercurial made no effort to transcode metadata, so the tolocal() function has some fallbacks to allow guessing the encoding of strings that don't appear to be Unicode.


CategoryInternals