Differences between revisions 1 and 14 (spanning 13 versions)
Revision 1 as of 2010-11-23 07:45:30
Size: 5226
Editor: mpm
Comment:
Revision 14 as of 2011-11-15 07:52:53
Size: 13129
Editor: mpm
Comment:
Deletions are marked like this. Additions are marked like this.
Line 2: Line 2:
Line 4: Line 3:
Line 12: Line 10:
Line 21: Line 18:
== Platform issues ==
=== Linux and Unix ===
 * kernel and native filesystems are encoding-transparent
 * filesystem APIs are UTF-8-compatible, but accept arbitrary encodings
 * multiple encodings can exist for file names on the same system
 * UTF-8 is the defacto standard console and text file on modern systems, though other encodings are still common
 * common tools like '`make(1)`' are designed to assume file encoding and contents match, so UTF-8 filenames in files will be used to find UTF-8 filenames on disk

=== Mac OS X ===
 * kernel is encoding-transparent
 * native HFS+ filesystem encodes filenames in a non-standard normalized form of UTF-8
 * filesystem uses an Unicode-aware case-folding algorithm by default
 * kernel I/O interfaces accept UTF-8 filenames
 * resulting filenames on disk may not match (strcmp) with the names supplied at creation time due to normalization
 * Most tools assume file encoding and contents match
 * UTF-8 is the defacto standard console and text file encoding
 * Legacy tools may use !MacRoman

=== Windows ===
 * kernel has several different incompatible 8-bit encoding regimes:
  * default encoding used in the GUI
  * default encoding used in the filesystem
  * default (legacy) encoding used in the console (cmd.exe)
 * kernel has a mix of byte-width and wide character APIs
 * kernel and console environment have basically no support for UTF-8 filename I/O or character display
 * in some circumstances, kernel will accept non-ASCII filenames, list them as different names,
  . and fail to open the file under the original name
 * filesystem uses a highly-obscure Unicode-aware case-folding algorithm by default
 * Many tools attempt to do transcoding of file contents from the local encoding to UTF-16 before passing it off to the filesystem
 * UTF-16 text files are occasionally found
 * Wide character encodings like Shift-JIS cause trouble here because they make "\" byte ambiguous

=== Web and XML ===
 * URLs are encoded in ASCII with %-escaping to ISO-Latin, according to RFC1738
 * translation of URLs to filesystem paths is webserver-dependent
 * HTML defaults to ISO-Latin, but may contain encoding specifiers or Unicode entities
 * XML assumes a subset of UTF-8

=== Mercurial assumptions ===
 * non-ASCII filenames are not reliably portable between systems in general
 * the "makefile issue" (matching of file content and name encoding) means that in general, we must attempt to preserve filename encoding
 * On Windows, we prefer the 8-bit encoding of the GUI environment to that of the console to be compatible with typical editors

== Problems and constraints ==
=== The encoding tracking problem ===
There are a number of problems with detecting and tracking encoding of files:

 * With the exception of UTF-8 and ASCII, encodings cannot be reliably detected
 * Files may not be in a known or valid encoding, or may be using multiple encodings
 * Experience has shown that users will not correctly configure encodings until after a problem is created in permanent history
 * Conversion to and from Unicode may not perfectly preserve file contents
 * Many projects may intentionally contain files in different encodings, so localizing encoding will be unhelpful

Therefore, we do not attempt to convert file contents to match user locales and instead preserve files intact.

=== The "makefile" problem ===
On Unix, if a file such as a script or makefile refers to another filename, it must do so using an encoding that matches the filename encoding. For instance, if a filename is encoded in Latin1, a makefile must also encode that filename in Latin1. Otherwise, a compiler will fail to find the referenced file.

Therefore, we cannot change filename encoding to match the locale of different users, as common tools will fail.
Line 22: Line 79:

The following are explicitly treated as binary data in an unknown encoding: 
The following are explicitly treated as binary data in an unknown encoding:
Line 28: Line 84:
These items should be treated as binary data and preserved losslessly wherever possible. Generally speaking, it is impossible to reliably and uniquely identify file type and encoding, thus Mercurial does not attempt to distinguish 'binary' files from 'text' files when storing them and instead aims to always preserve them exactly.  These items should be treated as binary data and preserved losslessly wherever possible. Generally speaking, it is impossible to reliably and uniquely identify file type and encoding, thus Mercurial does not attempt to distinguish 'binary' files from 'text' files when storing them and instead aims to always preserve them exactly.
Line 35: Line 91:
Line 49: Line 104:

Most of these are converted to and from local strings in the relevant I/O functions, so that internally the above items are always represented in the local encoding. This restricts UTF-8-aware code to the smallest footprint possible so that the bulk of the code does not need to keep track of what encoding a string is in.

The primary exception to this rule is branch names, which must be preserved as UTF-8 between being read from the dirstate and written to the changelog to avoid transcoding lossage.
 * .hg/bookmarks

T
hese are converted to and from local strings in the relevant I/O functions, so that internally the above items are always represented in the local encoding. This restricts UTF-8-aware code to the smallest footprint possible so that the bulk of the code does not need to keep track of what encoding a string is in.
Line 55: Line 109:
Line 67: Line 120:
Line 71: Line 123:
Line 85: Line 136:
Line 90: Line 140:
 * `colwidth()`: calculate the width of a local string in terminal columns   * `colwidth()`: calculate the width of a local string in terminal columns
Line 94: Line 144:
== Round-trip conversion ==
Some data, such as branch names, are stored locally as UTF-8, read in for processing, then stored in the repository history as UTF-8 again.

This presents difficulties, as we either need to make sure the dozens of places that handle branch names do so in UTF-8 or we need to avoid conversion loss when converting from the local encoding back to UTF-8. In Mercurial post-1.7, this is facilitated by the `encoding.localstr` class returned by `tolocal` which caches the original UTF-8 version of a string alongside its local encoding. The `fromlocal` function can retrieve this string if it's available, which allows lossless round-trip conversion.

/!\ String operations (eg strip()) on localstr objects will lose the cached UTF-8 data.
Line 95: Line 152:
Line 98: Line 154:
== Filename strategy compatibility matrices ==
=== Key ===
 * Unix = Linux and other traditional Unixlike systems
 * UTF-8/16 = UTF-8 file names with UTF-16 contents
 * Various = multiple, unknown, or meaningless encodings
 * (./) = fully interoperable
 * <!> = fails for some filenames
 * {X} = fails checkout
 * X-( = can't even check-in
 * R = human readability issues (aka mojibake)
 * B? = "make problem" with native byte-oriented tools
 * C? = "make problem" with native character-oriented tools
 * * = Windows has limited support for UTF-8 (CP65001)

=== Mercurial <= 2.0 strategy: ===
Current versions of Mercurial read and write filenames "as-is" with no attempt to adapt to encoding or use wide character interfaces.
||<tablewidth="200px"> ||Unix ASCII ||Unix Latin1 ||Unix ShiftJIS ||Unix UTF-8 ||Mac UTF-8 ||Windows Latin1 ||Windows ShiftJIS ||Windows UTF-8* ||
||ASCII || (./) || (./) || (./) || (./) || (./) || (./) || (./) || (./) ||
||Latin1 ||R || (./) ||R ||R ||R || (./) ||RC? ||RC? ||
||ShiftJIS ||R ||R || (./) ||R ||R ||RC? || (./) ||RC? ||
||UTF-8 ||R ||R ||R || (./) || (./) ||RC? ||RC? || (./) ||
||UTF-8/16 ||RB? ||RB? ||RB? ||RB? ||RB? ||RC? ||RC? || (./) ||
||Various ||R ||R ||R ||R ||R ||R ||R ||R ||




=== "Transcode everything to/from Unicode and use Windows Unicode API" strategy: ===
Some other SCMs (SVN, Bazaar) attempt to trancode all filenames to/from Unicode internally.
||<tablewidth="200px"> ||Unix ASCII ||Unix Latin1 ||Unix ShiftJIS ||Unix UTF-8 ||Mac UTF-8 ||Windows Latin1 ||Windows ShiftJIS ||Windows UTF-8* ||
||ASCII || (./) || (./) || (./) || (./) || (./) || (./) || (./) || (./) ||
||Latin1 || {X} || (./) || {X} ||B? ||B? || (./) ||C? ||C? ||
||ShiftJIS || {X} || {X} || (./) ||B? ||B? ||C? || (./) ||C? ||
||UTF-8 || {X} || <!> B? || <!> B? || (./) || (./) || (./) || (./) || (./) ||
||UTF-8/16 || {X} || <!> B? || <!> B? ||B? ||B? || (./) || (./) || (./) ||
||Various || X-( || X-( || X-( || X-( || X-( || X-( || X-( || X-( ||




=== Future hybrid strategy: ===
A proposed future version of Mercurial would use Windows Unicode APIs whenever UTF-8 filenames were stored in a repo:
|| ||Unix ASCII ||Unix Latin1 ||Unix ShiftJIS ||Unix UTF-8 ||Mac UTF-8 ||Windows Latin1 ||Windows ShiftJIS ||Windows UTF-8* ||
||ASCII || (./) || (./) || (./) || (./) || (./) || (./) || (./) || (./) ||
||Latin1 ||R || (./) ||R ||R ||R || (./) ||RC? ||RC? ||
||ShiftJIS ||R ||R || (./) ||R ||R ||RC? || (./) ||RC? ||
||UTF-8 ||R ||R ||R || (./) || (./) || (./) || (./) || (./) ||
||UTF-8/16 ||RB? ||RB? ||RB? ||RB? ||RB? || (./) || (./) || (./) ||
||Various ||R ||R ||R ||R ||R ||R ||R ||R ||




=== Observations ===
 * ASCII is the only perfectly cross-platform strategy
 * Mercurial strategy almost always results in a successful checkout
 * Mercurial strategy avoids makefile problem well on Unix-like systems
 * "Transcode" strategy trades a few successes on Windows for lots of failed checkouts elsewhere
 * "Transcode" strategy can't handle "various" at all
 * "Transcode" strategy sometimes trades readability problems (easy to ignore) for "makefile problems" (break the build)
 * "Transcode" strategy trades some "makefile problems" for others
 * Overall, "trancode" strategy is less robust and Unix-hostile
 * Hybrid strategy combines upside of "transcode strategy" without introducing new failure modes.
 * Hybrid with UTF-8 is nearly completely cross-platform
Line 99: Line 220:

Encoding Strategy

How encoding works in the Mercurial codebase.

/!\ This page is intended for developers.

1. Overview

There are three types of string used in Mercurial:

  • byte string in unknown encoding (tracked data)
  • byte string in local encoding (messages, user input)
  • byte string in UTF-8 encoding (repository metadata)

This page sorts out which type of string can be expected where on disk and in the code and what functions manipulate them.

2. Platform issues

2.1. Linux and Unix

  • kernel and native filesystems are encoding-transparent
  • filesystem APIs are UTF-8-compatible, but accept arbitrary encodings
  • multiple encodings can exist for file names on the same system
  • UTF-8 is the defacto standard console and text file on modern systems, though other encodings are still common
  • common tools like 'make(1)' are designed to assume file encoding and contents match, so UTF-8 filenames in files will be used to find UTF-8 filenames on disk

2.2. Mac OS X

  • kernel is encoding-transparent
  • native HFS+ filesystem encodes filenames in a non-standard normalized form of UTF-8
  • filesystem uses an Unicode-aware case-folding algorithm by default
  • kernel I/O interfaces accept UTF-8 filenames
  • resulting filenames on disk may not match (strcmp) with the names supplied at creation time due to normalization
  • Most tools assume file encoding and contents match
  • UTF-8 is the defacto standard console and text file encoding
  • Legacy tools may use MacRoman

2.3. Windows

  • kernel has several different incompatible 8-bit encoding regimes:
    • default encoding used in the GUI
    • default encoding used in the filesystem
    • default (legacy) encoding used in the console (cmd.exe)
  • kernel has a mix of byte-width and wide character APIs
  • kernel and console environment have basically no support for UTF-8 filename I/O or character display
  • in some circumstances, kernel will accept non-ASCII filenames, list them as different names,
    • and fail to open the file under the original name
  • filesystem uses a highly-obscure Unicode-aware case-folding algorithm by default
  • Many tools attempt to do transcoding of file contents from the local encoding to UTF-16 before passing it off to the filesystem
  • UTF-16 text files are occasionally found
  • Wide character encodings like Shift-JIS cause trouble here because they make "\" byte ambiguous

2.4. Web and XML

  • URLs are encoded in ASCII with %-escaping to ISO-Latin, according to RFC1738
  • translation of URLs to filesystem paths is webserver-dependent
  • HTML defaults to ISO-Latin, but may contain encoding specifiers or Unicode entities
  • XML assumes a subset of UTF-8

2.5. Mercurial assumptions

  • non-ASCII filenames are not reliably portable between systems in general
  • the "makefile issue" (matching of file content and name encoding) means that in general, we must attempt to preserve filename encoding
  • On Windows, we prefer the 8-bit encoding of the GUI environment to that of the console to be compatible with typical editors

3. Problems and constraints

3.1. The encoding tracking problem

There are a number of problems with detecting and tracking encoding of files:

  • With the exception of UTF-8 and ASCII, encodings cannot be reliably detected
  • Files may not be in a known or valid encoding, or may be using multiple encodings
  • Experience has shown that users will not correctly configure encodings until after a problem is created in permanent history
  • Conversion to and from Unicode may not perfectly preserve file contents
  • Many projects may intentionally contain files in different encodings, so localizing encoding will be unhelpful

Therefore, we do not attempt to convert file contents to match user locales and instead preserve files intact.

3.2. The "makefile" problem

On Unix, if a file such as a script or makefile refers to another filename, it must do so using an encoding that matches the filename encoding. For instance, if a filename is encoded in Latin1, a makefile must also encode that filename in Latin1. Otherwise, a compiler will fail to find the referenced file.

Therefore, we cannot change filename encoding to match the locale of different users, as common tools will fail.

4. Unknown byte strings

The following are explicitly treated as binary data in an unknown encoding:

  • file contents
  • file names

These items should be treated as binary data and preserved losslessly wherever possible. Generally speaking, it is impossible to reliably and uniquely identify file type and encoding, thus Mercurial does not attempt to distinguish 'binary' files from 'text' files when storing them and instead aims to always preserve them exactly.

Similarly, for historical reasons, non-ASCII filenames are not necessarily portable from Unix to Windows, and Mercurial does not attempt to 'solve' this problem with transcoding either.

In general, do not attempt to transcode such data to Unicode and back in Mercurial, it will result in data loss.

5. UTF-8 strings

UTF-8 strings are used to store most repository metadata. Unlike repository contents, repository metadata is 'owned and managed' by Mercurial and can be made to conform to its rules. In particular, this includes:

  • commit messages stored in the changelog
  • user names
  • tags
  • branches

The following files are stored in UTF-8:

  • .hgtags
  • .hg/branch
  • .hg/branchheads.cache
  • .hg/tags.cache
  • .hg/bookmarks

These are converted to and from local strings in the relevant I/O functions, so that internally the above items are always represented in the local encoding. This restricts UTF-8-aware code to the smallest footprint possible so that the bulk of the code does not need to keep track of what encoding a string is in.

6. Local strings

Strings not mentioned above are generally assumed to be in the local charset encoding. This includes:

  • command line arguments
  • configuration files like .hgrc

  • prompt input
  • commit message
  • .hg/localtags

All user input in the form of command line arguments, configuration files, etc. are assumed to be in the local encoding.

6.1. Internal messages

All internal messages are written in ASCII, which is assumed to be a subset of the local encoding. Where localized string data is available, these strings are translated to the local encoding via gettext.

7. Mixing output

Mercurial frequently mixes output of all three varieties. For instance, the output of 'hg log -p' will contain internal strings in local encoding to mark fields, UTF-8 metadata, and file contents in an unknown encoding. These are managed as follows:

  • UTF-8 data is converted to local encoding at the earliest opportunity, generally at read time
  • internal ASCII strings are translated to local encoding via gettext() or passed unmodified
  • data in unknown encoding (file contents and filenames) are treated as already being in the local encoding for I/O purposes
  • resulting strings are combined with typical string formatting and I/O operations
  • raw binary output is used with no additional transcoding

Thus, the vast bulk of string operations in Mercurial are done as if they were operating on local strings.

As an example, attempts to view a patch containing UTF-8 characters on a non-UTF-8 terminal may not be entirely human-readable, but the generated patch will be correct in the sense that a standard patch tool will be able to apply it and get the right UTF-8 characters in the result. Similarly, 'hg cat' of a binary file will output an exact copy of the binary file, regardless of the current encoding.

8. Functions

The encoding module defines the following functions:

  • fromlocal(): converts a string from the local encoding to UTF-8 for storage with validation

  • tolocal(): converts string stored as UTF-8 to the local encoding replacing unknown glyphs

  • colwidth(): calculate the width of a local string in terminal columns

Also, encoding.encoding specifies Mercurial's idea of what the current encoding is.

9. Round-trip conversion

Some data, such as branch names, are stored locally as UTF-8, read in for processing, then stored in the repository history as UTF-8 again.

This presents difficulties, as we either need to make sure the dozens of places that handle branch names do so in UTF-8 or we need to avoid conversion loss when converting from the local encoding back to UTF-8. In Mercurial post-1.7, this is facilitated by the encoding.localstr class returned by tolocal which caches the original UTF-8 version of a string alongside its local encoding. The fromlocal function can retrieve this string if it's available, which allows lossless round-trip conversion.

/!\ String operations (eg strip()) on localstr objects will lose the cached UTF-8 data.

10. Unicode strings

Python Unicode objects are only used in the implementation of the above functions and are carefully avoided elsewhere. Do not pass Unicode objects to any Mercurial APIs. Due to Python's misguided automatic Unicode to byte string conversion, Unicode objects are likely to work in testing, but break as soon as they encounter a non-ASCII character.

11. Filename strategy compatibility matrices

11.1. Key

  • Unix = Linux and other traditional Unixlike systems
  • UTF-8/16 = UTF-8 file names with UTF-16 contents
  • Various = multiple, unknown, or meaningless encodings
  • (./) = fully interoperable

  • <!> = fails for some filenames

  • {X} = fails checkout

  • X-( = can't even check-in

  • R = human readability issues (aka mojibake)
  • B? = "make problem" with native byte-oriented tools
  • C? = "make problem" with native character-oriented tools
  • * = Windows has limited support for UTF-8 (CP65001)

11.2. Mercurial <= 2.0 strategy:

Current versions of Mercurial read and write filenames "as-is" with no attempt to adapt to encoding or use wide character interfaces.

Unix ASCII

Unix Latin1

Unix ShiftJIS

Unix UTF-8

Mac UTF-8

Windows Latin1

Windows ShiftJIS

Windows UTF-8*

ASCII

(./)

(./)

(./)

(./)

(./)

(./)

(./)

(./)

Latin1

R

(./)

R

R

R

(./)

RC?

RC?

ShiftJIS

R

R

(./)

R

R

RC?

(./)

RC?

UTF-8

R

R

R

(./)

(./)

RC?

RC?

(./)

UTF-8/16

RB?

RB?

RB?

RB?

RB?

RC?

RC?

(./)

Various

R

R

R

R

R

R

R

R

11.3. "Transcode everything to/from Unicode and use Windows Unicode API" strategy:

Some other SCMs (SVN, Bazaar) attempt to trancode all filenames to/from Unicode internally.

Unix ASCII

Unix Latin1

Unix ShiftJIS

Unix UTF-8

Mac UTF-8

Windows Latin1

Windows ShiftJIS

Windows UTF-8*

ASCII

(./)

(./)

(./)

(./)

(./)

(./)

(./)

(./)

Latin1

{X}

(./)

{X}

B?

B?

(./)

C?

C?

ShiftJIS

{X}

{X}

(./)

B?

B?

C?

(./)

C?

UTF-8

{X}

<!> B?

<!> B?

(./)

(./)

(./)

(./)

(./)

UTF-8/16

{X}

<!> B?

<!> B?

B?

B?

(./)

(./)

(./)

Various

X-(

X-(

X-(

X-(

X-(

X-(

X-(

X-(

11.4. Future hybrid strategy:

A proposed future version of Mercurial would use Windows Unicode APIs whenever UTF-8 filenames were stored in a repo:

Unix ASCII

Unix Latin1

Unix ShiftJIS

Unix UTF-8

Mac UTF-8

Windows Latin1

Windows ShiftJIS

Windows UTF-8*

ASCII

(./)

(./)

(./)

(./)

(./)

(./)

(./)

(./)

Latin1

R

(./)

R

R

R

(./)

RC?

RC?

ShiftJIS

R

R

(./)

R

R

RC?

(./)

RC?

UTF-8

R

R

R

(./)

(./)

(./)

(./)

(./)

UTF-8/16

RB?

RB?

RB?

RB?

RB?

(./)

(./)

(./)

Various

R

R

R

R

R

R

R

R

11.5. Observations

  • ASCII is the only perfectly cross-platform strategy
  • Mercurial strategy almost always results in a successful checkout
  • Mercurial strategy avoids makefile problem well on Unix-like systems
  • "Transcode" strategy trades a few successes on Windows for lots of failed checkouts elsewhere
  • "Transcode" strategy can't handle "various" at all
  • "Transcode" strategy sometimes trades readability problems (easy to ignore) for "makefile problems" (break the build)
  • "Transcode" strategy trades some "makefile problems" for others
  • Overall, "trancode" strategy is less robust and Unix-hostile
  • Hybrid strategy combines upside of "transcode strategy" without introducing new failure modes.
  • Hybrid with UTF-8 is nearly completely cross-platform

12. Historical note

Early versions of Mercurial made no effort to transcode metadata, so the tolocal() function has some fallbacks to allow guessing the encoding of strings that don't appear to be Unicode.


CategoryInternals

EncodingStrategy (last edited 2012-12-03 16:20:41 by mpm)