Typical Windows systems operate with two character sets. On typical US systems, for example, most applications use a character set called cp1252 that is close to Latin-1. At the same time, the console ("DOS box") uses the legacy PC character set called cp437.

This makes things rather complicated for a command-line application like Mercurial. Data that Mercurial does locale conversion on includes things like user names and commit messages. These can come from the command line, from files like .hgrc, from local editors, and be output to files or displayed on the console.

In the case of data taken from files, we should generally assume that the contents are in the cp1252 charset. But this may not always be correct because a user may have used a native console-based editor to create the file.

The command line presents a similar problem. While typically the command line is typed in a console and are thus in cp437, it may have actually come from a batch file written in Notepad using cp1252. Or it may come from another program spawning hg to do its work, such as a graphical IDE or an importer where cp1252 is native.

Even environment variables like HGUSER are problematic, as it may have been set in the registry editor or with Notepad rather than on the command line.

Finally, consider output. When output goes directly to the console, it's usually possible to determine the character set to use (though in some situations, there will be no codepage associated with a console!). However, redirection makes things confusing again. Consider "hg log | more" or "hg log > file". These cases are indistinguishable from Mercurial's point of view. For the "| more" case, we'd like to have cp437 so that non-ASCII characters are displayed correctly. But for the "> file" case, we'd probably want cp1252 so that tools like Notepad will get the right results.

So what should a program like Mercurial do? The best options are:

This leaves us with the question of how to deal with character set compatibility problems. Here are a few possibilities:

Notes


CategoryWindows

CharacterEncodingOnWindows (last edited 2011-07-31 23:25:03 by nat3)