Much like the situation on Windows, character encoding on OS X is rather broken. The systemwide character encoding (typically MacRoman) is not reflected in LC_CTYPE (typically "C") so UNIX apps running inside Terminal often have a different notion of what character set to use than native apps.

If we honor LC_CTYPE, we'll end up with Mercurial not properly encoding non-ASCII characters from things like TextEdit by default. Because working with native apps should be the default, Mercurial ignores LC_CTYPE and just uses MacRoman. OS X users who want to work in a more UNIX-like fashion can set HGENCODING.

(Brendan) This can produce surprising results for Terminal users who don't know LANG et al. are going to be ignored. If commit information is provided in UTF-8, it will be mangled (interpreted as MacRoman, then converted to UTF-8) when stored, and then reverse-mangled when displayed. The result is garbage in the changelog that is not detectable by the user. It might be better to respect LANG and friends if they are not the default, since that implies the user set them and expects them to be used.

The problem is really python's locale.getpreferredencoding() returning mac-roman regardless of the unix locale.

Note: I don't see this behavior on 10.5.3 with the default python. Still need to check previous versions. (Lee)

lcantey$ python -c "import locale; print locale.getpreferredencoding()"

Alexis mentioned that bzr works around it with a trick like this:

if sys.platform == 'darwin':
    sys.platform = 'generic'
    import locale
    sys.platform = 'darwin'

In the presence of mercurial's demandimport module, we'd probably need to add a locale.getpreferredencoding() or so before resetting sys.platform.


Character_Encoding_On_OSX (last edited 2009-05-19 19:31:05 by localhost)