Diff for "Character_Encoding_On_OSX"

Differences between revisions 1 and 2

Much like the [:Character Encoding On Windows:situation on Windows], character encoding on OS X is rather broken. The systemwide character encoding (typically MacRoman) is not reflected in LC_CTYPE (typically "C") so UNIX apps running inside Terminal often have a different notion of what character set to use than native apps.

If we honor LC_CTYPE, we'll end up with Mercurial not properly encoding non-ASCII characters from things like TextEdit by default. Because working with native apps should be the default, Mercurial ignores LC_CTYPE and just uses MacRoman. OS X users who want to work in a more UNIX-like fashion can set HGENCODING.

(Brendan) This can produce surprising results for Terminal users who don't know LANG et al. are going to be ignored. If commit information is provided in UTF-8, it will be mangled (interpreted as MacRoman, then converted to UTF-8) when stored, and then reverse-mangled when displayed. The result is garbage in the changelog that is not detectable by the user. It might be better to respect LANG and friends if they are not the default, since that implies the user set them and expects them to be used.

The problem is really python's locale.getpreferredencoding() returning mac-roman regardless of the unix locale. Alexis mentioned that bzr works around it with a trick like this:

if sys.platform == 'darwin':
    sys.platform = 'generic'
    import locale
    sys.platform = 'darwin'

In the presence of mercurial's demandimport module, we'd probably need to add a locale.getpreferredencoding() or so before resetting sys.platform.

-  ⇤ ← Revision 1 as of 2007-01-12 16:22:23 → 
  Size: 647
  Editor: mpm
  Comment:
+   ← Revision 2 as of 2007-06-11 17:15:56 → ⇥
  Size: 1629
  Editor: BrendanCully
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 2:
-The systemwide character encoding (typically MacRoman) is not reflected in LC_CTYPE (typically "C") so UNIX apps running inside Terminal often have a different notion of what character set to use than native apps.
+The systemwide character encoding (typically !MacRoman) is not reflected in LC_CTYPE (typically "C") so UNIX apps running inside Terminal often have a different notion of what character set to use than native apps.
 Line 4:
-If we honor LC_CTYPE, we'll end up with Mercurial not properly encoding non-ASCII characters from things like TextEdit by default. Because working with native apps should be the default, Mercurial ignores LC_CTYPE and just uses MacRoman. OS X users who want to work in a more UNIX-like fashion can set HGENCODING.
+If we honor LC_CTYPE, we'll end up with Mercurial not properly encoding non-ASCII characters from things like !TextEdit by default. Because working with native apps should be the default, Mercurial ignores LC_CTYPE and just uses !MacRoman. OS X users who want to work in a more UNIX-like fashion can set HGENCODING.

(Brendan) This can produce surprising results for Terminal users who don't know LANG et al. are going to be ignored. If commit information is provided in UTF-8, it will be mangled (interpreted as !MacRoman, then converted to UTF-8) when stored, and then reverse-mangled when displayed. The result is garbage in the changelog that is not detectable by the user. It might be better to respect LANG and friends ''if they are not the default'', since that implies the user set them and expects them to be used.

The problem is really python's {{{locale.getpreferredencoding()}}} returning {{{mac-roman}}} regardless of the unix locale. Alexis mentioned that bzr works around it with a trick like this:

{{{
if sys.platform == 'darwin':
    sys.platform = 'generic'
    import locale
    sys.platform = 'darwin'
}}}

In the presence of mercurial's demandimport module, we'd probably need to add a {{{locale.getpreferredencoding()}}} or so before resetting sys.platform.