[PATCH] highlight: fixes garbled text in non-UTF-8 environment

Christian Ebert blacktrash at gmx.net
Sat Sep 5 04:21:44 CDT 2009


* Yuya Nishihara on Wednesday, September 02, 2009 at 11:16:03 -0000
> # HG changeset patch
> # User Yuya Nishihara <yuya at tcha.org>
> # Date 1251527055 -32400
> # Node ID 54e7217e12558be85f8ae410f1a168b58b966bae
> # Parent  37042e8b3b342b2e380d8be3e3f7692584c92d33
> highlight: fixes garbled text in non-UTF-8 environment
> 
> This patch treats all files inside repository as encoded by
> locale's encoding when pygmentize.
> 
> We can assume that most files are written in locale's encoding,
> but current implementation treats them as UTF-8.
> So there's no way to specify the encoding of files.
> 
> Current implementation, db7557359636 (issue1341):
> 1. Convert original `text`, which is treated as UTF-8, to locale's encoding.
>   `encoding.tolocal()` is the method to convert from internal UTF-8 to local.
>   If original `text` is not UTF-8, e.g. Japanese EUC-JP, some characters
>   become garbled here.

So why did iso-8859-1 content not become garbled? Probably
because it was in fallbackencoding.

> 2. pygmentize, with no UnicodeDecodeError.
> 
> This patch:
> 1. Convert original `text`, which is treated as locale's encoding, to unicode.
>   Pygments prefers unicode object than raw str. [1]_
>   If original `text` is not encoded by locale's encoding, some characters
>   become garbled here.
> 2. pygmentize, also with no UnicodeDecodeError :)
> 3. Convert unicode back to raw str, which is encoded by locale's.

Have you checked whether this still highlights the text in
question? With this patch I lose all highlighting!

I don't know why exactly. Have to investigate. There are so many
places where encoding can be set:

- hgrc files
- environment
- [web].encoding
- hgwebdir.cgi

etc. Except by experimenting I don't even know which gets
precedence. E.g. I just discovered that setting [web].encoding to
something like iso-8859-1 causes a traceback (not because of your
patch) whereas ascii doesn't (just garbling).

The test should probably contain not a .txt file (won't be
highlighted anyway) but a file that is recognized by extension
(and may contain non-ascii characters).

c
-- 
\black\trash movie    _C O W B O Y_  _C A N O E_  _C O M A_
                     Ein deutscher Western/A German Western
Next show: 18 September 2009 --->>    http://goldkante.org/
         --->> http://www.blacktrash.org/underdogma/ccc.php


More information about the Mercurial-devel mailing list