[PATCH] highlight: pass hgweb.encoding to lexers and formatter

Tue Dec 11 17:07:26 CST 2007

On Tue, Dec 11, 2007 at 11:40:25PM +0100, Christian Ebert wrote:
> * Matt Mackall on Tuesday, December 11, 2007 at 15:50:00 -0600
> > On Tue, Dec 11, 2007 at 10:21:23PM +0100, Christian Ebert wrote:
> >> The following is needed to avoid a nasty backtrace when a file
> >> contains non-ascii characters.
> >> 
> >> Should perhaps be tested in non-utf locale; also I am not
> >> entirely sure if the lexers should get passed util._encoding.
> >> Anyway this gave consistent results re encoding with highlight
> >> turned on and off.
> > 
> > Ugh. Apps should assume that regardless of what encoding they're in,
> > someone's going to throw them a byte that can't be decoded. If it was
> > throwing an exception when it was assuming ASCII, it will still throw
> > exceptions when you try to pass off Latin-1 as UTF-8 or whatever. So
> > this fix is insufficient.
> 
> No doubt. I am rather confused by the pygments docs (input charset
> iso-8859-1 is assumed???) too, see below.
> 
> > Odds are good that pygments is hopelessly infected with Unicode
> > braindamage, so I somehow doubt there -is- a good fix.
> 
> Frankly, I just tried "to make it work" for my machine. But
> perhaps someone more savvy with pygments has an idea; or can make
> something coherent out of the docs. I quote the relevant section
> for reference:
... 
> Since Pygments 0.6, all lexers use unicode strings internally. Because of that
> you might encounter the occasional `UnicodeDecodeError` if you pass strings with the
> wrong encoding.

Yeah, that's the brain damage I was talking about. 

> The best way is to pass Pygments unicode objects. In that case you can't get
> unexpected output.

And that's a bit of a strong statement. Anyway, this is probably the
best route - simply decode the strings yourself (using util.tolocal)
which will replace characters that can't be handled with something
vaguely appropriate and not spew chunks.

> The formatters now send Unicode objects to the stream if you don't set the
> output encoding. You can do so by passing the formatters an `encoding` option:
> 
>     from pygments.formatters import HtmlFormatter
>     f = HtmlFormatter(encoding='utf-8')

And you'll want to make sure the output encoding is set sensibly too.

Of course, this will all make a bloody mess when you have a repo
(or even a single file!) containing multiple character sets. But at
least it won't die.

-- 
Mathematics is the supreme nostalgia of our time.