[PATCH 2 of 3] templater: replace jsonescape in main json templater (issue4926)

Yuya Nishihara yuya at tcha.org
Thu Jan 14 07:12:46 CST 2016


On Wed, 13 Jan 2016 10:51:06 -0600, Matt Mackall wrote:
> On Wed, 2016-01-13 at 22:01 +0900, Yuya Nishihara wrote:
> > On Tue, 12 Jan 2016 11:01:06 -0600, Matt Mackall wrote:
> > > # HG changeset patch
> > > # User Matt Mackall <mpm at selenic.com>
> > > # Date 1452542432 21600
> > > #      Mon Jan 11 14:00:32 2016 -0600
> > > # Node ID 35d049d7e5a2dec87318ce8042844f56e107cf83
> > > # Parent  544d391bd3b42b96975a3521b73c25223db930b0
> > > templater: replace jsonescape in main json templater (issue4926)
> > > 
> > > This version differs in a couple ways:
> > > 
> > > - it skips optional escaping of codepoints > U+007f
> > > - it thus handles emoji correctly (JSON requires UTF-16 surrogates)
> > > - but it may run afoul of silly Unicode linebreaks if exec'd in js
> > > - it uses UTF-8b to round-trip undecodeable bytes
> > 
> > We can't do that because JSON output can be embedded in non-UTF-8 HTML,
> > where only 7bit ASCII is allowed,
> 
> Example scenarios, please.

HGENCODING=utf-8
export HGENCODING

hg init a
cd a
touch foo
hg ci -Am "$(python -c 'print u"\xc0".encode("utf-8")')"
hg serve --encoding iso-8859-1

Then, access to http://localhost:8000/graph/tip .
(In our real-word example, --encoding Shift_JIS and Japanese characters.)

Before this patch, there was no mojibake because "À" is escaped to "\u00c0".
With this patch, "À" is lost as follows:

  u"À" -> "\xc0" (iso-8859-1) -> "\xed\xb3\x80" (utf8b)
  -> "\xed\xb3\x80" (iso-8859-1)

> There's no configuration of hgweb that won't potentially display non-ASCII if it
> exists in files. If you commit Unicode "á" to a file and fire up
> "HGENCODING=ascii hg serve", you'll get mojibake in the browser by design (and
> the correct bytes verbatim if you select raw mode). So I'm not sure what you
> mean by "allowed". I guess we could get into trouble if we expand JSON directly
> into some in-page Javascript when the page metadata marks it as non-UTF8.

JSON data can be embedded in non-UTF8 page so long as it is represented in ASCII
and the page encoding is compatible with ASCII.

> >  and JSON input (i.e. template string)
> > is a local-encoding text in general.
> 
> encoding.jsonescape (indirectly) knows about localstr objects, and thus recovers
> the original UTF-8 text to encode if it exists.

Yes, but localstr is mostly lost in templater, and toutf8b() takes it as bytes,
not as local-encoding text.

> > I have patch series to fix the issue4926, but I found my patch seems to have
> > the emoji issue right now.
> 
> Whatever we do, we need to kill the second implementation of jsonescape in the
> templater.

Sure. My series will do:

 1. add option to escape all non-ASCII characters by encoding.jsonescape()
 2. add "|utf8" template filter to explicitly convert localstr|str to utf-8
 3. change "|json" to take input as utf8b bytes (BC)


More information about the Mercurial-devel mailing list