[PATCH 2 of 3] templater: replace jsonescape in main json templater (issue4926)
Yuya Nishihara
yuya at tcha.org
Thu Jan 14 07:12:46 CST 2016
On Wed, 13 Jan 2016 10:51:06 -0600, Matt Mackall wrote:
> On Wed, 2016-01-13 at 22:01 +0900, Yuya Nishihara wrote:
> > On Tue, 12 Jan 2016 11:01:06 -0600, Matt Mackall wrote:
> > > # HG changeset patch
> > > # User Matt Mackall <mpm at selenic.com>
> > > # Date 1452542432 21600
> > > # Mon Jan 11 14:00:32 2016 -0600
> > > # Node ID 35d049d7e5a2dec87318ce8042844f56e107cf83
> > > # Parent 544d391bd3b42b96975a3521b73c25223db930b0
> > > templater: replace jsonescape in main json templater (issue4926)
> > >
> > > This version differs in a couple ways:
> > >
> > > - it skips optional escaping of codepoints > U+007f
> > > - it thus handles emoji correctly (JSON requires UTF-16 surrogates)
> > > - but it may run afoul of silly Unicode linebreaks if exec'd in js
> > > - it uses UTF-8b to round-trip undecodeable bytes
> >
> > We can't do that because JSON output can be embedded in non-UTF-8 HTML,
> > where only 7bit ASCII is allowed,
>
> Example scenarios, please.
HGENCODING=utf-8
export HGENCODING
hg init a
cd a
touch foo
hg ci -Am "$(python -c 'print u"\xc0".encode("utf-8")')"
hg serve --encoding iso-8859-1
Then, access to http://localhost:8000/graph/tip .
(In our real-word example, --encoding Shift_JIS and Japanese characters.)
Before this patch, there was no mojibake because "À" is escaped to "\u00c0".
With this patch, "À" is lost as follows:
u"À" -> "\xc0" (iso-8859-1) -> "\xed\xb3\x80" (utf8b)
-> "\xed\xb3\x80" (iso-8859-1)
> There's no configuration of hgweb that won't potentially display non-ASCII if it
> exists in files. If you commit Unicode "á" to a file and fire up
> "HGENCODING=ascii hg serve", you'll get mojibake in the browser by design (and
> the correct bytes verbatim if you select raw mode). So I'm not sure what you
> mean by "allowed". I guess we could get into trouble if we expand JSON directly
> into some in-page Javascript when the page metadata marks it as non-UTF8.
JSON data can be embedded in non-UTF8 page so long as it is represented in ASCII
and the page encoding is compatible with ASCII.
> > and JSON input (i.e. template string)
> > is a local-encoding text in general.
>
> encoding.jsonescape (indirectly) knows about localstr objects, and thus recovers
> the original UTF-8 text to encode if it exists.
Yes, but localstr is mostly lost in templater, and toutf8b() takes it as bytes,
not as local-encoding text.
> > I have patch series to fix the issue4926, but I found my patch seems to have
> > the emoji issue right now.
>
> Whatever we do, we need to kill the second implementation of jsonescape in the
> templater.
Sure. My series will do:
1. add option to escape all non-ASCII characters by encoding.jsonescape()
2. add "|utf8" template filter to explicitly convert localstr|str to utf-8
3. change "|json" to take input as utf8b bytes (BC)
More information about the Mercurial-devel
mailing list