[PATCH 4 of 4] changelog: lazy decode user (API)

Matt Mackall mpm at selenic.com
Tue Mar 1 14:22:20 EST 2016

On Sat, 2016-02-27 at 23:27 -0800, Gregory Szorc wrote:
> # HG changeset patch
> # User Gregory Szorc <gregory.szorc at gmail.com>
> # Date 1456641258 28800
> #      Sat Feb 27 22:34:18 2016 -0800
> # Node ID ee98b780730118e8a8948396507633a0460c154e
> # Parent  8427442ba08dd8dc324ea9e1fd30f65c89b2b753
> changelog: lazy decode user (API)
> This appears to show a similar speedup as the previous patch.

These two scare me (and are against our encoding conventions).

I like the idea of being lazy here and I've definitely seen the hit for this in
profiles, but I worry that this will leak utf-8 data to users that are expecting
local strings and we won't discover the problem until some end user runs it on a
non-utf-8 system months down the road.

Because these sorts of encoding confusions are very hard to keep track of in a
weakly-typed system, our rule has always been: limit the exposure of system to
the secondary types as far as possible. Which is why ALL changelog
encoding/decoding is handled today in just a couple functions in changelog.py
and we mostly don't have to think about it.

If we want to go further down the lazy road, I can imagine shimming in a
lightweight class (still in changelog.py) to replace the return tuple. It can be
initialized with the raw changeset data and have accessor methods or members to
unpack/decode the pieces. Then we'll still be lazy without subtly changing the
types of the legacy API that we're still using all over the place.

Or, we can leave changeset.read() alone and add another entrypoint that contexts
can use for lazy parsing.

Mathematics is the supreme nostalgia of our time.

More information about the Mercurial-devel mailing list