Storage format for remotenames.

Gregory Szorc gregory.szorc at gmail.com
Thu Nov 9 01:59:10 EST 2017



> On Nov 8, 2017, at 06:00, Yuya Nishihara <yuya at tcha.org> wrote:
> 
>> On Tue, 7 Nov 2017 09:58:04 -0800, Durham Goode wrote:
>> I wish we had some easily reusable serializer/deserializer instead of 
>> having to reinvent these every time.  What's our reasoning for not using 
>> json? I forget. If there are some weird characters, like control 
>> characters or something, that break json, I'd say we just use json and 
>> prevent users from creating bookmarks and paths with those names.
> 
> Just about json. Using json (in Mercurial) is bad because it's so easily
> to spill out unicode objects without noticing that, all tests pass (because
> the whole data is ascii), and we'll get nice UnicodeError in production.

This issue can be prevented with diligent coding (and possibly a custom wrapper to convert unicode to bytes). Python 3 would also uncover type coercion.

> 
> Another concern is that encoding conversion can be lossy even if it goes
> with no error. There are n:m mappings between unicode and legacy encoding.
> For example, we have three major Shift_JISes in Japan, and the Microsoft one
> allocates multiple code points for one character for "compatibility" reasons.

This is the bigger problem. JSON doesn’t do a good job at preserving byte sequences unless strings are valid UTF-8. The most common way to robustly round trip arbitrary byte sequences through JSON is to apply an encoding to string fields that won’t result in escape characters in JSON. Base64 is common.

Avoiding code points that need escaped in JSON seems reasonable for some use cases. For things like storing the author field in obs markers, it is not.

I’d just as soon we vendor and use a binary serialization format like Protobuf, Thrift, Capnproto, Msgpack, Avro, etc. Bonus points if Rust’s serde crate can parse it using zero copy.


More information about the Mercurial-devel mailing list