[PATCH 2 of 6] changelog: branchinfo is a low level function - it should return UTF-8

Matt Mackall mpm at selenic.com
Tue Jan 6 14:45:31 CST 2015


On Tue, 2015-01-06 at 17:11 +0100, Mads Kiilerich wrote:
> On 12/22/2014 08:51 PM, Matt Mackall wrote:
> > On Sun, 2014-12-14 at 19:34 +0100, Mads Kiilerich wrote:
> >> # HG changeset patch
> >> # User Mads Kiilerich <madski at unity3d.com>
> >> # Date 1418581678 -3600
> >> #      Sun Dec 14 19:27:58 2014 +0100
> >> # Node ID 4f99561d776947cc3911b645c485be1bc2cf0c1f
> >> # Parent  d35c8e8e2c7f980853fc4da0f0e251f8ca197d43
> >> changelog: branchinfo is a low level function - it should return UTF-8
> >>
> >> This will make it possible to persist the utf8 value in caches, without having
> >> to decode it from the local encoding and risk case folding issues.
> >>
> >> Stuff is renamed to make it more obvious that the type is different.
> >>
> >> This refactoring seems to expose an existing problem in branchmap with local
> >> encoding being used for computing the branchmap cache.
> > You know that there is localstr magic here, right?
> >
> > http://mercurial.selenic.com/wiki/EncodingStrategy#Round-trip_conversion
> 
> I was aware of the magic but not to which extent it prevented any actual 
> problem here. I agree that it seems good enough to avoid problems in the 
> current branchmap implementation.
> 
> The localstr magic also seems fragile and a bit inconsistent:
> 
>  >>> u = encoding.localstr('u', 'l')
>  >>> U = encoding.localstr('U', 'l')

The only caller of the constructor is encoding.tolocal(utf8), which
doesn't allow this to happen.

> The localstr magic is fine for interaction with the UI layer, but it 
> seems like a layering violation and obviously wrong to use it for 
> internal processing that use local encoding for both input and output.

We only use localstr for stuff that's coming from UTF-8. Like branch
names, which are always stored on disk in UTF-8, but the user always
specifies in the local encoding.

The evil here is a) not everyone uses UTF-8 yet and b) UTF-8 can't be
round-tripped to other encodings. Localstr gives the illusion that it
can and lets us work in the most convenient encoding with a bare minimum
number of conversion points.

Localstr is such a huge design win that it works without people even
knowing it's there. Compare that to the usual model of encoding issues
that people don't even think about it until their code explodes on the
first encounter with a non-ASCII character in production. If I were
creating a language, some form of the idea would be in the core library.

> Converting the branch name of every single revision to local encoding 
> when we don't need it also seems unfortunate when we want to optimize 
> everything that is looping over every single revision.

Are you proposing to make the conversion lazy? We could totally do
that.</smiley>

-- 
Mathematics is the supreme nostalgia of our time.




More information about the Mercurial-devel mailing list