[PATCH 2 of 3] revset: transcode revsets to UTF-8

Tue Nov 16 02:57:01 CST 2010

Matt Mackall <mpm at selenic.com> writes:

> On Tue, 2010-11-16 at 00:43 +0100, Martin Geisler wrote:
>
>> Well, it works just the same for commit:
>> 
>>   % echo >> a.txt && hg commit -m bøb
>>   transaction abort!
>>   rollback completed
>>   abort: decoding near 'bøb': 'ascii' codec can't decode byte 0xf8 in
>>   position 1: ordinal not in range(128)!
>> 
>>   % echo >> a.txt && LC_ALL=en_US.UTF-8 hg commit -m bøb
>>   transaction abort!
>>   rollback completed
>>   abort: decoding near 'bøb': 'utf8' codec can't decode byte 0xf8 in
>>   position 1: invalid start byte!
>> 
>>   % echo >> a.txt && LC_ALL=en_US.ISO8859-1 hg commit -m bøb
>> 
>> This has always seems quite right to me: we take the bytes given by
>> the user and decode them using his locale. If we cannot do this, then
>> we abort and give the user a chance to fix things.
>
> Uh, yes? The above matches precisely with this:
>
>> > which can be summed up as "restrict encoding-aware code to the
>> > smallest set possible". If revset can't look up non-ASCII branch
>> > names in a Latin1 locale, then that means that branch lookup is
>> > broken, not that revsets needs to become encoding-aware.

I thought you meant that it should be possible to lookup non-ASCII
branch names in an ASCII locale. That was what I tried to illustrate
above: if my locale is X and I enter a commit message in an incompatible
encoding Y, then I get an error.

That is how I expect it to work and I thought Dan's patch made it work
the same for revsets so that 'branch(bøb)' raises an error when you are
in an ASCII locale.

> In particular, the only piece of code that gives a damn about
> transcoding the commit message is this ONE LINE right here:
>
> http://www.selenic.com/hg/file/cc4e13c92dfa/mercurial/changelog.py#l215
>
> (Ok, there's a matching line on 177 for reading commits.)
>
> Compare this to alternately transcoding in every single path where we
> can receive a commit message from the user (import, mq, commit, commit
> -m, rebase, etc.) and reversing it everywhere we show one (hgweb, log,
> export, summary, etc.) and then think about this again:
>
> "restrict encoding-aware code to the smallest set possible"

Yes, of course -- of course I agree that there should be only a few
places that are responsible for decoding the user's bytes.

> The inverse of this statement is:
>
> "be encoding-agnostic wherever possible"
>
> (By the way, this reminds me of something I recently spotted with
> sys.setdefaultencoding("undefined"):
>
> http://www.selenic.com/hg/file/cc4e13c92dfa/mercurial/minirst.py#l26
>
> The substs table always consists (as it should) of non-Unicode ASCII
> strings that get promoted to Unicode, so the transcoding is
> unnecessary. If transcoding -were- necessary, this code would break,
> because the default encoding for Unicode promotion is ASCII. Ergo,
> this code is over-engineered.)
>
>> > Related: how should lookup work for names that can't be represented
>> > in the local charset work? Answer: if hg branches shows "caf?"
>> > rather than "café", then I should be able to "hg up caf?".
>> 
>> That sounds bad to me -- the immediate question that arises is what
>> to do if there is a branch named 'caf?' with a "real" question mark?
>
> Bah. You're being a purist.

I just want to start by making things correct... it feels wrong to me
that we would start guessing what the user really meant.

> The intersection of users using ? (Q) and non-ASCII names (U) is going
> to be negligible, because both sets will be pretty small. And the
> number of collisions those users experience is going to be vanishingly
> small (C). The utility of checking out non-ASCII branchnames is larger
> by definition: U > Q and U > C.

I'm not sure what these equations should tell me?

> Here's how tag currently works if there's a collision:
>
>   $ hg tag café
>   $ hg tag caf\?
>   tip                                2:0ab081bedf6a
>   caf?                               1:7756a54706b8
>   café                               0:170468a5c0e1
>   $ LC_CTYPE=C hg tags
>   tip                                2:0ab081bedf6a
>   caf?                               0:170468a5c0e1
>   $ LC_CTYPE=C hg log r 'caf?'
>   changeset:   0:170468a5c0e1
>   tag:         caf?
>   user:        Matt Mackall <mpm at selenic.com>
>   date:        Mon Nov 15 18:10:59 2010 -0600
>   summary:     0
>
> That's not ideal: we should probably list caf? twice. But again, no
> one's going to encounter this in real life. What's important is that
> we can check out such changesets.

Yes, I agree we should output 'caf?' twice -- I'm surprised the tag is
hidden like that.

As for checking out the tags: if we show both tags, then the user can
always use the changeset hash to refer to the changeset in question.

Tag and branch names are metadata and we allow the full Unicode spectrum
there. That also implies that there will be tags that cannot be
displayed on all systems due to lack of fonts or because of locale
settings. I think that's fine since the number of users who end up in
these corner cases is vanishing small, as you note above.

-- 
Martin Geisler

aragost Trifork
Professional Mercurial support
http://mercurial.aragost.com/kick-start/