[PATCH 2 of 3] revset: transcode revsets to UTF-8

Matt Mackall mpm at selenic.com
Mon Nov 15 18:28:15 CST 2010


On Tue, 2010-11-16 at 00:43 +0100, Martin Geisler wrote:
> Matt Mackall <mpm at selenic.com> writes:
> 
> > On Fri, 2010-11-12 at 17:40 +0100, Dan Villiom Podlaski Christiansen
> > wrote:
> >> # HG changeset patch
> >> # User Dan Villiom Podlaski Christiansen <danchr at gmail.com>
> >> # Date 1289579971 -3600
> >> # Node ID 0fa148bcfe7f0755236e4b9d0034c5cc7ac4771d
> >> # Parent  bdf95be4ea789a13d088f0955ffcd072590a1eb6
> >> revset: transcode revsets to UTF-8.
> >> 
> >> This allows updating to a branch with non-ASCII names in non-UTF-8
> >> locales.
> >
> > This doesn't quite mesh with our encoding philosophy,
> 
> Well, it works just the same for commit:
> 
>   % echo >> a.txt && hg commit -m bøb
>   transaction abort!
>   rollback completed
>   abort: decoding near 'bøb': 'ascii' codec can't decode byte 0xf8 in
>   position 1: ordinal not in range(128)!
> 
>   % echo >> a.txt && LC_ALL=en_US.UTF-8 hg commit -m bøb
>   transaction abort!
>   rollback completed
>   abort: decoding near 'bøb': 'utf8' codec can't decode byte 0xf8 in
>   position 1: invalid start byte!
> 
>   % echo >> a.txt && LC_ALL=en_US.ISO8859-1 hg commit -m bøb
> 
> This has always seems quite right to me: we take the bytes given by the
> user and decode them using his locale. If we cannot do this, then we
> abort and give the user a chance to fix things.

Uh, yes? The above matches precisely with this:

> > which can be summed up as "restrict encoding-aware code to the
> > smallest set possible". If revset can't look up non-ASCII branch names
> > in a Latin1 locale, then that means that branch lookup is broken, not
> > that revsets needs to become encoding-aware.

In particular, the only piece of code that gives a damn about
transcoding the commit message is this ONE LINE right here:

http://www.selenic.com/hg/file/cc4e13c92dfa/mercurial/changelog.py#l215

(Ok, there's a matching line on 177 for reading commits.)

Compare this to alternately transcoding in every single path where we
can receive a commit message from the user (import, mq, commit, commit
-m, rebase, etc.) and reversing it everywhere we show one (hgweb, log,
export, summary, etc.) and then think about this again:

"restrict encoding-aware code to the smallest set possible"

The inverse of this statement is:

"be encoding-agnostic wherever possible"

(By the way, this reminds me of something I recently spotted with
sys.setdefaultencoding("undefined"):

http://www.selenic.com/hg/file/cc4e13c92dfa/mercurial/minirst.py#l26

The substs table always consists (as it should) of non-Unicode ASCII
strings that get promoted to Unicode, so the transcoding is unnecessary.
If transcoding -were- necessary, this code would break, because the
default encoding for Unicode promotion is ASCII. Ergo, this code is
over-engineered.)

> > Related: how should lookup work for names that can't be represented in
> > the local charset work? Answer: if hg branches shows "caf?" rather
> > than "café", then I should be able to "hg up caf?".
> 
> That sounds bad to me -- the immediate question that arises is what to
> do if there is a branch named 'caf?' with a "real" question mark?

Bah. You're being a purist. The intersection of users using ? (Q) and
non-ASCII names (U) is going to be negligible, because both sets will be
pretty small. And the number of collisions those users experience is
going to be vanishingly small (C). The utility of checking out non-ASCII
branchnames is larger by definition: U > Q and U > C.

Here's how tag currently works if there's a collision:

  $ hg tag café
  $ hg tag caf\?
  tip                                2:0ab081bedf6a
  caf?                               1:7756a54706b8
  café                               0:170468a5c0e1
  $ LC_CTYPE=C hg tags
  tip                                2:0ab081bedf6a
  caf?                               0:170468a5c0e1
  $ LC_CTYPE=C hg log r 'caf?'
  changeset:   0:170468a5c0e1
  tag:         caf?
  user:        Matt Mackall <mpm at selenic.com>
  date:        Mon Nov 15 18:10:59 2010 -0600
  summary:     0

That's not ideal: we should probably list caf? twice. But again, no
one's going to encounter this in real life. What's important is that we
can check out such changesets.

-- 
Mathematics is the supreme nostalgia of our time.




More information about the Mercurial-devel mailing list