[PATCH 2 of 3] revset: transcode revsets to UTF-8

Martin Geisler mg at lazybytes.net
Mon Nov 15 17:43:35 CST 2010


Matt Mackall <mpm at selenic.com> writes:

> On Fri, 2010-11-12 at 17:40 +0100, Dan Villiom Podlaski Christiansen
> wrote:
>> # HG changeset patch
>> # User Dan Villiom Podlaski Christiansen <danchr at gmail.com>
>> # Date 1289579971 -3600
>> # Node ID 0fa148bcfe7f0755236e4b9d0034c5cc7ac4771d
>> # Parent  bdf95be4ea789a13d088f0955ffcd072590a1eb6
>> revset: transcode revsets to UTF-8.
>> 
>> This allows updating to a branch with non-ASCII names in non-UTF-8
>> locales.
>
> This doesn't quite mesh with our encoding philosophy,

Well, it works just the same for commit:

  % echo >> a.txt && hg commit -m bøb
  transaction abort!
  rollback completed
  abort: decoding near 'bøb': 'ascii' codec can't decode byte 0xf8 in
  position 1: ordinal not in range(128)!

  % echo >> a.txt && LC_ALL=en_US.UTF-8 hg commit -m bøb
  transaction abort!
  rollback completed
  abort: decoding near 'bøb': 'utf8' codec can't decode byte 0xf8 in
  position 1: invalid start byte!

  % echo >> a.txt && LC_ALL=en_US.ISO8859-1 hg commit -m bøb

This has always seems quite right to me: we take the bytes given by the
user and decode them using his locale. If we cannot do this, then we
abort and give the user a chance to fix things.

> which can be summed up as "restrict encoding-aware code to the
> smallest set possible". If revset can't look up non-ASCII branch names
> in a Latin1 locale, then that means that branch lookup is broken, not
> that revsets needs to become encoding-aware.
>
> Related: how should lookup work for names that can't be represented in
> the local charset work? Answer: if hg branches shows "caf?" rather
> than "café", then I should be able to "hg up caf?".

That sounds bad to me -- the immediate question that arises is what to
do if there is a branch named 'caf?' with a "real" question mark?

I think the current behavior is fine: we make a best-effort when
converting the metadata into the user's local encoding, and we degrade
gracefully by letting Python substitute characters outside of the
encoding with '?'.

-- 
Martin Geisler

Mercurial links: http://mercurial.ch/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
URL: <http://selenic.com/pipermail/mercurial-devel/attachments/20101116/37a2ab5c/attachment.pgp>


More information about the Mercurial-devel mailing list