[PATCH 02 of 10] localrepo: bytes for errors

Thu May 12 00:47:25 EDT 2016

On Thu, 2016-05-12 at 01:23 +0000, timeless wrote:
> # HG changeset patch
> # User timeless <timeless at mozdev.org>
> # Date 1461035348 0
> #      Tue Apr 19 03:09:08 2016 +0000
> # Node ID 0669c0d7de92c4ef5207272ad17d1126338aa985
> # Parent  c3476399e7a64ea562a2748d010aff57a884481f
> # Available At bb://timeless/mercurial-crew
> #              hg pull bb://timeless/mercurial-crew -r 0669c0d7de92
> localrepo: bytes for errors
> 
> diff -r c3476399e7a6 -r 0669c0d7de92 mercurial/localrepo.py
> --- a/mercurial/localrepo.py	Tue Apr 19 14:34:11 2016 +0000
> +++ b/mercurial/localrepo.py	Tue Apr 19 03:09:08 2016 +0000
> @@ -294,9 +294,9 @@
>                          b' dummy changelog to prevent using the old repo
> layout'
>                      )
>              else:
> -                raise error.RepoError(_("repository %s not found") % path)
> +                raise error.RepoError(_("repository %s not found") %
> path.encode('utf-8'))

This violates our encoding strategy:

https://www.mercurial-scm.org/wiki/EncodingStrategy

(In particular, very very little outside of encoding.py should use the letters
"utf" outside of comment.)

It's also very, very broken.

First, path comes from the environment. And in the Unix environment, it can be
basically any sequence of bytes not containing null. It can be in any encoding,
multiple encodings[1], or even no encoding[2]. Unix doesn't know or care and
therefore Mercurial doesn't know or care either. Which means it can't
meaningfully "encode a path" to UTF-8. Nevermind that it's going to immediately
print the string in a locale that might not be UTF-8 either.

Second, str.encode(x) is a method that shouldn't exist because encoding is a
transformation from characters to bytes and a str is already bytes. But worse
than that, it's also broken, because it's effectively
str.decode('ascii').encode(x). So it'll usually no-op in naive tests that will
explode horribly when it first encounters the real world. It's basically a trap,
and it's probably the third-worst unicode design mistake in Python[3].

Also, be aware that I'm going to have a very dim view of bulk "" -> b"" patches.
There are 100x more bare strings than xrange calls, and we've already been
there.

[1] You should see my MP3 collection of German industrial music ripped circa
2002.
[2] Back in the days of university computer systems with tiny quotas, you could
often cheat by storing *binaries* in filenames or symlinks
[3] The second is automatic casting between bytes and unicode.
-- 
Mathematics is the supreme nostalgia of our time.