[PATCH 0 of 6] Improve readability of non-ascii hg emails (issue814)

Mon Jul 14 11:53:36 CDT 2008

On Mon, 2008-07-14 at 12:09 +0100, Christian Ebert wrote:

> >> Patches must be kept independent of conventions between sender
> >> and recipient. They are sent in ascii, utf-8, or as fake ascii
> >> (current behaviour; see also TODO). utf-8 is safe to detect.
> > 
> > Ok, so that means when we send an inline patch, we'll send the
> > description in the same form as the patch, possibly promoting the patch?
> > Alright, so that suggests this table:
> > 
> > description    inline patch
> >             ascii        utf-8     other
> > ascii        ascii        utf-8     fake-ascii    
> > utf-8        utf-8        utf-8     ??
> > 
> > So if someone checks in a file with latin-1 (aka other) and latin-1
> > description (converted to utf-8), what happens? Do we call the message
> > ascii? Do we transcode our utf-8 back to ascii? Or do we put the utf-8
> > description in the message body and still call it ascii? 
>
> >> 2. Mail parts that do not contain patches
> >> 
> >> Introduce new [email] sendcharsets config (default:
> >> util._encoding). us-ascii is always implied and tried first.
> > 
> > Does this interact with the above? Really, inline patches is the
> > interesting piece of this puzzle.
> 
> Indeed. You're going right for the interesting hairy stuff.

That's because the rest of it is practically trivial. Everything but the
patch itself is in a known encoding (utf-8) and it's a simple matter of
programming to put it in an email. 

When we look at inline patches, we have to answer some hard questions.
And hopefully our answers to the easy questions haven't painted us into
a corner. So let's look back at my table:

> > description    inline patch
> >              ascii        utf-8     other
> > ascii        ascii        utf-8     fake-ascii    
> > utf-8        utf-8        utf-8     ??

What should happen in the corner? First let's note that there are a
bunch of things that shouldn't happen. If we're sending a message and we
pick latin1 as the encoding because the author's name had a ü in it, we
shouldn't try to encode the patch as latin1, as it may in fact be koi8
and the receiving user may in fact be using utf-8 and his mailer may
helpfully save the patch in utf-8 at which point the content is now very
wrong. Second, we probably can't do fake-utf-8 because mailers are quite
likely to do the wrong thing or choke. Can we do better than fake-ascii?
Probably not. Should we transcode the description text from utf-8?
Maybe?

> I might be wrong but eg. trying util._encoding for a an 8bit
> patch that's not utf-8 is an optimistic assumption. The patch
> might have been made with a different 8bit _encoding. Or the
> "mixed" case above.

Yep, we should definitely not assume anything about the contents of a
patch. In fact, guessing utf-8 may also be trouble. Consider: I have a
file called utf-8-example.txt. I send you a patch to add it to your
repo. Your mailer is set to use latin1 by default. If hg mail marks the
text as utf-8, your mailer may helpfully transcode it to latin1 and then
we'll later discover that utf-8-example.txt is actually in latin1 on
your machine. Fail. On the other hand, if we lie and claim it's binary,
it'll be much harder to read.

-- 
Mathematics is the supreme nostalgia of our time.