fastimport: handling UTF-8

Wed May 6 11:31:25 CDT 2009

On Wed, May 06, 2009 at 12:17:57PM -0400, Greg Ward wrote:
> On Tue, May 5, 2009 at 10:28 PM, Matt Mackall <mpm at selenic.com> wrote:
> > On Tue, May 05, 2009 at 10:09:48PM -0400, Greg Ward wrote:
> >> So I implemented a vile hack that I *thought* was correct: ensure the
> >> values coming from the fastimport objects are encoded UTF-8:
> >>
> >> ?? ?? files = [f.encode("utf-8") for f in commit_handler.filelist()]
> >
> > You don't want to do that. Filenames are bytes on Unix, and in git,
> > and in Mercurial. If you're -getting- the filenames as Unicode
> > strings, you've already lost.
> 
> Ahhhh.  Right, of course.  Luckily, I control the upstream library
> that is spitting out Unicode strings from the fastimport stream, so I
> can fix that.  (I extracted the reusable bits of bzr-fastimport and am
> in the process of retrofitting hg-fastimport to use that library
> rather than a stale fork of parts of bzr-fastimport.  The real fun
> will be convincing the Bazaar guys to do a similar retrofit. ;-)
> 
> >> ?? ?? text = cmd.message.encode("utf-8")
> >> ?? ?? user = user.encode("utf-8")
> >
> > What this says is 'take a byte string in the local encoding and recode
> > it as UTF-8'. I suspect what you want is more along the lines of 'take
> > a string in encoding X and recode it as Y':
> 
> Actually, cmd.message is a Unicode string too.  I *think* that is
> defensible, since git-fast-import(1) says
> 
>     Commit messages are free-form and are not
>     interpreted by Git. Currently they must be encoded in UTF-8, as
>     fast-import does not permit other encodings to be specified.
> 
> That is, the fastimport parser library decodes UTF-8 to Unicode, and
> then hg-fastimport will have to reencode to UTF-8 to pass into the
> guts of Mercurial.  Inefficient, but I think it's correct.
> Considering that everything else in the fastimport stream is "just
> bytes", I should probably fix the library so it does no decoding and
> simply returns byte strings.  That should be more efficient.
> 
> Author and committer names are similar, although git-fast-import(1) is
> wishy-washy and just says
> 
>     Note that <name> is free-form and may contain
>     any sequence of bytes, except LT and LF. It is typically UTF-8
>     encoded.
> 
> So perhaps I should just treat them as bytes.

Mercurial really does want UTF-8 for commit description and author
name. We won't choke on other things on the output side but it won't
look pretty.

-- 
Mathematics is the supreme nostalgia of our time.