fastimport: handling UTF-8

Greg Ward greg-hg at gerg.ca
Wed May 6 11:17:57 CDT 2009


On Tue, May 5, 2009 at 10:28 PM, Matt Mackall <mpm at selenic.com> wrote:
> On Tue, May 05, 2009 at 10:09:48PM -0400, Greg Ward wrote:
>> So I implemented a vile hack that I *thought* was correct: ensure the
>> values coming from the fastimport objects are encoded UTF-8:
>>
>>     files = [f.encode("utf-8") for f in commit_handler.filelist()]
>
> You don't want to do that. Filenames are bytes on Unix, and in git,
> and in Mercurial. If you're -getting- the filenames as Unicode
> strings, you've already lost.

Ahhhh.  Right, of course.  Luckily, I control the upstream library
that is spitting out Unicode strings from the fastimport stream, so I
can fix that.  (I extracted the reusable bits of bzr-fastimport and am
in the process of retrofitting hg-fastimport to use that library
rather than a stale fork of parts of bzr-fastimport.  The real fun
will be convincing the Bazaar guys to do a similar retrofit. ;-)

>>     text = cmd.message.encode("utf-8")
>>     user = user.encode("utf-8")
>
> What this says is 'take a byte string in the local encoding and recode
> it as UTF-8'. I suspect what you want is more along the lines of 'take
> a string in encoding X and recode it as Y':

Actually, cmd.message is a Unicode string too.  I *think* that is
defensible, since git-fast-import(1) says

    Commit messages are free-form and are not
    interpreted by Git. Currently they must be encoded in UTF-8, as
    fast-import does not permit other encodings to be specified.

That is, the fastimport parser library decodes UTF-8 to Unicode, and
then hg-fastimport will have to reencode to UTF-8 to pass into the
guts of Mercurial.  Inefficient, but I think it's correct.
Considering that everything else in the fastimport stream is "just
bytes", I should probably fix the library so it does no decoding and
simply returns byte strings.  That should be more efficient.

Author and committer names are similar, although git-fast-import(1) is
wishy-washy and just says

    Note that <name> is free-form and may contain
    any sequence of bytes, except LT and LF. It is typically UTF-8
    encoded.

So perhaps I should just treat them as bytes.

Thanks!

Greg



More information about the Mercurial-devel mailing list