fastimport: handling UTF-8
Greg Ward
greg-hg at gerg.ca
Wed May 6 11:17:57 CDT 2009
On Tue, May 5, 2009 at 10:28 PM, Matt Mackall <mpm at selenic.com> wrote:
> On Tue, May 05, 2009 at 10:09:48PM -0400, Greg Ward wrote:
>> So I implemented a vile hack that I *thought* was correct: ensure the
>> values coming from the fastimport objects are encoded UTF-8:
>>
>> files = [f.encode("utf-8") for f in commit_handler.filelist()]
>
> You don't want to do that. Filenames are bytes on Unix, and in git,
> and in Mercurial. If you're -getting- the filenames as Unicode
> strings, you've already lost.
Ahhhh. Right, of course. Luckily, I control the upstream library
that is spitting out Unicode strings from the fastimport stream, so I
can fix that. (I extracted the reusable bits of bzr-fastimport and am
in the process of retrofitting hg-fastimport to use that library
rather than a stale fork of parts of bzr-fastimport. The real fun
will be convincing the Bazaar guys to do a similar retrofit. ;-)
>> text = cmd.message.encode("utf-8")
>> user = user.encode("utf-8")
>
> What this says is 'take a byte string in the local encoding and recode
> it as UTF-8'. I suspect what you want is more along the lines of 'take
> a string in encoding X and recode it as Y':
Actually, cmd.message is a Unicode string too. I *think* that is
defensible, since git-fast-import(1) says
Commit messages are free-form and are not
interpreted by Git. Currently they must be encoded in UTF-8, as
fast-import does not permit other encodings to be specified.
That is, the fastimport parser library decodes UTF-8 to Unicode, and
then hg-fastimport will have to reencode to UTF-8 to pass into the
guts of Mercurial. Inefficient, but I think it's correct.
Considering that everything else in the fastimport stream is "just
bytes", I should probably fix the library so it does no decoding and
simply returns byte strings. That should be more efficient.
Author and committer names are similar, although git-fast-import(1) is
wishy-washy and just says
Note that <name> is free-form and may contain
any sequence of bytes, except LT and LF. It is typically UTF-8
encoded.
So perhaps I should just treat them as bytes.
Thanks!
Greg
More information about the Mercurial-devel
mailing list