fastimport: handling UTF-8

Matt Mackall mpm at selenic.com
Tue May 5 21:28:05 CDT 2009


On Tue, May 05, 2009 at 10:09:48PM -0400, Greg Ward wrote:
> I have been working on hg-fastimport lately, hopefully for the better.
>  Just tried it out on a real-life fastimport dump (generated by
> cvs2git on our real-life CVS repository), and it bombed the first time
> it hit a non-ASCII character in the stream.
> 
> So I implemented a vile hack that I *thought* was correct: ensure the
> values coming from the fastimport objects are encoded UTF-8:
> 
>     files = [f.encode("utf-8") for f in commit_handler.filelist()]

You don't want to do that. Filenames are bytes on Unix, and in git,
and in Mercurial. If you're -getting- the filenames as Unicode
strings, you've already lost.

>     text = cmd.message.encode("utf-8")
>     user = user.encode("utf-8")

What this says is 'take a byte string in the local encoding and recode
it as UTF-8'. I suspect what you want is more along the lines of 'take
a string in encoding X and recode it as Y':

s.decode(X).encode(Y)

You might need to do something more clever like:

try:
  s.decode('utf-8') # is it utf-8?
except:
  # ok, try latin-1
  s = s.decode('latin1').encode('utf-8')
 
> and then pass those strings to localrepo.rawcommit():
> 
>     node = self.repo.rawcommit(
>         files=files, text=text, user=user, date=date)
> 
> That works when I run my test script normally, since my
> LANG=en_CA.utf-8.  Luckily, I'm using Mercurial's own run-tests.py to
> run my hg-fastimport tests, and it of course unsets LANG... which
> means mercurial.encoding.fromlocal() blows up:
> 
> Traceback (most recent call last):
>   File "/home/greg/src/hg-crew/mercurial/dispatch.py", line 43, in _runcatch
>     return _dispatch(ui, args)
> [...]
>   File "/home/greg/src/hg-crew/mercurial/changelog.py", line 208, in add
>     user, desc = encoding.fromlocal(user), encoding.fromlocal(desc)
>   File "/home/greg/src/hg-crew/mercurial/encoding.py", line 63, in fromlocal
>     raise error.Abort("decoding near '%s': %s!" % (sub, inst))
> Abort: decoding near 'Jean-Fran??ois <jf@': 'ascii' codec can't decode
> byte 0xc3 in position 9: ordinal not in range(128)!
> 
> Argh.  What is the right thing to do here?  About the only thing I am
> certain of is that fastimport dumps must be UTF-8, because it says so
> right here in the man page.  And, oh yeah, I found the wiki page
> ChangelogEncodingPlan that says Hg will use UTF-8 internally for
> changeset metadata: good.  But now it appears that fromlocal() is
> being influenced by LANG, rather than by the spec for fastimport.
> Should I find a way to influence fromlocal() for the special case of
> reading a fastimport dump?

Yes. Grep for encoding.encoding (or util._encoding) in convert.

-- 
Mathematics is the supreme nostalgia of our time.


More information about the Mercurial-devel mailing list