fastimport: handling UTF-8
Matt Mackall
mpm at selenic.com
Tue May 5 21:28:05 CDT 2009
On Tue, May 05, 2009 at 10:09:48PM -0400, Greg Ward wrote:
> I have been working on hg-fastimport lately, hopefully for the better.
> Just tried it out on a real-life fastimport dump (generated by
> cvs2git on our real-life CVS repository), and it bombed the first time
> it hit a non-ASCII character in the stream.
>
> So I implemented a vile hack that I *thought* was correct: ensure the
> values coming from the fastimport objects are encoded UTF-8:
>
> files = [f.encode("utf-8") for f in commit_handler.filelist()]
You don't want to do that. Filenames are bytes on Unix, and in git,
and in Mercurial. If you're -getting- the filenames as Unicode
strings, you've already lost.
> text = cmd.message.encode("utf-8")
> user = user.encode("utf-8")
What this says is 'take a byte string in the local encoding and recode
it as UTF-8'. I suspect what you want is more along the lines of 'take
a string in encoding X and recode it as Y':
s.decode(X).encode(Y)
You might need to do something more clever like:
try:
s.decode('utf-8') # is it utf-8?
except:
# ok, try latin-1
s = s.decode('latin1').encode('utf-8')
> and then pass those strings to localrepo.rawcommit():
>
> node = self.repo.rawcommit(
> files=files, text=text, user=user, date=date)
>
> That works when I run my test script normally, since my
> LANG=en_CA.utf-8. Luckily, I'm using Mercurial's own run-tests.py to
> run my hg-fastimport tests, and it of course unsets LANG... which
> means mercurial.encoding.fromlocal() blows up:
>
> Traceback (most recent call last):
> File "/home/greg/src/hg-crew/mercurial/dispatch.py", line 43, in _runcatch
> return _dispatch(ui, args)
> [...]
> File "/home/greg/src/hg-crew/mercurial/changelog.py", line 208, in add
> user, desc = encoding.fromlocal(user), encoding.fromlocal(desc)
> File "/home/greg/src/hg-crew/mercurial/encoding.py", line 63, in fromlocal
> raise error.Abort("decoding near '%s': %s!" % (sub, inst))
> Abort: decoding near 'Jean-Fran??ois <jf@': 'ascii' codec can't decode
> byte 0xc3 in position 9: ordinal not in range(128)!
>
> Argh. What is the right thing to do here? About the only thing I am
> certain of is that fastimport dumps must be UTF-8, because it says so
> right here in the man page. And, oh yeah, I found the wiki page
> ChangelogEncodingPlan that says Hg will use UTF-8 internally for
> changeset metadata: good. But now it appears that fromlocal() is
> being influenced by LANG, rather than by the spec for fastimport.
> Should I find a way to influence fromlocal() for the special case of
> reading a fastimport dump?
Yes. Grep for encoding.encoding (or util._encoding) in convert.
--
Mathematics is the supreme nostalgia of our time.
More information about the Mercurial-devel
mailing list