UTF-16 in Mercurial

Sat Mar 6 13:32:32 CST 2010

> -----Original Message-----
> From: mercurial-devel-bounces at selenic.com [mailto:mercurial-devel-
> bounces at selenic.com] On Behalf Of Benoît Allard
> Sent: Tuesday, March 02, 2010 9:04 AM
> To: Mercurial Devel
> Subject: UTF-16 in Mercurial
> 
> Hi there,
> 
> I've been experimenting on Windows with some UTF-16 (so called UNICODE
> under Windows) config files (registry export to be complete) and the
> attached -very- little extension that tries to make UTF-16 (or UTF-32)
> seen as text (not binary).

While UTF-16 can be considered text in that you can meaningfully make and
view diffs, in practice most programs, especially those written in C/C++,
define "text" as "anything not containing a null". 

> 
> It has the drawback of generating non consistent patches: the body of
> the patch being in the encoding of the file, and the metadata (@@, +++,
> ...) being in ANSI.

I would suggest converting the UTF-16 files to UTF-8 before giving them to
patch, merge, or diff. Since ANSI is a valid subset of UTF-8, the
annotations don't have to be modified in any way.

Here is my idea:

Diff(s1, s2):
	If s1 is UTF-16 or UTF-32:
		s1 = convert-to-utf8(s1)
	If s2 is UTF-16 or UTF-32:
		s2 = convert-to-utf8(s2)
	return calculate-diff(s1, s2)

Patch(file, diff):
	If file is UTF-16 or UTF-32:
		s = read(file)
		s = convert-to-utf8(s)
	s' = calculate-patch(s, diff)
	if file is UTF-16 or UTF-32:
		s' = convert-from-utf8(s)
		write(file, s')

~Anton