[thg-dev] file / diff: xml file false positive for binary

Matt Mackall mpm at selenic.com
Wed May 11 10:17:23 CDT 2011


On Wed, 2011-05-11 at 14:27 +0200, Dominik Psenner wrote:
> > -----Original Message-----
> > From: Sune Foldager [mailto:cryo at cyanite.org]
> > Sent: Wednesday, May 11, 2011 2:06 PM
> > To: Dominik Psenner
> > Cc: thg-dev at googlegroups.com; 'Mercurial Developers'
> > Subject: Re: [thg-dev] file / diff: xml file false positive for binary
> > 
> > On 2011-05-11 13:38, Dominik Psenner wrote:
> > >> -----Original Message-----
> > >> From: thg-dev at googlegroups.com [mailto:thg-dev at googlegroups.com] On
> > Behalf
> > >> Of Steve Borho
> > >> Sent: Tuesday, May 10, 2011 4:50 PM
> > >> To: thg-dev at googlegroups.com
> > >> Subject: Re: [thg-dev] file / diff: xml file false positive for binary
> > >>
> > >> On Tue, May 10, 2011 at 2:48 AM, Dominik Psenner <dpsenner at gmail.com>
> > >> wrote:
> > >> > Hi,
> > >> >
> > >> > I stumpled upon a possible regression for the workbench revision
> > details
> > >> > tab. Some XML files are recognized as binary and therefore the diff
> > is
> > >> not
> > >> > shown in workbench, but instead the message "File or diffs not
> > >> displayed:
> > >> > File is binary.". OTOH kdiff detects it perfectly fine and shows the
> > >> diff.
> > >> > The file encoding is UTF-16LE and the line end style is DOS.
> > >>
> > >> This is not a regression.  Mercurial has always considered UTF16 files
> > >> as binary.
> > >
> > >Good to know. :-) I'm taking this follow-up question to mercurial-devel.
> > >
> > >Is there an extension around that adds the functionality of handling
> > files
> > >with unicode encoding?
> > 
> > Not as far as I know, but might I suggest converting those files to XML's
> > native format, which is UTF-8, instead? :).
> 
> And what if one attempts to translate our application to klingon? :-)

Klingon is not in Unicode:

http://higbee.cots.net/Holtej/klingon/faq.htm#2.17
http://www.evertype.com/standards/csur/index.html

But UTF-8 can encode any code point encoded by UTF-16 just fine,
including the F8D0-F8FF empire that Klingon has staked out in the
Unicode Private Use Area. Code points in this range will take 3 bytes
each.

In fact, from its creation in 1993, UTF-8 was able to cover almost all
of the 32-bit code point space, but was later restricted to match the
much smaller range (0 - 10FFFF) supported by UTF-16.

-- 
Mathematics is the supreme nostalgia of our time.




More information about the Mercurial-devel mailing list