[PATCH] Don't consider UTF-16 and UTF-32 files as binary (issue1975) (version 2)

Benoît Allard benoit at aeteurope.nl
Fri Feb 12 03:52:23 CST 2010


Mads Kiilerich wrote:
> On 02/08/2010 04:03 PM, Dirkjan Ochtman wrote:
>> On Mon, Feb 8, 2010 at 15:43, Ollivier Robert<roberto at keltia.net>  wrote:
>>> Hmmm, technically, you don't need a BOM in UTF-8 so checking for it 
>>> seems wrong to me.
>>
>> I disagree. We want UTF-8 to not be treated as binary, so we want to
>> check for any BOMs people might include, even if it's optional for
>> UTF-8.
> 
> Sure. But "valid" UTF-8 does not contain any zero bytes and will thus 
> never be considered binary anyway.
> 
> So the question is just how invalid UTF-8 files should be handled.

My reasoning is that it's anyway quicker to detect UTF-8 with the help 
of the BOM, than by looking for '\0'. So, even if both test would give 
the same result, let's optimize it for the cases where we can do so.

Any remaining blocker for this patch ?

Regards,
Benoit


More information about the Mercurial-devel mailing list