[PATCH] Don't consider UTF-16 and UTF-32 files as binary (issue1975) (version 2)

Fri Feb 12 03:52:23 CST 2010

Mads Kiilerich wrote:
> On 02/08/2010 04:03 PM, Dirkjan Ochtman wrote:
>> On Mon, Feb 8, 2010 at 15:43, Ollivier Robert<roberto at keltia.net>  wrote:
>>> Hmmm, technically, you don't need a BOM in UTF-8 so checking for it 
>>> seems wrong to me.
>>
>> I disagree. We want UTF-8 to not be treated as binary, so we want to
>> check for any BOMs people might include, even if it's optional for
>> UTF-8.
> 
> Sure. But "valid" UTF-8 does not contain any zero bytes and will thus 
> never be considered binary anyway.
> 
> So the question is just how invalid UTF-8 files should be handled.

My reasoning is that it's anyway quicker to detect UTF-8 with the help 
of the BOM, than by looking for '\0'. So, even if both test would give 
the same result, let's optimize it for the cases where we can do so.

Any remaining blocker for this patch ?

Regards,
Benoit