Don't treat UTF-16, UTF-32 and SCSU files as binary (issue1975)

Thu Feb 4 02:03:55 CST 2010

Hi Mads,

Thanks for looking into this !

Mads Kiilerich wrote:
> Ben wrote, On 02/03/2010 08:55 PM:
>> # HG changeset patch
>> # User Benoit Allard<benoit at aeteurope.nl>
>> # Date 1265226631 -3600
>> # Branch BOM

That should disapear.

>> # Node ID 71d7c5ba3cbdaba214e56f50fd289f0861cb0bdc
>> # Parent  36417ba4d772be768a1014fe45d0584e82e1701e
>> Don't treat UTF-16, UTF-32 and SCSU files as binary (issue1975)
>>    
> 
> I doubt this really solves the problem. As djc said in msg11414:
>> We'd have to audit util.binary() users, I guess, but it might make 
>> sense to
>> make an exception for things with a BOM in some of the cases.
> 
> You will have to convince us that the changed result from util.binary 
> makes sense in all cases. What happens when the encoded output gets 
> mixed up with ascii? Won't for example diff which splits on \n bytes do 
> the wrong thing?

I made some investigations that way:
* GNU diff behaves the same as Mecurial without this patch
* GNU patch accepts the patch generated by Mercurial with this patch

> 
> I'm afraid a proper solution to that issue requires more invasive 
> changes - including massive use of unicode and error prone character 
> conversions. That will probably be an uphill battle.
> 

With the remarks above I'm not sure that's necessary, please mention 
broken usecase I should look into.

>> diff --git a/mercurial/util.py b/mercurial/util.py
>> --- a/mercurial/util.py
>> +++ b/mercurial/util.py
>> @@ -210,9 +210,16 @@
>>               return fn(s, cmd[len(name):].lstrip())
>>       return pipefilter(s, cmd)
>>
>> +BOMs = ["\xfe\xff", "\xff\xfe", "\0\0\xfe\xff", "\xff\xfe\0\0", 
>> "\x0e\xfe\xff"]
>>    
> 
> Is there a reason codecs.BOM* isn't used? That would - for free - 
> document what these BOMs are.

Sure, will do ! I was not aware of those one.

> 
> I think the constant name should be lowercase.

Even for those acronyms ?

> 
>>   def binary(s):
>>       """return true if a string is binary data"""
>> -    return bool(s and '\0' in s)
>> +    if s:
>> +        for BOM in BOMs:
>>    
> 
> Variables should be lowercase.
> 
>> +            if s[:len(BOM)] == BOM:
>>    
> 
> Use s.startswith(BOM) instead - that is simpler and faster.

Thanks for the hint.

> 
>> +                return False
>> +        return '\0' in s
>> +    return False
>>
>>   def increasingchunks(source, min=1024, max=65536):
>>       '''return no less than min bytes per chunk while data remains,
>>    
> 
> /Mads
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel at selenic.com
> http://selenic.com/mailman/listinfo/mercurial-devel
>