Don't treat UTF-16, UTF-32 and SCSU files as binary (issue1975)
Benoît Allard
benoit at aeteurope.nl
Thu Feb 4 02:03:55 CST 2010
Hi Mads,
Thanks for looking into this !
Mads Kiilerich wrote:
> Ben wrote, On 02/03/2010 08:55 PM:
>> # HG changeset patch
>> # User Benoit Allard<benoit at aeteurope.nl>
>> # Date 1265226631 -3600
>> # Branch BOM
That should disapear.
>> # Node ID 71d7c5ba3cbdaba214e56f50fd289f0861cb0bdc
>> # Parent 36417ba4d772be768a1014fe45d0584e82e1701e
>> Don't treat UTF-16, UTF-32 and SCSU files as binary (issue1975)
>>
>
> I doubt this really solves the problem. As djc said in msg11414:
>> We'd have to audit util.binary() users, I guess, but it might make
>> sense to
>> make an exception for things with a BOM in some of the cases.
>
> You will have to convince us that the changed result from util.binary
> makes sense in all cases. What happens when the encoded output gets
> mixed up with ascii? Won't for example diff which splits on \n bytes do
> the wrong thing?
I made some investigations that way:
* GNU diff behaves the same as Mecurial without this patch
* GNU patch accepts the patch generated by Mercurial with this patch
>
> I'm afraid a proper solution to that issue requires more invasive
> changes - including massive use of unicode and error prone character
> conversions. That will probably be an uphill battle.
>
With the remarks above I'm not sure that's necessary, please mention
broken usecase I should look into.
>> diff --git a/mercurial/util.py b/mercurial/util.py
>> --- a/mercurial/util.py
>> +++ b/mercurial/util.py
>> @@ -210,9 +210,16 @@
>> return fn(s, cmd[len(name):].lstrip())
>> return pipefilter(s, cmd)
>>
>> +BOMs = ["\xfe\xff", "\xff\xfe", "\0\0\xfe\xff", "\xff\xfe\0\0",
>> "\x0e\xfe\xff"]
>>
>
> Is there a reason codecs.BOM* isn't used? That would - for free -
> document what these BOMs are.
Sure, will do ! I was not aware of those one.
>
> I think the constant name should be lowercase.
Even for those acronyms ?
>
>> def binary(s):
>> """return true if a string is binary data"""
>> - return bool(s and '\0' in s)
>> + if s:
>> + for BOM in BOMs:
>>
>
> Variables should be lowercase.
>
>> + if s[:len(BOM)] == BOM:
>>
>
> Use s.startswith(BOM) instead - that is simpler and faster.
Thanks for the hint.
>
>> + return False
>> + return '\0' in s
>> + return False
>>
>> def increasingchunks(source, min=1024, max=65536):
>> '''return no less than min bytes per chunk while data remains,
>>
>
> /Mads
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel at selenic.com
> http://selenic.com/mailman/listinfo/mercurial-devel
>
More information about the Mercurial-devel
mailing list