Don't treat UTF-16, UTF-32 and SCSU files as binary (issue1975)

Mads Kiilerich mads at kiilerich.com
Wed Feb 3 17:31:36 CST 2010


Ben wrote, On 02/03/2010 08:55 PM:
> # HG changeset patch
> # User Benoit Allard<benoit at aeteurope.nl>
> # Date 1265226631 -3600
> # Branch BOM
> # Node ID 71d7c5ba3cbdaba214e56f50fd289f0861cb0bdc
> # Parent  36417ba4d772be768a1014fe45d0584e82e1701e
> Don't treat UTF-16, UTF-32 and SCSU files as binary (issue1975)
>    

I doubt this really solves the problem. As djc said in msg11414:
> We'd have to audit util.binary() users, I guess, but it might make sense to
> make an exception for things with a BOM in some of the cases.

You will have to convince us that the changed result from util.binary 
makes sense in all cases. What happens when the encoded output gets 
mixed up with ascii? Won't for example diff which splits on \n bytes do 
the wrong thing?

I'm afraid a proper solution to that issue requires more invasive 
changes - including massive use of unicode and error prone character 
conversions. That will probably be an uphill battle.

> diff --git a/mercurial/util.py b/mercurial/util.py
> --- a/mercurial/util.py
> +++ b/mercurial/util.py
> @@ -210,9 +210,16 @@
>               return fn(s, cmd[len(name):].lstrip())
>       return pipefilter(s, cmd)
>
> +BOMs = ["\xfe\xff", "\xff\xfe", "\0\0\xfe\xff", "\xff\xfe\0\0", "\x0e\xfe\xff"]
>    

Is there a reason codecs.BOM* isn't used? That would - for free - 
document what these BOMs are.

I think the constant name should be lowercase.

>   def binary(s):
>       """return true if a string is binary data"""
> -    return bool(s and '\0' in s)
> +    if s:
> +        for BOM in BOMs:
>    

Variables should be lowercase.

> +            if s[:len(BOM)] == BOM:
>    

Use s.startswith(BOM) instead - that is simpler and faster.

> +                return False
> +        return '\0' in s
> +    return False
>
>   def increasingchunks(source, min=1024, max=65536):
>       '''return no less than min bytes per chunk while data remains,
>    

/Mads


More information about the Mercurial-devel mailing list