[issue2162] BOM (byte order mark) support for Mercurial.ini

Tue Apr 27 10:04:59 CDT 2010

Alexander Belchenko wrote, On 04/27/2010 04:15 PM:
> Yuya Nishihara пишет:
>> New submission from Yuya Nishihara <yuya at tcha.org>:
>>
>> Some text editors, like Notepad.exe, insert BOM (byte order mark) 
>> silently if you save Mercurial.ini as UTF-8.
>>
>> IMHO, they shouldn't insert BOM for UTF-8, but it's really hard to 
>> debug because BOM isn't visible. So it seems reasonable to 
>> skip/recognize BOM before reading Mercurial.ini.
>
> I was under impression that UTF-8 might have optional BOM marker, and 
> Python even has this constant defined:
>
> In [1]: import codecs
>
> In [2]: codecs.BOM
> codecs.BOM          codecs.BOM_BE       codecs.BOM_UTF32
> codecs.BOM32_BE     codecs.BOM_LE       codecs.BOM_UTF32_BE
> codecs.BOM32_LE     codecs.BOM_UTF16    codecs.BOM_UTF32_LE
> codecs.BOM64_BE     codecs.BOM_UTF16_BE codecs.BOM_UTF8
> codecs.BOM64_LE     codecs.BOM_UTF16_LE
>
> In [2]: codecs.BOM_UTF8
> Out[2]: '\xef\xbb\xbf'
>
> So, why you say it "shouldn't"?

Because it is optional, has no benefit, and "never" is used?

Mercurial is not particular encoding-aware but very 
encoding-transparent. Encoding Mercurial.ini in any ascii-superset is 
fine, and BOMs could probably be removed or ignored when parsed, but in 
that case the BOM should probably be prepended to all value strings too 
... and that would cause other strange issues.

FWIW I'm -0 on special handling of BOM - but a strip on the config file 
content before parsing should do no harm.

Perhaps we could warn if any non-7-bit characters if found before the 
first # or =?

/Mads