[issue2162] BOM (byte order mark) support for Mercurial.ini

Wed Apr 28 05:45:58 CDT 2010

On 27 Apr 2010, at 16:15 , Alexander Belchenko wrote:
> 
> I was under impression that UTF-8 might have optional BOM marker, and Python even has this constant defined:
> 
> In [1]: import codecs
> 
> In [2]: codecs.BOM
> codecs.BOM          codecs.BOM_BE       codecs.BOM_UTF32
> codecs.BOM32_BE     codecs.BOM_LE       codecs.BOM_UTF32_BE
> codecs.BOM32_LE     codecs.BOM_UTF16    codecs.BOM_UTF32_LE
> codecs.BOM64_BE     codecs.BOM_UTF16_BE codecs.BOM_UTF8
> codecs.BOM64_LE     codecs.BOM_UTF16_LE
> 
> In [2]: codecs.BOM_UTF8
> Out[2]: '\xef\xbb\xbf'
> 
> So, why you say it "shouldn't"?
Well, since utf-8 has no "customizable" byte order, "byte order mark" is a misnomer to start with. Second, while it's allowed, the byte order mark in a utf-8 document is *not recommended* by the official Unicode standard:

> Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature
(Unicode Standard 5.0 chapter 2)

and it's generally a pain in the ass.

On 28 Apr 2010, at 11:02 , Sune Foldager wrote:
> On 28-04-2010 09:29, Alexander Belchenko wrote:
>>>>> So, why you say it "shouldn't"?
>>>> Because it is optional, has no benefit, and "never" is used?
>>> I heard it can be used for detection of character encoding,
>>> but it seems silly to lose ascii compatibility just for such reason.
>>> UTF-8 does exist for ascii transparency.
>> I don't understand what is "ascii transparency" here. When somebody said
>> about "ascii" seriously, for me it sounds the same as pretend we're
>> living in the flat world which stand on the back of big turtle.
> Welcome to the UNIX world, where many people are scared of anything non-ASCII due to compatibility with ancient programs ;-)
There's also the issue that no two systems use the same encoding (let alone use them consistently), and even if you get two systems to agree on an (hopefully unicode-based) encoding they probably will disagree on something else making all your earlier efforts pointless. For instance OSX uses NFD for file names where most Linux systems use NFC. This means a file name which displays fine might not be selectable via the console (and potentially via other APIs), because the NFC you'll enter (if the file was transferred from OSX to Linux) will not match the on-disk NFD name.