UTF-8 Byte order marks inserted by hg merge

Tue Jul 1 01:57:51 CDT 2008

On 01.07.2008 06:50, Brian Wallis wrote:
> On 30/06/2008, at 10:50 PM, Adrian Buehlmann wrote:
>> Known use case with notepad:
>>
>> Create a file containing the following byte sequence: c2 a9 0d 0a
>>
>> Open that file with notepad and add an additional empty line by  
>> hitting
>> return (i.e. another 0d 0a). Then save. File now starts with the BOM.
>>
>>> The file is not UTF-8 in particular, it is simple ASCII, 7 bit clean,
>>> an XML file.
>> And it doesn't happen to start with
>>
>> <?xml version="1.0" encoding="utf-8" ?>
> 
> Yes, it turns out that the files in question did start with that. At  
> first I thought there were more but that was a different problem, it  
> ended up with three xml files (eclipse .classpath) being affected.
> 
> The merge tool was the culprit. Our windows users are using Beyond  
> Compare version 3 beta and it is adding the BOMs to the files.
> 
> As this is a problem that will not go away and the BOM in a UTF-8 file  
> is just noise, I am going to use a slightly modified version of the  
> win32text extension to detect and filter out these three bytes. This  
> will ensure that our repository never has them committed to it. I will  
> use the dumbencode/decode so I can control exactly which files are  
> converted rather than trust it to never touch a binary file. I will  
> also put a forbid hook on our main repository to disallow files with  
> BOMs from being pushed.
> 
> Thanks for the help on resolving this.

Thanks for your detailed report of your interesting solution. This
might be helpful for other users in the future too.

It's useful to know that Beyond Compare may cause such BOM insertions.
Maybe that tool could be convinced (configured) to not do that. If it
can't, it might be worth sending a bug report to the maintainers.