UTF-8 Byte order marks inserted by hg merge

Adrian Buehlmann adrian at cadifra.com
Mon Jun 30 07:50:25 CDT 2008


On 30.06.2008 14:04, Brian Wallis wrote:
> On 30/06/2008, at 7:21 PM, Adrian Buehlmann wrote:
> 
>> On 30.06.2008 10:26, Brian Wallis wrote:
>>> We have a user on Linux (Suse 10.3) running Mercurial 1.0.1 and
>>> another on Windows Vista running TortoiseHg 0.4 each of who were
>>> working on some changes on a branch. When it came time to merge, the
>>> user on Windows pulled the changes from the other repository and
>>> merged the two heads. The merged result seemed to be slightly
>>> corrupted in that there were three extra characters added to the  
>>> front
>>> of a few files. These were (in hex) EF BB BF which are the byte order
>>> marker for UTF-8.
>> This shouldn't have anything to do with Mercurial. I bet this
>> was notepad.
> 
> I'm not sure it was but I will check with the developer tomorrow.
> 
> I have tried to reproduce it by editing the file with notepad but it  
> leaves it alone, no BOM inserted. Do you know if there are particular  
> circumstances in which notepad would do this? (or some other windows  
> utility).

Known use case with notepad:

Create a file containing the following byte sequence: c2 a9 0d 0a

Open that file with notepad and add an additional empty line by hitting
return (i.e. another 0d 0a). Then save. File now starts with the BOM.

> The file is not UTF-8 in particular, it is simple ASCII, 7 bit clean,  
> an XML file.

And it doesn't happen to start with

<?xml version="1.0" encoding="utf-8" ?>

?

Well, then it's not that usecase then. But Mercurial really doesn't insert
BOM's. It must be an editor or a merge tool then.

But it would be a pretty silly tool that inserts BOM's into a plain
ASCII file. I would expect this to happen on explicit user request only.
For example, notepad++ can save the file in format "UTF-8" which
adds a BOM (there is also an option "UTF-8 without BOM").
Of course, if a user does that, you will end with a BOM in the file.

The editor of Visual Studio (tested version 2005) can do that too, if you
tell it to do so. Menu File, entry "Advanced save options..." brings a dialog
with a drop down "Encoding" on top, which has for example entries
"Unicode (UTF-8 with signature) - Codepage 65001" and
"Unicode (UTF-8 without signature) - Codepage 65001".




More information about the Mercurial mailing list