UTF-8 Byte order marks inserted by hg merge
Adrian Buehlmann
adrian at cadifra.com
Mon Jun 30 07:50:25 CDT 2008
On 30.06.2008 14:04, Brian Wallis wrote:
> On 30/06/2008, at 7:21 PM, Adrian Buehlmann wrote:
>
>> On 30.06.2008 10:26, Brian Wallis wrote:
>>> We have a user on Linux (Suse 10.3) running Mercurial 1.0.1 and
>>> another on Windows Vista running TortoiseHg 0.4 each of who were
>>> working on some changes on a branch. When it came time to merge, the
>>> user on Windows pulled the changes from the other repository and
>>> merged the two heads. The merged result seemed to be slightly
>>> corrupted in that there were three extra characters added to the
>>> front
>>> of a few files. These were (in hex) EF BB BF which are the byte order
>>> marker for UTF-8.
>> This shouldn't have anything to do with Mercurial. I bet this
>> was notepad.
>
> I'm not sure it was but I will check with the developer tomorrow.
>
> I have tried to reproduce it by editing the file with notepad but it
> leaves it alone, no BOM inserted. Do you know if there are particular
> circumstances in which notepad would do this? (or some other windows
> utility).
Known use case with notepad:
Create a file containing the following byte sequence: c2 a9 0d 0a
Open that file with notepad and add an additional empty line by hitting
return (i.e. another 0d 0a). Then save. File now starts with the BOM.
> The file is not UTF-8 in particular, it is simple ASCII, 7 bit clean,
> an XML file.
And it doesn't happen to start with
<?xml version="1.0" encoding="utf-8" ?>
?
Well, then it's not that usecase then. But Mercurial really doesn't insert
BOM's. It must be an editor or a merge tool then.
But it would be a pretty silly tool that inserts BOM's into a plain
ASCII file. I would expect this to happen on explicit user request only.
For example, notepad++ can save the file in format "UTF-8" which
adds a BOM (there is also an option "UTF-8 without BOM").
Of course, if a user does that, you will end with a BOM in the file.
The editor of Visual Studio (tested version 2005) can do that too, if you
tell it to do so. Menu File, entry "Advanced save options..." brings a dialog
with a drop down "Encoding" on top, which has for example entries
"Unicode (UTF-8 with signature) - Codepage 65001" and
"Unicode (UTF-8 without signature) - Codepage 65001".
More information about the Mercurial
mailing list