Line ending translation extension

Stephen J. Turnbull stephen at xemacs.org
Fri Sep 11 23:56:41 CDT 2009


Martin Geisler writes:

 > > No, internal format should always be the same (probably LF).
 > 
 > I disagree -- I want a repository that looks normal even without using
 > the extension.

Well, *you* (singular) can have that, but *we* (plural) can't.  The
repo-internal format (which may vary across files but is fixed for any
given file) will necessarily look abnormal on some platforms; that's
the problem we're trying to solve.

 > That means that files which are specified as CRLF format
 > in .hgeol should be stored in CRLF format in the repository.

This is the ISO 2022 approach to encoding (allow everybody to have
their own encoding, and use in-band information to indicate what it is
so clients -- here "clients" means developer tools in general, not hg!
-- can convert as they need it).  It doesn't work there, and I
don't think it will work here.

The problem is that this simply perpetuates the notion that CRLF and
LF are somehow essentially different.  For some purposes they are, but
for the VCS they are not.  The VCS should be internally consistent
about the EOL convention (I suspect even across repos), and only
distinguish on checkout.

In general, I would argue that the basic distinction is binary
vs. text, and we almost always want text files checked out with the
native EOL convention for the convenience of humans with less-capable
editors.  So even for files like a Visual Studio project file,
checking it out as 'native' almost always makes sense, since on
Windows it gets the mandatory EOL, whereas on Unix the CRLF EOL is not
mandatory because the project file is useless.

I guess that the default should be <Windows text file> = native.  This
works well because all text files will default to native on checkout,
and thus on Windows they'll all have the right EOL *even if created on
Unix and the user doesn't remember to specify "<file> = windows" in
.hgeol*.  Users who actually need CRLF on non-Windows platforms (eg,
users who build a Windows version using Wine or administer a Samba
server on Unix) would have a separate "cross-platform" branch where
the .hgeol specifies CRLF.  I'm not sure how this would work for
people prefer LF even though they work on Windows (hi, Paul!).
Probably they have to do

[hgeol]
native = unix
**/Project = windows        # or whatever the naming convention is

I wrote "guess" here because it's possible that on Unix most people
now use editors that handle EOLs properly (ie, detect the file's
convention and maintain it).  In that case <Windows text file> = CRLF
might be better, but it would still require user intervention in the
case that somebody created a Windows-specific file on Unix.  (In the
Emacs community this happens commonly, although most such files are
simply the platform-specific part of a new feature, and the tools
involved don't care, accepting LF or CRLF as line separators.)

 > That makes most sense to me and it means that this extension will have
 > minimal impact on an existing repository. If you created the repository
 > on Windows, then just configure the extension with '[repository] native
 > = windows' in .hgeol

I would argue against this approach, just based on a bad feeling I get
from decades of wrestling with a language that has *five* commonly
used encodings, typically mixing two or more of them in the same
document.  If there is to be configuration of repo-internal EOL at
all, it *must* be a parameter that applies repository-wide to text
files.  As such, it's a reasonable way to handle "platform-native"
repositories.

But if you do this, call that 'internal', not 'native'.  The user of
the extension doesn't care what the internal format is, but these tags
are only meaningful in the context of using the extension.  To users,
'native' is what the platforms in front of them use, not the encrypted
form internal to the repository.  On Unix, "native = windows" is an
oxymoron.

 > and commit that file only. If all files had to be in LF format in
 > the repository you would have to make a huge commit where you
 > change every file when you introduce the .hgeol file.

This is the argument that the anti-Unicode faction made, with somewhat
more justication.  They were wrong; everybody is moving toward Unicode
now, even if it involves conversion of large corpuses of text.  You
could probably actually change the internal representation of every
file almost transparently (even transparently if you keep an
equivalence table of the files' digests across the mass conversion);
only Mercurial would need to know.

Paul Moore wrote:

 > > [...] Don't use a repo-internal format with U+2028 as the line ending
 > > as no-one would consider the results "sane" in that case :-))

IMHO, dead wrong.  Users of the extension will consider the results
eminently sane, and non-users won't be able to get any work done.
Isn't that what we want, assuming the extension is competent, and
trivial to install?

The only problem with this is that a lot of things that (say) work by
design on Windows and by accident on Unix with LF-internal will show
up immediately as bugs, meaning that maintainers of the extension have
more work to do in the short run.  It might require more attention to
UI design as well, but I'm out of time to think about that.


More information about the Mercurial-devel mailing list