Note:

This page is no longer relevant but is kept for historical purposes.

Note:

This page is primarily intended for developers of Mercurial.

EOL Translation Plan

<!> This page is historical, see the EolExtension.

Status: draft

A plan for improved handling of EOL translation.

The problem

Different platforms have different conventions for representation of end-of-line in text files. A common feature request for Mercurial is good support for getting native line endings in text files when changesets are shared between different platforms. The internal storage format for Mercurial is bytes, and core Mercurial doesn't care much about what is encoded in the bytes.

Requirements

The main requirement is simply to somehow get

Here comes an attempt to describe some requests / requirements / challenges-to-be-considered. Some are obligatory to some, some considers some nice-to-have, and some would consider implementation of some of them misfeatures. What is listed here should be essential to the problem, not tied to a specific solution and not just enumerate examples of bad solutions to avoid:

The win32text extension

Mercurial already comes with the Win32TextExtension, but it has a number of short-comings:

Subversion

Subversion has built-in support for line ending conversion. We have a test case that documents how Subversion handles some cases of inconsistencies. It turns out that

Git

Git also has a lot of options for crlf and whitespace.

Design of the eol extension

Note: This design is work-in-progress together with the corresponding implementation at http://bitbucket.org/mg/hg-eol/.

We will keep configuration in a version controlled file called .hgeol in the repository root. It declares which files the extension should convert. The file could look like this:

[patterns]
Windows.txt = CRLF
Unix.txt = LF
test/mixed.txt = BIN
**.txt = native
**.py = native
**.proj = CRLF

[repository]
native = CRLF

The [patterns] section defines glob patterns for conversions - see hg help patterns. The first pattern wins, so more specific patterns should be put first. In the above example, the test/mixed.txt file is considered BIN since the test/mixed.txt rule matches before the **.txt rule.

This solution has been designed to be minimally invasive, so that repositories without the extension will behave as correct as possible.

Working directory format

Files with a declared format as CRLF or LF are always checked out in that format, files declared as native are converted to the operating system native format, and files not mentioned (or declared as binary) receive no treatment.

Repository format

Files declared as LF, CRLF, or BIN are stored as-is in the repository. Files declared as native are stored in a configurable repository-native format which defaults to LF.

The repository-native format can optionally be configured in .hgeol in the [repository] section.

Detailed behavior

The extension will change the behavior of core commands as follows.

hg update

The extension should read the .hgeol file from the target revision. So

hg update -r 100

will read .hgeol from revision 100 and ensure that files have EOLs according to the rules from that revision after the update.

If the working copy is dirty, the following should happen:

The idea is that this should let one move changes around "like normal" by basically ignoring the EOL rules while doing so.

hg commit

Files are checked to ensure correct EOLs. If .hgeol is changed, the EOLs in working directory can get out of sync with the .hgeol file. This should make the commit abort with a message:

abort: EOL mis-match in Windows.txt: has LF, but should have CRLF
(run "hg eolupdate" to update files)

The hg eolupdate command will rewrite files in the working directory to match the .hgeol file. After that, the commit will succeed as normal and include the rewritten files. This means that updates to .hgeol are made in lock-step with the corresponding file changes. That way things are kept nicely synchronized in the repository.

The eolupdate command will make it easy to clearly separate content changes and EOL style changes. We can try to guide our users into not mixing those changes together in a single commit, by letting hg eolupdate fail if the repository has uncommitted changes, to specifically avoid updating EOLs in a "content" changeset.

hg add

No checking is done at that time, the check will be done when the files are committed.

hg diff

File content is normalized to the repository form before the diff is computed. The diff is then presented using the working copy form. TODO: the diff is currently shown based on the repository form of files.

Discussion of the eol extension

Format naming

Regardless of the implementation details, we are aware that we will need to pick unambiguous names for our various components. For some, native does not stand out as a name that is self-explanatory, but it does make sense to those exposed to Subversion's svn:eol property setting which inspired this mechanism in the first place.

A naming policy centered on storage might be more clear to end-users: storeasis, storeaslf is already depicting the behavior on commit, for example. Depending on the implementation, it might be interesting to specify distinctly the behaviors on commit, and on update: "storeasis, getaslf" or "storeaslf, getasis", or "storeascrlf, converttolocal" are too long, but are self-explanatory.

Some suggested that mercurial should not use CRLF or LF in our names, and use instead 'Windows' and 'Unix', respectively. One convention can be chosen, or aliases can be used.

Instead of defining the repository format for native in a separate section it could be a part of the format specification, such as native/CRLF, native/LF, or native/auto. The native/auto setting means that files are stored with native EOLs in the working copy, but otherwise preserved in the repository. So a CRLF file will remain in CRLF format in the working copy, but be checked out in LF format on Unix and in CRLF format on Windows.

RAW could perhaps be a better name than than BIN.

Content filtering hooks

The eol extension utilizes the generic encode/decode filters, just like win32text does. The filters can thus not be used to anything else, and eol gets some extra complexity in order to work with that interface. Perhaps the extension should build on something else than the current encode/decode filters.

The keyword extension solves a similar problem - perhaps some code can be reused, or perhaps there is a common need for better hooks for content filtering?

Another but very similar problem is conversion between character encodings - for example between UTF-8 with or without BOM, UTF-16 and other multi-byte formats, and 8-bit encodings such as the ISO 8859 variants and the most common Windows code pages. Yet another example could be automatic coding style conversions. Perhaps all these problems have so much in common that they all could be solved at once?

Mercurial should be careful not to lose any information, so it would be nice if a warning was given before any lossy filter was applied. For example, a pure conversion from CRLF to LF isn't lossy, but normalization of a file with inconsistent line-endings is. Perhaps core Mercurial could recognize a .hgfilter which could like this:

[eol]
**.py = native/lf

[keywords]
src/**.py =

[encoding]
**.txt = native/utf-8

Extensions could handle a section, and core mercurial could warn about any unhandled sections. That would help ensuring that users had the right extensions enabled. This functionality could also be used to ensure that certain (commit) hooks are enabled in all working clones. We note that some kind of filters just ensure an invariant (CRLF or LF (or RAW)) and thus can be applied several times, for example both on checkout and on commit and as a possibel extra fix-up step to ensure the invariant both in working directory and in repo. Inconsistency is thus easily fixed. Other kinds of filters converts between different formats without reaching one fix-point, so the filters must be each others inverse (and probably only partial) and applied exactly once. Inconsistencies with this kind of filters is hard to clean up.

TODO

Extension Help Text

This is the module help text. It has been put here for easy editing and to collect all information on this page:

This extension allows you to manage what kind of line endings (CRLF or
LF) are used in the repository and in the local working directory.

The extension reads its configuration from a versioned ``.hgeol``
configuration file every time you run an ``hg`` command.  ``.hgeol`` has
similar syntax to regular Mercurial configuration files.  It uses two
sections, ``[patterns]`` and ``[repository]``.

Use ``[patterns]`` to specify the encodings to use by file pattern in
the working directory.  The available encodings are ``LF``, ``CRLF``,
and ``BIN``.  Additionally, ``native`` is an alias for the platform's
default encoding: ``LF`` on Unix (including Mac OS X) and ``CRLF`` on
Windows.  Note that ``BIN`` (do nothing to line endings) is Mercurial's
default behaviour; it's only needed so that later, more specific
patterns can override earlier, more general patterns.

You can override the default interpretation of ``native`` by configuring
``eol.native``.  Set it to ``LF`` or ``CRLF``.

The repository representation of newlines in files configured as
``native`` can be specified in the ``[repository]`` section in
``.hgeol``. The default is LF, meaning that on Windows, files configured
as ``native`` (CRLF) will be converted to LF on commit.

Example versioned ``.hgeol`` file::

  [patterns]
  **.py = native
  **.vcproj = CRLF
  **.txt = native
  Makefile = LF
  **.jpg = BIN

  [repository]
  native = LF

Example ``.hgrc`` (or ``Mercurial.ini``) section::

  [eol]
  native = CRLF

See 'hg help patterns' for more information about the glob patterns
used.


CategoryDeveloper

EOLTranslationPlan (last edited 2012-10-25 20:31:35 by mpm)