Line ending translation extension

Stephen J. Turnbull stephen at xemacs.org
Mon Sep 7 04:10:51 CDT 2009


"Martin v. Löwis" writes:

 > > And the design very likely has to change, to deal with the
 > > decentralized, content-oriented behavior of DVCSes.  It seems to me
 > > that your question about hg diff acknowledges that the proposed design
 > > is incomplete in this sense.
 > 
 > I don't see how this has anything to do with the decentralized behavior.
 > Can you please be more specific?

In a centralized VCS doing a diff means checking out one or more
versions of each file.  The main cost is network transport.  So
there's no reason to look for a fast path locally AFAICS, no reason to
have different ways to checkout.  So both versions will have been
treated by the same checkout filters and the diff is valid and
efficient.

In a distributed VCS, however, there may be a fast path.  For example,
in git you can first check for equality with the SHA1.  This is not
useful if you are comparing a checked-out file that has been filtered
with a committed version, and thus users who configure a non-identity
filter will get inefficient performance compared to those with
identity filters.  Again, if you are comparing two checked-in
versions, it may make sense to never actually write the file, and
simply compare repo contents of one blob with another.  This will be
incorrect if you compare repo contents of a version with the filtered
working version, so that's a bug we need to check for.

I suspect there are other, similar issues that have to do with
optimizations based on direct repository access.  I don't *know* that
git does that kind of stuff (and I know almost nothing about the
internals of hg), but it's certainly possible and needs to be checked.

 > >  > What he is concerned about is that he may have to maintain the
 > >  > extension if he just barely touches it. I can understand that.
 > >  > The hard problem is a social one, not a technical.
 > > 
 > > Indeed it's social, but it's not about maintaining the extension.  For
 > > that, he can "just say no" at any time.
 > 
 > No, he can't - because then the extension will become unmaintained.
 > So the situation would be just like win32ext: it is there, it is
 > unmaintained, and it doesn't quite work.

This is no different from the current situation, where refusal to work
leaves us with win32ext, except that the not quite working
unmaintained extension might be a marked improvement over win32ext.

 > >  > and it doesn't work.
 > > 
 > > I can't see how it could if it "guesses" and isn't maintained.  Such
 > > guessing needs to be tweaked continuously to approximate accuracy.
 > > 
 > > Still, I'd like to see URLs to descriptions of *exactly* how it fails
 > > to work in practical situations.
 > 
 > "it" being your method B - there are no URLs discussing it, since
 > it isn't implemented yet.

No, "it" being win32ext.

 > This method (treating project files as binary) can't work since it
 > allows people to introduce mixed eol styles into the file, which
 > would break the tools that want to process the files.

See?  I told you you knew more about the problem than me.

 > So start investigating things, and contributing code, to ease that
 > nervousness.

I'm trying to investigate!  That's why I'm asking for URLs!  I don't
even know where to find the Print Options menu on Vista, and my normal
tools on Linux and Mac OS X handle CRLF and LF transparently.  *There
are no problems visible where I live, and there never will be.*

If you want me to move over to Windows, and see for myself, first you
get to wait 6 weeks while I requisition a copy, then you get to wait
two weeks while I install it into Xen, then you get to wait an
indeterminate amount of time while I learn enough about Windows to get
some idea of what the problems are.

Alternatively, the problems could be described more concretely than
"win32ext is unsatisfactory" by someone who has experienced them.  My
field of expertise is encodings, not Windows and not Mercurial.

 > I'd be happy if somebody who you would accept as expert comes up
 > with a specification (as long as it supports Python source files
 > to show up in CRLF on disk on Windows).
 > 
 > Notice that the strategy "let's follow svn" is *not* ad-hoc. It
 > gives a very clear guideline on how this feature should behave.
 > Whether that will be useful in practice remains to be seen, but
 > the strong indication is that it has worked well for Subversion,
 > and did work (in a more limited form) for CVS before.

It *is* ad hoc to the extent that the versions I have seen specify
commands where certain transformations must take place.  But svnclient
will *never* operate by direct access to the repository contents, and
therefore as long as every file checked out from the server is
transformed before other operations are conducted, operations are
correct and tolerably efficient.  This is not true for git and other
DVCSes.  They *do* have online access to the repository contents, and
AFAIK they use it.

Content-orientation also changes the requirements.  For example, git
needs to compute the SHA1 to do an add, requiring a transform.  Surely
subversion does no transforms for svn add!  I suspect Mercurial
sometimes compares the size of a file in repo to the size of a working
file: to do that correctly requires transforming the file.  Which
commands might do that?  I don't know, but I bet there are a lot more
of them than need to worry about EOLs in Subversion.

 > FWIW, git supports the crlf attribute, which is very similar to
 > the proposed feature.

No, it's very similar to win32ext, with a little bit of additional
safety.  Quoting from git-config(1) (emphasis added):

    core.autocrlf
        If true, makes git convert CRLF at the end of lines in text
        files to LF when reading from the filesystem, and convert in
        reverse when writing to the filesystem. The variable can be
        set to input, in which case the conversion happens only while
        reading from the filesystem but files are written out with LF
        at the end of lines. CURRENTLY, WHICH PATHS TO CONSIDER "TEXT"
        (I.E. BE SUBJECTED TO THE AUTOCRLF MECHANISM) IS DECIDED
        PURELY BASED ON THE CONTENTS.

    core.safecrlf
        If true, makes git check if converting CRLF as controlled by
        core.autocrlf is reversible. Git will verify if a command
        modifies a file in the work tree either directly or
        indirectly. For example, committing a file followed by
        checking out the same file should yield the original file in
        the work tree. If this is not the case for the current setting
        of core.autocrlf, git will reject the file. The variable can
        be set to "warn", in which case git will only warn about an
        irreversible conversion but continue the operation.

        CRLF conversion bears a slight chance of corrupting data.
        autocrlf=true will convert CRLF to LF during commit and LF to
        CRLF during checkout. A FILE THAT CONTAINS A MIXTURE OF LF AND
        CRLF BEFORE THE COMMIT CANNOT BE RECREATED BY GIT.

 > Bazaar has the *same* feature since 1.14:
 > http://doc.bazaar-vcs.org/bzr.dev/en/user-reference/bzr_man.html#end-of-line-conversion

IIRC[1], it's considered mostly unsatisfactory by the Windows users
(who nonetheless love bzr).  In fact, bazaar at canonical.com is quite
excited by the prospect that Mercurial might actually do something
about the EOL problem so that they might get a working implementation
too.

 > So I don't buy the argument that DVCSs are different and don't
 > need the feature.

I didn't make that argument.  My argument is that the feature needs
different implementation, and probably broader coverage of commands,
for DVCS rather than CVCS.


Footnotes: 
[1]  Windows users of bzr have a lot of complaints, just as they do
for git or Mercurial.  It's possible I've confused eol-conversion with
a related feature, such as "versioned rules" (ie, "mandating
conversions"), and they're happy with eol-conversion itself, but not
the requirement that each user configure it for themselves.




More information about the Mercurial-devel mailing list