Bug 4291 - highlight omits line(s) when a file contains form feed characters
Summary: highlight omits line(s) when a file contains form feed characters
Status: RESOLVED FIXED
Alias: None
Product: Mercurial
Classification: Unclassified
Component: hgweb (show other bugs)
Version: 3.0.1
Hardware: All All
: normal bug
Assignee: Bugzilla
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-06-26 16:51 UTC by Augie Fackler
Modified: 2015-01-22 15:04 UTC (History)
3 users (show)

See Also:
Python Version: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Augie Fackler 2014-06-26 16:51 UTC
http://hg.python.org/cpython/file/3a1db0d2747e/Lib/email/mime/nonmultipart.py is missing the last line of the file in the web view, but the raw view is cromulent.
Comment 1 Matt Mackall 2014-06-26 17:17 UTC
Confirmed with the highlight extensions (works fine without).
Comment 2 Nikolaj Sjujskij 2014-06-29 15:06 UTC
Pygments 1.6 (pygmentize) highlights that file correctly, but highlight extension (locally) still miss the last line out.
Comment 3 Augie Fackler 2014-12-16 14:59 UTC
Progress! The file in question contains a form feed (!), which splits differently depending on the unicode-ness of the file. pygments is working on the unicoded version, and so it gets one extra line compared to our earlier splitlines() call.

In [3]: 'foo\x0cbar'.decode('utf-8').splitlines()
Out[3]: [u'foo', u'bar']

In [4]: 'foo\x0cbar'.splitlines()
Out[4]: ['foo\x0cbar']

What this means is that the generators are of differing length, and we never consume the last line of the file.

I'm not entirely sure what the fix should be.
Comment 4 Augie Fackler 2014-12-16 18:11 UTC
I've determined that with the current line-numbering system, we can't do any better than just not highlight files that'd break, so I've got patches ready to mail for that. I'll mail them when the queue is acceptably shallow.
Comment 5 Matt Mackall 2014-12-17 12:15 UTC
If we disable pygments for files with "\f", we'll just get a different weird bug report. Probably best to filter the character in the pygments extension.
Comment 6 Augie Fackler 2014-12-17 12:18 UTC
Filter it how?

I'd rather avoid duplicating Python's logic about what codepoints are linebreaking in Unicode and then hiding them from pygments somehow.

IMO showing the whole file is a strict improvement, even if it breaks highlighting for some timeframe.
Comment 7 Matt Mackall 2014-12-17 13:26 UTC
I'd suggest something like this:

    # str.splitlines() != unicode.splitlines() because "reasons"
    for c in "\x0c\x1c\x1d\x1e":
        if c in text:
            text = text.replace(c, '')

Not sure what you mean by "timeframe". Python has a bug on this that was closed back in 2010 (in fact, they made it worse and didn't fix the docs!) so they're not likely to fix it on their end. Can't even guess if the Pygments people would care.

http://bugs.python.org/issue7643
Comment 8 HG Bot 2014-12-19 15:46 UTC
Fixed by http://selenic.com/repo/hg/rev/7b8ff3fd11d3
Matt Mackall <mpm@selenic.com>
highlight: ignore Unicode's extra linebreaks (issue4291)

Unicode and Python's unicode.splitlines() treat several extra legacy
ASCII codepoints as linebreaks, even though the vast bulk of computing
and Python's own str.splitlines() do not. Rather than introduce line
numbering confusion, we filter them out when highlighting.

(please test the fix)
Comment 9 Matt Mackall 2015-01-22 15:04 UTC
Bulk testing -> fixed