4291 – highlight omits line(s) when a file contains form feed characters

Bug 4291 - highlight omits line(s) when a file contains form feed characters

Summary: highlight omits line(s) when a file contains form feed characters

Status:	RESOLVED FIXED

Alias:	None

Product:	Mercurial
Classification:	Unclassified
Component:	hgweb (show other bugs)
Version:	3.0.1
Hardware:	All All

Importance:	normal bug
Assignee:	Bugzilla

URL:
Keywords:

Depends on:
Blocks:

Reported:	2014-06-26 16:51 UTC by Augie Fackler
Modified:	2015-01-22 15:04 UTC (History)
CC List:	3 users (show)

See Also:
Python Version:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Augie Fackler 2014-06-26 16:51 UTC

http://hg.python.org/cpython/file/3a1db0d2747e/Lib/email/mime/nonmultipart.py is missing the last line of the file in the web view, but the raw view is cromulent.

Comment 1 Matt Mackall 2014-06-26 17:17 UTC

Confirmed with the highlight extensions (works fine without).

Comment 2 Nikolaj Sjujskij 2014-06-29 15:06 UTC

Pygments 1.6 (pygmentize) highlights that file correctly, but highlight extension (locally) still miss the last line out.

Comment 3 Augie Fackler 2014-12-16 14:59 UTC

Progress! The file in question contains a form feed (!), which splits differently depending on the unicode-ness of the file. pygments is working on the unicoded version, and so it gets one extra line compared to our earlier splitlines() call.

In [3]: 'foo\x0cbar'.decode('utf-8').splitlines()
Out[3]: [u'foo', u'bar']

In [4]: 'foo\x0cbar'.splitlines()
Out[4]: ['foo\x0cbar']

What this means is that the generators are of differing length, and we never consume the last line of the file.

I'm not entirely sure what the fix should be.

Comment 4 Augie Fackler 2014-12-16 18:11 UTC

I've determined that with the current line-numbering system, we can't do any better than just not highlight files that'd break, so I've got patches ready to mail for that. I'll mail them when the queue is acceptably shallow.

Comment 5 Matt Mackall 2014-12-17 12:15 UTC

If we disable pygments for files with "\f", we'll just get a different weird bug report. Probably best to filter the character in the pygments extension.

Comment 6 Augie Fackler 2014-12-17 12:18 UTC

Filter it how?

I'd rather avoid duplicating Python's logic about what codepoints are linebreaking in Unicode and then hiding them from pygments somehow.

IMO showing the whole file is a strict improvement, even if it breaks highlighting for some timeframe.

Comment 7 Matt Mackall 2014-12-17 13:26 UTC

I'd suggest something like this:

    # str.splitlines() != unicode.splitlines() because "reasons"
    for c in "\x0c\x1c\x1d\x1e":
        if c in text:
            text = text.replace(c, '')

Not sure what you mean by "timeframe". Python has a bug on this that was closed back in 2010 (in fact, they made it worse and didn't fix the docs!) so they're not likely to fix it on their end. Can't even guess if the Pygments people would care.

http://bugs.python.org/issue7643

Comment 8 HG Bot 2014-12-19 15:46 UTC

Fixed by http://selenic.com/repo/hg/rev/7b8ff3fd11d3
Matt Mackall <mpm@selenic.com>
highlight: ignore Unicode's extra linebreaks (issue4291)

Unicode and Python's unicode.splitlines() treat several extra legacy
ASCII codepoints as linebreaks, even though the vast bulk of computing
and Python's own str.splitlines() do not. Rather than introduce line
numbering confusion, we filter them out when highlighting.

(please test the fix)

Comment 9 Matt Mackall 2015-01-22 15:04 UTC

Bulk testing -> fixed