http://hg.python.org/cpython/file/3a1db0d2747e/Lib/email/mime/nonmultipart.py is missing the last line of the file in the web view, but the raw view is cromulent.
Confirmed with the highlight extensions (works fine without).
Pygments 1.6 (pygmentize) highlights that file correctly, but highlight extension (locally) still miss the last line out.
Progress! The file in question contains a form feed (!), which splits differently depending on the unicode-ness of the file. pygments is working on the unicoded version, and so it gets one extra line compared to our earlier splitlines() call. In [3]: 'foo\x0cbar'.decode('utf-8').splitlines() Out[3]: [u'foo', u'bar'] In [4]: 'foo\x0cbar'.splitlines() Out[4]: ['foo\x0cbar'] What this means is that the generators are of differing length, and we never consume the last line of the file. I'm not entirely sure what the fix should be.
I've determined that with the current line-numbering system, we can't do any better than just not highlight files that'd break, so I've got patches ready to mail for that. I'll mail them when the queue is acceptably shallow.
If we disable pygments for files with "\f", we'll just get a different weird bug report. Probably best to filter the character in the pygments extension.
Filter it how? I'd rather avoid duplicating Python's logic about what codepoints are linebreaking in Unicode and then hiding them from pygments somehow. IMO showing the whole file is a strict improvement, even if it breaks highlighting for some timeframe.
I'd suggest something like this: # str.splitlines() != unicode.splitlines() because "reasons" for c in "\x0c\x1c\x1d\x1e": if c in text: text = text.replace(c, '') Not sure what you mean by "timeframe". Python has a bug on this that was closed back in 2010 (in fact, they made it worse and didn't fix the docs!) so they're not likely to fix it on their end. Can't even guess if the Pygments people would care. http://bugs.python.org/issue7643
Fixed by http://selenic.com/repo/hg/rev/7b8ff3fd11d3 Matt Mackall <mpm@selenic.com> highlight: ignore Unicode's extra linebreaks (issue4291) Unicode and Python's unicode.splitlines() treat several extra legacy ASCII codepoints as linebreaks, even though the vast bulk of computing and Python's own str.splitlines() do not. Rather than introduce line numbering confusion, we filter them out when highlighting. (please test the fix)
Bulk testing -> fixed