[PATCH 3 of 3] highlight: add option to prevent content-only based fallback

Anton Shestakov engored at ya.ru
Thu Oct 15 01:29:22 CDT 2015


15.10.2015, 09:25, "Gregory Szorc" <gregory.szorc at gmail.com>:
> # HG changeset patch
> # User Gregory Szorc <gregory.szorc at gmail.com>
> # Date 1444872136 25200
> # Wed Oct 14 18:22:16 2015 -0700
> # Node ID a55c6e623cb63e6ac2e4f074aff8b767ab8fc50e
> # Parent bf9868e78cdfa8acb4a9a035bc21d49260043f5c
> highlight: add option to prevent content-only based fallback

LGTM.

> When Mozilla enabled Pygments on hg.mozilla.org, we got a lot of weirdly
> colorized files. Upon further investigation, the hightlight extension
> is first attempting a filename+content based match then falling back to a
> purely content-driven detection mode in Pygments. Sounds good in theory.
>
> Unfortunately, Pygments' content-driven detection establishes no minimum
> threshold for returning a lexer. Furthermore, the detection code for
> a number of languages is very liberal. For example, ActionScript 3 will
> return a confidence of 0.3 (out of 1.0) if the first 1k of the file
> we pass in matches the regex "\w+\s*:\s*\w"! Python matches on
> "import ". It's no coincidence that a number of our extension-less files
> were getting highlighted improperly.

It's a shame that Pygments don't allow configuring minimum confidence level inside guess_lexer, which could (again, in theory) be a better option than to disable guessing purely by content altogether. But yeah, PythonLexer.analyse_text() does give out 100% confidence if there's an 'import ' somewhere in the first 1000 bytes. Wow.


More information about the Mercurial-devel mailing list