run-tests.py refactor

Thu Apr 24 16:42:37 CDT 2014

I'm working on a significant refactor of run-tests.py. My work can be
found in the indygreg/run-tests-refactor bookmark of
https://bitbucket.org/indygreg/hg. I plan to send the patches to this
list once the 3.0 window is behind us. But I wanted to give people a
heads up in case there are comments before the patch bombing.

There are two overarching goals to this project:

1) Make the Mercurial tests more embeddable and consumable from external
testing frameworks.

2) Make the tests faster.

As a side-effect of both, I believe I've cleaned up run-tests.py
significantly and made it easier to understand and hack on.

First, as someone who writes a lot of 3rd party extensions and hooks, I
find it difficult to integrate run-tests.py into my testing processes.
Specifically, run-tests.py doesn't play nice with existing Python
testing tools, such as unittest and nose. Every time I set up a new
testing environment, I have to reinvent the wheel or copy some code.
Things are fragile and I dare say the barrier to entry is high enough
that it discourages testing. Testing should be encouraged by making it
turnkey.

To facilitate easier testing, my patch series converts run-tests.py to
use the Python standard library's unittest package for declaring and
running tests. Individual test cases are unittest.TestCase instances.
There is a custom unittest.TestSuite that knows how to run a collection
of Mercurial tests. There is a custom unittest.TestRunner that knows how
to output results that should be identical to what run-tests.py outputs
today. The goal is that external testers can instantiate these custom
classes and easily plug them into existing testing tools, thus lowering
the barrier to testing.

Refactoring run-tests.py to work in a unittest world had the beneficial
side-effect of forcing the code to become more... robust. Reliance on
global variables has been nearly eliminated. The code for parsing and
executing .t tests now lives in a single class and is easily importable.
(This may make https://bitbucket.org/brodie/cram unnecessary.) Things
like temporary directories are managed via unittest primitives such as
setUp() and tearDown(). Code for handling failures is streamlined, etc.
IMO it's a long overdue cleanup.

The execution time of the Mercurial tests is a common complaint. I did
some profiling and determined that the tests were spending an awful lot
of time in overhead of invoking the "hg" process. By executing tests in
shells, we have to incur the process start-up and repository
"re-association" costs for every invocation of "hg." The overhead is
significant.

I added an experimental "pysh" mode that parses shell commands into
Python functions. If a .t file consists of only shell commands that can
be inlined to Python, the test executes in pure Python. For "hg" calls,
it creates a new mercurial.dispatch.request and calls
mercurial.dispatch.dispatch().

The results of inlining .t tests into Python is *very* encouraging.
test-bisect2.t (the largest test in terms of file size that can be
inlined) drops from ~14s wall to ~1.8s wall (time ./run-tests.py -l
--pysh test-bisect2.t)! Even a small test like test-resolve.t drops from
~1.6s to ~0.7s. That's the good news.

The bad news is that only ~54 of the ~425 existing .t tests can be
executed in pure Python. And, even with inlining, total wall time
execution for the entire test suite only dropped by 20-30s (i7-2600K - 4
+ 4 core -j8). The reason we can't inline more tests is because the
tests are doing things with the shell that can't yet be parsed into
Python functions.

This brings us to an interesting crossroads. I've identified separate
process overhead of commands inside tests (notably hg invocations) as a
significant factor contributing to slow test execution. I can entertain
the argument that invoking hg from the same process multiple times
instead of from isolated processes does taint the effectiveness of the
test (we're not measuring real world conditions any more). But, I think
hooking in at mercurial.dispatch.dispatch() - effectively what "hg" does
- isn't a significant departure. And, since it buys us a massive speedup
win, it's hard to ignore that benefit. While mpm and crew may insist on
running tests in shell mode for official acceptance testing, developers
would greatly benefit from the "99% accurate" pure Python mode (I think).

Assuming there is buy-in to executing tests in pure Python, complex
shell commands will continue to undermine the efficiency of the test
suite. We have a number of options here:

1) Start rewriting .t tests to the subset of shell we can convert to Python
2) Support more shell primitives in Python (this becomes hard fast and I
don't like parsing shell for various reasons)
3a) Establish separate, independent sections in .t tests and allow mixed
mode execution
3b) Split .t tests into multiple files
4) Establish a new test syntax for denoting Python commands. e.g. "%
mkdir foo" would be "execute this Python function with arguments." With
this approach, we could convert Python "commands" into shell and execute
in shell mode. I think that's easier than parsing shell.

These solutions all require a significant amount of effort. And, since
my patch series so far has focused on maintaining backwards
compatibility, I didn't want to start down a potentially dead-end project.

You can make the argument that instead of inlining tests into pure
Python we should be making hg process invocation faster. I agree that
would be a worthwhile effort. However, no matter how efficient you make
hg process invocation, it will still be slower than reusing an existing
Python process. Thus, inline Python tests will always be faster than
shell tests. Given the number of Mercurial tests, I can't imagine that
difference being less than 20+ seconds and thus will always be relevant
to developers wanting to quickly iterate. I therefore argue that
investment in pure Python tests is not misplaced.

Anyway, I just wanted to give people context before the massive patch
bomb arrives. I hope you find this work beneficial. Hopefully it can be
used to power a 5x faster test suite in the not so distant future.

Gregory