Development of a performance tracking tool for Mercurial

Wed Mar 30 03:52:37 EDT 2016

On 03/30/2016 12:55 AM, Martin von Zweigbergk wrote:
> 
> 
> On Tue, Mar 29, 2016 at 3:21 PM Pierre-Yves David <pierre-yves.david at ens-lyon.org <mailto:pierre-yves.david at ens-lyon.org>> wrote:
> 
> 
> 
>     On 03/22/2016 01:54 AM, David Douard wrote:
>     > Hi everyone,
>     >
>     > we (Philippe and a bit of myself, at Logilab) are beginning to work on a
>     > performance tracking system for Mercurial.
>     >
>     > The need for such a tool has been expressed by Pierre-Yves who managed to
>     > get the project financed by fb.
>     >
>     > We've started (mostly Philippe did) by studying several solutions
>     > to start from.
>     >
>     > Below is a "quick" report on what we did for now.
>     >
>     > There is an html version of this document on
>     >
>     >    https://hg.logilab.org/review/hgperf/raw-file/tip/docs/tools.html
> 
>     Thanks for posting this, I've inlined the document in my reply for
>     easier comment.
> 
>     > #####################################################
>     > Mercurial performance regression detection and report
>     > #####################################################
>     >
>     > Objectives
>     > ==========
>     >
>     > Mercurial code change fast and we must detect and prevent performances
>     > regressions as soon as possible.
>     >
>     > * Automatic execution of performance tests on a given Mercurial revision
>     > * Store the performance results in a database
>     > * Expose the performance results in a web application (with graphs, reports, dashboards etc.)
>     > * Provide some regression detection alarms with email notifications
>     >
>     > Metrics
>     > ~~~~~~~
>     >
>     > We already have code that produce performance metrics:
>     >
>     > * Commands from the perf extension in contrib/perf.py
>     > * Revset performance tests contrib/revsetbenchmarks.py
>     > * Unit test execution time
>     > * Annotated portions of unit test execution time
> 
> 
> I don't think we have many unit tests, and I don't think it's worth timing them anyway. The numbers would change too easily due to simple refactorings (e.g. moving code into or out of the tested function). Perhaps you mean the test run by run-tests.py. They're more relevant but would of course change whenever we add or remove tests to them (that may not be so bad).
>  
> 

We are thinking of tracking execution time for each unit test (means, each test run via 
run-tests.py, yes, but individually, not as a whole).
But we also plan to detect modifications of the test itself to prevent from
sending regression notifications in the case where the failing test test itself 
(in terms of execution time) is modified in the offending changeset. 

> 
>     Note that we don't have official annotation (and logically no timing for
>     them). But the phrasing seems fixed on the wiki page so I'm mostly
>     talking to third party reader here.
> 
>     > These metrics will be used (after some refactorings for some of the tools that
>     > produce them) as performance metrics, but we may need some more specifically
>     > written for the purpose of performance regression detection.
> 
> 
> Many of the regressions I have seen fixed have been related to performance on large repos, so we would probably want to have some tests like that. Perhaps tests can be done on e.g. the Firefox repo. That will of course be very slow and you will not want to run the performance tests on your own machine very often.
>  

Yes, in fact, the step after having chosen and deployed a perf tracking tool is to select 
a large mercurial repository (firefox being the obvious candidate) as a reference repository
for a series of benchmarks to track against.  

> 
>     >
>     >
>     > Expected Results
>     > ~~~~~~~~~~~~~~~~
>     >
>     > Expected results are still to be discussed. For now, we aim at having a simple tool
>     > to track performance regressions on the 2 branches of the main mercurial repository
>     > (stable and default).
>     >
>     > However, there are some open questions for mid-term objectives:
>     >
>     > - What revisions of the Mercurial source code should we run the performance
>     > regression tool on? (public cs on the main branch only? Which branches? ...)
> 
>     Let's focus on public changeset for now.
> 
>     > - How do we manage the non-linear structure of a Mercurial history?
> 
>     That's a fun question. The Mercurial repository is mostly linear as long
>     as only one branch is concerned. However:
> 
>       - We don't (and have no reason to) enforce it,
>       - the big picture with multiple branches part is still non-linear.
> 
> 
> Unless we foresee it changing soon, I'd still vote for having just two graphs: one for each branch. But perhaps also one for the "committed" repo's default branch.

Yes, I agree, let's focus on this simple situation.

>  
> 
> 
>     > - What kind of aggregations / comparisons de we want to be able to do? Should these
>     > be available through a "query language" or can they be hardwritten in the
>     > performance regression tool?
> 
>     I think we can start with whatever is the simpler. But possible
>     evolution in this area is probably one of the criteria for picking a tool.
> 
>     > Existing tools
>     > ==============
> 
>     It would be nice to boil that down to a list of criteria (eg:
>     run-locally, handle-branch, regression algorithm, setup cost, storage
>     format, etc…) And put all of them in a big table on the wiki page. That
>     would help comparing them to each other and picking a winner.
> 
>     >
>     > Airspeed velocity
>     > ~~~~~~~~~~~~~~~~~
>     >
>     > - http://asv.readthedocs.org/
>     > - used by the http://www.astropy.org/ project and inspired by https://github.com/pydata/vbench
>     > - Code: https://github.com/spacetelescope/asv
>     > - Presentation (2014): https://www.youtube.com/watch?v=OsxJ5O6h8s0
>     > - Python, Javascript (http://www.flotcharts.org/)
>     >
>     >
>     > This tool aims at benchmarking Python packages over their lifetime.
>     > It is mainly a command line tool, ``asv``, that run a series of benchmarks (described
>     > in JSON configuration file), and produces a static HTML/JS report.
>     >
>     > When running a benchmark suite, ``asv`` take care of clone/pulling the source repository
>     > in a virtual env and running the configured tasks in this virtual env.
>     >
>     > Results of each benchmark execution are stored in a "database" (consisting in
>     > JSON files). This database is used to produce evolution plots of the time required
>     > to run a test (or any metrics; out of the box, asv has support for 4 types of benchmark:
>     > timing, memory, peak memory and tracking), and to run the regression detection algorithms.
>     >
>     > One key feature of this tool is that it's very easy for every developer to use it on
>     > its own development environment. For example, it provides an ``asv compare`` command allowing to compare
>     > the results of any 2 revisions.
>     >
>     > However, asv will require some work to fit the needs:
>     >
>     > - The main drawback with asv is the fact it's designed with commit date as X axis.
>     > We must adapt the code of asv to properly handle this "non-linearity" related to
>     > dates (see https://github.com/spacetelescope/asv/issues/390)
>     > - Tags are displayed in the graphs as a secondary x axis labels, and are related to commit
>     > date of the tag; these should be displayed as annotations of the dots instead.
>     >
>     >
>     > :Pros:
>     >
>     > - Complete and cover most of our needs (and more)
>     > - Handle mercurial repositories
>     > - Generate static website with dashboard, interactive graphs
>     > - Detect regressions, implement step detection algorithms: http://asv.readthedocs.org/en/latest/dev.html#module-asv.step_detect
>     > - Parametrized benchmarks
>     > - Can collect metrics from multiple machines
>     > - Show tags on the graph, link to commits
>     > - Framework to write time, memory or custom benchmarks
>     > - Facilities to run benchmarks (run against a revset, compute only missing values etc)
>     > - Can be used easily on the developer side as well (before submitting patches)
>     > - Seems extensible easily through a plugin system
>     >
>     > :Cons:
>     >
>     > - No email notifications
>     > - Need to plot the graph by revision number instead of commit date
>     > - The graph per branch need to be fixed for mercurial
> 
>     This one seems pretty solid and I like the idea of being able to run it
>     locally.
> 
>     The dashboard seems a bit too simple to me, and I'm a bit worried here.
>     the branch part is another unknown.
> 
>     How hard would be to implement a notification system on top of that.
> 
>     > Example: https://hg.logilab.org/review/hgperf/raw-file/1e6b03b9407c/index.html (built with a patched ASV that workaround commit date and branch issues)
> 
> 
> I haven't read through the description of all the tools, but I liked this demo. It seems useful to me.
>  

Glad to hear that.

> 
>     >
>     > Codespeed
>     > ~~~~~~~~~
>     >
>     >
>     > - Code: https://github.com/tobami/codespeed
>     > - Python (django), Javascript
>     > - Used by pypy and twisted, example http://speed.pypy.org/
>     > - Web application to store bench results and providing graphs and basic
>     >
>     >
>     > This tool is a python (django) web application that can retrieve
>     > benchmark results, store them in a SQL database and analyze them. It
>     > provide multiples views of the results (graphs, grids, report) and can
>     > generate a feed of notifications (regression or improvements) on the
>     > home page.
>     >
>     > A few things need to be setup to make it works. A project (VCS
>     > repository), an "executable" (particular compilation options of the
>     > project), an "environment" is the context in which tests are executed
>     > (CPU, OS), a "benchmark" has unit and can be cross-project or
>     > own-project. Then a "result" is a value of a benchmark running in a
>     > environment on an executable that is produced from a particular
>     > revision of the project.
>     >
>     > The trending computation is a comparison between the result and the
>     > average of the three previous results that produce a lot of false
>     > positive and the key feature of cross-project comparison seems useless
>     > to us.
>     >
>     >
>     > :Pros:
>     >
>     > - Nice UI with colors and trends (red, green)
>     > - Useful for comparative benchs (eg: pypy vs cpython)
>     > - Generate notifications automatically (global improvement/regression or per benchmark)
>     > - Integration with mercurial repository (show commits content, links etc)
>     >
>     > :Cons:
>     >
>     > - Poor regression detection algorithms (lot of false improvement/regression alarms)
>     > - No email notifications
>     > - Need a lot of setup
> 
>     I've mixed filling about that, it has a pretty solid set of feature
>     including some we'll really need soon (comparison of implementation). It
>     also have some solid name behing it.
> 
>     But It might be a bit over complicated and the regression tracking seems
>     pretty bad compared to the other.
> 
>     > Skia perf
>     > ~~~~~~~~~
>     >
>     > - https://perf.skia.org
>     > - Used to benchmark the skia graphic library
>     > - Nice UI https://perf.skia.org/alerts/
>     > - Code: https://github.com/google/skia-buildbot/tree/master/perf
>     > - Design: https://raw.githubusercontent.com/google/skia-buildbot/master/perf/DESIGN.md
>     > - Json format: https://raw.githubusercontent.com/google/skia-buildbot/master/perf/FORMAT.md
>     > - Go, Mysql, InfluxDB, Google Compute Engine, Javascript (https://www.polymer-project.org/, https://d3js.org/)
>     >
>     >
>     > Skia perf (https://perf.skia.org) is an interactive dashboard to display Skia (graphic library)
>     > performance data against multiple devices and GPUs. It provides a
>     > powerful interface to build custom graphs of performance data.
>     >
>     > The tool can detect regression using last square fitting method and produces a dashboards
>     > or regressions that can be annotated by connected users.
>     >
>     > It is written in go and javascript is based on git and rely on a complex stack including
>     > Google Compute Engine so it cannot be used as is without a huge adaptation.
>     >
>     >
>     > :Pros:
>     >
>     > - Detect regression using least square fitting method
>     > - Interface to set the status of a detected regression (ignore, bug)
>     > - Link to the commit which introduced the regression
>     > - Interface to build a custom graph from multiple metrics
>     > - Handle notifications
>     >
>     > :Cons:
>     >
>     > - Slow interface (eat browser memory)
>     > - Complex stack
>     > - Requires GCE
>     > - Not usable as it
>     > - Doesn't handle mercurial repositories
> 
>     That one seems to have a pretty neat set of advance feature. But it
>     seems like the amount of work to adapt it to our need is too large for
>     it to be considered seriously.
> 
>     >
>     > AreWeFastYet
>     > ~~~~~~~~~~~~
>     >
>     > - http://arewefastyet.com/
>     > - tracking performance of JavaScript engines
>     > - Code: https://github.com/h4writer/arewefastyet
>     > - Python, Php, MySQL, Javascript (http://www.flotcharts.org/).
>     > - All in one application
>     >
>     > AreWeFastYet (http://arewefastyet.com/) is a tool that checkout code of popular
>     > javascript engines and run some benchmark suites against them (octane, kraken, sunspider...),
>     > store the results in a mysql database and expose them in a web application that display
>     > comparative graphs and regression reports.
>     >
>     > AWFY is written in python, php and javascript and require a mysql database. The
>     > regression detection algorithm is based on a local average comparison and
>     > many things (builder machines etc) are hardcoded, it's specific to its purpose.
>     >
>     >
>     >
>     > :Pros:
>     >
>     > - Handle mercurial repositories
>     > - Show the regression commit range http://arewefastyet.com/regressions/#/regression/1796301
>     >
>     > :Cons:
>     >
>     > - Not usable as it
>     > - Unclear and custom regression detection algorithm that might not work in our cases
> 
>     This does not seems like the droids we are looking for (too specialised)
> 
>     >
>     >
>     > EzBench
>     > ~~~~~~~
>     >
>     > - Code: https://cgit.freedesktop.org/ezbench
>     > - Used to benchmark graphics related patch on the Linux kernel.
>     > - Slides: https://fosdem.org/2016/schedule/event/ezbench/attachments/slides/1168/export/events/attachments/ezbench/slides/1168/fosdem16_martin_peres_ezbench.pdf
>     > - Shell scripts
>     >
>     > EzBench (https://cgit.freedesktop.org/ezbench) is a collection of tools to benchmark
>     > graphics-related patchsets on the Linux kernel. It runs the benchmark suite on a particular
>     > commit and store the results as csv files. It has tools to read the results and generate static
>     > html reports. It can also automate the bisect process to find the commit who introduced the
>     > regression. It's written in shell and python and is highly coupled to its purpose.
>     >
>     > :Pros:
>     >
>     > - Generate reports
>     > - Bisects performance changes automatically and confirm a detected regression by reproducing it
>     > - Reducing variance tips, capture all benchmark data (hardware, libraries, versions)
>     >
>     > :Cons:
>     >
>     > - Not usable as it
>     > - Doesn't handle mercurial repositories
> 
>     It is unclear to me what's make it not usable as is (beside  the lack of
>     Mercurial support?)
> 
>     > TimeSeries oriented Databases
>     > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>     >
>     > The reporting interface will provide reports (of course), graphs and dashboards so
>     > it's tempting to use tools like Grafana_, InfluxDB_, Graphite_ that already provide such
>     > features. Some of them (like Beacon_) can even provide notifications based on
>     > rules.
>     >
>     > But they are all based on *time* series only, and that doesn't really fit our needs
>     > because our problem is not linear with respect to the dates, and it may become really
>     > tricky to use it to collect metrics on drafts or handle merge changesets properly.
>     >
>     > Choosing such a time-series oriented database would most probably prove to be a
>     > poor choice due to structural inhability to handle the model of a repository.
> 
>     I did not think too much about it yet, but I think I agree here. We
>     probably want our primary view to be indexed on "revision number" or
>     something similar.
> 
>     > CI Builders
>     > ===========
> 
>     Is there any reasons to not go with buildbot here?
> 
>     > Use an existing CI builder will remove the hard part of running performances
>     > tests on build machines and trigger build automatically on SCM changes.
>     >
>     >
>     > Buildbot
>     > ~~~~~~~~
>     >
>     > The Mercurial project already use buildbot http://buildbot.mercurial-scm.org/
>     >
>     > Buildbot seems very extensible and configurable through python
>     > code/configuration and have a read only http/json API.
>     >
>     >
>     > Jenkins
>     > ~~~~~~~
>     >
>     > https://jenkins-ci.org has a lot of features including, smart choice of slave
>     > build (e.g. one concurrent build per slave), parametrized build (e.g. revision
>     > hash), gather artifacts (e.g. performance results). All features are also
>     > available through a REST API.
>     >
>     > Jenkins (and plugins) are in java.
>     >
>     >
>     > .. _Grafana: http://grafana.org/
>     > .. _InfluxDB: https://influxdata.com/
>     > .. _Graphite: http://graphite.wikidot.com/
>     > .. _Beacon: https://github.com/klen/graphite-beacon
> 
> 
>     --
>     Pierre-Yves David
>     _______________________________________________
>     Mercurial-devel mailing list
>     Mercurial-devel at mercurial-scm.org <mailto:Mercurial-devel at mercurial-scm.org>
>     https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
> 

-- 

David DOUARD		 LOGILAB
Directeur du département Outils & Systèmes

+33 1 45 32 03 12	 david.douard at logilab.fr
       	                 http://www.logilab.fr/id/david.douard

Formations - http://www.logilab.fr/formations
Développements - http://www.logilab.fr/services
Gestion de connaissances - http://www.cubicweb.org/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: david_douard.vcf
Type: text/x-vcard
Size: 302 bytes
Desc: not available
URL: <http://www.mercurial-scm.org/pipermail/mercurial-devel/attachments/20160330/a9d9d9d4/attachment.vcf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 181 bytes
Desc: OpenPGP digital signature
URL: <http://www.mercurial-scm.org/pipermail/mercurial-devel/attachments/20160330/a9d9d9d4/attachment.sig>