Statistical reports for Mercurial repos

Mon Apr 5 16:54:32 CDT 2010

As of now, there are several solutions for generating different reports
for Hg repos:

- activity extension (http://labs.freehackers.org/projects/hgactivity):
  generates activity chart for a repository by convolving number of
  commits or changed lines data; uses matplotlib for output.

- churn extension: generates changesets|changed lines per user reports,
  can group by date; uses textual output.

- chart extension (http://www.bitbucket.org/mg/hgchart/): generates
  changes per user, line count, function count and file size graphs
  plotted along revision count axis; flexible configuration using YAML
  file; uses matplotlib for output.

- chart extension
  (http://www.bitbucket.org/Ry4an/hg-chart-extension/wiki/Home):
  produces graphs of commit or changed lines rate grouped by date; uses
  Google Charts API (pygooglechart) for output.

Some time ago I wrote another in which I tried to combine several
existing solution into one in a flexible way, the code is available at
http://sphinx.net.ru/hg/hgstats/. We process a stream of repository
changesets using a combination of differents filters, piping them after
each other. I've written several filters which allow to get different
meaningful reports. It's possible to combine reports for different repos
onto one graph. Some usage examples with results are available at
http://sphinx.net.ru/stats-test/. User docs are scarce, but most of
stuff is documented in the code.

hgstats.py is currently the only command-line based frontend which
supports plain text output (usable with gnuplot) and Google charts as
well.

Planned features:

- web control panel built into hgwebdir which would provide interface to
  filters and their settings;

- new filters to allow splitting/filtering by commiters, skipping
  commits before/after given dates, and running external programs for
  every revision. More code metrics may be added (like "bugs introduced
  within this period" etc.);

- to effectively process large repositories, there should be a
  possibility to incrementally update repository stats stored in a
  database (should be easy if we have a min/max revision switch) so this
  data may be visualized or processed further fast. Currently when text
  output backend is used, the whole list with stats data is never fully
  built in runtime but get written to stdout incrementally so we don't
  eat up much memory on large repositories;

- for web version, a new output backend based on Canvas with some rich
  features. This can be implemented as a visualization tool for data
  stored in a database as well.

Is anyone interested in mentoring such project for the upcoming GSoC?

Any feedback on my code and this project is really welcome and will be
appreciated.
-- 
Happy Hacking.

http://sphinx.net.ru
む