[PATCH 1 of 2] templates: add {lasttag} and {lasttagdistance} keywords

Gilles Moris gilles.moris at free.fr
Fri Aug 28 16:32:34 CDT 2009


On Fri August 28 2009 21:17:16 Greg Ward wrote:
> On Fri, Aug 28, 2009 at 10:18 AM, Mads Kiilerich<mads at kiilerich.com> wrote:
> >> I really hope there is a way to do this that efficiently supports
> >> large repositories.  I think it's a great idea and I'd like to see it
> >> go in, but not if it uses O(N) time and memory (N being the number of
> >> changesets in the whole repo).
> >>
> >
> > If your repo has N changesets then the latest tag might be N parent-links
> > away, so O(N) is a real upper bound which will be reached in some cases.
> 
> Granted: so it has to have O(N) runtime in the worst case.  But as
> coded, it uses O(N) memory in *every* case.  And I think it also uses
> O(N) runtime in every case (I didn't see a break or return in that
> loop over all changesets).
> 
> As I said recently, repositories with 100,000+ changesets are reality
> *today*, and they are not going to get any smaller as time goes by.
> And one important component of optimization is picking algorithms that
> scale well and don't *need* to be optimized for large datasets.
> 

Guess what is my test repo: netbeans 6.7 133k+ cset.

Let's start with the time:
I have a pretty decent processor (C2D 3Ghz, but in a VirtualBox), but this is
not a high end system.
Time for hg tip --template '{lasttag}' : around 4 sec
Now guess, time for hg id -n: around 3sec, so I am about 25% slower. Fair
enough, but that's not bad considering I am doing a topological search.
Yes, everything is in the disk cache, and possibly identify could be improved.

But this brings us to memory consumption: I quickly measured to 80 MB for this
133K+ repo.
That's a lot, but how much size is your working directory: between 1 and 2 GB,
may be even compiled. Don't tell that the package maintainer has less than 1 GB
of RAM and a pretty decent CPU to deal with that repo.

Optimization is a difficult task: I have already spent some time on several
optimization projects, and you never know which approach is going to be the best.
Moreover, you can optimize for memory or for speed but rarely both, it is
always a trade-off.

As to use an algorithm that search for a single rev (like the the one I use in
the nearest extension), you can count the time in hours if somebody use 'hg log'.
You say that usually they will not use it, but actually they *will*. I have made
enough customer support to know they usually use our software in a way we don't
expect. Anything else than a map for a template is broken from the start.

Last, did you just tried it on your 108k repo ?
May be will find that it is not so bad and I would be interested to get your
timings.

> >> One final question, which is unclear from the comments and the code:
> >> are you going by topological nearness or revision-count nearness?  I
> >> think topological distance is the only sane way to do this, so I
> >> really hope that's what you're doing.  The comments and docstrings
> >> should be clearer.
> >>
> >
> > Neither; it uses the nearest tag with the latest date, and the distance is
> > the longest topological distance to that. Matt preferred 'most recent' in
> > http://www.selenic.com/pipermail/mercurial-devel/2009-July/013866.html ,
> > interpreted the way that the latest tag in the example is b.
> 
> Oh yeah, I forgot about that.  You guys clearly thought long and hard
> about it at the time, so I have nothing further to add.  (Except that
> possibly the comments could explain the motivation for the current
> algorithm, which is non-trivial.)
> 

Yes, I will try to answer your feedback and improve the doc.

Regards.
Gilles.

> Greg
> 

PS: I realized couple days ago that my "fast" nearest extension do not
return the right distance. So fast but broken. I will have to figure out why.
Probably because mixing heuristic and topology search.



More information about the Mercurial-devel mailing list