[patch] syntax:plain for .hgignore

Tue Sep 11 09:27:35 CDT 2007

On Tue, 2007-09-11 at 12:23 +0200, Guido Ostkamp wrote:
> I've done it, but the results don't differ much (I called the commands
> several times). Now all results are with Python 2.5.1:

Thank you. It is good to compare apples and apples.

Your numbers strongly suggest that the disk overheads are the same in
all cases. The "empty repo" comparisons suggest that the RE execution
time is indeed the major culprit. This is still extremely surprising to
me, but I accept your numbers. Based on your numbers, I agree that
precompiling and saving the RE will not help very much.

I would still like to understand why the RE is so slow, but I definitely
understand why you feel that the performance issue is important here. I
have a two suspicions about the RE compare performance: we are either
getting eaten by backtracking or we are getting eaten by string
manipulation.

I can look into this, but not today.

> > Without seeing your .hgignore file, I can only speculate, but here is my
> > suspicion about it: (1) the .hgignore entries have many long common
> > prefixes, such as:
> >
> >    foo/bar/baz/bletch/file1.c
> >    foo/bar/baz/bletch/file2.c
> 
> For the 'style: plain' like patterns, we have the following distribution
> of entries (all with full pathname):

Unfortunately those stats do not tell me what I was trying to find out.
Let me ask it differently: on those longer paths, do they mainly differ
at the end of the path? I suspect the answer is yes (this is also true
in our tree).

> > Ultimately, however, you still have not addressed my main concern with
> > this, which is that we are going to have to undo this when
> > include/exclude is done. Even if the benefit here is real, it does not
> > seem worthwhile if we just going to have to throw it out in the next
> > release. I would rather try to find a way to make the RE mechanism
> > faster here.
> 
> I don't understand why include/exclude stuff should be an issue here.
> If you introduce that, you will have to process the patterns 
> separately one after the other because order matters, correct?

Initially we believed that this would be the case, but it turns out that
it is not. Using the Python RE syntax, it is possible to build a single
(admittedly perverse) RE that will process everything in a single RE.

That may not turn out to be the most efficient approach, but it is
possible. It is also possible to build a sequence of REs. If you do
that, it is not quite as bad as one per entry. Any uninterrupted
sequence of includes can be collapsed to a single RE, and any
uninterrupted sequence of excludes can be collapsed to a single RE.

> In this case you will no longer have one large check but a lot
> of small checks which you need to go through.
> 
> Why would you want to not support a faster algorithm for such a
> check?

Because I don't think the implementation will match your
assumptions. :-)
-- 
Jonathan S. Shapiro
Managing Director
The EROS Group, LLC
www.coyotos.org, www.eros-os.org