[PATCH 1 of 2 resend] keyword: compile regexes on demand

Mads Kiilerich mads at kiilerich.com
Thu Nov 4 11:38:10 CDT 2010


On 11/04/2010 04:55 PM, Martin Geisler wrote:
> Christian Ebert<blacktrash at gmx.net>  writes:
>
>> # HG changeset patch
>> # User Christian Ebert<blacktrash at gmx.net>
>> # Date 1288791461 -3600
>> # Node ID 2ce1ff53e29f4b775ed550c13beb42da3942523e
>> # Parent  0e0a52bd58f941c00b2a1d57f23676fa486e58c3
>> keyword: compile regexes on demand
>
> Are you sure this is faster? I tried to see how long the old code took
> and here it's very fast:
>
> % python -m timeit \
>    -s "import re" \
>    -s "escaped = 'RCSfile|Author|Header|Source|Date|RCSFile|Id|Revision'" \
>    "kw = re.compile(r'\$(%s)\$' % escaped)" \
>    "kwexp = re.compile(r'\$(%s): [^$\n\r]*? \$' % escaped)"
> 100000 loops, best of 3: 2.52 usec per loop

Beware of the caching of compiled expressions inside the re module:

   $ python -m timeit \
   >   -s "import re" \
   >   -s "escaped = 'RCS|Aut|Hea|Sou|Dat|RCS|Id|Rev'" \
   >   "kw = re.compile(r'\$(%s)\$' % escaped)" \
   >   "kwexp = re.compile(r'\$(%s): [^$\n\r]*? \$' % escaped)"
   100000 loops, best of 3: 8.16 usec per loop

   $ python -m timeit \
   >   -s "import re" \
   >   -s "escaped = 'RCS|Aut|Hea|Sou|Dat|RCS|Id|Rev'" \
   >   "re.purge()"
   1000000 loops, best of 3: 1.17 usec per loop

   $ python -m timeit \
   >   -s "import re" \
   >   -s "escaped = 'RCS|Aut|Hea|Sou|Dat|RCS|Id|Rev'" \
   >   "re.purge()" \
   >   "kw = re.compile(r'\$(%s)\$' % escaped)" \
   >   "kwexp = re.compile(r'\$(%s): [^$\n\r]*? \$' % escaped)"
   1000 loops, best of 3: 1.93 msec per loop

These numbers are so much higher that they might justify the change.

I'm not sure if we should rely on the re cache or always should 
pre-compile everywhere, but unnecessary compilation is unfortunate. A 
new general util function or pattern could perhaps be nice.

/Mads


More information about the Mercurial-devel mailing list