[PATCH 1 of 5 V2] parsers: add a function to efficiently lowercase ASCII strings

Thu Oct 16 12:35:08 CDT 2014

At Sat, 4 Oct 2014 12:01:31 -0700,
Siddharth Agarwal wrote:
> 
> # HG changeset patch
> # User Siddharth Agarwal <sid0 at fb.com>
> # Date 1412386959 25200
> #      Fri Oct 03 18:42:39 2014 -0700
> # Node ID d9942c85d5f97c240038bbf302005d47e41c9018
> # Parent  2805d23e1f885ff183b900e4407d3c31758d5b59
> parsers: add a function to efficiently lowercase ASCII strings

Unfortunately, this change broke pure Python build by cyclic
dependency (util => i18n => encoding => parsers => util).

I just posted the series to fix this problem.

    http://selenic.com/pipermail/mercurial-devel/2014-October/062754.html

Could you confirm that my series doesn't cause performance regression
in similar condition ?

> We need a way to efficiently lowercase ASCII strings. For example, 'hg status'
> needs to build up the fold map -- a map from a canonical case (for OS X,
> lowercase) to the actual case of each file and directory in the dirstate.
> 
> The current way we do that is to try decoding to ASCII and then calling
> lower() on the string, labeled 'orig' below:
> 
>     str.decode('ascii')
>     return str.lower()
> 
> This is pretty inefficient, and it turns out we can do much better.
> 
> I also tested out a condition-based approach, labeled 'cond' below:
> 
>     (c >= 'A' && c <= 'Z') ? (c + ('a' - 'A')) : c
> 
> 'cond' turned out to be slower in all cases. A 256-byte lookup table with
> invalid values for everything past 127 performed similarly, but this was less
> verbose.
> 
> On OS X 10.9 with LLVM version 6.0 (clang-600.0.51), the asciilower function
> was run against two corpuses.
> 
> Corpus 1 (list of files from real-world repo, > 100k files):
> orig:   wall 0.428567 comb 0.430000 user 0.430000 sys 0.000000 (best of 24)
> cond:   wall 0.077204 comb 0.070000 user 0.070000 sys 0.000000 (best of 100)
> lookup: wall 0.060714 comb 0.060000 user 0.060000 sys 0.000000 (best of 100)
> 
> Corpus 2 (mozilla-central, 113k files):
> orig:   wall 0.238406 comb 0.240000 user 0.240000 sys 0.000000 (best of 42)
> cond:   wall 0.040779 comb 0.040000 user 0.040000 sys 0.000000 (best of 100)
> lookup: wall 0.037623 comb 0.040000 user 0.040000 sys 0.000000 (best of 100)
> 
> On a Linux server-class machine with GCC 4.4.6 20120305 (Red Hat 4.4.6-4):
> 
> Corpus 1 (real-world repo, > 100k files):
> orig:   wall 0.260899 comb 0.260000 user 0.260000 sys 0.000000 (best of 38)
> cond:   wall 0.054818 comb 0.060000 user 0.060000 sys 0.000000 (best of 100)
> lookup: wall 0.048489 comb 0.050000 user 0.050000 sys 0.000000 (best of 100)
> 
> Corpus 2 (mozilla-central, 113k files):
> orig:   wall 0.153082 comb 0.150000 user 0.150000 sys 0.000000 (best of 65)
> cond:   wall 0.031007 comb 0.040000 user 0.040000 sys 0.000000 (best of 100)
> lookup: wall 0.028793 comb 0.030000 user 0.030000 sys 0.000000 (best of 100)
> 
> SSE instructions might help even more, but I didn't experiment with those.
> 
> diff --git a/mercurial/encoding.py b/mercurial/encoding.py
> --- a/mercurial/encoding.py
> +++ b/mercurial/encoding.py
> @@ -5,7 +5,7 @@
>  # This software may be used and distributed according to the terms of the
>  # GNU General Public License version 2 or any later version.
>  
> -import error
> +import error, parsers
>  import unicodedata, locale, os
>  
>  def _getpreferredencoding():
> @@ -258,6 +258,20 @@
>              return concat(usub.encode(encoding))
>      return ellipsis # no enough room for multi-column characters
>  
> +def asciilower(s):
> +    '''convert a string to lowercase if ASCII
> +
> +    Raises UnicodeDecodeError if non-ASCII characters are found.'''
> +    s.decode('ascii')
> +    return s.lower()
> +
> +# this would ideally be util.safehasattr, but making encoding depend on util is
> +# a layering violation
> +try:
> +    asciilower = parsers.asciilower
> +except AttributeError:
> +    pass
> +
>  def lower(s):
>      "best-effort encoding-aware case-folding of local string s"
>      try:
> diff --git a/mercurial/parsers.c b/mercurial/parsers.c
> --- a/mercurial/parsers.c
> +++ b/mercurial/parsers.c
> @@ -35,6 +35,27 @@
>  	-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1
>  };
>  
> +static char lowertable[128] = {
> +	'\x00', '\x01', '\x02', '\x03', '\x04', '\x05', '\x06', '\x07',
> +	'\x08', '\x09', '\x0a', '\x0b', '\x0c', '\x0d', '\x0e', '\x0f',
> +	'\x10', '\x11', '\x12', '\x13', '\x14', '\x15', '\x16', '\x17',
> +	'\x18', '\x19', '\x1a', '\x1b', '\x1c', '\x1d', '\x1e', '\x1f',
> +	'\x20', '\x21', '\x22', '\x23', '\x24', '\x25', '\x26', '\x27',
> +	'\x28', '\x29', '\x2a', '\x2b', '\x2c', '\x2d', '\x2e', '\x2f',
> +	'\x30', '\x31', '\x32', '\x33', '\x34', '\x35', '\x36', '\x37',
> +	'\x38', '\x39', '\x3a', '\x3b', '\x3c', '\x3d', '\x3e', '\x3f',
> +	'\x40',
> +	        '\x61', '\x62', '\x63', '\x64', '\x65', '\x66', '\x67', /* A-G */
> +	'\x68', '\x69', '\x6a', '\x6b', '\x6c', '\x6d', '\x6e', '\x6f', /* H-O */
> +	'\x70', '\x71', '\x72', '\x73', '\x74', '\x75', '\x76', '\x77', /* P-W */
> +	'\x78', '\x79', '\x7a',                                         /* X-Z */
> +	                        '\x5b', '\x5c', '\x5d', '\x5e', '\x5f',
> +	'\x60', '\x61', '\x62', '\x63', '\x64', '\x65', '\x66', '\x67',
> +	'\x68', '\x69', '\x6a', '\x6b', '\x6c', '\x6d', '\x6e', '\x6f',
> +	'\x70', '\x71', '\x72', '\x73', '\x74', '\x75', '\x76', '\x77',
> +	'\x78', '\x79', '\x7a', '\x7b', '\x7c', '\x7d', '\x7e', '\x7f'
> +};
> +
>  static inline int hexdigit(const char *p, Py_ssize_t off)
>  {
>  	int8_t val = hextable[(unsigned char)p[off]];
> @@ -72,6 +93,39 @@
>  	return ret;
>  }
>  
> +static PyObject *asciilower(PyObject *self, PyObject *args)
> +{
> +	char *str, *newstr;
> +	int i, len;
> +	PyObject *newobj = NULL;
> +
> +	if (!PyArg_ParseTuple(args, "s#", &str, &len))
> +		goto quit;
> +
> +	newobj = PyBytes_FromStringAndSize(NULL, len);
> +	if (!newobj)
> +		goto quit;
> +
> +	newstr = PyBytes_AS_STRING(newobj);
> +
> +	for (i = 0; i < len; i++) {
> +		char c = str[i];
> +		if (c & 0x80) {
> +			PyObject *err = PyUnicodeDecodeError_Create(
> +				"ascii", str, len, i, (i + 1),
> +				"unexpected code byte");
> +			PyErr_SetObject(PyExc_UnicodeDecodeError, err);
> +			goto quit;
> +		}
> +		newstr[i] = lowertable[(unsigned char)c];
> +	}
> +
> +	return newobj;
> +quit:
> +	Py_XDECREF(newobj);
> +	return NULL;
> +}
> +
>  /*
>   * This code assumes that a manifest is stitched together with newline
>   * ('\n') characters.
> @@ -2165,6 +2219,7 @@
>  	{"parse_manifest", parse_manifest, METH_VARARGS, "parse a manifest\n"},
>  	{"parse_dirstate", parse_dirstate, METH_VARARGS, "parse a dirstate\n"},
>  	{"parse_index2", parse_index2, METH_VARARGS, "parse a revlog index\n"},
> +	{"asciilower", asciilower, METH_VARARGS, "lowercase if ASCII\n"},
>  	{"encodedir", encodedir, METH_VARARGS, "encodedir a path\n"},
>  	{"pathencode", pathencode, METH_VARARGS, "fncache-encode a path\n"},
>  	{"lowerencode", lowerencode, METH_VARARGS, "lower-encode a path\n"},
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel at selenic.com
> http://selenic.com/mailman/listinfo/mercurial-devel
> 

----------------------------------------------------------------------
[FUJIWARA Katsunori]                             foozy at lares.dti.ne.jp