[PATCH 1 of 5 V2] parsers: add a function to efficiently lowercase ASCII strings
FUJIWARA Katsunori
foozy at lares.dti.ne.jp
Thu Oct 16 12:35:08 CDT 2014
At Sat, 4 Oct 2014 12:01:31 -0700,
Siddharth Agarwal wrote:
>
> # HG changeset patch
> # User Siddharth Agarwal <sid0 at fb.com>
> # Date 1412386959 25200
> # Fri Oct 03 18:42:39 2014 -0700
> # Node ID d9942c85d5f97c240038bbf302005d47e41c9018
> # Parent 2805d23e1f885ff183b900e4407d3c31758d5b59
> parsers: add a function to efficiently lowercase ASCII strings
Unfortunately, this change broke pure Python build by cyclic
dependency (util => i18n => encoding => parsers => util).
I just posted the series to fix this problem.
http://selenic.com/pipermail/mercurial-devel/2014-October/062754.html
Could you confirm that my series doesn't cause performance regression
in similar condition ?
> We need a way to efficiently lowercase ASCII strings. For example, 'hg status'
> needs to build up the fold map -- a map from a canonical case (for OS X,
> lowercase) to the actual case of each file and directory in the dirstate.
>
> The current way we do that is to try decoding to ASCII and then calling
> lower() on the string, labeled 'orig' below:
>
> str.decode('ascii')
> return str.lower()
>
> This is pretty inefficient, and it turns out we can do much better.
>
> I also tested out a condition-based approach, labeled 'cond' below:
>
> (c >= 'A' && c <= 'Z') ? (c + ('a' - 'A')) : c
>
> 'cond' turned out to be slower in all cases. A 256-byte lookup table with
> invalid values for everything past 127 performed similarly, but this was less
> verbose.
>
> On OS X 10.9 with LLVM version 6.0 (clang-600.0.51), the asciilower function
> was run against two corpuses.
>
> Corpus 1 (list of files from real-world repo, > 100k files):
> orig: wall 0.428567 comb 0.430000 user 0.430000 sys 0.000000 (best of 24)
> cond: wall 0.077204 comb 0.070000 user 0.070000 sys 0.000000 (best of 100)
> lookup: wall 0.060714 comb 0.060000 user 0.060000 sys 0.000000 (best of 100)
>
> Corpus 2 (mozilla-central, 113k files):
> orig: wall 0.238406 comb 0.240000 user 0.240000 sys 0.000000 (best of 42)
> cond: wall 0.040779 comb 0.040000 user 0.040000 sys 0.000000 (best of 100)
> lookup: wall 0.037623 comb 0.040000 user 0.040000 sys 0.000000 (best of 100)
>
> On a Linux server-class machine with GCC 4.4.6 20120305 (Red Hat 4.4.6-4):
>
> Corpus 1 (real-world repo, > 100k files):
> orig: wall 0.260899 comb 0.260000 user 0.260000 sys 0.000000 (best of 38)
> cond: wall 0.054818 comb 0.060000 user 0.060000 sys 0.000000 (best of 100)
> lookup: wall 0.048489 comb 0.050000 user 0.050000 sys 0.000000 (best of 100)
>
> Corpus 2 (mozilla-central, 113k files):
> orig: wall 0.153082 comb 0.150000 user 0.150000 sys 0.000000 (best of 65)
> cond: wall 0.031007 comb 0.040000 user 0.040000 sys 0.000000 (best of 100)
> lookup: wall 0.028793 comb 0.030000 user 0.030000 sys 0.000000 (best of 100)
>
> SSE instructions might help even more, but I didn't experiment with those.
>
> diff --git a/mercurial/encoding.py b/mercurial/encoding.py
> --- a/mercurial/encoding.py
> +++ b/mercurial/encoding.py
> @@ -5,7 +5,7 @@
> # This software may be used and distributed according to the terms of the
> # GNU General Public License version 2 or any later version.
>
> -import error
> +import error, parsers
> import unicodedata, locale, os
>
> def _getpreferredencoding():
> @@ -258,6 +258,20 @@
> return concat(usub.encode(encoding))
> return ellipsis # no enough room for multi-column characters
>
> +def asciilower(s):
> + '''convert a string to lowercase if ASCII
> +
> + Raises UnicodeDecodeError if non-ASCII characters are found.'''
> + s.decode('ascii')
> + return s.lower()
> +
> +# this would ideally be util.safehasattr, but making encoding depend on util is
> +# a layering violation
> +try:
> + asciilower = parsers.asciilower
> +except AttributeError:
> + pass
> +
> def lower(s):
> "best-effort encoding-aware case-folding of local string s"
> try:
> diff --git a/mercurial/parsers.c b/mercurial/parsers.c
> --- a/mercurial/parsers.c
> +++ b/mercurial/parsers.c
> @@ -35,6 +35,27 @@
> -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1
> };
>
> +static char lowertable[128] = {
> + '\x00', '\x01', '\x02', '\x03', '\x04', '\x05', '\x06', '\x07',
> + '\x08', '\x09', '\x0a', '\x0b', '\x0c', '\x0d', '\x0e', '\x0f',
> + '\x10', '\x11', '\x12', '\x13', '\x14', '\x15', '\x16', '\x17',
> + '\x18', '\x19', '\x1a', '\x1b', '\x1c', '\x1d', '\x1e', '\x1f',
> + '\x20', '\x21', '\x22', '\x23', '\x24', '\x25', '\x26', '\x27',
> + '\x28', '\x29', '\x2a', '\x2b', '\x2c', '\x2d', '\x2e', '\x2f',
> + '\x30', '\x31', '\x32', '\x33', '\x34', '\x35', '\x36', '\x37',
> + '\x38', '\x39', '\x3a', '\x3b', '\x3c', '\x3d', '\x3e', '\x3f',
> + '\x40',
> + '\x61', '\x62', '\x63', '\x64', '\x65', '\x66', '\x67', /* A-G */
> + '\x68', '\x69', '\x6a', '\x6b', '\x6c', '\x6d', '\x6e', '\x6f', /* H-O */
> + '\x70', '\x71', '\x72', '\x73', '\x74', '\x75', '\x76', '\x77', /* P-W */
> + '\x78', '\x79', '\x7a', /* X-Z */
> + '\x5b', '\x5c', '\x5d', '\x5e', '\x5f',
> + '\x60', '\x61', '\x62', '\x63', '\x64', '\x65', '\x66', '\x67',
> + '\x68', '\x69', '\x6a', '\x6b', '\x6c', '\x6d', '\x6e', '\x6f',
> + '\x70', '\x71', '\x72', '\x73', '\x74', '\x75', '\x76', '\x77',
> + '\x78', '\x79', '\x7a', '\x7b', '\x7c', '\x7d', '\x7e', '\x7f'
> +};
> +
> static inline int hexdigit(const char *p, Py_ssize_t off)
> {
> int8_t val = hextable[(unsigned char)p[off]];
> @@ -72,6 +93,39 @@
> return ret;
> }
>
> +static PyObject *asciilower(PyObject *self, PyObject *args)
> +{
> + char *str, *newstr;
> + int i, len;
> + PyObject *newobj = NULL;
> +
> + if (!PyArg_ParseTuple(args, "s#", &str, &len))
> + goto quit;
> +
> + newobj = PyBytes_FromStringAndSize(NULL, len);
> + if (!newobj)
> + goto quit;
> +
> + newstr = PyBytes_AS_STRING(newobj);
> +
> + for (i = 0; i < len; i++) {
> + char c = str[i];
> + if (c & 0x80) {
> + PyObject *err = PyUnicodeDecodeError_Create(
> + "ascii", str, len, i, (i + 1),
> + "unexpected code byte");
> + PyErr_SetObject(PyExc_UnicodeDecodeError, err);
> + goto quit;
> + }
> + newstr[i] = lowertable[(unsigned char)c];
> + }
> +
> + return newobj;
> +quit:
> + Py_XDECREF(newobj);
> + return NULL;
> +}
> +
> /*
> * This code assumes that a manifest is stitched together with newline
> * ('\n') characters.
> @@ -2165,6 +2219,7 @@
> {"parse_manifest", parse_manifest, METH_VARARGS, "parse a manifest\n"},
> {"parse_dirstate", parse_dirstate, METH_VARARGS, "parse a dirstate\n"},
> {"parse_index2", parse_index2, METH_VARARGS, "parse a revlog index\n"},
> + {"asciilower", asciilower, METH_VARARGS, "lowercase if ASCII\n"},
> {"encodedir", encodedir, METH_VARARGS, "encodedir a path\n"},
> {"pathencode", pathencode, METH_VARARGS, "fncache-encode a path\n"},
> {"lowerencode", lowerencode, METH_VARARGS, "lower-encode a path\n"},
> _______________________________________________
> Mercurial-devel mailing list
> Mercurial-devel at selenic.com
> http://selenic.com/mailman/listinfo/mercurial-devel
>
----------------------------------------------------------------------
[FUJIWARA Katsunori] foozy at lares.dti.ne.jp
More information about the Mercurial-devel
mailing list