[PATCH 09 of 10 lazy-changelog-parse] changelog: avoid slicing raw data until needed

Sun Mar 6 19:13:07 EST 2016

Yuya, et al:

You'll note from the revset numbers in this series that revset execution
time increases proportionally to the number of revset functions that access
changelog revision data. e.g. "author(x) or author(y)" is roughly 2x slower
than "author(x)." We're apparently decoding changelog revisions for each
revset function that accesses the data.

There is definitely a revset optimization to avoid the redundant changelog
reading. I'm no expert on the revset code and am not sure what the most
appropriate layer to implement such an optimization would be. Unlike
template keywords, we don't have a cache passed to the revset functions. We
/could/ implement an aggressive changectx cache as a context manager and
have it active during revset execution. We could potentially also work
something into the revset layer itself. I'm capable of implementing the
former pretty easily. But I'm not sure it is the best layer. I'd appreciate
if someone who understood the revset code better could weigh in on whether
it is an appropriate location for more optimal behavior here.

On Sun, Mar 6, 2016 at 3:58 PM, Gregory Szorc <gregory.szorc at gmail.com>
wrote:

> # HG changeset patch
> # User Gregory Szorc <gregory.szorc at gmail.com>
> # Date 1457307620 28800
> #      Sun Mar 06 15:40:20 2016 -0800
> # Node ID d1a19a190175f7fd17a37a2a97e5969e8cd8b9a4
> # Parent  6f9345c743c024c50c634f6cb54d2dd892dc2792
> changelog: avoid slicing raw data until needed
>
> Before, we were slicing the original raw text and storing individual
> variables with values corresponding to each field. This is avoidable
> overhead.
>
> With this patch, we store the offsets of the fields at construction
> time and perform the slice when a property is accessed.
>
> This appears to show a very marginal performance win on its own and
> the gains are so small as to not be worth reporting. However, this
> patch marks the end of our parsing refactor, so it is worth reporting
> the gains from the entire series:
>
> author(mpm)
> 0.896565
> 0.795987  89%
>
> desc(bug)
> 0.887169
> 0.803438  90%
>
> date(2015)
> 0.878797
> 0.773961  88%
>
> extra(rebase_source)
> 0.865446
> 0.761603  88%
>
> author(mpm) or author(greg)
> 1.801832
> 1.576025  87%
>
> author(mpm) or desc(bug)
> 1.812438
> 1.593335  88%
>
> date(2015) or branch(default)
> 0.968276
> 0.875270  90%
>
> author(mpm) or desc(bug) or date(2015) or extra(rebase_source)
> 3.656193
> 3.183104  87%
>
> Pretty consistent speed-up across the board for any revset accessing
> changelog revision data. Not bad!
>
> It's also worth noting that PyPy appears to experience a similar to
> marginally greater speed-up as well!
>
> According to statprof, revsets accessing changelog revision data
> are now clearly dominated by zlib decompression (16-17% of execution
> time). Surprisingly, it appears the most expensive part of revision
> parsing are the various text.index() calls to search for newlines!
> These appear to cumulatively add up to 5+% of execution time. I
> reckon implementing the parsing in C would make things marginally
> faster.
>
> If accessing larger strings (such as the commit message),
> encoding.tolocal() is the most expensive procedure outside of
> decompression.
>
> diff --git a/mercurial/changelog.py b/mercurial/changelog.py
> --- a/mercurial/changelog.py
> +++ b/mercurial/changelog.py
> @@ -146,21 +146,18 @@ class changelogrevision(object):
>      """Holds results of a parsed changelog revision.
>
>      Changelog revisions consist of multiple pieces of data, including
>      the manifest node, user, and date. This object exposes a view into
>      the parsed object.
>      """
>
>      __slots__ = (
> -        '_rawdateextra',
> -        '_rawdesc',
> -        '_rawfiles',
> -        '_rawmanifest',
> -        '_rawuser',
> +        '_offsets',
> +        '_text',
>      )
>
>      def __new__(cls, text):
>          if not text:
>              return _changelogrevision(
>                  manifest=nullid,
>                  user='',
>                  date=(0, 0),
> @@ -180,51 +177,51 @@ class changelogrevision(object):
>          #                 : extra is metadata, encoded and separated by
> '\0'
>          #                 : older versions ignore it
>          # files\n\n       : files modified by the cset, no \n or \r
> allowed
>          # (.*)            : comment (free text, ideally utf-8)
>          #
>          # changelog v0 doesn't use extra
>
>          nl1 = text.index('\n')
> -        self._rawmanifest = text[0:nl1]
> -
>          nl2 = text.index('\n', nl1 + 1)
> -        self._rawuser = text[nl1 + 1:nl2]
> -
>          nl3 = text.index('\n', nl2 + 1)
> -        self._rawdateextra = text[nl2 + 1:nl3]
>
>          # The list of files may be empty. Which means nl3 is the first of
> the
>          # double newline that precedes the description.
>          if text[nl3 + 1] == '\n':
> -            self._rawfiles = None
> -            self._rawdesc = text[nl3 + 2:]
> +            doublenl = nl3
>          else:
>              doublenl = text.index('\n\n', nl3 + 1)
> -            self._rawfiles = text[nl3 + 1:doublenl]
> -            self._rawdesc = text[doublenl + 2:]
> +
> +        self._offsets = (nl1, nl2, nl3, doublenl)
> +        self._text = text
>
>          return self
>
>      @property
>      def manifest(self):
> -        return bin(self._rawmanifest)
> +        return bin(self._text[0:self._offsets[0]])
>
>      @property
>      def user(self):
> -        return encoding.tolocal(self._rawuser)
> +        off = self._offsets
> +        return encoding.tolocal(self._text[off[0] + 1:off[1]])
>
>      @property
>      def _rawdate(self):
> -        return self._rawdateextra.split(' ', 2)[0:2]
> +        off = self._offsets
> +        dateextra = self._text[off[1] + 1:off[2]]
> +        return dateextra.split(' ', 2)[0:2]
>
>      @property
>      def _rawextra(self):
> -        fields = self._rawdateextra.split(' ', 2)
> +        off = self._offsets
> +        dateextra = self._text[off[1] + 1:off[2]]
> +        fields = dateextra.split(' ', 2)
>          if len(fields) != 3:
>              return None
>
>          return fields[2]
>
>      @property
>      def date(self):
>          raw = self._rawdate
> @@ -242,24 +239,25 @@ class changelogrevision(object):
>          raw = self._rawextra
>          if raw is None:
>              return _defaultextra
>
>          return decodeextra(raw)
>
>      @property
>      def files(self):
> -        if self._rawfiles is None:
> +        off = self._offsets
> +        if off[2] == off[3]:
>              return []
>
> -        return self._rawfiles.split('\n')
> +        return self._text[off[2] + 1:off[3]].split('\n')
>
>      @property
>      def description(self):
> -        return encoding.tolocal(self._rawdesc)
> +        return encoding.tolocal(self._text[self._offsets[3] + 2:])
>
>  class changelog(revlog.revlog):
>      def __init__(self, opener):
>          revlog.revlog.__init__(self, opener, "00changelog.i")
>          if self._initempty:
>              # changelogs don't benefit from generaldelta
>              self.version &= ~revlog.REVLOGGENERALDELTA
>              self._generaldelta = False
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.mercurial-scm.org/pipermail/mercurial-devel/attachments/20160306/c6366288/attachment.html>