Startup time is regressing

Wed Nov 10 03:32:43 CST 2010

FUJIWARA Katsunori <fujiwara at ascade.co.jp> writes:

> I confirmed that defining the sub class of textwrap.TextWrapper causes
> regression, as Martin says.
>
>> I just looked at the textwrap module, and the TextWrapper class is
>> fairly short -- lots of comments but little code. I suggest we borrow
>> the code, fix it to handle wide characters and take out the unneeded
>> parts such as handling double-space after periods. Then put the whole
>> thing in i18n.
>
> I also confirmed that both ways shown below can prevent startup time
> from regressing.
>
>     1. move sub class definition from util.py to other file
>
>     2. define sub class on-demand like:
>
>         ====================
>         MBTextWrapper = None
>
>         def wrap(...)
>
>             global MBTextWrapper
>             if not MBTextWrapper:
>                 class MBTextWrapper(textwrap.TextWrapper):
>                       ....
>
>             wrapper = MBTextWrapper()
>
>         ====================
>
> I think (1) is better for its simplicity.

Yes, I agree with using (1).

I just tried to move the MBTextwrap class to i18n and copy the code from
the textwrap class into i18n too. I then combined the two classes so
that there is only a single i18n.MBTextWrapper class left.

I could not see any significant performance difference -- this is
without the patch:

% HGRCPATH= hg --config extensions.perf=contrib/perf.py perfstartup
! wall 0.025350 comb 0.000000 user 0.000000 sys 0.000000 (best of 118)

and this is with the patch

% HGRCPATH= hg --config extensions.perf=contrib/perf.py perfstartup
! wall 0.024969 comb 0.000000 user 0.000000 sys 0.000000 (best of 120)

So on my machine (Core i7, normal harddisk) the speedup is about 4
milliseconds. The patch is below, so you can see if I've made a silly
mistake :)


# HG changeset patch
# User Martin Geisler <mg at aragost.com>
# Date 1289381429 -3600
# Node ID 59a4357e84875abce8cd8e3b6c71fc0f1795dff6
# Parent  9f2ac318b92e3bde06108bf8493b7d71219ad13e
imported patch startup

diff --git a/mercurial/i18n.py b/mercurial/i18n.py
--- a/mercurial/i18n.py
+++ b/mercurial/i18n.py
@@ -6,7 +6,286 @@
 # GNU General Public License version 2 or any later version.
 
 import encoding
-import gettext, sys, os
+import gettext, sys, os, string, re
+
+#### naming convention of below implementation follows 'textwrap' module
+
+# Hardcode the recognized whitespace characters to the US-ASCII
+# whitespace characters.  The main reason for doing this is that in
+# ISO-8859-1, 0xa0 is non-breaking whitespace, so in certain locales
+# that character winds up in string.whitespace.  Respecting
+# string.whitespace in those cases would 1) make textwrap treat 0xa0 the
+# same as any other whitespace char, which is clearly wrong (it's a
+# *non-breaking* space), 2) possibly cause problems with Unicode,
+# since 0xa0 is not in range(128).
+_whitespace = '\t\n\x0b\x0c\r '
+
+class MBTextWrapper:
+    """
+    Object for wrapping/filling text.  The public interface consists of
+    the wrap() and fill() methods; the other methods are just there for
+    subclasses to override in order to tweak the default behaviour.
+    If you want to completely replace the main wrapping algorithm,
+    you'll probably have to override _wrap_chunks().
+
+    Several instance attributes control various aspects of wrapping:
+      width (default: 70)
+        the maximum width of wrapped lines (unless break_long_words
+        is false)
+      initial_indent (default: "")
+        string that will be prepended to the first line of wrapped
+        output.  Counts towards the line's width.
+      subsequent_indent (default: "")
+        string that will be prepended to all lines save the first
+        of wrapped output; also counts towards each line's width.
+      expand_tabs (default: true)
+        Expand tabs in input text to spaces before further processing.
+        Each tab will become 1 .. 8 spaces, depending on its position in
+        its line.  If false, each tab is treated as a single character.
+      replace_whitespace (default: true)
+        Replace all whitespace characters in the input text by spaces
+        after tab expansion.  Note that if expand_tabs is false and
+        replace_whitespace is true, every tab will be converted to a
+        single space!
+      break_long_words (default: true)
+        Break words longer than 'width'.  If false, those words will not
+        be broken, and some lines might be longer than 'width'.
+      break_on_hyphens (default: true)
+        Allow breaking hyphenated words. If true, wrapping will occur
+        preferably on whitespaces and right after hyphens part of
+        compound words.
+      drop_whitespace (default: true)
+        Drop leading and trailing whitespace from lines.
+    """
+
+    whitespace_trans = string.maketrans(_whitespace, ' ' * len(_whitespace))
+
+    unicode_whitespace_trans = {}
+    uspace = ord(u' ')
+    for x in map(ord, _whitespace):
+        unicode_whitespace_trans[x] = uspace
+
+    # This funky little regex is just the trick for splitting
+    # text up into word-wrappable chunks.  E.g.
+    #   "Hello there -- you goof-ball, use the -b option!"
+    # splits into
+    #   Hello/ /there/ /--/ /you/ /goof-/ball,/ /use/ /the/ /-b/ /option!
+    # (after stripping out empty strings).
+    wordsep_re = re.compile(
+        r'(\s+|'                                  # any whitespace
+        r'[^\s\w]*\w+[^0-9\W]-(?=\w+[^0-9\W])|'   # hyphenated words
+        r'(?<=[\w\!\"\'\&\.\,\?])-{2,}(?=\w))')   # em-dash
+
+    # This less funky little regex just split on recognized spaces. E.g.
+    #   "Hello there -- you goof-ball, use the -b option!"
+    # splits into
+    #   Hello/ /there/ /--/ /you/ /goof-ball,/ /use/ /the/ /-b/ /option!/
+    wordsep_simple_re = re.compile(r'(\s+)')
+
+    # XXX this is not locale- or charset-aware -- string.lowercase
+    # is US-ASCII only (and therefore English-only)
+    sentence_end_re = re.compile(r'[%s]'              # lowercase letter
+                                 r'[\.\!\?]'          # sentence-ending punct.
+                                 r'[\"\']?'           # optional end-of-quote
+                                 r'\Z'                # end of chunk
+                                 % string.lowercase)
+
+
+    def __init__(self,
+                 width=70,
+                 initial_indent="",
+                 subsequent_indent="",
+                 expand_tabs=True,
+                 replace_whitespace=True,
+                 break_long_words=True,
+                 drop_whitespace=True,
+                 break_on_hyphens=True):
+        self.width = width
+        self.initial_indent = initial_indent
+        self.subsequent_indent = subsequent_indent
+        self.expand_tabs = expand_tabs
+        self.replace_whitespace = replace_whitespace
+        self.break_long_words = break_long_words
+        self.drop_whitespace = drop_whitespace
+        self.break_on_hyphens = break_on_hyphens
+
+        # recompile the regexes for Unicode mode -- done in this clumsy way for
+        # backwards compatibility because it's rather common to monkey-patch
+        # the TextWrapper class' wordsep_re attribute.
+        self.wordsep_re_uni = re.compile(self.wordsep_re.pattern, re.U)
+        self.wordsep_simple_re_uni = re.compile(
+            self.wordsep_simple_re.pattern, re.U)
+
+
+    # -- Private methods -----------------------------------------------
+    # (possibly useful for subclasses to override)
+
+    def _munge_whitespace(self, text):
+        """_munge_whitespace(text : string) -> string
+
+        Munge whitespace in text: expand tabs and convert all other
+        whitespace characters to spaces.  Eg. " foo\tbar\n\nbaz"
+        becomes " foo    bar  baz".
+        """
+        if self.expand_tabs:
+            text = text.expandtabs()
+        if self.replace_whitespace:
+            if isinstance(text, str):
+                text = text.translate(self.whitespace_trans)
+            elif isinstance(text, unicode):
+                text = text.translate(self.unicode_whitespace_trans)
+        return text
+
+
+    def _split(self, text):
+        """_split(text : string) -> [string]
+
+        Split the text to wrap into indivisible chunks.  Chunks are
+        not quite the same as words; see wrap_chunks() for full
+        details.  As an example, the text
+          Look, goof-ball -- use the -b option!
+        breaks into the following chunks:
+          'Look,', ' ', 'goof-', 'ball', ' ', '--', ' ',
+          'use', ' ', 'the', ' ', '-b', ' ', 'option!'
+        if break_on_hyphens is True, or in:
+          'Look,', ' ', 'goof-ball', ' ', '--', ' ',
+          'use', ' ', 'the', ' ', '-b', ' ', option!'
+        otherwise.
+        """
+        if isinstance(text, unicode):
+            if self.break_on_hyphens:
+                pat = self.wordsep_re_uni
+            else:
+                pat = self.wordsep_simple_re_uni
+        else:
+            if self.break_on_hyphens:
+                pat = self.wordsep_re
+            else:
+                pat = self.wordsep_simple_re
+        chunks = pat.split(text)
+        chunks = filter(None, chunks)  # remove empty chunks
+        return chunks
+
+
+    def _cutdown(self, str, space_left):
+        l = 0
+        ucstr = unicode(str, encoding.encoding)
+        colwidth = unicodedata.east_asian_width
+        for i in xrange(len(ucstr)):
+            l += colwidth(ucstr[i]) in 'WFA' and 2 or 1
+            if space_left < l:
+                return (ucstr[:i].encode(encoding.encoding),
+                        ucstr[i:].encode(encoding.encoding))
+        return str, ''
+
+    def _handle_long_word(self, reversed_chunks, cur_line, cur_len, width):
+        space_left = max(width - cur_len, 1)
+
+        if self.break_long_words:
+            cut, res = self._cutdown(reversed_chunks[-1], space_left)
+            cur_line.append(cut)
+            reversed_chunks[-1] = res
+        elif not cur_line:
+            cur_line.append(reversed_chunks.pop())
+
+    def _wrap_chunks(self, chunks):
+        """_wrap_chunks(chunks : [string]) -> [string]
+
+        Wrap a sequence of text chunks and return a list of lines of
+        length 'self.width' or less.  (If 'break_long_words' is false,
+        some lines may be longer than this.)  Chunks correspond roughly
+        to words and the whitespace between them: each chunk is
+        indivisible (modulo 'break_long_words'), but a line break can
+        come between any two chunks.  Chunks should not have internal
+        whitespace; ie. a chunk is either all whitespace or a "word".
+        Whitespace chunks will be removed from the beginning and end of
+        lines, but apart from that whitespace is preserved.
+        """
+        lines = []
+        if self.width <= 0:
+            raise ValueError("invalid width %r (must be > 0)" % self.width)
+
+        # Arrange in reverse order so items can be efficiently popped
+        # from a stack of chucks.
+        chunks.reverse()
+
+        while chunks:
+
+            # Start the list of chunks that will make up the current line.
+            # cur_len is just the length of all the chunks in cur_line.
+            cur_line = []
+            cur_len = 0
+
+            # Figure out which static string will prefix this line.
+            if lines:
+                indent = self.subsequent_indent
+            else:
+                indent = self.initial_indent
+
+            # Maximum width for this line.
+            width = self.width - len(indent)
+
+            # First chunk on line is whitespace -- drop it, unless this
+            # is the very beginning of the text (ie. no lines started yet).
+            if self.drop_whitespace and chunks[-1].strip() == '' and lines:
+                del chunks[-1]
+
+            while chunks:
+                l = len(chunks[-1])
+
+                # Can at least squeeze this chunk onto the current line.
+                if cur_len + l <= width:
+                    cur_line.append(chunks.pop())
+                    cur_len += l
+
+                # Nope, this line is full.
+                else:
+                    break
+
+            # The current line is full, and the next chunk is too big to
+            # fit on *any* line (not just this one).
+            if chunks and len(chunks[-1]) > width:
+                self._handle_long_word(chunks, cur_line, cur_len, width)
+
+            # If the last chunk on this line is all whitespace, drop it.
+            if self.drop_whitespace and cur_line and cur_line[-1].strip() == '':
+                del cur_line[-1]
+
+            # Convert current line back to a string and store it in list
+            # of all lines (return value).
+            if cur_line:
+                lines.append(indent + ''.join(cur_line))
+
+        return lines
+
+
+    # -- Public interface ----------------------------------------------
+
+    def wrap(self, text):
+        """wrap(text : string) -> [string]
+
+        Reformat the single paragraph in 'text' so it fits in lines of
+        no more than 'self.width' columns, and return a list of wrapped
+        lines.  Tabs in 'text' are expanded with string.expandtabs(),
+        and all other whitespace characters (including newline) are
+        converted to space.
+        """
+        text = self._munge_whitespace(text)
+        chunks = self._split(text)
+        return self._wrap_chunks(chunks)
+
+    def fill(self, text):
+        """fill(text : string) -> string
+
+        Reformat the single paragraph in 'text' to fit in lines of no
+        more than 'self.width' columns, and return a new string
+        containing the entire wrapped paragraph.
+        """
+        return "\n".join(self.wrap(text))
+
+
+#### naming convention of above implementation follows 'textwrap' module
+
 
 # modelled after templater.templatepath:
 if hasattr(sys, 'frozen'):
diff --git a/mercurial/util.py b/mercurial/util.py
--- a/mercurial/util.py
+++ b/mercurial/util.py
@@ -13,10 +13,10 @@
 hide platform-specific details from the core.
 """
 
-from i18n import _
+from i18n import _, MBTextWrapper
 import error, osutil, encoding
 import errno, re, shutil, sys, tempfile, traceback
-import os, stat, time, calendar, textwrap, unicodedata, signal
+import os, stat, time, calendar, unicodedata, signal
 import imp, socket
 
 # Python compatibility
@@ -1325,49 +1325,6 @@
     # Avoid double backslash in Windows path repr()
     return repr(s).replace('\\\\', '\\')
 
-#### naming convention of below implementation follows 'textwrap' module
-
-class MBTextWrapper(textwrap.TextWrapper):
-    """
-    Extend TextWrapper for double-width characters.
-
-    Some Asian characters use two terminal columns instead of one.
-    A good example of this behavior can be seen with u'\u65e5\u672c',
-    the two Japanese characters for "Japan":
-    len() returns 2, but when printed to a terminal, they eat 4 columns.
-
-    (Note that this has nothing to do whatsoever with unicode
-    representation, or encoding of the underlying string)
-    """
-    def __init__(self, **kwargs):
-        textwrap.TextWrapper.__init__(self, **kwargs)
-
-    def _cutdown(self, str, space_left):
-        l = 0
-        ucstr = unicode(str, encoding.encoding)
-        colwidth = unicodedata.east_asian_width
-        for i in xrange(len(ucstr)):
-            l += colwidth(ucstr[i]) in 'WFA' and 2 or 1
-            if space_left < l:
-                return (ucstr[:i].encode(encoding.encoding),
-                        ucstr[i:].encode(encoding.encoding))
-        return str, ''
-
-    # ----------------------------------------
-    # overriding of base class
-
-    def _handle_long_word(self, reversed_chunks, cur_line, cur_len, width):
-        space_left = max(width - cur_len, 1)
-
-        if self.break_long_words:
-            cut, res = self._cutdown(reversed_chunks[-1], space_left)
-            cur_line.append(cut)
-            reversed_chunks[-1] = res
-        elif not cur_line:
-            cur_line.append(reversed_chunks.pop())
-
-#### naming convention of above implementation follows 'textwrap' module
-
 def wrap(line, width, initindent='', hangindent=''):
     maxindent = max(len(hangindent), len(initindent))
     if width <= maxindent:


-- 
Martin Geisler

aragost Trifork
Professional Mercurial support
http://aragost.com/mercurial/