[PATCH] util: improve iterfile so it impacts little on performance

Jun Wu quark at fb.com
Tue Nov 15 16:04:13 UTC 2016


# HG changeset patch
# User Jun Wu <quark at fb.com>
# Date 1479225350 0
#      Tue Nov 15 15:55:50 2016 +0000
# Node ID 3cd2e9873bc1d565300b629e72100800075d12bb
# Parent  d1a0a64f6e16432333bea0476098c46a61222b9b
# Available At https://bitbucket.org/quark-zju/hg-draft
#              hg pull https://bitbucket.org/quark-zju/hg-draft -r 3cd2e9873bc1
util: improve iterfile so it impacts little on performance

We have performance concerns on "iterfile" as it is 4X slower on normal
files. While modern systems have the nice property that reading a "fast"
(on-disk) file cannot be interrupted and should be made use of.

This patch dumps the related knowledge in comments. And tries to minimize
the performance impact: it only use the slower but safer approach for
non-normal files. It gives up for Python < 2.7.4 because the slower approach
does not make a difference in terms of safety. And it avoids the workaround
for Python >= 3 and PyPy who don't have the EINTR issue.

diff --git a/mercurial/util.py b/mercurial/util.py
--- a/mercurial/util.py
+++ b/mercurial/util.py
@@ -25,8 +25,10 @@ import hashlib
 import imp
 import os
+import platform as pyplatform
 import re as remod
 import shutil
 import signal
 import socket
+import stat
 import string
 import subprocess
@@ -2191,8 +2193,31 @@ def wrap(line, width, initindent='', han
     return wrapper.fill(line).encode(encoding.encoding)
 
-def iterfile(fp):
-    """like fp.__iter__ but does not have issues with EINTR. Python 2.7.12 is
-    known to have such issues."""
-    return iter(fp.readline, '')
+if (pyplatform.python_implementation() == 'CPython' and
+    sys.version_info <= (3, 0) and sys.version_info >= (2, 7, 4)):
+    # There is an issue with CPython 2 that file.__iter__ does not handle EINTR
+    # correctly. CPython <= 2.7.12 is known to have the issue.
+    # In CPython >= 2.7.4, file.read, file.readline etc. deal with EINTR
+    # correctly so we can use the workaround below. However the workaround is
+    # about 4X slower than the native iterator because the latter does
+    # readahead caching in CPython layer.
+    # On modern systems like Linux, the "read" syscall cannot be interrupted
+    # for reading "fast" files like on-disk files. So the EINTR issue only
+    # affects things like pipes, sockets, ttys etc. We treat "normal" (S_ISREG)
+    # files approximately as "fast" files and use the fast (unsafe) code path.
+    def iterfile(fp):
+        fastpath = True
+        try:
+            fastpath = stat.S_ISREG(os.fstat(fp.fileno()).st_mode)
+        except (AttributeError, OSError): # no fileno, or stat fails
+            pass
+        if fastpath:
+            return fp
+        else:
+            return iter(fp.readline, '')
+else:
+    # For CPython < 2.7.4, the workaround wouldn't make things better.
+    # PyPy and CPython 3 do not have the EINTR issue thus no workaround needed.
+    def iterfile(fp):
+        return fp
 
 def iterlines(iterator):


More information about the Mercurial-devel mailing list