Bug 2196 - convert/svn: abort: Interrupted system call when retrieving revision log
Summary: convert/svn: abort: Interrupted system call when retrieving revision log
Status: RESOLVED ARCHIVED
Alias: None
Product: Mercurial
Classification: Unclassified
Component: convert (show other bugs)
Version: unspecified
Hardware: All All
: normal bug
Assignee: Bugzilla
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-05-18 09:17 UTC by Aaron Digulla
Modified: 2015-02-13 01:01 UTC (History)
7 users (show)

See Also:
Python Version: ---


Attachments
(34 bytes, text/x-patch)
2010-05-20 02:05 UTC, Aaron Digulla
Details
(34 bytes, text/x-patch)
2010-05-20 04:24 UTC, Aaron Digulla
Details
(34 bytes, text/x-patch)
2010-08-02 05:23 UTC, Aaron Digulla
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Aaron Digulla 2010-05-18 09:17 UTC
When I run

hg --traceback convert -s svn http://svn.xxx.com/svn/... ~/hg/...

I get this traceback:


  File "/usr/lib/pymodules/python2.6/hgext/convert/convcmd.py", line 398, in
convert
    c.convert(sortmode)
  File "/usr/lib/pymodules/python2.6/hgext/convert/convcmd.py", line 312, in
convert
    parents = self.walktree(heads)
  File "/usr/lib/pymodules/python2.6/hgext/convert/convcmd.py", line 109, in
walktree
    commit = self.cachecommit(n)
  File "/usr/lib/pymodules/python2.6/hgext/convert/convcmd.py", line 267, in
cachecommit
    commit = self.source.getcommit(rev)
  File "/usr/lib/pymodules/python2.6/hgext/convert/subversion.py", line 433,
in getcommit
    self._fetch_revisions(revnum, stop)
  File "/usr/lib/pymodules/python2.6/hgext/convert/subversion.py", line 814,
in _fetch_revisions
    for entry in stream:
  File "/usr/lib/pymodules/python2.6/hgext/convert/subversion.py", line 122,
in __iter__
    entry = pickle.load(self._stdout)
IOError: [Errno 4] Interrupted system call
abort: Interrupted system call

Apparently, a signal is sent to Python which interrupts pickle.load(). Is
there something which Mercurial can do or do I have to file a bug report
with Python?

If you can't do anything about this, what signal could that be? SIGCHILD? Or
is there a timer running?
Comment 1 Patrick Mézard 2010-05-18 09:24 UTC
Could it be the child process retrieving svn logs is being killed because it
consumes too much memory or something like this?
Comment 2 Aaron Digulla 2010-05-18 09:41 UTC
Forgot the version info: hg version 1.5.2

OS: Linux xxx 2.6.32-22-generic-pae #33-Ubuntu SMP i686 GNU/Linux

Subversion: version 1.6.6
Comment 3 Aaron Digulla 2010-05-18 09:42 UTC
@pmezard: No, when I call "svn log" or the other operations which I see with
"-debug -v", then they proceed. ulimit also looks good:

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 20
file size               (blocks, -f) unlimited
pending signals                 (-i) 16382
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
Comment 4 Patrick Mézard 2010-05-18 09:54 UTC
Ok, then could it be the http/svn server closing the connection? Perhaps you
can measure how long it takes for the error to happen.

To explain a bit, before starting the revision conversion, convert extension
discovers the whole revision graph it has to convert. In svn case, it may
request large amount of changelog for many branches and your server may not
tolerate it. Also, because the svn python bindings where historically very
leaky, this operation is performed in a child process, killed once the log
is retrieved which probably explains the weird error message.

You can try passing --config convert.svn.startrev=1000 or something not too
large and see if it works better.
Comment 5 Patrick Mézard 2010-05-18 09:55 UTC
Making title more accurate
Comment 6 Aaron Digulla 2010-05-18 10:09 UTC
I had a strange failure which ended with "Error 200 OK" but that was because
of a corrupted Subversion repository which we fixed.

Now, I'm using convert to update this first clone so I only need about 10
revisions. There is no timeout, the processing happens very fast and then it
fails. time(1) says 1.246s. I doubt it's a memory issue. But killing the
child could be the problem. Can you tell me how to delay the killing until
after the pickle.load() returns None?
Comment 7 Aaron Digulla 2010-05-18 10:11 UTC
Oh, and it happens sometimes after "scanning source..." and before
"sorting..." and sometimes after "converting...". So it's not only "svn log"
but also the download of the revisions itself which fails.
Comment 8 Matt Mackall 2010-05-18 10:13 UTC
Can you run strace on it? 

strace -o trace.out -f <hg command>

That might give some hints.
Comment 9 Patrick Mézard 2010-05-18 10:23 UTC
And can you try to comment this line:

http://hg.intevation.org/mercurial/file/b5ef95b5ae14/hgext/convert/subversion.py#l100

An early termination issue was reported a long time ago and it seemed to fix it.
Comment 10 Aaron Digulla 2010-05-18 10:30 UTC
Sometimes, it fails after all the revisions are converted and sometimes, it
completes without error. Looks like a race condition.

strace makes it better but I got it to fail again ;-) I've cleaned the trace
(500KB). Should I add it or sent it to you directly?
Comment 11 Aaron Digulla 2010-05-18 10:34 UTC
Uncommenting exit(0) doesn't help.
Comment 12 Aaron Digulla 2010-05-18 10:47 UTC
What I see is a lot of calls to

execve("/usr/local/bin/hg", ["/usr/local/bin/hg", "debugsvnlog"],...)

What happens when this script throws an exception? Is such an exception
correctly pickled?
Comment 13 Aaron Digulla 2010-05-18 10:55 UTC
One step closer. I added "traceback.print_exc()" in "get_log_child()" in the
exception handler for IOError. I get this stack trace:

Traceback (most recent call last):
  File "/usr/local/lib/python2.6/dist-packages/hgext/convert/subversion.py",
line 90, in get_log_child
    receiver)
  File "/usr/lib/pymodules/python2.6/libsvn/ra.py", line 736, in svn_ra_get_log
    return apply(_ra.svn_ra_get_log, args)
  File "/usr/local/lib/python2.6/dist-packages/hgext/convert/subversion.py",
line 80, in receiver
    fp, protocol)
IOError: [Errno 32] Broken pipe

Like 80 is the call to "pickle.dump()" in the inner function "receiver()".
Comment 14 Matt Mackall 2010-05-18 12:33 UTC
Add it. Or at least the interesting bits.
Comment 15 Aaron Digulla 2010-05-19 07:30 UTC
Okay, I found a bug but that didn't solve the whole issue: Search for a line

for entry in self._getlog([self.module], from_revnum, to_revnum):

The returned stream is never closed. You must change this into:

stream = self._getlog([self.module], from_revnum, to_revnum)
try:
    for entry in stream:
    ...
finally:
    stream.close()

I'll now analyze the strace dump. I can't post this easily because it
contains internal data.
Comment 16 Patrick Mézard 2010-05-19 07:44 UTC
Ouch. Could you give us a "Name <email@domain>" for bug report/fix attribution?
Comment 17 Aaron Digulla 2010-05-19 08:09 UTC
You should check the main loop in _fetch_revisions(): It can stop before all
the output of the child has been read which causes the child to terminate on
SIGPIPE.

I'm Aaron Digulla <digulla@hepe.com>
Comment 18 Patrick Mézard 2010-05-19 14:38 UTC
I pushed a first fix for the missing .close() in gettags() before rereading
the issue again. So it's not solved but improving.

In crew: http://hg.intevation.org/mercurial/crew/rev/46bb49134498
Comment 19 Patrick Mézard 2010-05-19 14:47 UTC
About the SIGPIPE issue, it seems to be the expected behaviour and in theory
python ignores it (http://docs.python.org/library/signal.html).

Could you check this is the case on your system? Something like:

import signal
print signal.getsignal(signal.SIGPIPE)

should print '1' (signal.SIG_IGN) when run in get_log_child()

Otherwise, we may have to handle it explicitely.
Comment 20 HG Bot 2010-05-19 15:00 UTC
Fixed by http://hg.intevation.org/mercurial/crew/rev/46bb49134498
(convert/svn: close gettags() log stream (issue2196))
Comment 21 Aaron Digulla 2010-05-20 01:20 UTC
It does print '1' (one) in get_log_child() but operations like read(),
flush() and close() will still fail with an errno of EPIPE.

The problem is that the parent aborts with '[Errno 4] Interrupted system call'.

Right now, my guess is that this happens:

1. The parent starts to fetch a log
2. Child starts to send log entries
3. Parent has enough, breaks out of the loop in _fetch_revisions() and
closes the input stream.
4. Eventually the child tries to send more log entries which fails with
EPIPE. Child terminates.
5. Parent gets unexpected SIGCHILD while reading data from a different child.

Maybe the solution would be to terminate the child from the parent instead
of killing it implicitly by closing the input stream.
Comment 22 Aaron Digulla 2010-05-20 01:38 UTC
Okay, I have a solution:

Instead of returning the the streams in util.popen2/3(), you *must* return
the process instance.

Then, in logstream.close() and similar places, you *must* call
process.wait() so the SIGCHILD signal is properly processed in the parent
process.
Comment 23 Aaron Digulla 2010-05-20 02:05 UTC
Here is a patch against 1.5.2 which works around the issue for people who
can't wait for 1.5.4.

It contains a lot of debug output which I left in because I'm running out of
time. Clean up at your leisure :-)
Comment 24 Aaron Digulla 2010-05-20 03:11 UTC
Damn ... that helps but still doesn't always solve the issue :-(
Comment 25 Aaron Digulla 2010-05-20 04:24 UTC
I tried to write a version of logstream.__iter__ which reads the characters
from the child individually. This works ... it's not even very slow. But the
code is pretty hideous.
Comment 26 Patrick Mézard 2010-05-20 05:01 UTC
What python version do you have?

I am suprised this is the first time someone raise this issue, so I'd like
to know why.
Comment 27 Aaron Digulla 2010-05-20 05:19 UTC
I'm running Python 2.6.5. Maybe it's because my computer and the lan is so
fast that sometimes, child processes terminate before the parent can read
everything and sometimes afterwards.
Comment 28 Aaron Digulla 2010-08-02 05:23 UTC
Updated patch for convert for Mercurial 1.6
Comment 29 HG Bot 2011-07-28 14:00 UTC
See http://selenic.com/repo/hg/rev/5a0fdc715769
Augie Fackler <durin42@gmail.com>
check-code: disallow use of hasattr()
Comment 30 Bugzilla 2012-05-12 09:09 UTC

--- Bug imported by bugzilla@serpentine.com 2012-05-12 09:09 EDT  ---

This bug was previously known as _bug_ 2196 at http://mercurial.selenic.com/bts/issue2196
Imported an attachment (id=1384)
Imported an attachment (id=1385)
Imported an attachment (id=1386)
Comment 31 devurandom 2013-05-13 08:21 UTC
I think I encounter the same issue when using hgsubversion and mercurial 2.6:
---
[rM] ...
pulled N revisions
updating to branch default
abort: Interrupted system call
---
At other times the error is:
abort: Error
---

Mercurial gives no further information.

hgsubversion: f0bde4bf1f78
Subversion: 1.7.6
bindings: Subvertpy 0.9.0
Comment 32 Bugzilla 2015-02-13 01:01 UTC
Bug was inactive for 640 days, archiving