[PATCH 6 of 8 V5] worker: make sure killworkers runs at most once

Wed Nov 16 07:25:45 EST 2016

On Tue, 15 Nov 2016 22:28:46 +0000, Jun Wu wrote:
> Excerpts from Yuya Nishihara's message of 2016-11-15 23:02:35 +0900:
> > On Tue, 15 Nov 2016 02:39:09 +0000, Jun Wu wrote:
> > > # HG changeset patch
> > > # User Jun Wu <quark at fb.com>
> > > # Date 1479176697 0
> > > #      Tue Nov 15 02:24:57 2016 +0000
> > > # Node ID 8402c91c250a9dd369296dcdf00f7b50110ff6ae
> > > # Parent  9dbb3532b173a980f341e41d9e96338a386364e5
> > > # Available At https://bitbucket.org/quark-zju/hg-draft 
> > > #              hg pull https://bitbucket.org/quark-zju/hg-draft  -r 8402c91c250a
> > > worker: make sure killworkers runs at most once
> > 
> > > @@ -90,4 +91,5 @@ def _posixworker(ui, func, staticargs, a
> > >      signal.signal(signal.SIGINT, signal.SIG_IGN)
> > >      pids, problem = set(), [0]
> > > +    problemcount = itertools.count()
> > >      def killworkers():
> > >          # if one worker bails, there's no good reason to wait for the rest
> > > @@ -114,5 +116,8 @@ def _posixworker(ui, func, staticargs, a
> > >                  pids.remove(p)
> > >                  st = _exitstatus(st)
> > > -            if st and not problem[0]:
> > > +            # next(itertools.count()) cannot be interrupted by a Python signal
> > > +            # handler - true for both CPython and PyPy. So killworkers() runs
> > > +            # at most once.
> > > +            if st and next(problemcount) == 0:
> > 
> > I'm not sure if this atomic counter buys. Since we'll remove threading and
> > unregister SIGCHLD handler before killworkers(), the only concern would be
> > whether pids are collected by waitpid() or not. If waitforworkers() is
> > interrupted between os.waitpid() and pids.remove(), the waited pid could be
> > killed again. But this patch doesn't prevent it.
> 
> I agree that the only important thing is whether processes are waited or
> not.
> 
> And I was aware of the situation you described (so I used the words "making
> it harder" instead of "making it impossible" in patch 7).
> 
> However, this patch prevents killing the "unwaited" processes multiple
> times, which is not that important but is "nice-to-have", I think.
> 
> Up to you for decision.
> 
> >   two workers; the latter fails:
> >   # pids = [2, 3]
> >   SIGCHLD-2
> >     waitpid(2) -> p = 2, st = ok
> >     SIGCHLD-3
> >       waitpid(2) -> ECHLD
> >       waitpid(3) -> p = 3, st = error
> >       pids.remove(3)
> >       # pids = [2]
> >       killworkers()
> >     pids.remove(2)
> > 
> >   two workers; both fail:
> >   # pids = [2, 3]
> 
> If pids = [2, 3, 4, 5, 6, 7, 8]. killworkers may kill 4 to 8 twice without
> this patch.

Yes, but killing active/zombie processes twice would be harmless. And I don't
think it's likely to happen enough we have to take special care with the atomic
counter.

> >   SIGCHLD-2
> >     waitpid(2) -> p = 2, st = error
> >     SIGCHLD-3
> >       waitpid(2) -> ECHLD
> >       waitpid(3) -> p = 3, st = error
> >       pids.remove(3)
> >       # pids = [2]
> >       killworkers()
> >     pids.remove(2)
> >     # pids = []
> >     killworkers()
> > 
> > Maybe we can discard pid if waitpid() fails with ECHLD?
> 
> If ECHILD happens, it means the child is waited by another "waitforworkers",
> who is responsible to do "pids.remove(p)". Discarding pid when ECHILD
> happens would cause one of them to raise Key Error.

I meant set.discard(). My idea is:

 a. discard pid by anyone who notices it no longer exists
 b. then, kill the rest

something like this:

  for pid in pids.copy():
      p, st = waitpid(pid, WNOHANG)
      if ECHILD:
          pids.discard(p)
      if p:
          pids.discard(p)
      if st:
          problem[0] = st
  if problem[0]:
      restore_signal()
      killworkers()