[PATCH] worker: ensure a posix worker exits with os._exit

Yuya Nishihara yuya at tcha.org
Tue Aug 9 10:45:37 EDT 2016


On Tue, 9 Aug 2016 00:20:47 +0100, Jun Wu wrote:
> # HG changeset patch
> # User Jun Wu <quark at fb.com>
> # Date 1470697491 -3600
> #      Tue Aug 09 00:04:51 2016 +0100
> # Node ID 6c1ce8f56180337cce7d67a77a71c41a4207b6b4
> # Parent  3dbc95f3eb31874ab4633f936acff4609714dc41
> # Available At https://bitbucket.org/quark-zju/hg-draft
> #              hg pull https://bitbucket.org/quark-zju/hg-draft -r 6c1ce8f56180
> worker: ensure a posix worker exits with os._exit
> 
> Like commandserver, the worker should never run other resource cleanup logic.
> 
> Previously this is not true for workers if they have exceptions other than
> KeyboardInterrupt.
> 
> This actually causes a deadlock with remotefilelog in production:
> 
> 1. remotefilelog/fileserverclient creates a sshpeer. pipei/o/e get created.
> 2. worker inherits that sshpeer's pipei/o/e.
> 3. worker runs sshpeer.cleanup (only happens without os._exit)
> 4. worker closes pipeo/i, which will normally make the sshpeer read EOF from
>    its stdin and exit. But the master process still have pipeo, so no EOF.
> 5. worker reads pipee (stderr of sshpeer), which never completes because
>    the ssh process does not exit, does not close its stderr.
> 6. master waits for all workers, which never completes because they never
>    complete sshpeer.cleanup.
> 
> This could also be addressed by closing these fds after fork, which is not
> easy because Python 2.x does not have an official "afterfork" hook. Hacking
> os.fork is also ugly. Besides, sshpeer is probably not the only troublemarker.
> 
> The patch changes _posixworker so all its code paths will use os._exit to
> avoid running unwanted resource clean-ups.
> 
> diff --git a/mercurial/worker.py b/mercurial/worker.py
> --- a/mercurial/worker.py
> +++ b/mercurial/worker.py
> @@ -97,8 +97,11 @@ def _posixworker(ui, func, staticargs, a
>                  os._exit(0)
>              except KeyboardInterrupt:
>                  os._exit(255)
> -                # other exceptions are allowed to propagate, we rely
> -                # on lock.py's pid checks to avoid release callbacks
> +            except: # never return, hence no re-raises
> +                try:
> +                    ui.traceback()
> +                finally:
> +                    os._exit(255)

I agree we should call _exit(), but the problem wouldn't be that simple. My
understanding is that worker has to propagate an exception so that it will
be caught at dispatch._runcatch() and mapped to an appropriate warning and
exit code.


More information about the Mercurial-devel mailing list