[PATCH] worker: ensure a posix worker exits with os._exit
Yuya Nishihara
yuya at tcha.org
Tue Aug 9 10:45:37 EDT 2016
On Tue, 9 Aug 2016 00:20:47 +0100, Jun Wu wrote:
> # HG changeset patch
> # User Jun Wu <quark at fb.com>
> # Date 1470697491 -3600
> # Tue Aug 09 00:04:51 2016 +0100
> # Node ID 6c1ce8f56180337cce7d67a77a71c41a4207b6b4
> # Parent 3dbc95f3eb31874ab4633f936acff4609714dc41
> # Available At https://bitbucket.org/quark-zju/hg-draft
> # hg pull https://bitbucket.org/quark-zju/hg-draft -r 6c1ce8f56180
> worker: ensure a posix worker exits with os._exit
>
> Like commandserver, the worker should never run other resource cleanup logic.
>
> Previously this is not true for workers if they have exceptions other than
> KeyboardInterrupt.
>
> This actually causes a deadlock with remotefilelog in production:
>
> 1. remotefilelog/fileserverclient creates a sshpeer. pipei/o/e get created.
> 2. worker inherits that sshpeer's pipei/o/e.
> 3. worker runs sshpeer.cleanup (only happens without os._exit)
> 4. worker closes pipeo/i, which will normally make the sshpeer read EOF from
> its stdin and exit. But the master process still have pipeo, so no EOF.
> 5. worker reads pipee (stderr of sshpeer), which never completes because
> the ssh process does not exit, does not close its stderr.
> 6. master waits for all workers, which never completes because they never
> complete sshpeer.cleanup.
>
> This could also be addressed by closing these fds after fork, which is not
> easy because Python 2.x does not have an official "afterfork" hook. Hacking
> os.fork is also ugly. Besides, sshpeer is probably not the only troublemarker.
>
> The patch changes _posixworker so all its code paths will use os._exit to
> avoid running unwanted resource clean-ups.
>
> diff --git a/mercurial/worker.py b/mercurial/worker.py
> --- a/mercurial/worker.py
> +++ b/mercurial/worker.py
> @@ -97,8 +97,11 @@ def _posixworker(ui, func, staticargs, a
> os._exit(0)
> except KeyboardInterrupt:
> os._exit(255)
> - # other exceptions are allowed to propagate, we rely
> - # on lock.py's pid checks to avoid release callbacks
> + except: # never return, hence no re-raises
> + try:
> + ui.traceback()
> + finally:
> + os._exit(255)
I agree we should call _exit(), but the problem wouldn't be that simple. My
understanding is that worker has to propagate an exception so that it will
be caught at dispatch._runcatch() and mapped to an appropriate warning and
exit code.
More information about the Mercurial-devel
mailing list