Transient Windows test failures

Mon Jul 10 03:15:01 UTC 2017

On Mon, 19 Jun 2017 11:30:28 -0400, Yuya Nishihara <yuya at tcha.org> wrote:

> On Sun, 18 Jun 2017 22:19:29 -0400, Augie Fackler wrote:
>>
>> > On Jun 16, 2017, at 22:02, Matt Harbison <mharbison72 at gmail.com>  
>> wrote:
>> >
>> > On Fri, 16 Jun 2017 09:59:30 -0400, Augie Fackler <raf at durin42.com>  
>> wrote:
>> >
>> >> On Fri, Jun 16, 2017 at 12:18:18AM -0400, Matt Harbison wrote:
>> >>> So apparently, this is a symptom of not having %SystemRoot% in the
>> >>> environment when calling CreateProcess().
>> >>>
>> >>> https://bugs.python.org/issue13524
>> >>>  
>> https://jpassing.com/2009/12/28/the-hidden-danger-of-forgetting-to-specify-systemroot-in-a-custom-environment-block/
>> >>>
>> >>> I see that setup.py special cases this variable.  I did a search  
>> for 'env
>> >>> =', and it looks like hooks and pager start with empty  
>> environments, so they
>> >>> must not inherit this.  IDR if any recent changes were made that  
>> start with
>> >>> an empty environment.
>> >>>
>> >>> The thing I can't get my mind around is the hit and miss nature of  
>> the
>> >>> error, if this is really the problem.
>> >>
>> >> It sounds like it should be harmless to just always forward
>> >> %SystemRoot% - should we just do that?
>> >
>> > Seems reasonable, but run-tests._getenv() already does an  
>> os.environ.copy(), so it should be there?
>> >
>> > It does seem like a good idea to do it for hooks and other things  
>> executed, where the environment is built from scratch.  The question is  
>> where?  There's util.popen[2-4](), plus some direct calls to  
>> subprocess.Popen(), and an os.system().  I considered  
>> util.shellenviron(), but there are far fewer calls to this than places  
>> where processes are spawned.
>
> (+CC foozy since he has Windows)
>
> Is the problem only seen in tests? I don't think environment variables  
> are
> cleared in hg side.

I hit this problem again this weekend, after it was quiet for the past  
couple of weeks.  It looks like it might be an issue with nearly running  
low on memory.

When it happened this time, Windows popped up a dialog box saying memory  
was running low, offering to kill some programs.  I had TaskManager open,  
and saw the Performance > System > Commit (MB) line was running around  
5400/6076.  I closed thg, which exited thg.exe and hg.exe (listed at ~300  
MB each in the process list), and the issue stopped.  I was able to  
recreate it again after a day of quiet by opening up a bunch of tabs in  
FireFox, and pushing the memory usage around that threshold.  I tend to  
run the tests with -j9, and I've seen the first number bounce around  
between 4900 and 5700+ during these failures.  So I'm not sure what the  
exact problem threshold is, as tests start and exit.  Interestingly, the  
free memory number in Resource Monitor at the same time indicates only  
150MB-20MB free.

One of the "optimizations" of the SSD install software was to cap the page  
file, which is probably why I hadn't seen this until recently.  Kostia had  
mentioned to me that he was seeing errors saying â€œapplication failed to  
start 0xc0000142â€, which I also saw (along with dialog box failures of  
various msys executables, like env.exe and grep.exe).  So maybe this is  
useful to others wanting to run Windows tests.  It seems unlikely that it  
would be seen in the wild (the page file usually isn't capped), and I  
doubt there's anything we can do about it anyway.