multiprocessing.Pool workers hold on to response body #95

evanj · 2016-05-20T18:55:59Z

In our app, we make use of pull task queues, and we pull large number of tasks with large bodies (e.g. ~5 MB of payloads). I noticed that it appeared we had a memory leak when running this in the python-compat runtime, with many megabytes of strings being retained. With a whole lot of hacking, I ended up tracking it down to the following:

Multiprocessing pool workers hold on to the result objects in their local variables until they get another work item.
The pool has 100 threads in the python-compat environment. This means it can retain up to ~100 items.
The requests.Response is the result type that gets cached, and it stores the body of the response.

My hack to fix it: in google/appengine/ext/vmruntime/vmstub.py, I set response._content = None right after the response protocol buffer message is parsed. This caused our process to use ~100 MB of memory, instead of ~400 MB of memory.

Hacky output from my tool that found the large strings that were being retained, showing what is holding on to it:

found new string! len=4653338
referred by 139691536297608 <type 'dict'>
-referred by 139691536340816 <class 'requests_nologs.models.Response'>
--referred by 139691536378120 <type 'tuple'>
---referred by frame /usr/lib/python2.7/multiprocessing/pool.py 102 worker
---referred by frame /usr/lib/python2.7/multiprocessing/pool.py 380 _handle_results

The text was updated successfully, but these errors were encountered:

theacodes · 2016-05-20T19:01:22Z

Thank you for your bug report!

@bryanmau1 can you take a look? This seems big enough to be a beta blocker.

evanj · 2016-05-23T21:59:14Z

FWIW: Something similar to this issue may exist in the standard environment as well. This same service increases from ~70 MB memory up to a "steady state" of ~170 MB over a few hours when we restart it. I tried running it in the flexible environment so I could try to trace if there was an actual leak. I couldn't find anything other than this issue. There is an issue report from ~2 years ago suggesting that others have observed high memory usage for big RPCs: https://code.google.com/p/googleappengine/issues/detail?id=10475

mikelambert · 2017-05-30T04:15:11Z

I am running into something similar (but with worker instead of _handle_results)...

I believe this change should fix the underlying CPython bug: python/cpython@5084ff7

Unfortunately the latest CPython 2.7.13 was released five months prior to this fix. According to the release schedule, 2.7.14 will come in mid 2017.

"Unfortunately", GCE's python-compat is on debian stable-jessie (python 2.7.9). Looks like debian testing-stretch (python 2.7.13) is now in full-freeze for a mid-2017 stable release, and therefore unlikely to grab 2.7.14...so we'll have to wait for the next release who-knows-when.

I suppose the easiest option would be to have our Dockerfile to overwrite /usr/lib/python2.7/multiprocessing/pool.py with our own patched version...

mikelambert · 2017-05-30T08:37:41Z

And if I understand things right, that's MAX_CONCURRENT_API_CALLS=100 threads per worker, and there are up to 100 workers in Flexible VM worker pools (according to the default gunicorn.conf.py).

(Correct me if I'm wrong here!) Gunicorn appears to re-use workers when possible, but the API Call threadpool appears to distribute load randomly across all blocked threads. So in high load situations, you'd quickly get 100*100=10K Response objects sitting around...

duggelz · 2017-05-30T15:53:29Z

Updating the Python runtimes to cPython 2.7.13 is on my todo list. We can cherry-pick particular patches from upstream.

I should note that the python-compat runtime is deprecated, and will stop working on October 26, 2017, once the underlying service bridge is turned down.

mikelambert · 2017-05-31T02:17:10Z

FWIW this bug still manifests even outside python-compat and vmstub.py, just to a lesser extent (since it's 100 instead of 100*100).

But just to follow-up: adding in the fixed multiprocessing/pool.py appears to have significantly helped my memory issues.

Previously I would inch up to 100% memory usage within 24 hours (+0.3GB/hour), and then eventually swap-slowly, crash, and restart. Now it seems to be consistently hovering around 27%.

Glad to hear about new cPython! Is it possible to circle-back if/when you pull this fix into a pushed cPython 2.7.13 (or if there's some tracking bug for it), so that I know when to delete my hack-fix? Thanks!

theacodes assigned bryanmau1 May 20, 2016

theacodes added the bug label May 20, 2016

theacodes added this to the Beta milestone May 20, 2016

duggelz assigned duggelz and unassigned bryanmau1 Nov 29, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multiprocessing.Pool workers hold on to response body #95

multiprocessing.Pool workers hold on to response body #95

evanj commented May 20, 2016

theacodes commented May 20, 2016

evanj commented May 23, 2016

mikelambert commented May 30, 2017 •

edited

Loading

mikelambert commented May 30, 2017

duggelz commented May 30, 2017 •

edited

Loading

mikelambert commented May 31, 2017 •

edited

Loading

multiprocessing.Pool workers hold on to response body #95

multiprocessing.Pool workers hold on to response body #95

Comments

evanj commented May 20, 2016

theacodes commented May 20, 2016

evanj commented May 23, 2016

mikelambert commented May 30, 2017 • edited Loading

mikelambert commented May 30, 2017

duggelz commented May 30, 2017 • edited Loading

mikelambert commented May 31, 2017 • edited Loading

mikelambert commented May 30, 2017 •

edited

Loading

duggelz commented May 30, 2017 •

edited

Loading

mikelambert commented May 31, 2017 •

edited

Loading