Skip to content
This repository has been archived by the owner on Jan 12, 2022. It is now read-only.

multiprocessing.Pool workers hold on to response body #95

Open
evanj opened this issue May 20, 2016 · 6 comments
Open

multiprocessing.Pool workers hold on to response body #95

evanj opened this issue May 20, 2016 · 6 comments
Assignees
Labels
Milestone

Comments

@evanj
Copy link

evanj commented May 20, 2016

In our app, we make use of pull task queues, and we pull large number of tasks with large bodies (e.g. ~5 MB of payloads). I noticed that it appeared we had a memory leak when running this in the python-compat runtime, with many megabytes of strings being retained. With a whole lot of hacking, I ended up tracking it down to the following:

  • Multiprocessing pool workers hold on to the result objects in their local variables until they get another work item.
  • The pool has 100 threads in the python-compat environment. This means it can retain up to ~100 items.
  • The requests.Response is the result type that gets cached, and it stores the body of the response.

My hack to fix it: in google/appengine/ext/vmruntime/vmstub.py, I set response._content = None right after the response protocol buffer message is parsed. This caused our process to use ~100 MB of memory, instead of ~400 MB of memory.

Hacky output from my tool that found the large strings that were being retained, showing what is holding on to it:

found new string! len=4653338
referred by 139691536297608 <type 'dict'>
-referred by 139691536340816 <class 'requests_nologs.models.Response'>
--referred by 139691536378120 <type 'tuple'>
---referred by frame /usr/lib/python2.7/multiprocessing/pool.py 102 worker
---referred by frame /usr/lib/python2.7/multiprocessing/pool.py 380 _handle_results
@theacodes theacodes added the bug label May 20, 2016
@theacodes theacodes added this to the Beta milestone May 20, 2016
@theacodes
Copy link

Thank you for your bug report!

@bryanmau1 can you take a look? This seems big enough to be a beta blocker.

@evanj
Copy link
Author

evanj commented May 23, 2016

FWIW: Something similar to this issue may exist in the standard environment as well. This same service increases from ~70 MB memory up to a "steady state" of ~170 MB over a few hours when we restart it. I tried running it in the flexible environment so I could try to trace if there was an actual leak. I couldn't find anything other than this issue. There is an issue report from ~2 years ago suggesting that others have observed high memory usage for big RPCs: https://code.google.com/p/googleappengine/issues/detail?id=10475

@duggelz duggelz assigned duggelz and unassigned bryanmau1 Nov 29, 2016
@mikelambert
Copy link

mikelambert commented May 30, 2017

I am running into something similar (but with worker instead of _handle_results)...

I believe this change should fix the underlying CPython bug: python/cpython@5084ff7

Unfortunately the latest CPython 2.7.13 was released five months prior to this fix. According to the release schedule, 2.7.14 will come in mid 2017.

"Unfortunately", GCE's python-compat is on debian stable-jessie (python 2.7.9). Looks like debian testing-stretch (python 2.7.13) is now in full-freeze for a mid-2017 stable release, and therefore unlikely to grab 2.7.14...so we'll have to wait for the next release who-knows-when.

I suppose the easiest option would be to have our Dockerfile to overwrite /usr/lib/python2.7/multiprocessing/pool.py with our own patched version...

@mikelambert
Copy link

And if I understand things right, that's MAX_CONCURRENT_API_CALLS=100 threads per worker, and there are up to 100 workers in Flexible VM worker pools (according to the default gunicorn.conf.py).

(Correct me if I'm wrong here!) Gunicorn appears to re-use workers when possible, but the API Call threadpool appears to distribute load randomly across all blocked threads. So in high load situations, you'd quickly get 100*100=10K Response objects sitting around...

@duggelz
Copy link
Contributor

duggelz commented May 30, 2017

Updating the Python runtimes to cPython 2.7.13 is on my todo list. We can cherry-pick particular patches from upstream.

I should note that the python-compat runtime is deprecated, and will stop working on October 26, 2017, once the underlying service bridge is turned down.

@mikelambert
Copy link

mikelambert commented May 31, 2017

FWIW this bug still manifests even outside python-compat and vmstub.py, just to a lesser extent (since it's 100 instead of 100*100).

But just to follow-up: adding in the fixed multiprocessing/pool.py appears to have significantly helped my memory issues.

Previously I would inch up to 100% memory usage within 24 hours (+0.3GB/hour), and then eventually swap-slowly, crash, and restart. Now it seems to be consistently hovering around 27%.

Glad to hear about new cPython! Is it possible to circle-back if/when you pull this fix into a pushed cPython 2.7.13 (or if there's some tracking bug for it), so that I know when to delete my hack-fix? Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants