-
Notifications
You must be signed in to change notification settings - Fork 444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Celery lms/cms-worker consumes too much RAM #1126
Comments
Related PR: #1010 For more information regarding additional options, see: edx/configuration#68 |
Hi Ivo, I agree that we need to provide a way to override the number of concurrent workers. And we also need to provide good defaults for that value. In that context, I'm curious to have your take on this comment by @dkaliberda, where he makes the case that we should default to concurrency=1 and instead scale the number of replicas. I'm not sure I agree with that comment, because in my experience there is quite a bit of overhead incurred by scaling celery workers horizontally (by increasing the number of replicas) as opposed to vertically (by increasing the number of workers). For instance, here are the figures for memory usage on my laptop, in idle mode:
In your current use case, what would be the ideal number of replicas/workers? (both for LMS and CMS) EDIT: I just learned about process autoscaling in Celery and I'm very tempted to use that as the default. Of course, we would still need to provide a mechanism to override that. |
Hi @regisb, I think a default concurrency=1 is a good value. Nevertheless, I prefer that it should be parameterized. About the autoscaling in Celery, I also just found out it! I think I won't change my current setup, because it's just working. But if had found out before, I think I would be tempted to just use it. Even for our case just fixed 2 replicas with a vertical autoscale on Celery would be good option. The good news is that it would benefit everyone, docker compose or K8s installations. 3 configurations with:
An upgrade note on the docs could be added to configure a proper value of For example this is a snippet of my custom tutor plugin to override the workers: apiVersion: apps/v1
kind: Deployment
metadata:
name: cms-worker
spec:
template:
spec:
terminationGracePeriodSeconds: 900
containers:
- name: cms-worker
args:
- celery
- --app=cms.celery
- worker
- --loglevel=info
- --concurrency=1
- --hostname=edx.cms.core.default.%%h
- --max-tasks-per-child=100
- --prefetch-multiplier=1
- --exclude-queues=edx.lms.core.default
- --without-gossip
- --without-mingle The |
While I do think that administrators should be able to customise the default celery concurrency, I disagree that I think that I haven't looked yet at the other options that you are suggesting -- I'll investigate once I start working on this. |
Bug description
The normal execution of the celery workers, lms-worker and cms-worker services, consumes too much memory by default.
Tested on my tutor local, but I'm assuming that with tutor dev is the same.
By default the celery start 1 process by each CPU. If the deployed laptop or server has multiple CPU architecture, then the celery will launch multiple OS processes.
Each celery OS process consumes memory.
This makes more CPUs you have, more memory RAM the celery worker will consume.
The problem that I see is that on a deployed server, the operator probably don't want this magic, and want to control how much celery OS process each container/pod will start and consequently how much memory each container/pods will consume.
How to reproduce
Run a tutor local environment and see how much RAM your celery workers is consuming.
On 16 CPU server/laptop an idle environment each container lms/cms-worker consumes >2GB RAM per container.
If you add
--concurrency=1
to the lms/cms-worker command an idle tutor local env uses <300MB per container.Environment
tutor, version 14.2.3
But this applies also to newer version.
Solution (my opinion)
--concurrency={{ LMS_WORKER_CELERY_CONCURRENCY }}
1
.--prefetch-multiplier=1 --without-gossip --without-mingle
The text was updated successfully, but these errors were encountered: