-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pulp performance issues #1308
Comments
Is it expected the workers would not show online status if they're not processing a job? The API and Content apps seem to return the correct number of processes available, but not all 10 workers show online (although their pods are healthy and there's nothing to indicate an issue in logs).
|
For the workers going offline, I did see this in the API pod startup:
I mentioned this previously (pulp/pulp_container#1592) but it appears that this value is hard-coded. Is the only way to override it to overwrite settings.py as a volumeMount? |
I saw that at one point one of the failing pods (I think it was the API but I'm not sure, it could've been a worker) had a message from PostgreSQL complaining about "too many clients already". I backed down our number of API pods to 5 but increased the number of gunicorn_workers to 4 to compensate, and reduced our number of worker pods from 10 to 5 (and, interestingly, all 5 show online now). It seems like things are running a little more smoothly but I'll keep an eye on it tomorrow during business hours. Gunicorn docs make a specific mention of too many workers thrashing your system, which makes sense... |
@grzleadams It seems to me that you are being constrained by the resources available to the database. Is the database being managed by the operator also? What resource constraints does your database have? |
We are managing the PostgreSQL DB with the operator but didn't make any changes to the configuration (i.e., we use the defaults). There are no CPU/memory limits associated with the DB. I've thought about making changes to max_connections, work_mem, etc., but I haven't actually seen any evidence on the DB of congestion. |
@dkliban How can I modify the PostgreSQL configuration (i.e., |
Nevermind, got it:
|
We can close this issue; I haven't seen any of the deadlocks, dead workers, etc., since tuning PostgreSQL and gunicorn. That said, if there's ever a Pulp performance tuning doc, I'd be happy to contribute the lessons I've learned the hard way. :) |
Version
Deployed via Pulp Operator v1.0.0-beta.4 on K8s 1.26.
Describe the bug
We've seen (increasingly) poor performance from our Pulp instance lately, and it's not entirely clear to us why. Some of the behaviors we've seen are:
The Pulp pods use a NFS-based storageClass but we have pretty much ruled out NFS congestion/slowness as a cause. We have 10 API pods, 5 content pods, and 10 worker pods (most of which are always idle), which seems like it should be enough to handle our use case, and none of them seem to be consuming unusual CPU/memory. We've identified some potential performance tuning that could be done on the DB but we're not seeing deadlocks or similar indications of congestion so I'm not confident that will necessarily solve things. I guess we're just wondering if there's some undocumented tuning/configuration that you could point us to.
To Reproduce
Steps to reproduce the behavior:
It's unclear... it seems like, as our Pulp instance gets larger, it just slows down.
Expected behavior
Pulp should remain performant as we scale the infrastructure to support our use case.
Additional context
N/A
The text was updated successfully, but these errors were encountered: