Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pulp performance issues #1308

Closed
grzleadams opened this issue Jul 25, 2024 · 8 comments
Closed

Pulp performance issues #1308

grzleadams opened this issue Jul 25, 2024 · 8 comments
Labels

Comments

@grzleadams
Copy link

grzleadams commented Jul 25, 2024

Version
Deployed via Pulp Operator v1.0.0-beta.4 on K8s 1.26.

$ pulp status           
{                                                                                                        
  "versions": [                                                                                          
    {                                                                                                    
      "component": "core",                                                                               
      "version": "3.49.1",                                                                               
      "package": "pulpcore",                                                                             
      "module": "pulpcore.app",                                                                          
      "domain_compatible": true                                                                          
    },                                                                                                   
    <snip>                                                                                           
  ]
  <snip>

Describe the bug
We've seen (increasingly) poor performance from our Pulp instance lately, and it's not entirely clear to us why. Some of the behaviors we've seen are:

  • Workers failing liveness checks (but not so badly that K8s restarts the pods)
    image
    2024-07-25T12:07:07.934721504Z pulp [92902ed0f2db4a62a53dfdaa49f34a24]: pulpcore.tasking.tasks:INFO: Task completed 0190e9c3-30b6-760c-8e26-deb015569ad7
    2024-07-25T12:07:50.658526639Z pulp [69a42f792d4f42728d6f9d49155a93e4]: pulpcore.tasking.worker:INFO: Worker '1@pulp-worker-746fc4f5cb-nl62g' is back online.
    
  • File uploads taking significant amounts of time (>8 hours for 800 10-ish MB files)
  • CLI operations seem to take a while:
    $ time pulp -v task list --state running
    tasks_list : get https://pulp.<domain>/pulp/api/v3/tasks/?state=running&offset=0&limit=25
    Response: 200
    []
    
    real	0m16.427s
    user	0m0.258s
    sys	0m0.032s
    
  • Potentially (?) the issue in Docker push to Pulp registry gives 429 pulp_container#1716.

The Pulp pods use a NFS-based storageClass but we have pretty much ruled out NFS congestion/slowness as a cause. We have 10 API pods, 5 content pods, and 10 worker pods (most of which are always idle), which seems like it should be enough to handle our use case, and none of them seem to be consuming unusual CPU/memory. We've identified some potential performance tuning that could be done on the DB but we're not seeing deadlocks or similar indications of congestion so I'm not confident that will necessarily solve things. I guess we're just wondering if there's some undocumented tuning/configuration that you could point us to.

To Reproduce
Steps to reproduce the behavior:
It's unclear... it seems like, as our Pulp instance gets larger, it just slows down.

Expected behavior
Pulp should remain performant as we scale the infrastructure to support our use case.

Additional context
N/A

@grzleadams grzleadams changed the title Worker performance issues Pulp performance issues Jul 25, 2024
@grzleadams
Copy link
Author

Is it expected the workers would not show online status if they're not processing a job? The API and Content apps seem to return the correct number of processes available, but not all 10 workers show online (although their pods are healthy and there's nothing to indicate an issue in logs).

$ pulp status | jq -s '.[] | .online_content_apps | length'
10
$ pulp status | jq -s '.[] | .online_api_apps | length'
20
$ pulp status | jq -s '.[] | .online_workers | length'
2

@grzleadams
Copy link
Author

grzleadams commented Jul 25, 2024

For the workers going offline, I did see this in the API pod startup:

pulp [None]: pulpcore.app.entrypoint:WARNING: API_APP_TTL (120) is smaller than double the gunicorn timeout (900.0). You may experience workers wrongly reporting as missing
2024-07-25T22:25:15.016075012Z pulp [None]: pulpcore.app.entrypoint:WARNING: API_APP_TTL (120) is smaller than double the gunicorn timeout (900.0). You may experience workers wrongly reporting as missing

I mentioned this previously (pulp/pulp_container#1592) but it appears that this value is hard-coded. Is the only way to override it to overwrite settings.py as a volumeMount?

@grzleadams
Copy link
Author

I saw that at one point one of the failing pods (I think it was the API but I'm not sure, it could've been a worker) had a message from PostgreSQL complaining about "too many clients already". I backed down our number of API pods to 5 but increased the number of gunicorn_workers to 4 to compensate, and reduced our number of worker pods from 10 to 5 (and, interestingly, all 5 show online now).

It seems like things are running a little more smoothly but I'll keep an eye on it tomorrow during business hours. Gunicorn docs make a specific mention of too many workers thrashing your system, which makes sense...

@dkliban
Copy link
Member

dkliban commented Jul 30, 2024

@grzleadams It seems to me that you are being constrained by the resources available to the database. Is the database being managed by the operator also? What resource constraints does your database have?

@grzleadams
Copy link
Author

We are managing the PostgreSQL DB with the operator but didn't make any changes to the configuration (i.e., we use the defaults). There are no CPU/memory limits associated with the DB. I've thought about making changes to max_connections, work_mem, etc., but I haven't actually seen any evidence on the DB of congestion.

@grzleadams
Copy link
Author

@dkliban How can I modify the PostgreSQL configuration (i.e., max_connections) via the Operator?

@grzleadams
Copy link
Author

Nevermind, got it:

    postgres_extra_args:
    - -c
    - max_connections=1000
    - -c
    - shared_buffers=512MB

@grzleadams
Copy link
Author

We can close this issue; I haven't seen any of the deadlocks, dead workers, etc., since tuning PostgreSQL and gunicorn. That said, if there's ever a Pulp performance tuning doc, I'd be happy to contribute the lessons I've learned the hard way. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants