Pulp performance issues #1308

grzleadams · 2024-07-25T12:56:36Z

Version
Deployed via Pulp Operator v1.0.0-beta.4 on K8s 1.26.

$ pulp status           
{                                                                                                        
  "versions": [                                                                                          
    {                                                                                                    
      "component": "core",                                                                               
      "version": "3.49.1",                                                                               
      "package": "pulpcore",                                                                             
      "module": "pulpcore.app",                                                                          
      "domain_compatible": true                                                                          
    },                                                                                                   
    <snip>                                                                                           
  ]
  <snip>

Describe the bug
We've seen (increasingly) poor performance from our Pulp instance lately, and it's not entirely clear to us why. Some of the behaviors we've seen are:

Workers failing liveness checks (but not so badly that K8s restarts the pods)

2024-07-25T12:07:07.934721504Z pulp [92902ed0f2db4a62a53dfdaa49f34a24]: pulpcore.tasking.tasks:INFO: Task completed 0190e9c3-30b6-760c-8e26-deb015569ad7
2024-07-25T12:07:50.658526639Z pulp [69a42f792d4f42728d6f9d49155a93e4]: pulpcore.tasking.worker:INFO: Worker '1@pulp-worker-746fc4f5cb-nl62g' is back online.

File uploads taking significant amounts of time (>8 hours for 800 10-ish MB files)

CLI operations seem to take a while:

$ time pulp -v task list --state running
tasks_list : get https://pulp.<domain>/pulp/api/v3/tasks/?state=running&offset=0&limit=25
Response: 200
[]

real	0m16.427s
user	0m0.258s
sys	0m0.032s

Potentially (?) the issue in Docker push to Pulp registry gives 429 pulp_container#1716.

The Pulp pods use a NFS-based storageClass but we have pretty much ruled out NFS congestion/slowness as a cause. We have 10 API pods, 5 content pods, and 10 worker pods (most of which are always idle), which seems like it should be enough to handle our use case, and none of them seem to be consuming unusual CPU/memory. We've identified some potential performance tuning that could be done on the DB but we're not seeing deadlocks or similar indications of congestion so I'm not confident that will necessarily solve things. I guess we're just wondering if there's some undocumented tuning/configuration that you could point us to.

To Reproduce
Steps to reproduce the behavior:
It's unclear... it seems like, as our Pulp instance gets larger, it just slows down.

Expected behavior
Pulp should remain performant as we scale the infrastructure to support our use case.

Additional context
N/A

The text was updated successfully, but these errors were encountered:

grzleadams · 2024-07-25T13:08:45Z

Is it expected the workers would not show online status if they're not processing a job? The API and Content apps seem to return the correct number of processes available, but not all 10 workers show online (although their pods are healthy and there's nothing to indicate an issue in logs).

$ pulp status | jq -s '.[] | .online_content_apps | length'
10
$ pulp status | jq -s '.[] | .online_api_apps | length'
20
$ pulp status | jq -s '.[] | .online_workers | length'
2

grzleadams · 2024-07-25T22:39:54Z

For the workers going offline, I did see this in the API pod startup:

pulp [None]: pulpcore.app.entrypoint:WARNING: API_APP_TTL (120) is smaller than double the gunicorn timeout (900.0). You may experience workers wrongly reporting as missing
2024-07-25T22:25:15.016075012Z pulp [None]: pulpcore.app.entrypoint:WARNING: API_APP_TTL (120) is smaller than double the gunicorn timeout (900.0). You may experience workers wrongly reporting as missing

I mentioned this previously (pulp/pulp_container#1592) but it appears that this value is hard-coded. Is the only way to override it to overwrite settings.py as a volumeMount?

grzleadams · 2024-07-25T23:00:31Z

I saw that at one point one of the failing pods (I think it was the API but I'm not sure, it could've been a worker) had a message from PostgreSQL complaining about "too many clients already". I backed down our number of API pods to 5 but increased the number of gunicorn_workers to 4 to compensate, and reduced our number of worker pods from 10 to 5 (and, interestingly, all 5 show online now).

It seems like things are running a little more smoothly but I'll keep an eye on it tomorrow during business hours. Gunicorn docs make a specific mention of too many workers thrashing your system, which makes sense...

dkliban · 2024-07-30T15:44:39Z

@grzleadams It seems to me that you are being constrained by the resources available to the database. Is the database being managed by the operator also? What resource constraints does your database have?

grzleadams · 2024-07-31T15:28:26Z

We are managing the PostgreSQL DB with the operator but didn't make any changes to the configuration (i.e., we use the defaults). There are no CPU/memory limits associated with the DB. I've thought about making changes to max_connections, work_mem, etc., but I haven't actually seen any evidence on the DB of congestion.

grzleadams · 2024-08-06T15:02:48Z

@dkliban How can I modify the PostgreSQL configuration (i.e., max_connections) via the Operator?

grzleadams · 2024-08-06T15:26:28Z

Nevermind, got it:

    postgres_extra_args:
    - -c
    - max_connections=1000
    - -c
    - shared_buffers=512MB

grzleadams · 2024-09-03T20:23:58Z

We can close this issue; I haven't seen any of the deadlocks, dead workers, etc., since tuning PostgreSQL and gunicorn. That said, if there's ever a Pulp performance tuning doc, I'd be happy to contribute the lessons I've learned the hard way. :)

grzleadams added Issue Triage-Needed labels Jul 25, 2024

grzleadams changed the title ~~Worker performance issues~~ Pulp performance issues Jul 25, 2024

lubosmj mentioned this issue Jul 30, 2024

Docker push to Pulp registry gives 429 pulp/pulp_container#1716

Closed

dkliban transferred this issue from pulp/pulpcore Jul 30, 2024

mikedep333 removed the Triage-Needed label Jul 30, 2024

grzleadams closed this as completed Sep 3, 2024

git-hyagi mentioned this issue Sep 4, 2024

Add memory/performance tips to the plugin writer's guide pulp/pulpcore#1861

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pulp performance issues #1308

Pulp performance issues #1308

grzleadams commented Jul 25, 2024 •

edited

Loading

grzleadams commented Jul 25, 2024

grzleadams commented Jul 25, 2024 •

edited

Loading

grzleadams commented Jul 25, 2024

dkliban commented Jul 30, 2024

grzleadams commented Jul 31, 2024

grzleadams commented Aug 6, 2024

grzleadams commented Aug 6, 2024

grzleadams commented Sep 3, 2024

Pulp performance issues #1308

Pulp performance issues #1308

Comments

grzleadams commented Jul 25, 2024 • edited Loading

grzleadams commented Jul 25, 2024

grzleadams commented Jul 25, 2024 • edited Loading

grzleadams commented Jul 25, 2024

dkliban commented Jul 30, 2024

grzleadams commented Jul 31, 2024

grzleadams commented Aug 6, 2024

grzleadams commented Aug 6, 2024

grzleadams commented Sep 3, 2024

grzleadams commented Jul 25, 2024 •

edited

Loading

grzleadams commented Jul 25, 2024 •

edited

Loading