Default operator (controller) memory resources are too low #1622

bukovjanmic · 2024-07-29T14:01:23Z

Operator (as distributed via OLM) has default memory limit set to 550MiB and memory request to 20MiB.

Our monitoring shows that the memory usage spikes a bit higher, on our relatively small cluster (4 worker nodes, tens of Grafana CRs) the memory usage oscilates between 540Mi and 683Mi, without any intervention the operator gets continuously OOMKilled.

On our largest cluster, the memory consumption oscilates around 2-2.1 GB of memory.

While we know how to adjust the values for individual operator instances, the defaults should probably be changed to something reasonable, I suggest 600MiB as memory.request and 2.5GiB for memory.limit.

The default for memory request (20 MiB) is way off, it seems, the operator has never such consumption.

theSuess · 2024-07-29T14:54:33Z

@NissesSenap I know you're running a larger deployment of the operator - can you share your resource consumption to see if we should change the defaults here?

NissesSenap · 2024-07-29T15:48:57Z

I'm currently on PTO and don't have access to our metrics, but I can take a look at the end of the week when I'm back home.

weisdd · 2024-08-07T20:11:52Z

@bukovjanmic we agree that the default limits/requests are unlikely to be optimal, and we can certainly adjust the values. At the same time, the production memory consumption that you highlighted seems to be quite high, so we'd like to treat it as an opportunity to assess if 2Gi is reasonable or we can optimize it. @theSuess has added a pprof endpoint as part of the latest release, so the best way forward would be to both get insights on your environment (how many of each CRa you run, their resync period, etc) and pprof traces. The second best option if you can share the above-mentioned insights, so at least we can try to reproduce them. We understand that any of those might not be part of what you're willing to share publicly, so we can get in touch on Slack. Please, let us know how you'd like to continue with this :)

bukovjanmic · 2024-08-08T07:23:09Z

Hi, no problem to share stats (green is memory request, yellow memory limit, blue real consumption).

Here is our test cluster consumption over 2 days, (operator version 5.11.0):

This cluster-wide instance is reconciling 5 Grafana CR, 48 GrafanaDashboards, 20 GrafanaDatasources, 1 GrafanaFolder

Here is our one of our production clusters over 2 days (operator version 5.9.2):

This cluster-wide instance is reconciling 33 Grafana CR, 187 GrafanaDashboards, 240 GrafanaDatasources, 2 GrafanaFolders.

All operators are installed via OLM with defaults (except for resources), the resnyc period is up to operator.

bukovjanmic · 2024-08-08T07:36:32Z

Also, I put version 5.0.12 on our test cluster and see the pprof endpoint. Any pointers what to do with this?

weisdd · 2024-08-09T11:13:25Z

@bukovjanmic

Install golang 1.22+ if needed;
Run kubectl port-forward pods/<pod-name> 8888:8888
go tool pprof -pdf http://localhost:8888/debug/pprof/heap > heap.pdf
go tool pprof -pdf http://localhost:8888/debug/pprof/goroutine > goroutine.pdf
Ideally, a graph with memory consumption where we can see both overall and residential usage (RSS)
Also, if you collect metrics from the operator itself, a graph with go_goroutines would be handy
Insights on number and type of CRs that were deployed at the time when you captured the data. Extra information like whether it's in-cluster or external grafana instances, source for dashboards (e.g. whether they're fetched from URLs), etc is appreciated.

Thanks!

bukovjanmic · 2024-08-09T12:26:04Z

Here are pprof dumps from our test cluster. This instance of Grafana operator has 5 Grafana CR, 48 GrafanaDashboards, 20 GrafanaDatasources, 1 GrafanaFolder.
grafana-operator-pprof.zip

weisdd · 2024-08-12T08:37:31Z

@bukovjanmic in this profiling set, seems like most of the memory is occupied for caching configmaps / secrets. I would be interesting to see how memory is used in production.

bukovjanmic · 2024-09-03T08:39:22Z

Here are pprof files for our production cluster - 33 Grafana CR, 187 GrafanaDashboards, 240 GrafanaDatasources, 2 GrafanaFolders.

grafana-operator-pprof-prod.zip

theSuess · 2024-09-03T13:38:59Z

Looks like the RabbitMQ operator had a similar issue in the past. I'll try to adapt their solution to our operator

rabbitmq/cluster-operator#1549

weisdd · 2024-09-03T14:07:11Z

The good thing is that it's not a memory leak, but just part of the "normal" operations. Would be nice to tune the caching for sure.

github-actions · 2024-10-04T01:28:06Z

This issue hasn't been updated for a while, marking as stale, please respond within the next 7 days to remove this label

github-actions · 2024-11-04T01:28:11Z

This issue hasn't been updated for a while, marking as stale, please respond within the next 7 days to remove this label

bukovjanmic added bug Something isn't working needs triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 29, 2024

theSuess added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 5, 2024

theSuess mentioned this issue Aug 6, 2024

feat: enable pprof endpoint #1632

Merged

theSuess mentioned this issue Sep 5, 2024

fix: add common labels to created resources #1661

Merged

github-actions bot added the stale label Oct 4, 2024

theSuess removed the stale label Oct 4, 2024

github-actions bot added the stale label Nov 4, 2024

theSuess added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed stale labels Nov 4, 2024

theSuess removed the triage/needs-information Indicates an issue needs more information in order to work on it. label Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default operator (controller) memory resources are too low #1622

Default operator (controller) memory resources are too low #1622

bukovjanmic commented Jul 29, 2024

theSuess commented Jul 29, 2024

NissesSenap commented Jul 29, 2024

weisdd commented Aug 7, 2024

bukovjanmic commented Aug 8, 2024

bukovjanmic commented Aug 8, 2024

weisdd commented Aug 9, 2024

bukovjanmic commented Aug 9, 2024

weisdd commented Aug 12, 2024

bukovjanmic commented Sep 3, 2024

theSuess commented Sep 3, 2024

weisdd commented Sep 3, 2024

github-actions bot commented Oct 4, 2024

github-actions bot commented Nov 4, 2024

Default operator (controller) memory resources are too low #1622

Default operator (controller) memory resources are too low #1622

Comments

bukovjanmic commented Jul 29, 2024

theSuess commented Jul 29, 2024

NissesSenap commented Jul 29, 2024

weisdd commented Aug 7, 2024

bukovjanmic commented Aug 8, 2024

bukovjanmic commented Aug 8, 2024

weisdd commented Aug 9, 2024

bukovjanmic commented Aug 9, 2024

weisdd commented Aug 12, 2024

bukovjanmic commented Sep 3, 2024

theSuess commented Sep 3, 2024

weisdd commented Sep 3, 2024

github-actions bot commented Oct 4, 2024

github-actions bot commented Nov 4, 2024