-
Notifications
You must be signed in to change notification settings - Fork 397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default operator (controller) memory resources are too low #1622
Comments
@NissesSenap I know you're running a larger deployment of the operator - can you share your resource consumption to see if we should change the defaults here? |
I'm currently on PTO and don't have access to our metrics, but I can take a look at the end of the week when I'm back home. |
@bukovjanmic we agree that the default limits/requests are unlikely to be optimal, and we can certainly adjust the values. At the same time, the production memory consumption that you highlighted seems to be quite high, so we'd like to treat it as an opportunity to assess if 2Gi is reasonable or we can optimize it. @theSuess has added a pprof endpoint as part of the latest release, so the best way forward would be to both get insights on your environment (how many of each CRa you run, their resync period, etc) and pprof traces. The second best option if you can share the above-mentioned insights, so at least we can try to reproduce them. We understand that any of those might not be part of what you're willing to share publicly, so we can get in touch on Slack. Please, let us know how you'd like to continue with this :) |
Hi, no problem to share stats (green is memory request, yellow memory limit, blue real consumption). Here is our test cluster consumption over 2 days, (operator version 5.11.0): This cluster-wide instance is reconciling 5 Grafana CR, 48 GrafanaDashboards, 20 GrafanaDatasources, 1 GrafanaFolder Here is our one of our production clusters over 2 days (operator version 5.9.2): This cluster-wide instance is reconciling 33 Grafana CR, 187 GrafanaDashboards, 240 GrafanaDatasources, 2 GrafanaFolders. All operators are installed via OLM with defaults (except for resources), the resnyc period is up to operator. |
Also, I put version 5.0.12 on our test cluster and see the pprof endpoint. Any pointers what to do with this? |
Thanks! |
Here are pprof dumps from our test cluster. This instance of Grafana operator has 5 Grafana CR, 48 GrafanaDashboards, 20 GrafanaDatasources, 1 GrafanaFolder. |
@bukovjanmic in this profiling set, seems like most of the memory is occupied for caching configmaps / secrets. I would be interesting to see how memory is used in production. |
Here are pprof files for our production cluster - 33 Grafana CR, 187 GrafanaDashboards, 240 GrafanaDatasources, 2 GrafanaFolders. |
Looks like the RabbitMQ operator had a similar issue in the past. I'll try to adapt their solution to our operator |
The good thing is that it's not a memory leak, but just part of the "normal" operations. Would be nice to tune the caching for sure. |
This issue hasn't been updated for a while, marking as stale, please respond within the next 7 days to remove this label |
This issue hasn't been updated for a while, marking as stale, please respond within the next 7 days to remove this label |
Operator (as distributed via OLM) has default memory limit set to 550MiB and memory request to 20MiB.
Our monitoring shows that the memory usage spikes a bit higher, on our relatively small cluster (4 worker nodes, tens of Grafana CRs) the memory usage oscilates between 540Mi and 683Mi, without any intervention the operator gets continuously OOMKilled.
On our largest cluster, the memory consumption oscilates around 2-2.1 GB of memory.
While we know how to adjust the values for individual operator instances, the defaults should probably be changed to something reasonable, I suggest 600MiB as memory.request and 2.5GiB for memory.limit.
The default for memory request (20 MiB) is way off, it seems, the operator has never such consumption.
The text was updated successfully, but these errors were encountered: