Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate and optimize Mimir Resource Usage #3514

Closed
2 of 3 tasks
Tracked by #3360
Rotfuks opened this issue Jun 24, 2024 · 11 comments
Closed
2 of 3 tasks
Tracked by #3360

Investigate and optimize Mimir Resource Usage #3514

Rotfuks opened this issue Jun 24, 2024 · 11 comments
Assignees
Labels

Comments

@Rotfuks
Copy link
Contributor

Rotfuks commented Jun 24, 2024

Motivation

With our Dashboards and discussions we have identified that Mimir uses roughly 2 times as much RAM as our previous prometheus setup used. Since we want to have as low of an impact on our customers resources as possible we should investigate if we can't reduce this huge usage of costly RAM a bit more.

Todo

  • Investigate how we can reduce resource usage from mimir (you might find some tips in this thread)
  • Optimise the resource usage of mimir
  • Find out what's the comparison between our previous prometheus setup and our new mimir setup exactly for the memory usage and how we can explain the bigger resource consumption

Outcome

One of the two options:

  • Mimir uses less or the same amount of resources than our previous setup.
  • We know and can advertise why Mimir has a higher resource consumption than prometheus.
@Rotfuks Rotfuks added the team/atlas Team Atlas label Jun 25, 2024
@Rotfuks Rotfuks changed the title Investigate and optimize Mimir RAM Usage Investigate and optimize Mimir Resource Usage Jun 25, 2024
@Rotfuks
Copy link
Contributor Author

Rotfuks commented Jun 25, 2024

One of the potential ideas on what we can do is done here: giantswarm/giantswarm#31132

@QuentinBisson
Copy link

QuentinBisson commented Jul 17, 2024

@giantswarm/team-atlas I think the extra resource usage is explained with the replication factor as we are definitely in the same zone as the old usage IIRC? I think this could be closed once we have #3576

@QuantumEnigmaa
Copy link

I agree

@QuentinBisson
Copy link

For instance, on grizzly, here is what we had/have now and sure there are differences because we deleted the old setups when clusters were deleted so used memory with prometheus was not over time but over 1h only. With mimir, we keep the data longer, even for deleted clsuters. As our automation create sa lot of new clusters on grizzly, the numbers will be difficult to compare but I would roughly guess that we are quite close

Image

@QuentinBisson
Copy link

@hervenicol and @QuantumEnigmaa I'm assigning you because you did most of the investigation work on this and I don't think this needs more work but let's discuss it during standup

@hervenicol
Copy link

Grizzly is not the right installation to look at, because as you said it creates lots of ephemeral WCs.
We should look at more stable installations.
Unfortunately, we don't have big stable CAPI installations yet.

Let's have a look at gazelle: it went from 16GB (prometheus) to 20GB (Mimir).
image

Alba had new clusters, so its number of series grew, and comparison is difficult.
image

glippy went from very few (max 4GB) to 20GB.
image

velvet went from 5 to 8:
image

Feel free to explore the dashboard here: https://giantswarm.grafana.net/d/edpqololp1ibkc/prometheus-and-mimir-resource-usage

While decreasing replication factor reduced mimir memory usage, Mimir is still using more RAM than Prometheus. I didn't find an installation where it's not the case.

@QuentinBisson
Copy link

So I agree and I was looking at this dashboard definitely l'ESS thoroughly. Now if we only look at gazelle, I would think that going from 16GB to 20 GB would be around the same ballpark.

Thé issue , as you mentionned is that it's almost impossible to compare because of migrations and other clusters. We could theoreticslly compare eagle with Enigma once Eagle was fully migrated though but we don't know when this will happen

@QuentinBisson
Copy link

It could also be that some of that différence is due to new components in capi mcs 😅

@QuentinBisson
Copy link

QuentinBisson commented Jul 17, 2024

Oh I stand corrected, we can use leopard as a baseline for that but communication to grafana cloud is broken

@QuentinBisson
Copy link

Here is the leopard one after I've manually set the proxy url on the buddy:

Image

So it looks like we are using roughly 4GB more to run mimir as opposed to running prometheus but do we want to investigate this?

@QuentinBisson QuentinBisson added the needs/refinement Needs refinement in order to be actionable label Jul 17, 2024
@Rotfuks
Copy link
Contributor Author

Rotfuks commented Jul 22, 2024

We're happy with the resource usage at this point. so this is done, thanks everybody!

@Rotfuks Rotfuks closed this as completed Jul 22, 2024
@Rotfuks Rotfuks removed the needs/refinement Needs refinement in order to be actionable label Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

No branches or pull requests

4 participants