Investigate and optimize Mimir Resource Usage #3514

Rotfuks · 2024-06-24T12:29:06Z

Motivation

With our Dashboards and discussions we have identified that Mimir uses roughly 2 times as much RAM as our previous prometheus setup used. Since we want to have as low of an impact on our customers resources as possible we should investigate if we can't reduce this huge usage of costly RAM a bit more.

Todo

Investigate how we can reduce resource usage from mimir (you might find some tips in this thread)
Optimise the resource usage of mimir
Find out what's the comparison between our previous prometheus setup and our new mimir setup exactly for the memory usage and how we can explain the bigger resource consumption

Outcome

One of the two options:

Mimir uses less or the same amount of resources than our previous setup.
We know and can advertise why Mimir has a higher resource consumption than prometheus.

Rotfuks · 2024-06-25T11:58:59Z

One of the potential ideas on what we can do is done here: giantswarm/giantswarm#31132

QuentinBisson · 2024-07-17T14:25:23Z

@giantswarm/team-atlas I think the extra resource usage is explained with the replication factor as we are definitely in the same zone as the old usage IIRC? I think this could be closed once we have #3576

QuantumEnigmaa · 2024-07-17T14:41:12Z

I agree

QuentinBisson · 2024-07-17T14:45:15Z

For instance, on grizzly, here is what we had/have now and sure there are differences because we deleted the old setups when clusters were deleted so used memory with prometheus was not over time but over 1h only. With mimir, we keep the data longer, even for deleted clsuters. As our automation create sa lot of new clusters on grizzly, the numbers will be difficult to compare but I would roughly guess that we are quite close

QuentinBisson · 2024-07-17T14:48:12Z

@hervenicol and @QuantumEnigmaa I'm assigning you because you did most of the investigation work on this and I don't think this needs more work but let's discuss it during standup

hervenicol · 2024-07-17T15:47:12Z

Grizzly is not the right installation to look at, because as you said it creates lots of ephemeral WCs.
We should look at more stable installations.
Unfortunately, we don't have big stable CAPI installations yet.

Let's have a look at gazelle: it went from 16GB (prometheus) to 20GB (Mimir).

Alba had new clusters, so its number of series grew, and comparison is difficult.

glippy went from very few (max 4GB) to 20GB.

velvet went from 5 to 8:

Feel free to explore the dashboard here: https://giantswarm.grafana.net/d/edpqololp1ibkc/prometheus-and-mimir-resource-usage

While decreasing replication factor reduced mimir memory usage, Mimir is still using more RAM than Prometheus. I didn't find an installation where it's not the case.

QuentinBisson · 2024-07-17T16:03:12Z

So I agree and I was looking at this dashboard definitely l'ESS thoroughly. Now if we only look at gazelle, I would think that going from 16GB to 20 GB would be around the same ballpark.

Thé issue , as you mentionned is that it's almost impossible to compare because of migrations and other clusters. We could theoreticslly compare eagle with Enigma once Eagle was fully migrated though but we don't know when this will happen

QuentinBisson · 2024-07-17T16:03:56Z

It could also be that some of that différence is due to new components in capi mcs 😅

QuentinBisson · 2024-07-17T16:08:44Z

Oh I stand corrected, we can use leopard as a baseline for that but communication to grafana cloud is broken

QuentinBisson · 2024-07-17T16:43:50Z

Here is the leopard one after I've manually set the proxy url on the buddy:

So it looks like we are using roughly 4GB more to run mimir as opposed to running prometheus but do we want to investigate this?

Rotfuks · 2024-07-22T08:15:30Z

We're happy with the resource usage at this point. so this is done, thanks everybody!

Rotfuks mentioned this issue Jun 24, 2024

Setup Mimir on CAPA installation #3039

Closed

Rotfuks added the team/atlas Team Atlas label Jun 25, 2024

Rotfuks mentioned this issue Jun 25, 2024

Optimized Resource Usage for Observability Platform #3360

Closed

Rotfuks changed the title ~~Investigate and optimize Mimir RAM Usage~~ Investigate and optimize Mimir Resource Usage Jun 25, 2024

QuentinBisson assigned QuantumEnigmaa and hervenicol Jul 17, 2024

QuentinBisson added the blocked label Jul 17, 2024

QuentinBisson added the needs/refinement Needs refinement in order to be actionable label Jul 17, 2024

Rotfuks closed this as completed Jul 22, 2024

Rotfuks removed the needs/refinement Needs refinement in order to be actionable label Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate and optimize Mimir Resource Usage #3514

Investigate and optimize Mimir Resource Usage #3514

Rotfuks commented Jun 24, 2024 •

edited

Loading

Rotfuks commented Jun 25, 2024

QuentinBisson commented Jul 17, 2024 •

edited

Loading

QuantumEnigmaa commented Jul 17, 2024

QuentinBisson commented Jul 17, 2024

QuentinBisson commented Jul 17, 2024

hervenicol commented Jul 17, 2024

QuentinBisson commented Jul 17, 2024

QuentinBisson commented Jul 17, 2024

QuentinBisson commented Jul 17, 2024 •

edited

Loading

QuentinBisson commented Jul 17, 2024

Rotfuks commented Jul 22, 2024

Investigate and optimize Mimir Resource Usage #3514

Investigate and optimize Mimir Resource Usage #3514

Comments

Rotfuks commented Jun 24, 2024 • edited Loading

Motivation

Todo

Outcome

Rotfuks commented Jun 25, 2024

QuentinBisson commented Jul 17, 2024 • edited Loading

QuantumEnigmaa commented Jul 17, 2024

QuentinBisson commented Jul 17, 2024

QuentinBisson commented Jul 17, 2024

hervenicol commented Jul 17, 2024

QuentinBisson commented Jul 17, 2024

QuentinBisson commented Jul 17, 2024

QuentinBisson commented Jul 17, 2024 • edited Loading

QuentinBisson commented Jul 17, 2024

Rotfuks commented Jul 22, 2024

Rotfuks commented Jun 24, 2024 •

edited

Loading

QuentinBisson commented Jul 17, 2024 •

edited

Loading

QuentinBisson commented Jul 17, 2024 •

edited

Loading