Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitoring infrastructure next level with Mimir 🚀 #1817

Closed
12 tasks done
Tracked by #2007
mcbenjemaa opened this issue Dec 27, 2022 · 12 comments
Closed
12 tasks done
Tracked by #2007

Monitoring infrastructure next level with Mimir 🚀 #1817

mcbenjemaa opened this issue Dec 27, 2022 · 12 comments
Assignees
Labels
needs/refinement Needs refinement in order to be actionable team/atlas Team Atlas topic/monitoring

Comments

@mcbenjemaa
Copy link

mcbenjemaa commented Dec 27, 2022

Context

In order to solve limitations of our current monitoring setup, we need to try new tools that will basically help us for long term storage metrics, and HA, as well as Global view Querying...

One of the best tools in the market is Grafana Mimir.
We need to Try it, and find the best way to integrate it in our setup.

User Story

As a Giant Swarm engineer, I want to be able to access a history of metrics for components managing our platform in order to be to both investigate ongoing operational issues and provide details for incident reports.

Ideally a month of metrics history should be available.

Tasks

Tasks

  1. team/atlas topic/monitoring
    QuantumEnigmaa
  2. team/atlas topic/monitoring
    QuantumEnigmaa
  3. team/atlas topic/monitoring
    QuantumEnigmaa hervenicol
  4. team/atlas topic/monitoring
    hervenicol
  5. team/atlas topic/monitoring
    QuantumEnigmaa hervenicol
  6. component/prometheus team/atlas topic/monitoring
    QuantumEnigmaa
  7. team/atlas team/phoenix topic/monitoring
    QuentinBisson
  8. Planning team/atlas topic/monitoring
  9. Planning team/atlas topic/monitoring
  10. 39 of 41
    team/atlas topic/monitoring

Later

  • Define cost dashboard
    • current prometheus cost
    • mimir cost
    • prometheus-agent transfer cost
  • How to migrate
  • Try-it mixins dashboards
  • Define/migrate alerting rules
  • What happens to SLOs

Related issues

@mcbenjemaa mcbenjemaa self-assigned this Dec 27, 2022
@TheoBrigitte
Copy link
Member

TheoBrigitte commented Apr 4, 2023

No description provided.

@hervenicol
Copy link

The "Kubernetes / Compute Resources / Namespace (Pods)" dashboard shows nice info about CPU/RAM usage for the whole mimir namespace. 👍

@TheoBrigitte TheoBrigitte added the needs/refinement Needs refinement in order to be actionable label Apr 18, 2023
@TheoBrigitte
Copy link
Member

TheoBrigitte commented May 2, 2023

We want to compare Mimir vs Prometheus at installation level

Let's aim for something similar as Loki cost dashboard from #2182

@TheoBrigitte TheoBrigitte removed the needs/refinement Needs refinement in order to be actionable label May 9, 2023
@carillan81
Copy link

From the hackaton:

@carillan81
Copy link

@TheoBrigitte TheoBrigitte changed the title Try mimir Monitoring infrastructure next level with Mimir 🚀 Jun 13, 2023
@TheoBrigitte
Copy link
Member

TheoBrigitte commented Jun 27, 2023

Status update

Mimir currently deployed on gauss via GitOps giantswarm/management-cluster
No multi-tenancy: using anonymous

Metrics

  • Data sent via remote write from current Prometheus servers
  • Rules are imported within Mimir using Grafana Agent

Alerting

  • Mimir Alertmanager deployed
  • Alerts available in Grafana via Mimir Alertmanager datasource
  • Mimir Alertmanager config deployed manually via mimir tool using current Alertmanager config
  • Mimir ruler evaluate rules for alerting

LTS / Grafana Cloud

  • Dedicated Prometheus used for rule evaluation
    • remote read from Mimir
    • evaluate aggregation:* (grafana cloud) rules
    • rules split for better performances
    • remote write to Grafana Cloud (currently duplicated in Mimir ruler)

TODOs

  • How to keep external labels
  • Fix source link in alerts
  • Meta monitoring
    • Mimir
    • Grafana Agent
    • Dedicated Prometheus
  • Use different S3 bucket for Mimir ruler & Mimir alertmanager config
  • Silence-operator: add support for multi-tenancy, to configure silence per tenant
  • Define migration path

@TheoBrigitte
Copy link
Member

  • How to debug Prometheus targets ?

TODOs

  • Monitor Mimir, Grafana Agent
  • Fix alert source link
  • Add new alert templating for Mimir
  • Scrape old targets via scrape config
    • Keep old Prometheus to scrape targets via scrape config

Deploy steps

  • Switch prometheus-agent remoteWrite from old Prometheus to Mimir
  • Stop rule evaluation on old Prometheus

@QuentinBisson
Copy link

@TheoBrigitte how about we start by writting down a RFC?

@TheoBrigitte TheoBrigitte added the needs/refinement Needs refinement in order to be actionable label Oct 17, 2023
@QuentinBisson
Copy link

The discovery phase is over, the epic can be found here #3039. Closing this one

@github-project-automation github-project-automation bot moved this from Backlog 📦 to Done ✅ in Roadmap Jan 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs/refinement Needs refinement in order to be actionable team/atlas Team Atlas topic/monitoring
Projects
Archived in project
Development

No branches or pull requests

6 participants