Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup Mimir on CAPA installation #3039

Closed
39 of 41 tasks
Tracked by #1817
TheoBrigitte opened this issue Dec 12, 2023 · 4 comments
Closed
39 of 41 tasks
Tracked by #1817

Setup Mimir on CAPA installation #3039

TheoBrigitte opened this issue Dec 12, 2023 · 4 comments

Comments

@TheoBrigitte
Copy link
Member

TheoBrigitte commented Dec 12, 2023

The goal here is to replace the current monitoring on a testing CAPA installation and have Mimir in place to send alerts and visualize data via Grafana.

This solution should be fully automated, so we can replicate this setup on multiple installations by using a toggle flag which would trigger the deployment and configuration of this monitoring solution. Automation for the new monitoring setup should no be added into the existing prometheus-meta-operator.

Here are some goals to reach :

  • Being able to send alerts to OpsGenie from rules evaluated by Mimir
  • Being able to link alerts and visualize actual data and dashboards in Grafana
  • Being able to silence alerts (ideally using our silence repository)

Mimir Setup

  1. team/atlas topic/monitoring
    QuantumEnigmaa
  2. team/atlas topic/monitoring
    QuantumEnigmaa
  3. team/atlas topic/monitoring
    QuantumEnigmaa QuentinBisson
  4. team/atlas
    QuantumEnigmaa
  5. team/atlas
    QuantumEnigmaa
  6. component/mimir team/atlas
    QuantumEnigmaa
  7. component/mimir team/atlas topic/monitoring
    QuantumEnigmaa
  8. component/mimir team/atlas
    QuentinBisson
  9. component/mimir
  10. team/atlas
    QuentinBisson
  11. team/atlas
    QuentinBisson
  12. 3 of 3
    team/atlas
    QuentinBisson
  13. 3 of 3
    team/atlas
    marieroque
  14. team/atlas topic/monitoring
    QuantumEnigmaa
  15. team/atlas
    hervenicol
  16. blocked-upstream team/atlas
    QuantumEnigmaa
  17. component/mimir team/atlas topic/monitoring
    QuantumEnigmaa

After Mimir is succesfully setup on one installation, we can start getting our internal customers onboarded and check the new system before rolling it out everywhere and cleaning up the old system.

Migration and Cleanup

  1. component/alertmanager team/atlas
    QuentinBisson marieroque
  2. team/atlas topic/monitoring
    QuantumEnigmaa
  3. capi/migration component/mimir epic/capi team/turtles topic/observability
  4. capi/migration component/mimir epic/capi team/phoenix topic/observability
  5. team/atlas
    QuantumEnigmaa QuentinBisson
    marieroque
  6. 8 of 8
    blocked component/mimir component/prometheus team/atlas topic/observability
    QuentinBisson
  7. blocked team/atlas
    QuentinBisson
@github-project-automation github-project-automation bot moved this to Inbox 📥 in Roadmap Dec 12, 2023
@TheoBrigitte TheoBrigitte changed the title Setup Mimir on gazelle Prototype Mimir on a CAPA installation Dec 12, 2023
@TheoBrigitte TheoBrigitte removed this from Roadmap Dec 12, 2023
@github-project-automation github-project-automation bot moved this to Inbox 📥 in Roadmap Dec 12, 2023
@snizhana-dynnyk snizhana-dynnyk moved this from Inbox 📥 to Backlog 📦 in Roadmap Jan 9, 2024
@TheoBrigitte TheoBrigitte changed the title Prototype Mimir on a CAPA installation Setup Mimir on a CAPA installation Jan 9, 2024
@TheoBrigitte TheoBrigitte changed the title Setup Mimir on a CAPA installation Setup Mimir on CAPA installation Jan 9, 2024
@QuentinBisson
Copy link

QuentinBisson commented May 2, 2024

To unblock Mimir:

  • Release announcement for mimir changes and opsctl open removal
  • Adding missing datasource to all dashboards and finish atlas alert reviews
  • Clean up the heartbeats per cluster and clean up prometheus heartbeat alert and remove the heartbeat creation
  • Add information about the grafana agent in the heartbeat ops-recipe
  • Prometheus to grafana cloud prometheus needs an alert and an ops-recipe update
  • Explain in v20 release note that the prometheus agent resource usage will grow bigger
  • Finish removing the prometheus on the MC to reduce resource usage
  • Add HPAs on components we can but the ingester and add an ops-recipe on how to scale it up and down and also that we are alerted when we need to scale it up or down
  • Review opsrecipes around prometheus/mimir
  • Try out Mimir on Goat (private MCs)
  • Remove the mimir specific opsgenie policies

@QuentinBisson
Copy link

cc @Rotfuks as we still need those items :)

@Rotfuks
Copy link
Contributor

Rotfuks commented May 16, 2024

Here's the list of stories discussed from our onsite:

  • giantswarm/giantswarm#30836
  • giantswarm/giantswarm#30833
  • giantswarm/giantswarm#30834
  • giantswarm/giantswarm#30835
  • giantswarm/giantswarm#30837
  • giantswarm/giantswarm#30839

Need to do later:
Once we have alert review finished:

  • Remove the mimir specific opsgenie policies that silence alerts coming from mimir out of BH

Once we have mimir rolled out:

  • Clean up the heartbeats per cluster and clean up prometheus heartbeat alert and remove the heartbeat creation

@Rotfuks
Copy link
Contributor

Rotfuks commented Jul 16, 2024

This is done, hurray! Great job @giantswarm/team-atlas !!!! 🎊

@Rotfuks Rotfuks closed this as completed Jul 16, 2024
@github-project-automation github-project-automation bot moved this from Backlog 📦 to Done ✅ in Roadmap Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

3 participants