Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After COS load failure dashboards disappear from Grafana #344

Open
natalytvinova opened this issue Jul 29, 2024 · 2 comments
Open

After COS load failure dashboards disappear from Grafana #344

natalytvinova opened this issue Jul 29, 2024 · 2 comments

Comments

@natalytvinova
Copy link

natalytvinova commented Jul 29, 2024

Bug Description

Hi team!
I have a ceph-mon:cos-agent grafana-agent relation.
I see nice ceph dashboards with data etc. But if COS goes down for some reason (machine is off or it ran out of RAM and everything restarted). Those dashboards with their info disappear. I need to remove and re-add the relation in order to restore the state. I was able to reproduce it on the orangebox.

Workaround:

juju remove-relation ceph-mon:cos-agent grafana-agent-container

juju add-relation ceph-mon:cos-agent grafana-agent-container

To Reproduce

cos--bundle.txt
openstack-bundle.txt
cos-status.txt
openstack-status.txt

Environment

Openstack Yoga, Ceph Quincy, Juju 3.4.2 and 3.4.4 but also saw this on 3.5

Relevant log output

-

Additional context

No response

@lucabello
Copy link
Contributor

We should try to reproduce this by:

  1. If on Multipass, restart the VM; or, kubectl delete both the controller pod and the grafana pod.
  2. When everything comes back up, check if relation data is there / if the dashboards are in the Grafana container / that they don't appear in Grafana.

@lucabello lucabello changed the title After COS failure and restore relations don't come back After COS load failure dashboards disappear from Grafana Sep 5, 2024
@ca-scribner
Copy link
Contributor

This feels like a case where the data is probably there, but the event sequence we receive in this case differs from what was expected. It would be interesting to cause this issue, then force the charm to loop on its existing relation data and see if everything is repopulated. We should also document here the charm event sequence that happens when this failure is triggered so we know what events the charm should be designed to

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants