You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Each one of us wrote down their thoughts, and once categorised we had this:
incident detection ++++++++++
visibility ++++++++++
debugging +++++
billing ++
performance ++
business +
security +
Current pain points?
Alerting / oncall experience:
can we ease problem solving?
links to graphs / dashboards? (requires to have quality dashboards)
Should we rely more on error budget / SLOs to reduce the number of alerts?
Debugging:
low metrics capacity is impacting visibility
could we increase retention?
could we downsample old data?
Logs
Logs are often lost (like after restarts), so we need to store them somewhere for debugging
Also, alerting on logs would be great
based on keywords
or based on patterns (like sudden increase in number of logs)
Which logs?
kube system logs
no customer logs because they may contain sensitive info.
but could we create metrics from customer logs? (like amount of logs), so we can alert them in case something seems strange?
give control to teams, like with Prometheus-rules:
what logs to collect
alerts
recording rules to create metrics from logs
Future:
Traces would be great
Linkerd generates great data (metrics, logs traces...), would give great hindsight. Do we use this data?
Automatic prediction is a great promise, but not realistic for the moment. Some have tried and failed.
Final word
I really loved this discussion, I collected so much feedback! Thanks to all those who participated!
Now we have a lot of homework :rolling_on_the_floor_laughing:
Should we create issues to track actionable items? It sig-monitoring, in Atlas? Or just use this feedback to better prioritise existing tasks?
Immediately actionable follow up tasks (let's revisit logs and tracing when we have the stack)
Biggest takeaway for atlas here is that the focus should clearly be on logging.
Also we should involve @weatherhog and schedule a session with sig monitoring and other interested parties to identify action items in the field of debugging, alerting and on-call experience. @anvddriesch (myself) can schedule this session.
context: https://gigantic.slack.com/archives/C0E0V16DC/p1661961550750789
Notes on today's open talk about Observability
What is it for?
Each one of us wrote down their thoughts, and once categorised we had this:
incident detection ++++++++++
visibility ++++++++++
debugging +++++
billing ++
performance ++
business +
security +
Current pain points?
Alerting / oncall experience:
can we ease problem solving?
links to graphs / dashboards? (requires to have quality dashboards)
Should we rely more on error budget / SLOs to reduce the number of alerts?
Debugging:
low metrics capacity is impacting visibility
could we increase retention?
could we downsample old data?
Logs
Logs are often lost (like after restarts), so we need to store them somewhere for debugging
Also, alerting on logs would be great
based on keywords
or based on patterns (like sudden increase in number of logs)
Which logs?
kube system logs
no customer logs because they may contain sensitive info.
but could we create metrics from customer logs? (like amount of logs), so we can alert them in case something seems strange?
give control to teams, like with Prometheus-rules:
what logs to collect
alerts
recording rules to create metrics from logs
Future:
Traces would be great
Linkerd generates great data (metrics, logs traces...), would give great hindsight. Do we use this data?
Automatic prediction is a great promise, but not realistic for the moment. Some have tried and failed.
Final word
I really loved this discussion, I collected so much feedback! Thanks to all those who participated!
Now we have a lot of homework :rolling_on_the_floor_laughing:
Should we create issues to track actionable items? It sig-monitoring, in Atlas? Or just use this feedback to better prioritise existing tasks?
Immediately actionable follow up tasks (let's revisit logs and tracing when we have the stack)
The text was updated successfully, but these errors were encountered: