Organize outcome of on-site observability session #1358

anvddriesch · 2022-09-05T09:23:23Z

context: https://gigantic.slack.com/archives/C0E0V16DC/p1661961550750789

Notes on today's open talk about Observability

What is it for?

Each one of us wrote down their thoughts, and once categorised we had this:
incident detection ++++++++++
visibility ++++++++++
debugging +++++
billing ++
performance ++
business +
security +
Current pain points?

Alerting / oncall experience:

can we ease problem solving?
links to graphs / dashboards? (requires to have quality dashboards)
Should we rely more on error budget / SLOs to reduce the number of alerts?

Debugging:

low metrics capacity is impacting visibility
could we increase retention?
could we downsample old data?

Logs

Logs are often lost (like after restarts), so we need to store them somewhere for debugging
Also, alerting on logs would be great
based on keywords
or based on patterns (like sudden increase in number of logs)
Which logs?
kube system logs
no customer logs because they may contain sensitive info.
but could we create metrics from customer logs? (like amount of logs), so we can alert them in case something seems strange?
give control to teams, like with Prometheus-rules:
what logs to collect
alerts
recording rules to create metrics from logs
Future:

Traces would be great
Linkerd generates great data (metrics, logs traces...), would give great hindsight. Do we use this data?
Automatic prediction is a great promise, but not realistic for the moment. Some have tried and failed.
Final word

I really loved this discussion, I collected so much feedback! Thanks to all those who participated!
Now we have a lot of homework :rolling_on_the_floor_laughing:
Should we create issues to track actionable items? It sig-monitoring, in Atlas? Or just use this feedback to better prioritise existing tasks?

Immediately actionable follow up tasks (let's revisit logs and tracing when we have the stack)

anvddriesch · 2022-09-20T09:40:50Z

Biggest takeaway for atlas here is that the focus should clearly be on logging.
Also we should involve @weatherhog and schedule a session with sig monitoring and other interested parties to identify action items in the field of debugging, alerting and on-call experience.
@anvddriesch (myself) can schedule this session.

anvddriesch · 2022-10-05T12:32:10Z

We identified some action items related to debugging and alerting (issues linked in top comment)

TheoBrigitte · 2023-02-10T11:12:22Z

We took care of most items mentioned above.

Also we are now working on:

implementing logging (see roadmap/311)
improving monitoring (see roadmap/2007)

I am therefore closing this, but always happy to have a chat and feedback on how to improve things further.

TheoBrigitte closed this as completed Feb 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Organize outcome of on-site observability session #1358

Organize outcome of on-site observability session #1358

anvddriesch commented Sep 5, 2022 •

edited

Loading

anvddriesch commented Sep 20, 2022

anvddriesch commented Oct 5, 2022

TheoBrigitte commented Feb 10, 2023

Organize outcome of on-site observability session #1358

Organize outcome of on-site observability session #1358

Comments

anvddriesch commented Sep 5, 2022 • edited Loading

Alerting / oncall experience:

Debugging:

Logs

anvddriesch commented Sep 20, 2022

anvddriesch commented Oct 5, 2022

TheoBrigitte commented Feb 10, 2023

anvddriesch commented Sep 5, 2022 •

edited

Loading