Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Organize outcome of on-site observability session #1358

Closed
4 tasks done
anvddriesch opened this issue Sep 5, 2022 · 3 comments
Closed
4 tasks done

Organize outcome of on-site observability session #1358

anvddriesch opened this issue Sep 5, 2022 · 3 comments

Comments

@anvddriesch
Copy link

anvddriesch commented Sep 5, 2022

context: https://gigantic.slack.com/archives/C0E0V16DC/p1661961550750789

Notes on today's open talk about Observability

What is it for?

Each one of us wrote down their thoughts, and once categorised we had this:
incident detection ++++++++++
visibility ++++++++++
debugging +++++
billing ++
performance ++
business +
security +
Current pain points?

Alerting / oncall experience:

can we ease problem solving?
links to graphs / dashboards? (requires to have quality dashboards)
Should we rely more on error budget / SLOs to reduce the number of alerts?

Debugging:

low metrics capacity is impacting visibility
could we increase retention?
could we downsample old data?

Logs

Logs are often lost (like after restarts), so we need to store them somewhere for debugging
Also, alerting on logs would be great
based on keywords
or based on patterns (like sudden increase in number of logs)
Which logs?
kube system logs
no customer logs because they may contain sensitive info.
but could we create metrics from customer logs? (like amount of logs), so we can alert them in case something seems strange?
give control to teams, like with Prometheus-rules:
what logs to collect
alerts
recording rules to create metrics from logs
Future:

Traces would be great
Linkerd generates great data (metrics, logs traces...), would give great hindsight. Do we use this data?
Automatic prediction is a great promise, but not realistic for the moment. Some have tried and failed.
Final word

I really loved this discussion, I collected so much feedback! Thanks to all those who participated!
Now we have a lot of homework :rolling_on_the_floor_laughing:
Should we create issues to track actionable items? It sig-monitoring, in Atlas? Or just use this feedback to better prioritise existing tasks?

Immediately actionable follow up tasks (let's revisit logs and tracing when we have the stack)

@anvddriesch
Copy link
Author

Biggest takeaway for atlas here is that the focus should clearly be on logging.
Also we should involve @weatherhog and schedule a session with sig monitoring and other interested parties to identify action items in the field of debugging, alerting and on-call experience.
@anvddriesch (myself) can schedule this session.

@anvddriesch
Copy link
Author

We identified some action items related to debugging and alerting (issues linked in top comment)

@TheoBrigitte
Copy link
Member

We took care of most items mentioned above.

Also we are now working on:

I am therefore closing this, but always happy to have a chat and feedback on how to improve things further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants