This repository contains all the SRE (Site Reliability Engineering) principles and guidelines for managing the Operate First services.
SRE is a software engineering approach to manage operations for systems, applications and services. We use software as a tool to manage systems, solve problems, and automate operations tasks.
If you'd like to learn and get hands on experience with SRE practices, but aren't sure where or how to start, let us help!
- Follow this link to find beginner friendly issues.
- Tag yourself in the issue
- Join the Slack and let us know that you're interested in helping by posting in the #supportchannel a short introduction of yourself and a link to the issue you'd like to complete.
To learn more, check out the incident management procedures, GitHub receiver setup, learn to configure Prometheus alerts, or browse the GitHub repo.