Skip to content

Sloth (rometheus SLO generator) packaged for the Giant Swarm app platform

License

Notifications You must be signed in to change notification settings

giantswarm/sloth-app

Repository files navigation

CircleCI

Sloth APP

This is the Giant Swarm managed app for Sloth SLO framework.

Sloth generates understandable, uniform and reliable Prometheus SLOs for any kind of service. Using a simple SLO spec that results in multiple metrics and multi window multi burn alerts.

Visualize the SLOs

Sloth-app provides a Grafana dashboard allowing simple and quick visualization of the SLOs defined. Those dashboards are defined in the customizations/templates folder.

For more detailed information concerning the Sloth SLOs dashboard, check the official documentation

Example

Part of a Sloth SLO dasboard : sloth slo dashboard

Part of a Sloth SLO Overview dashboard : sloth slo overview dasboard

Rule management

Sloth allow to create PrometheusRules CR as well as plain prometheus rules and can run as an operator in a Kubernetes cluster (for the PrometheusRules CR). It also provides a CLI tool for plain prometheus rules.

Rule configuration

Sloth provides a lot of configurability concerning the rules, allowing for example the user to prevent a rule from paging by adding inhibition labels in the alerting section.

Sloth rule example :

apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: kaas-phoenix-controller-manager-latency
  namespace: monitoring
  labels:
    release: prometheus
spec:
  service: "controller-manager"
  labels:
    component: "controller-manager"
  slos:
    - name: "latency"
      objective: 99
      description: Reconciliation time for each resource controlled by controller manager
      sli:
        events:
          errorQuery: |-
            clamp_min(sum(rate(workqueue_queue_duration_seconds_count{}[{{.window}}])) by (cluster_id) - sum(rate(workqueue_queue_duration_seconds_bucket{le="10"}[{{.window}}])) by (cluster_id), 0)
          totalQuery: |-
            sum(rate(workqueue_queue_duration_seconds_count{}[{{.window}}])) by (cluster_id)
      alerting:
        name: ControllerManagerReconciliationLatencyTooHigh
        labels:
          team: phoenix
          area: kaas
        annotations: {}
        pageAlert:
          labels:
            cancel_if_cluster_status_creating: "true"
            cancel_if_cluster_status_deleting: "true"
            severity: page
            team: phoenix
        ticketAlert:
          labels:
            severity: "slack"
            slack_channel: "#responsible-team"

Update to the latest version

Run

bash bin/import_upstream_chart

Credit