Skip to content

Conversation

Copilot
Copy link

@Copilot Copilot AI commented Aug 17, 2025

This PR implements comprehensive monitoring and Service Level Indicators (SLI) / Service Level Objectives (SLO) setup for GKE cluster deployments to track uptime with a 90% availability target and 10% error budget.

Key Changes

1. Cluster Configuration Updates

  • Updated default cluster name from gke to prod-k8s as specified in requirements
  • Added new monitoring-specific variables for configurability

2. Log-Based Metrics (monitoring.tf)

Created four key log-based metrics to track deployment health:

  • Deployment Availability: Monitors deployment creation/failure events using container logs
  • Replica Readiness: Tracks replica scaling and readiness status
  • Pod Readiness: Monitors pod lifecycle and readiness probe results
  • Uptime Percentage: Calculates deployment uptime as ratio of ready/desired replicas

3. SLO Configuration

  • Target: 90% availability (10% error budget)
  • Period: 7-day rolling window
  • Implementation: Request-based SLI using log data for accurate measurement
  • Scope: All deployments in the configurable target namespace (default: "app")

4. Alerting System

  • Burn Rate Alert: Triggers when error budget consumption exceeds 2x normal rate
  • Investigation Guide: Includes structured troubleshooting steps
  • Auto-close: 7-day auto-resolution for alert hygiene

5. Variables and Flexibility

Added configurable variables:

variable "target_namespace" {
  default = "app"
  description = "Target namespace for deployment monitoring"
}

variable "slo_availability_target" {
  default = 0.9
  description = "SLO availability target (90% = 0.9)"
}

Usage

Deploy the monitoring setup:

terraform apply

Create deployments in the target namespace:

kubectl create namespace app
kubectl apply -f deployment.yaml -n app

Monitor SLO compliance through Google Cloud Console → Monitoring → Services.

Error Budget Management

With 90% SLO target:

  • Allowed downtime: 16.8 hours per week
  • Alert threshold: 2x burn rate (budget consumed 2x faster than normal)
  • Monitoring scope: All deployments in target namespace

The implementation uses Google Cloud's native logging and monitoring services to provide accurate, real-time visibility into deployment availability without requiring additional infrastructure or agents.

Fixes #18.

Warning

Firewall rules blocked me from connecting to one or more addresses

I tried to connect to the following addresses, but was blocked by firewall rules:

  • checkpoint-api.hashicorp.com
    • Triggering command: terraform fmt (dns block)
    • Triggering command: terraform init (dns block)
    • Triggering command: terraform validate (dns block)
  • metadata.google.internal
    • Triggering command: .terraform/providers/registry.terraform.io/hashicorp/google/4.53.1/linux_amd64/terraform-provider-google_v4.53.1_x5 (dns block)

If you need me to access, download, or install something from one of these locations, you can either:


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copy link
Author

Copilot AI commented Aug 17, 2025

@sharmavijay86 👋 This repository doesn't have Copilot instructions. With Copilot instructions, I can understand the repository better, work faster and produce higher quality PRs.

I can generate a .github/copilot-instructions.md file for you automatically. Click here to open a pre-filled issue and assign it to me. I'll write the instructions, and then tag you for review.

@Copilot Copilot AI changed the title [WIP] gke cluster deployed services uptime SLi SLI monitoring setup Implement GKE cluster SLI/SLO monitoring for deployment uptime tracking Aug 17, 2025
@Copilot Copilot AI requested a review from sharmavijay86 August 17, 2025 10:09
Copilot finished work on behalf of sharmavijay86 August 17, 2025 10:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gke cluster deployed services uptime SLi SLI monitoring setup

2 participants