Date: 2023-12-29
Accepted
The drive for standardizing metrics in our services is informed by the need for clear, consistent, and actionable data that can guide operational decisions and provide insights into service performance. Prometheus, with its robust monitoring capabilities, offers a suitable platform for this endeavor. This decision also aligns with our commitment to maintaining high service availability and reliability and is influenced by our previous ADR on Prometheus Metrics Naming (ADR #23).
We have decided to adopt standard metrics for our services using Prometheus.
service_availability_ratio
: A gauge metric representing the ratio of uptime to the total time, aligning with our SLA uptime guarantees.http_request_duration_seconds
: A histogram metric measuring the request response times, aiding in tracking our SLOs related to response time.error_rate_per_minute
: A counter metric tracking the number of errors per minute, an SLI for system reliability.
cache_hit_total
andcache_miss_total
: Counter metrics for the total number of cache hits and misses, respectively.cache_hit_ratio
: A calculated ratio fromcache_hit_total
andcache_miss_total
to determine the effectiveness of the cache.hot_key_access_frequency
: A gauge metric indicating the frequency of access for hot keys.
requests_per_second
(RPS): Counter metric measuring service request load per second.transactions_per_second
(TPS) &queries_per_second
(QPS): Counter metrics for transactions and database queries per second, respectively.response_time_seconds
: Histogram metric tracking service response time.error_rate_percentage
: Gauge metric for percentage of error requests.
network_traffic_bytes
: Counter metric for inbound and outbound network traffic.cpu_usage_percentage
: Gauge metric for CPU utilization.ram_usage_bytes
: Gauge metric for RAM usage.disk_usage_bytes
: Gauge metric for disk space usage (HDD/SSD).
queue_size
: Gauge metric for the size of each critical queue.process_count
&thread_count
: Gauge metrics for monitoring the number of processes and threads.
Consistent with ADR #23, all metrics will follow the prescribed naming conventions and utilize labels for additional dimensions. This will enhance clarity, ease of understanding, and consistency in metric categorization.
This standardization will enable systematic monitoring and improvement of service performance. However, challenges include ensuring accuracy in distributed systems and avoiding over-reliance on quantitative metrics. These will be mitigated through continuous monitoring strategy refinement and periodic metric reviews.