Skip to content

Commit

Permalink
Backport of docs: Well Architected Framework content migration into r…
Browse files Browse the repository at this point in the history
…elease/1.18.x (#21145)

backport of commit c36efc8

Co-authored-by: boruszak <[email protected]>
  • Loading branch information
hc-github-team-consul-core and boruszak authored May 20, 2024
1 parent 52fe1fb commit 51f2470
Show file tree
Hide file tree
Showing 5 changed files with 435 additions and 2 deletions.
83 changes: 83 additions & 0 deletions website/content/docs/agent/monitor/alerts.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
---
layout: docs
page_title: Consul monitoring and alerts recommendations
description: >-
Apply best practices towards Consul monitoring and alerts.
---

# Consul monitoring and alerts recommendations

This document will guide you through which host resources to monitor and how monitoring tools can help you set up alerts to notify you when your Consul cluster exceeds its limits. By monitoring Consul and setting up alerts, you can ensure Consul works as expected for all your service discovery and service mesh needs.

## Instance level monitoring

While each host environment and Consul deployment is unique, these recommendations can serve as a starting point for you to reference to meet the unique needs of your deployment.

A Consul datacenter is the smallest unit of Consul infrastructure that can perform basic Consul operations like service discovery or service mesh. A datacenter contains at least one Consul server agent, but a real-world deployment contains three or five server agents and several Consul client agents.

Consul server agents store all state information, including service and node IP addresses, health checks, and configuration. Consul clients report node and service health status to the Consul cluster. In a typical deployment, you must run client agents on every compute node in your datacenter. If you have Kubernetes workloads, you can also run Consul with an alternate service mesh configuration that deploys Envoy proxies but not client agents. Refer to [Simplified service mesh with Consul dataplanes](/consul/docs/connect/dataplane) for more information.

We recommend monitoring the following parameters for Consul agents health:
- Disk space and file handles
- [RAM utilization](/consul/docs/agent/telemetry#memory-usage)
- CPU utilization
- Network activity and utilization

We recommend using an [application performance monitoring (APM) system](#monitoring-tools) to track these metrics. For a full list of key metrics, visit the [Key metrics](/consul/docs/agent/telemetry#key-metrics) section of Telemetry documentation.

## Recommendations for host-level alerts

We recommend starting with a small cluster for most initial production deployments or for testing environments. For production environments with a consistently high workload, we recommend large clusters . Refer to the [Consul capacity planning](/well-architected-framework/reliability/reliability-consul-capacity-planning#minimum-hardware-requirements) article for more information.

When collecting metrics, it is important to establish a baseline. This baseline ensures your Consul deployment is healthy, and serves as a reference point when troubleshooting abnormal Cluster behavior. Complete the [Monitor Consul datacenter health](/consul/tutorials/day-2-operations/monitor-datacenter-health#how-to-collect-metrics) tutorial to learn how to collect metrics.

Once you have established a baseline for your metrics, use them and the following recommendations to configure reasonable alerts for your Consul agent.

### Memory alert recommendations

Consul uses RAM as the primary storage for data on its leader node, while periodically flushing it to disk. Reference the [Memory usage](/consul/docs/agent/telemetry#memory-usage) section of the Telemetry documentation for more details. The recommended instance type depends on your hosting provider. Refer to the [Hardware sizing for Consul servers](/consul/tutorials/production-deploy/reference-architecture#hardware-sizing-for-consul-servers) for recommended instance types for most cloud providers along with other up-to-date hardware recommendations.

When determining how much RAM you should allocate, we recommend enough RAM for your server agents to contain between 2 to 4 times the working set size. You can determine the working set size by noting the value of `consul.runtime.alloc_bytes` in the telemetry data.

Set up an alert if your RAM usage exceeds a reasonable threshold (for example, 90% of your allocated RAM).

### CPU alert recommendations

Your Consul servers should scale up to handle peak CPU load, not idle load. When idle, Consul servers are waiting to react to changes in service health, placement, or other configuration. If there are any service state changes, the Consul server has to notify all impacted Consul clients simultaneously. For example, if the Consul server has to notify hundreds or thousands of Consul clients of a service state update, the Consul server CPU may spike.

If this happens, your monitoring dashboard will show a CPU spike on all servers immediately after a big registration/deregistration operation. This should not happen — you should be able to do a rollout or other high-change operation without taxing the Consul servers.

Set up an alert to detect CPU spikes on your Consul server agents. When this happens, evaluate the size of your Consul servers and upgrade them accordingly.

### Network recommendations

The data sent between all Consul agents must follow latency requirements for total round trip time (RTT):

Average RTT for all traffic cannot exceed 50ms.
RTT for 99 percent of traffic cannot exceed 100ms.

Refer to the [Reference architecture](/consul/tutorials/production-deploy/reference-architecture#network-latency-and-bandwidth) to learn more about network latency and bandwidth guidance.

Set an alert to detect when the RTT exceeds these values. When this happens, Therefore, you should monitor metrics related to the host's network latency so the RTT does not exceed these values.

### Monitoring Consul using Prometheus and Grafana

Time series based observability tools, such as Grafana and Prometheus, help you monitor the health of Consul clusters over long intervals of time. Refer to the
[Monitoring for Layer 7 observability with Prometheus, Grafana, and Kubernetes](/consul/tutorials/day-2-operations/kubernetes-layer7-observability) tutorial for additional information.

### Monitoring Consul using Datadog

Datadog is a SaaS-based monitoring and analytics platform for large-scale applications and infrastructure. It is one of the supported platforms for monitoring Consul. Datadogs agents run on your host reporting logs, metrics and traces. By configuring Datadog agents on your Consul server and client instances, you can monitor your Consul cluster's health.

Refer to the following resources for more information:

- [Setup Consul logging with DataDog](https://www.datadoghq.com/blog/consul-datadog/)
- [Datadog monitoring solutions brief](https://www.datocms-assets.com/2885/1576713622-datadog-consul.pdf)
- [Hashicorp partner portal for Consul support on Datadog](https://www.hashicorp.com/partners/tech/datadog#consul)

## Next steps

In this guide, you learned which host resources to monitor and how monitoring tools can help you set up alerts to notify you when your Consul cluster exceeds its limits.

- To learn about monitoring the Consul control and data plane, visit our [Monitoring Consul components](/well-architected-framework/reliability/reliability-consul-monitoring-consul-components) documentation.
- Complete the [Monitor Consul datacenter health with Telegraf](/consul/tutorials/day-2-operations/monitor-health-telegraf) tutorial for additional metrics and alerting recommendations.
121 changes: 121 additions & 0 deletions website/content/docs/agent/monitor/components.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
---
layout: docs
page_title: Monitoring Consul components
description: >-
Apply best practices monitoring your Consul control and data plane.
---

# Monitoring Consul components

This document will guide you recommendations for monitoring your Consul control and data plane. By keeping track of these components and setting up alerts, you can better maintain the overall health and resilience of your service mesh.

## Background

A Consul datacenter is the smallest unit of Consul infrastructure that can perform basic Consul operations like service discovery or service mesh. A datacenter contains at least one Consul server agent, but a real-world deployment contains three or five server agents and several Consul client agents.

The Consul control plane consists of server agents that store all state information, including service and node IP addresses, health checks, and configuration. In addition, the control plane is responsible for securing the mesh, facilitating service discovery, health checking, policy enforcement, and other similar operational concerns. In addition, the control plane contains client agents that report node and service health status to the Consul cluster. In a typical deployment, you must run client agents on every compute node in your datacenter.

The Consul data plane consists of proxies deployed locally alongside each service instance. These proxies, called sidecar proxies, receive mesh configuration data from the control plane, and control network communication between their local service instance and other services in the network. The sidecar proxy handles inbound and outbound service connections, and ensures TLS connections between services are both verified and encrypted.

If you have Kubernetes workloads, you can also run Consul with an alternate service mesh configuration that deploys Envoy proxies but not client agents. Refer to [Simplified service mesh with Consul dataplanes](/consul/docs/connect/dataplane) for more information.

## Consul control plane monitoring

The Consul control plane consists of the following components:

- RPC Communication between Consul servers and clients.
- Data plane routing instructions for the Envoy Layer 7 proxy.
- Serf Traffic: LAN and WAN
- Consul cluster peering and server federation

It is important to monitor and establish baseline and alert thresholds for Consul control plane components for abnormal behavior detection. Note that these alerts can also be triggered by some planned events like Consul cluster upgrades, configuration changes, or leadership change.

To help monitor your Consul control plane, we recommend to establish a baseline and standard deviation for the following:

- [Server health](/consul/docs/agent/telemetry#server-health)
- [Leadership changes](/consul/docs/agent/telemetry#leadership-changes)
- [Key metrics](/consul/docs/agent/telemetry#key-metrics)
- [Autopilot](/consul/docs/agent/telemetry#autopilot)
- [Network activity](/consul/docs/agent/telemetry#network-activity-rpc-count)
- [Certificate authority expiration](/consul/docs/agent/telemetry#certificate-authority-expiration)

It is important to have a highly performant network with low network latency. Ensure network latency for gossip in all datacenters are within the 8ms latency budget for all Consul agents. View the [Production server requirements](/consul/docs/install/performance#production-server-requirements) for more information.

### Raft recommendations

Consul uses [Raft for consensus protocol](/consul/docs/architecture/consensus). High saturation of the Raft goroutines can lead to elevated latency in the rest of the system and may cause the Consul cluster to be unstable. As a result, it is important to monitor Raft to track your control plane health. We recommend the following actions to keep control plane healthy:
- Create an alert that notifies you when [Raft thread saturation](/consul/docs/agent/telemetry#raft-thread-saturation) exceeds 50%.
- Monitor [Raft replication capacity](/consul/docs/agent/telemetry#raft-replication-capacity-issues) when Consul is handling large amounts of data and high write throughput.
- Lower [`raft_multiplier`](/consul/docs/install/performance#production) to keep your Consul cluster stable. The value of `raft_multiplier` defines the scaling factor for Consul. Default value for raft_multiplier is 5.

A short multiplier minimizes failure detection and election time but may trigger frequently in high latency situations. This can cause constant leadership churn and associated unavailability. A high multiplier reduces the chances that spurious failures will cause leadership churn but it does this at the expense of taking longer to detect real failures and thus takes longer to restore Consul cluster availability.

Wide networks with higher latency will perform better with larger `raft_multipler` values.

Raft uses BoltDB for storing data and maintaining its own state. Refer to the [Bolt DB performance metrics](/consul/docs/agent/telemetry#bolt-db-performance) when you are troubleshooting Raft performance issues.

## Consul data plane monitoring

The data plane of Consul consists of Consul clients or [Connect proxies](/consul/docs/connect/proxies) interacting with each other through service-to-service communication. Service-to-service traffic always stays within the data plane, while the control plane only enforces traffic rules. Monitoring service-to-service communication is important but may become extremely complex in an enterprise setup with multiple services communicating to each other across federated Consul clusters through mesh, ingress and terminating gateways.

### Service monitoring

You can extract the following service-related information:

- Use the [`catalog`](/consul/commands/catalog) command or the Consul UI to query all registered services in a Consul datacenter.
- Use the [`/agent/service/:service_id`](/consul/api-docs/agent/service#get-service-configuration) API endpoint to query individual services. Connect proxies use this endpoint to discover embedded configuration.

### Proxy monitoring

Envoy is the supported Connect proxy for Consul service mesh. For virtual machines (VMs), Envoy starts as a sidecar service process. For Kubernetes, Envoy starts as a sidecar container in a Kubernetes service pod.
Refer to the [Supported Envoy versions](/consul/docs/connect/proxies/envoy#supported-versions) documentation to find the compatible Envoy versions for your version of Consul.

For troubleshooting service mesh issues, set Consul logs to `trace` or `debug`. The following example annotation sets Envoy logging to `debug`.

```yaml
annotations:
consul.hashicorp.com/envoy-extra-args: '--log-level debug --disable-hot-restart'
```
Refer to the [Enable logging on Envoy sidecar pods](/consul/docs/k8s/annotations-and-labels#consul-hashicorp-com-envoy-extra-args) documention for more information.
#### Envoy Admin Interface
To troubleshoot service-to-service communication issues, monitor Envoy host statistics. Envoy exposes a local administration interface that can be used to query and modify different aspects of the server on port `19000` by default. Envoy also exposes a public listener port to receive mTLS connections from other proxies in the mesh on port `20000` by default.

All endpoints exposed by Envoy are available at the node running Envoy on port `19000`. The node can either be a pod in Kubernetes or VM running Consul Service Mesh. For example, if you forward the Envoy port to your local machine, you can access the Envoy admin interface at `http://localhost:19000/`.

The following Envoy admin interface endpoints are particularly useful:

- The `listeners` endpoint lists all listeners running on `localhost`. This allows you to confirm whether the upstream services are binding correctly to Envoy.

```shell-session
$ curl http://localhost:19000/listeners
public_listener:192.168.19.168:20000::192.168.19.168:20000
Outbound_listener:127.0.0.1:15001::127.0.0.1:15001
```

- The `/clusters` endpoint displays information about the xDS clusters, such as service requests and mTLS related data. The following example shows a truncated output.

```shell-session
$ http://localhost:19000/clusters
`local_app::observability_name::local_app
local_app::default_priority::max_connections::1024
local_app::default_priority::max_pending_requests::1024
local_app::default_priority::max_requests::1024
local_app::default_priority::max_retries::3
local_app::high_priority::max_connections::1024
local_app::high_priority::max_pending_requests::1024
local_app::high_priority::max_requests::1024
local_app::high_priority::max_retries::3
local_app::added_via_api::true
## ...
```

Visit the main admin interface (`http://localhost:19000`) to find the full list of possible Consul admin endpoints. Refer to the [Envoy docs](https://www.envoyproxy.io/docs/envoy/latest/operations/admin) for more information.

## Next steps

In this guide, you learned recommendations for monitoring your Consul control and data plane.

To learn about monitoring the Consul host and instance resources, visit our [Monitoring best practices](/well-architected-framework/reliability/reliability-monitoring-service-to-service-communication-with-envoy) documentation.
Loading

0 comments on commit 51f2470

Please sign in to comment.