Implement Circuit Breaker Pattern to Protect PD Leader #8678

Tema · 2024-10-03T00:19:23Z

Feature Request

Describe your feature request related problem

In large TiDB clusters with hundreds of tidb and tikv nodes, the PD leader can become overwhelmed during certain failure conditions, leading to a "retry storm" or other feedback loop scenarios. Once triggered, PD transitions into a metastable state and cannot recover autonomously, leaving the cluster in a degraded or unavailable state. Existing mechanisms, such as the one discussed in Issue #4480 and the PR #6834, introduce rate-limiting and backoff strategies but are insufficient to prevent PD from being overloaded by a high volume of traffic, even before reaching the server-side limits.

Describe the feature you'd like

I propose implementing a circuit breaker pattern to protect the PD leader from overloading due to retry storms or similar feedback loops. This circuit breaker would:

Actively monitor incoming requests through both gRPC and HTTP channels from TiDB, TiKV, TiFlash, CDC, and PD-ctl components.
Trip the circuit breaker when a predefined threshold of errors, retries, or resource exhaustion is detected, preventing further requests from overwhelming the PD leader.
Allow PD to enter a fail-fast state, limiting incoming traffic until the system has had time to recover or the underlying issue has been resolved.
Gradually restore normal operations by allowing a limited number of requests to flow through once the circuit has cooled down and conditions have stabilized.

This feature is especially critical for large clusters where a high number of pods can continuously hammer a single PD instance during failures, leading to cascading effects that worsen recovery times and overall cluster health.

Describe alternatives you've considered

While existing solutions like rate-limiting in Issue #4480 and PR #6834 provide some protection, they are reactive and dependent on the server-side limiter thresholds being hit. These protections do not adequately account for sudden traffic spikes or complex feedback loops that can overload PD before those thresholds are reached. A proactive circuit breaker would mitigate these scenarios by preemptively tripping before PD becomes overwhelmed, ensuring a smoother recovery process.

Teachability, Documentation, Adoption, Migration Strategy

Introducing the circuit breaker pattern would likely require adjustments in the client request logic across TiDB, TiKV, TiFlash, CDC, and PD-ctl components. This feature could be made configurable, allowing users to set custom thresholds and recovery parameters to fit their specific cluster sizes and workloads.

Documentation would need to include clear guidelines on:

How the circuit breaker operates across the different components.
Configurable options for tuning the circuit breaker thresholds.
Best practices for monitoring and adjusting the circuit breaker behavior to avoid false positives or unnecessary tripping.

Scenarios where this feature could be helpful include:

A large number of TiDB nodes restart simultaneously due to a failure, causing a surge of reconnection requests to the PD leader and rebuilding local region cache
Continuous retries from components like TiKV, TiDB, TiFlash or CDC during network partitions, causing feedback loops that overwhelm PD.

zhangjinpeng87 · 2024-10-03T16:51:52Z

@niubell PTAL

siddontang · 2024-10-14T20:57:29Z

how about TiKV or other services? do we also need to consider together?

Tema · 2024-10-30T21:03:50Z

how about TiKV or other services? do we also need to consider together?

@siddontang I agree. Given that a single key could be served only by on TiKV, we can get many tidb servers hammering single tikv server, so this looks like a similar problem as with PD.

rleungx · 2024-11-05T08:04:47Z

Stage 1: Introduce circuit breaker for go client

RFC ready
- introduce client circuit breaker rfcs#115
Implement circuit breaker in PD client
- Implement the basic framework.
- Wrap some of the current functions, like GetRegion with the circuit breaker.
Support changing settings dynamically through TiDB variables
- Make circuit breaker dynamically configurable.
- Add a new TiDB variable to switch on/off the circuit breaker, e.g., tidb_cb_pd_metadata_error_rate_threshold_pct.
Verify the functionality
- Add endless cases for the circuit breaker.
- All existing tests, after we enable the circuit breaker, can pass without a drawback.

The more details can be expanded when we implement it.

niubell · 2024-11-08T06:41:40Z

how about TiKV or other services? do we also need to consider together?

@siddontang Yes, the circuit-breaker pattern also applies to client-go(TiKV), this RFC only focuses on the high QPS domains/APIs in PD, and other components can reuse some base libs about circuit-breaker, cc @Tema

niubell · 2024-11-08T06:45:38Z

@niubell PTAL

@zhangjinpeng87 Thanks for the reminder, we have discussed details several times online before the formal RFC, LGTM

Tema added the type/feature-request Categorizes issue or PR as related to a new feature. label Oct 3, 2024

Tema mentioned this issue Oct 18, 2024

add retry limiter to backoff function tikv/client-go#1478

Open

Tema mentioned this issue Oct 31, 2024

introduce client circuit breaker tikv/rfcs#115

Open

rleungx pinned this issue Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Circuit Breaker Pattern to Protect PD Leader #8678

Implement Circuit Breaker Pattern to Protect PD Leader #8678

Tema commented Oct 3, 2024

zhangjinpeng87 commented Oct 3, 2024

siddontang commented Oct 14, 2024

Tema commented Oct 30, 2024

rleungx commented Nov 5, 2024

niubell commented Nov 8, 2024

niubell commented Nov 8, 2024

Implement Circuit Breaker Pattern to Protect PD Leader #8678

Implement Circuit Breaker Pattern to Protect PD Leader #8678

Comments

Tema commented Oct 3, 2024

Feature Request

Describe your feature request related problem

Describe the feature you'd like

Describe alternatives you've considered

Teachability, Documentation, Adoption, Migration Strategy

zhangjinpeng87 commented Oct 3, 2024

siddontang commented Oct 14, 2024

Tema commented Oct 30, 2024

rleungx commented Nov 5, 2024

Stage 1: Introduce circuit breaker for go client

niubell commented Nov 8, 2024

niubell commented Nov 8, 2024