Skip to content

Commit c346e9a

Browse files
Fixes formatting inconsistencies and updates terminology in A100 proposal. Adds links, clarifies slow start implementation details, and aligns with linked A24 proposal.
Signed-off-by: anurag.ag <[email protected]>
1 parent 082fc33 commit c346e9a

File tree

2 files changed

+61
-44
lines changed

2 files changed

+61
-44
lines changed

A100-client-side-weighted-round-robin-slow-start.md

Lines changed: 60 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
A100: Client-side weighted round robin slow start configuration
1+
A100: Client-side weighted round-robin slow start configuration
22
----
33

44
* Author(s): [Anurag Agarwal](https://github.com/anuragagarwal561994)
@@ -10,40 +10,40 @@ A100: Client-side weighted round robin slow start configuration
1010
## Abstract
1111

1212
This proposal introduces an enhancement to the existing client-side weighted_round_robin (WRR) load balancing policy in
13-
gRPC by incorporating a configurable `slow_start_config` mechanism. The intent of this feature is to gradually increase
14-
traffic to backend endpoints that are newly introduced or have recently rejoined the cluster, allowing them time to warm
15-
up and reach their optimal performance level before handling their full share of traffic. This change increases system
16-
stability and resilience in environments with dynamic scaling and volatile workloads.
13+
gRPC by incorporating a configurable `slow_start_config` mechanism. This feature enables a controlled, gradual increase
14+
in traffic allocation to backend endpoints that are either newly introduced or recently rejoined the cluster. By
15+
gradually ramping up their traffic, this enhancement ensures backend endpoints have enough time to warm up, optimize
16+
performance, and stabilize before serving their full traffic share.
1717

18-
The design borrows from production-ready practices in other data plances such as Envoy, where gradual traffic ramp-up (
19-
slow start) is a [well-established technique][Envoy Slow Start Documentation] for avoiding performance degradation and
18+
The design borrows from production-ready practices in other data planes such as Envoy, where gradual traffic ramp-up
19+
(slow start) is a [well-established technique][Envoy Slow Start Documentation] for avoiding performance degradation and
2020
request failures during backend startup or recovery. The slow start feature gradually increases the traffic sent to
2121
newly added endpoints during a warmup period, allowing them to warm up their caches and establish connections before
22-
receiving full traffic load.
22+
receiving the full traffic load.
2323

2424
## Background
2525

2626
gRPC's WRR load balancing policy allows clients to route requests to backend endpoints in proportion to assigned
2727
weights. These weights are usually derived from backend metrics, such as CPU usage, QPS, and error rates published by
2828
the backend servers. This allows gRPC clients to adapt traffic distribution dynamically based on backend capacity.
2929

30-
The current WRR implementation has a blackout period mechanism that provides some warm-up functionality, but it's not
31-
sufficient for all cases, as described in detail below. When a new endpoint appears—either due to autoscaling,
32-
replacement, or recovery from a failure—the client still routes traffic to it based on weights that may not account for
33-
its initialization state. This can result in overloading the endpoint before it is fully initialized, negatively
34-
impacting response times and service reliability. It is especially problematic for systems with cold caches, JIT
35-
compilation warm-up delays, or dependency initialization steps.
30+
The current WRR implementation includes a blackout period with some warm-up functionality, but it falls short in various
31+
scenarios. When a new endpoint is introduced due to autoscaling, replacement, or recovery—clients still route traffic
32+
based on weights that don’t account for the endpoint’s initialization state. This can overwhelm the endpoint before it
33+
is fully stable, leading to degraded response times and reduced service reliability. It is especially problematic for
34+
systems with cold caches, JIT compilation warm-up delays, or dependency initialization steps.
3635

3736
In contrast, many modern systems adopt slow start strategies in load balancing to address these issues. These strategies
3837
allow endpoints to ramp up traffic gradually over a defined window, smoothing transitions and mitigating the risks of
39-
traffic spikes. Similar functionality exists in Envoy's load balancing policies, where slow start is implemented for
40-
round robin and least request policies.
38+
traffic spikes. Similar functionality exists in Envoy's load balancing policies, where a slow start is implemented for
39+
round-robin and the least request policies.
4140

4241
Introducing a `slow_start_config` configuration in gRPC WRR will offer these benefits within the native client policy,
4342
reducing reliance on external traffic-shaping mechanisms or manual intervention.
4443

4544
### Related Proposals:
4645

46+
* [gRFC A24][A24]
4747
* [gRFC A58][A58]
4848
* [gRFC A66][A66]
4949
* [gRFC A78][A78]
@@ -57,27 +57,28 @@ computed weights for endpoints during their warmup period, gradually increasing
5757

5858
### LB Policy Config and Parameters
5959

60-
The `weighted_round_robin` LB policy config will be extended to include slow start configuration:
60+
The `weighted_round_robin` [LB policy config][A24] will be extended to include slow start configuration:
6161

6262
```textproto
6363
message LoadBalancingConfig {
6464
oneof policy {
65-
ClientSideWeightedRoundRobin weighted_round_robin = 20 [json_name = "weighted_round_robin"];
65+
WeightedRoundRobinLbConfig weighted_round_robin = 20 [json_name = "weighted_round_robin"];
6666
}
6767
}
6868
69-
message ClientSideWeightedRoundRobin {
69+
message WeightedRoundRobinLbConfig {
7070
// ... existing fields ...
7171
7272
// Configuration for slow start feature
73-
SlowStartConfig slow_start_config = 8;
73+
SlowStartConfig slow_start_config = 7;
7474
}
7575
7676
message SlowStartConfig {
7777
// Represents the size of slow start window.
78-
// If set, the newly created endpoint remains in slow start mode starting from its creation time
78+
//
79+
// The newly created endpoint remains in slow start mode starting from its creation time
7980
// for the duration of slow start window.
80-
google.protobuf.Duration slow_start_window = 1;
81+
google.protobuf.Duration slow_start_window = 1; // Required
8182
8283
// This parameter controls the speed of traffic increase over the slow start window. Defaults to 1.0,
8384
// so that endpoint would get linearly increasing amount of traffic.
@@ -87,16 +88,16 @@ message SlowStartConfig {
8788
//
8889
// During slow start window, effective weight of an endpoint would be scaled with time factor and aggression:
8990
// ``new_weight = weight * max(min_weight_percent / 100, time_factor ^ (1 / aggression))``,
90-
// where ``time_factor=max(time_since_start_seconds, 1) / slow_start_window_seconds``.
91+
// where ``time_factor = max(time_since_start_seconds, 1) / slow_start_window_seconds``.
9192
//
9293
// As time progresses, more and more traffic would be sent to endpoint, which is in slow start window.
9394
// Once endpoint exits slow start, time_factor and aggression no longer affect its weight.
94-
google.protobuf.FloatValue aggression = 2;
95+
float aggression = 2;
9596
9697
// Configures the minimum percentage of the original weight that will be used for an endpoint
9798
// in slow start. This helps to avoid a scenario in which endpoints receive no traffic during the
98-
// slow start window. Valid range is between 0 and 100. If the value is not specified, the default is 10%.
99-
google.protobuf.UInt32Value min_weight_percent = 3;
99+
// slow start window. Valid range is [0.0, 100.0]. If the value is not specified, the default is 10%.
100+
float min_weight_percent = 3;
100101
}
101102
```
102103

@@ -140,21 +141,22 @@ When an endpoint is not in the warmup period, the scale factor is set to 1.0, me
140141
without modification. This ensures that the slow start mechanism only affects endpoints during their initial warmup
141142
phase, after which they participate in normal load balancing based on their actual performance metrics.
142143

143-
### Blackout Period vs Slow Start
144+
### Blackout Period vs. Slow Start
144145

145146
The WRR load balancing policy will offer two independent mechanisms for handling new endpoints: the blackout period and
146147
slow start. These mechanisms can be used independently or in combination, allowing operators to choose the approach that
147148
best fits their needs.
148149

149-
The blackout period, which defaults to 10 seconds, begins when an endpoint receives its first non-zero load report (
150-
tracked by `non_empty_since` timestamp). During this period, the endpoint continues to receive traffic, but instead of
151-
using the weights reported by the backend servers, the load balancer uses the mean of all backend-reported weights. This
152-
period helps prevent churn in the load balancing decisions when the set of endpoint addresses changes, ensuring that the
153-
weights used are based on stable, continuous load reporting.
150+
The blackout period begins when an endpoint’s first nonzero load report is observed (tracked by `non_empty_since`) and
151+
defaults to 10 seconds. Traffic is still routed to the endpoint during this time, but the load balancer ignores
152+
per‑endpoint reported weights and uses the average of all backendreported weights instead. This reduces churn in
153+
load‑balancing decisions as the set of endpoints changes and allows time for weight reports to become stable before they
154+
are used.
154155

155156
The slow start period begins when an endpoint transitions to ready state (tracked by `ready_since` timestamp) and
156157
applies a gradual scaling factor to the weights over a configurable duration. This scaling is applied to whatever weight
157-
is being used (either the mean weight during blackout period or the actual backend-reported weight after blackout
158+
is being used (either the mean weight during the blackout period or the actual backend-reported weight after the
159+
blackout
158160
period). The slow start period operates independently of the blackout period, meaning it will continue to scale the
159161
weights regardless of whether the blackout period is still active or has ended.
160162

@@ -181,11 +183,12 @@ expected.
181183

182184
These mechanisms can be configured in different ways:
183185

184-
- Using only blackout period: Ensures stable weight reporting by using mean weights before switching to backend weights
186+
- Using only the blackout period: Ensures stable weight reporting by using mean weights before switching to backend
187+
weights
185188
- Using only slow start: Allows immediate use of backend weights but scales them gradually
186189
- Using both: Provides both stable weight reporting and gradual scaling
187-
- The slow start period will scale the mean weights during blackout period
188-
- After blackout period ends, it will continue to scale the actual backend-reported weights
190+
- The slow start period will scale the mean weights during the blackout period
191+
- After the blackout period ends, it will continue to scale the actual backend-reported weights
189192

190193
This flexible design allows operators to tune the behavior based on their specific needs, whether they want to
191194
prioritize stable weight reporting, faster weight adoption, or gradual traffic ramp-up.
@@ -223,14 +226,14 @@ function get_final_weight(endpoint_weight: float, scaling_factor: float) -> floa
223226

224227
### xDS Integration
225228

226-
The slow start configuration will be added to the xDS proto for the weighted round robin policy:
229+
The slow start configuration will be added to the xDS proto for the weighted round-robin policy:
227230

228231
```textproto
229232
package envoy.extensions.load_balancing_policies.client_side_weighted_round_robin.v3;
230233
231234
message ClientSideWeightedRoundRobin {
232235
// ... existing fields ...
233-
cluster.v3.Cluster.SlowStartConfig slow_start_config = 8;
236+
common.v3.SlowStartConfig slow_start_config = 8;
234237
}
235238
236239
message SlowStartConfig {
@@ -242,14 +245,25 @@ message SlowStartConfig {
242245

243246
xDS PR: https://github.com/envoyproxy/envoy/pull/40090
244247

248+
#### Transforming xDS message to gRPC service config
249+
250+
The gRPC client converts the xDS policy config into the gRPC service config format defined
251+
in [LB Policy Config](#lb-policy-config-and-parameters). For the slow start fields, the transformation is as follows:
252+
253+
* `ClientSideWeightedRoundRobin.slow_start_config` -> `LoadBalancingConfig.weighted_round_robin.slow_start_config`
254+
* `SlowStartConfig.slow_start_window` -> `SlowStartConfig.slow_start_window`
255+
* `SlowStartConfig.aggression` -> `SlowStartConfig.aggression`
256+
* `SlowStartConfig.min_weight_percent.value` -> `SlowStartConfig.min_weight_percent`
257+
245258
### Metrics
246259

247260
The following metric will be exposed to help monitor the slow start behavior:
248261

249262
`grpc.lb.wrr.endpoints_in_slow_start`
250263

251264
- Type: Counter
252-
- Description: Number of endpoints currently in slow start period
265+
- Description: Number of endpoints currently in the slow start period. This is incremented when a new scheduler is
266+
created.
253267
- Labels:
254268
- `grpc.lb.locality`: The locality of the endpoints [gRFC A78][A78]
255269
- `grpc.lb.backend_service`: The backend service name [gRFC A89][A89]
@@ -280,7 +294,7 @@ The slow start feature is most effective in scenarios where:
280294

281295
- Few new endpoints are added at a time (e.g., scale events in Kubernetes)
282296
- Endpoints need time to warm up caches or establish connections
283-
- The system has sufficient traffic to gradually increase load
297+
- The system has enough traffic to gradually increase the load
284298

285299
The feature may be less effective when:
286300

@@ -296,14 +310,14 @@ In these cases, the slow start feature may lead to:
296310

297311
### Scope and Limitations
298312

299-
This proposal specifically focuses on implementing slow start for the weighted round robin load balancing policy. While
300-
similar slow start functionality could potentially be implemented for other load balancing algorithms like Round Robin
301-
and Least Request, these are not included in this proposal for the following reasons:
313+
This proposal only adds a slow start to weighted round-robin. While similar slow start functionality could potentially
314+
be implemented for other load balancing algorithms like Round Robin and Least Request, these are not included in this
315+
proposal for the following reasons:
302316

303317
1. These algorithms don't use weights to determine endpoint selection, making the implementation of slow start more
304318
complex
305319
2. Additional considerations would be needed for how to gradually increase traffic to new endpoints in these algorithms
306-
3. The implementation details would likely differ significantly from the weighted round robin approach
320+
3. The implementation details would likely differ significantly from the weighted round-robin approach
307321

308322
These other load balancing algorithms can be considered for slow start implementation in future proposals, with their
309323
own specific design considerations and requirements.
@@ -325,3 +339,5 @@ Java Implementation: https://github.com/grpc/grpc-java/pull/12200
325339
[A79]: A79-non-per-call-metrics-architecture.md
326340

327341
[A89]: A89-backend-service-metric-label.md
342+
343+
[A24]: A24-lb-policy-config.md

A24-lb-policy-config.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ Load Balancing Policy Configuration
66
* Implemented in: C-core
77
* Last updated: 2018-12-05
88
* Discussion at: https://groups.google.com/d/topic/grpc-io/K03NV5H8HoE/discussion
9+
* Updated By: [A100-client-side-weighted-round-robin-slow-start.md](A100-client-side-weighted-round-robin-slow-start.md)
910

1011
## Abstract
1112

0 commit comments

Comments
 (0)