1- A100: Client-side weighted round robin slow start configuration
1+ A100: Client-side weighted round- robin slow start configuration
22----
33
44* Author(s): [ Anurag Agarwal] ( https://github.com/anuragagarwal561994 )
@@ -10,40 +10,40 @@ A100: Client-side weighted round robin slow start configuration
1010## Abstract
1111
1212This proposal introduces an enhancement to the existing client-side weighted_round_robin (WRR) load balancing policy in
13- gRPC by incorporating a configurable ` slow_start_config ` mechanism. The intent of this feature is to gradually increase
14- traffic to backend endpoints that are newly introduced or have recently rejoined the cluster, allowing them time to warm
15- up and reach their optimal performance level before handling their full share of traffic. This change increases system
16- stability and resilience in environments with dynamic scaling and volatile workloads .
13+ gRPC by incorporating a configurable ` slow_start_config ` mechanism. This feature enables a controlled, gradual increase
14+ in traffic allocation to backend endpoints that are either newly introduced or recently rejoined the cluster. By
15+ gradually ramping up their traffic, this enhancement ensures backend endpoints have enough time to warm up, optimize
16+ performance, and stabilize before serving their full traffic share .
1717
18- The design borrows from production-ready practices in other data plances such as Envoy, where gradual traffic ramp-up (
19- slow start) is a [ well-established technique] [ Envoy Slow Start Documentation ] for avoiding performance degradation and
18+ The design borrows from production-ready practices in other data planes such as Envoy, where gradual traffic ramp-up
19+ ( slow start) is a [ well-established technique] [ Envoy Slow Start Documentation ] for avoiding performance degradation and
2020request failures during backend startup or recovery. The slow start feature gradually increases the traffic sent to
2121newly added endpoints during a warmup period, allowing them to warm up their caches and establish connections before
22- receiving full traffic load.
22+ receiving the full traffic load.
2323
2424## Background
2525
2626gRPC's WRR load balancing policy allows clients to route requests to backend endpoints in proportion to assigned
2727weights. These weights are usually derived from backend metrics, such as CPU usage, QPS, and error rates published by
2828the backend servers. This allows gRPC clients to adapt traffic distribution dynamically based on backend capacity.
2929
30- The current WRR implementation has a blackout period mechanism that provides some warm-up functionality, but it's not
31- sufficient for all cases, as described in detail below. When a new endpoint appears—either due to autoscaling,
32- replacement, or recovery from a failure—the client still routes traffic to it based on weights that may not account for
33- its initialization state. This can result in overloading the endpoint before it is fully initialized, negatively
34- impacting response times and service reliability. It is especially problematic for systems with cold caches, JIT
35- compilation warm-up delays, or dependency initialization steps.
30+ The current WRR implementation includes a blackout period with some warm-up functionality, but it falls short in various
31+ scenarios. When a new endpoint is introduced due to autoscaling, replacement, or recovery—clients still route traffic
32+ based on weights that don’t account for the endpoint’s initialization state. This can overwhelm the endpoint before it
33+ is fully stable, leading to degraded response times and reduced service reliability. It is especially problematic for
34+ systems with cold caches, JIT compilation warm-up delays, or dependency initialization steps.
3635
3736In contrast, many modern systems adopt slow start strategies in load balancing to address these issues. These strategies
3837allow endpoints to ramp up traffic gradually over a defined window, smoothing transitions and mitigating the risks of
39- traffic spikes. Similar functionality exists in Envoy's load balancing policies, where slow start is implemented for
40- round robin and least request policies.
38+ traffic spikes. Similar functionality exists in Envoy's load balancing policies, where a slow start is implemented for
39+ round- robin and the least request policies.
4140
4241Introducing a ` slow_start_config ` configuration in gRPC WRR will offer these benefits within the native client policy,
4342reducing reliance on external traffic-shaping mechanisms or manual intervention.
4443
4544### Related Proposals:
4645
46+ * [ gRFC A24] [ A24 ]
4747* [ gRFC A58] [ A58 ]
4848* [ gRFC A66] [ A66 ]
4949* [ gRFC A78] [ A78 ]
@@ -57,27 +57,28 @@ computed weights for endpoints during their warmup period, gradually increasing
5757
5858### LB Policy Config and Parameters
5959
60- The ` weighted_round_robin ` LB policy config will be extended to include slow start configuration:
60+ The ` weighted_round_robin ` [ LB policy config] [ A24 ] will be extended to include slow start configuration:
6161
6262``` textproto
6363message LoadBalancingConfig {
6464 oneof policy {
65- ClientSideWeightedRoundRobin weighted_round_robin = 20 [json_name = "weighted_round_robin"];
65+ WeightedRoundRobinLbConfig weighted_round_robin = 20 [json_name = "weighted_round_robin"];
6666 }
6767}
6868
69- message ClientSideWeightedRoundRobin {
69+ message WeightedRoundRobinLbConfig {
7070 // ... existing fields ...
7171
7272 // Configuration for slow start feature
73- SlowStartConfig slow_start_config = 8 ;
73+ SlowStartConfig slow_start_config = 7 ;
7474}
7575
7676message SlowStartConfig {
7777 // Represents the size of slow start window.
78- // If set, the newly created endpoint remains in slow start mode starting from its creation time
78+ //
79+ // The newly created endpoint remains in slow start mode starting from its creation time
7980 // for the duration of slow start window.
80- google.protobuf.Duration slow_start_window = 1;
81+ google.protobuf.Duration slow_start_window = 1; // Required
8182
8283 // This parameter controls the speed of traffic increase over the slow start window. Defaults to 1.0,
8384 // so that endpoint would get linearly increasing amount of traffic.
@@ -87,16 +88,16 @@ message SlowStartConfig {
8788 //
8889 // During slow start window, effective weight of an endpoint would be scaled with time factor and aggression:
8990 // ``new_weight = weight * max(min_weight_percent / 100, time_factor ^ (1 / aggression))``,
90- // where ``time_factor= max(time_since_start_seconds, 1) / slow_start_window_seconds``.
91+ // where ``time_factor = max(time_since_start_seconds, 1) / slow_start_window_seconds``.
9192 //
9293 // As time progresses, more and more traffic would be sent to endpoint, which is in slow start window.
9394 // Once endpoint exits slow start, time_factor and aggression no longer affect its weight.
94- google.protobuf.FloatValue aggression = 2;
95+ float aggression = 2;
9596
9697 // Configures the minimum percentage of the original weight that will be used for an endpoint
9798 // in slow start. This helps to avoid a scenario in which endpoints receive no traffic during the
98- // slow start window. Valid range is between 0 and 100. If the value is not specified, the default is 10%.
99- google.protobuf.UInt32Value min_weight_percent = 3;
99+ // slow start window. Valid range is [0.0, 100.0] . If the value is not specified, the default is 10%.
100+ float min_weight_percent = 3;
100101}
101102```
102103
@@ -140,21 +141,22 @@ When an endpoint is not in the warmup period, the scale factor is set to 1.0, me
140141without modification. This ensures that the slow start mechanism only affects endpoints during their initial warmup
141142phase, after which they participate in normal load balancing based on their actual performance metrics.
142143
143- ### Blackout Period vs Slow Start
144+ ### Blackout Period vs. Slow Start
144145
145146The WRR load balancing policy will offer two independent mechanisms for handling new endpoints: the blackout period and
146147slow start. These mechanisms can be used independently or in combination, allowing operators to choose the approach that
147148best fits their needs.
148149
149- The blackout period, which defaults to 10 seconds, begins when an endpoint receives its first non- zero load report (
150- tracked by ` non_empty_since ` timestamp). During this period, the endpoint continues to receive traffic , but instead of
151- using the weights reported by the backend servers, the load balancer uses the mean of all backend- reported weights. This
152- period helps prevent churn in the load balancing decisions when the set of endpoint addresses changes, ensuring that the
153- weights used are based on stable, continuous load reporting .
150+ The blackout period begins when an endpoint’s first non‑ zero load report is observed (tracked by ` non_empty_since ` ) and
151+ defaults to 10 seconds. Traffic is still routed to the endpoint during this time , but the load balancer ignores
152+ per‑endpoint reported weights and uses the average of all backend‑ reported weights instead . This reduces churn in
153+ load‑balancing decisions as the set of endpoints changes and allows time for weight reports to become stable before they
154+ are used .
154155
155156The slow start period begins when an endpoint transitions to ready state (tracked by ` ready_since ` timestamp) and
156157applies a gradual scaling factor to the weights over a configurable duration. This scaling is applied to whatever weight
157- is being used (either the mean weight during blackout period or the actual backend-reported weight after blackout
158+ is being used (either the mean weight during the blackout period or the actual backend-reported weight after the
159+ blackout
158160period). The slow start period operates independently of the blackout period, meaning it will continue to scale the
159161weights regardless of whether the blackout period is still active or has ended.
160162
@@ -181,11 +183,12 @@ expected.
181183
182184These mechanisms can be configured in different ways:
183185
184- - Using only blackout period: Ensures stable weight reporting by using mean weights before switching to backend weights
186+ - Using only the blackout period: Ensures stable weight reporting by using mean weights before switching to backend
187+ weights
185188- Using only slow start: Allows immediate use of backend weights but scales them gradually
186189- Using both: Provides both stable weight reporting and gradual scaling
187- - The slow start period will scale the mean weights during blackout period
188- - After blackout period ends, it will continue to scale the actual backend-reported weights
190+ - The slow start period will scale the mean weights during the blackout period
191+ - After the blackout period ends, it will continue to scale the actual backend-reported weights
189192
190193This flexible design allows operators to tune the behavior based on their specific needs, whether they want to
191194prioritize stable weight reporting, faster weight adoption, or gradual traffic ramp-up.
@@ -223,14 +226,14 @@ function get_final_weight(endpoint_weight: float, scaling_factor: float) -> floa
223226
224227### xDS Integration
225228
226- The slow start configuration will be added to the xDS proto for the weighted round robin policy:
229+ The slow start configuration will be added to the xDS proto for the weighted round- robin policy:
227230
228231``` textproto
229232package envoy.extensions.load_balancing_policies.client_side_weighted_round_robin.v3;
230233
231234message ClientSideWeightedRoundRobin {
232235 // ... existing fields ...
233- cluster .v3.Cluster .SlowStartConfig slow_start_config = 8;
236+ common .v3.SlowStartConfig slow_start_config = 8;
234237}
235238
236239message SlowStartConfig {
@@ -242,14 +245,25 @@ message SlowStartConfig {
242245
243246xDS PR: https://github.com/envoyproxy/envoy/pull/40090
244247
248+ #### Transforming xDS message to gRPC service config
249+
250+ The gRPC client converts the xDS policy config into the gRPC service config format defined
251+ in [ LB Policy Config] ( #lb-policy-config-and-parameters ) . For the slow start fields, the transformation is as follows:
252+
253+ * ` ClientSideWeightedRoundRobin.slow_start_config ` -> ` LoadBalancingConfig.weighted_round_robin.slow_start_config `
254+ * ` SlowStartConfig.slow_start_window ` -> ` SlowStartConfig.slow_start_window `
255+ * ` SlowStartConfig.aggression ` -> ` SlowStartConfig.aggression `
256+ * ` SlowStartConfig.min_weight_percent.value ` -> ` SlowStartConfig.min_weight_percent `
257+
245258### Metrics
246259
247260The following metric will be exposed to help monitor the slow start behavior:
248261
249262` grpc.lb.wrr.endpoints_in_slow_start `
250263
251264- Type: Counter
252- - Description: Number of endpoints currently in slow start period
265+ - Description: Number of endpoints currently in the slow start period. This is incremented when a new scheduler is
266+ created.
253267- Labels:
254268 - ` grpc.lb.locality ` : The locality of the endpoints [ gRFC A78] [ A78 ]
255269 - ` grpc.lb.backend_service ` : The backend service name [ gRFC A89] [ A89 ]
@@ -280,7 +294,7 @@ The slow start feature is most effective in scenarios where:
280294
281295- Few new endpoints are added at a time (e.g., scale events in Kubernetes)
282296- Endpoints need time to warm up caches or establish connections
283- - The system has sufficient traffic to gradually increase load
297+ - The system has enough traffic to gradually increase the load
284298
285299The feature may be less effective when:
286300
@@ -296,14 +310,14 @@ In these cases, the slow start feature may lead to:
296310
297311### Scope and Limitations
298312
299- This proposal specifically focuses on implementing slow start for the weighted round robin load balancing policy. While
300- similar slow start functionality could potentially be implemented for other load balancing algorithms like Round Robin
301- and Least Request, these are not included in this proposal for the following reasons:
313+ This proposal only adds a slow start to weighted round- robin. While similar slow start functionality could potentially
314+ be implemented for other load balancing algorithms like Round Robin and Least Request, these are not included in this
315+ proposal for the following reasons:
302316
3033171 . These algorithms don't use weights to determine endpoint selection, making the implementation of slow start more
304318 complex
3053192 . Additional considerations would be needed for how to gradually increase traffic to new endpoints in these algorithms
306- 3 . The implementation details would likely differ significantly from the weighted round robin approach
320+ 3 . The implementation details would likely differ significantly from the weighted round- robin approach
307321
308322These other load balancing algorithms can be considered for slow start implementation in future proposals, with their
309323own specific design considerations and requirements.
@@ -325,3 +339,5 @@ Java Implementation: https://github.com/grpc/grpc-java/pull/12200
325339[ A79 ] : A79-non-per-call-metrics-architecture.md
326340
327341[ A89 ] : A89-backend-service-metric-label.md
342+
343+ [ A24 ] : A24-lb-policy-config.md
0 commit comments