Fixes formatting inconsistencies and updates terminology in A100 proposal. Adds links, clarifies slow start implementation details, and aligns with linked A24 proposal.

anuragagarwal561994 · anuragagarwal561994 · commit c346e9ab012a · 2025-08-29T07:08:18.000+05:30
Signed-off-by: anurag.ag &lt;anuragagarwal561994@users.noreply.github.com&gt;
diff --git a/A100-client-side-weighted-round-robin-slow-start.md b/A100-client-side-weighted-round-robin-slow-start.md
@@ -1,4 +1,4 @@
-A100: Client-side weighted round robin slow start configuration
+A100: Client-side weighted round-robin slow start configuration
 ----
 
 * Author(s): [Anurag Agarwal](https://github.com/anuragagarwal561994)
@@ -10,40 +10,40 @@ A100: Client-side weighted round robin slow start configuration
 ## Abstract
 
 This proposal introduces an enhancement to the existing client-side weighted_round_robin (WRR) load balancing policy in
-gRPC by incorporating a configurable `slow_start_config` mechanism. The intent of this feature is to gradually increase
-traffic to backend endpoints that are newly introduced or have recently rejoined the cluster, allowing them time to warm
-up and reach their optimal performance level before handling their full share of traffic. This change increases system
-stability and resilience in environments with dynamic scaling and volatile workloads.
+gRPC by incorporating a configurable `slow_start_config` mechanism. This feature enables a controlled, gradual increase
+in traffic allocation to backend endpoints that are either newly introduced or recently rejoined the cluster. By
+gradually ramping up their traffic, this enhancement ensures backend endpoints have enough time to warm up, optimize
+performance, and stabilize before serving their full traffic share.
 
-The design borrows from production-ready practices in other data plances such as Envoy, where gradual traffic ramp-up (
-slow start) is a [well-established technique][Envoy Slow Start Documentation] for avoiding performance degradation and
+The design borrows from production-ready practices in other data planes such as Envoy, where gradual traffic ramp-up
+(slow start) is a [well-established technique][Envoy Slow Start Documentation] for avoiding performance degradation and
 request failures during backend startup or recovery. The slow start feature gradually increases the traffic sent to
 newly added endpoints during a warmup period, allowing them to warm up their caches and establish connections before
-receiving full traffic load.
+receiving the full traffic load.
 
 ## Background
 
 gRPC's WRR load balancing policy allows clients to route requests to backend endpoints in proportion to assigned
 weights. These weights are usually derived from backend metrics, such as CPU usage, QPS, and error rates published by
 the backend servers. This allows gRPC clients to adapt traffic distribution dynamically based on backend capacity.
 
-The current WRR implementation has a blackout period mechanism that provides some warm-up functionality, but it's not
-sufficient for all cases, as described in detail below. When a new endpoint appears—either due to autoscaling,
-replacement, or recovery from a failure—the client still routes traffic to it based on weights that may not account for
-its initialization state. This can result in overloading the endpoint before it is fully initialized, negatively
-impacting response times and service reliability. It is especially problematic for systems with cold caches, JIT
-compilation warm-up delays, or dependency initialization steps.
+The current WRR implementation includes a blackout period with some warm-up functionality, but it falls short in various
+scenarios. When a new endpoint is introduced due to autoscaling, replacement, or recovery—clients still route traffic
+based on weights that don’t account for the endpoint’s initialization state. This can overwhelm the endpoint before it
+is fully stable, leading to degraded response times and reduced service reliability. It is especially problematic for
+systems with cold caches, JIT compilation warm-up delays, or dependency initialization steps.
 
 In contrast, many modern systems adopt slow start strategies in load balancing to address these issues. These strategies
 allow endpoints to ramp up traffic gradually over a defined window, smoothing transitions and mitigating the risks of
-traffic spikes. Similar functionality exists in Envoy's load balancing policies, where slow start is implemented for
-round robin and least request policies.
+traffic spikes. Similar functionality exists in Envoy's load balancing policies, where a slow start is implemented for
+round-robin and the least request policies.
 
 Introducing a `slow_start_config` configuration in gRPC WRR will offer these benefits within the native client policy,
 reducing reliance on external traffic-shaping mechanisms or manual intervention.
 
 ### Related Proposals:
 
+* [gRFC A24][A24]
 * [gRFC A58][A58]
 * [gRFC A66][A66]
 * [gRFC A78][A78]
@@ -57,27 +57,28 @@ computed weights for endpoints during their warmup period, gradually increasing
 
 ### LB Policy Config and Parameters
 
-The `weighted_round_robin` LB policy config will be extended to include slow start configuration:
+The `weighted_round_robin` [LB policy config][A24] will be extended to include slow start configuration:
 
 ```textproto
 message LoadBalancingConfig {
   oneof policy {
-    ClientSideWeightedRoundRobin weighted_round_robin = 20 [json_name = "weighted_round_robin"];
+    WeightedRoundRobinLbConfig weighted_round_robin = 20 [json_name = "weighted_round_robin"];
   }
 }
 
-message ClientSideWeightedRoundRobin {
+message WeightedRoundRobinLbConfig {
   // ... existing fields ...
 
   // Configuration for slow start feature
-  SlowStartConfig slow_start_config = 8;
+  SlowStartConfig slow_start_config = 7;
 }
 
 message SlowStartConfig {
   // Represents the size of slow start window.
-  // If set, the newly created endpoint remains in slow start mode starting from its creation time
+  //
+  // The newly created endpoint remains in slow start mode starting from its creation time
   // for the duration of slow start window.
-  google.protobuf.Duration slow_start_window = 1;
+  google.protobuf.Duration slow_start_window = 1; // Required
 
   // This parameter controls the speed of traffic increase over the slow start window. Defaults to 1.0,
   // so that endpoint would get linearly increasing amount of traffic.
@@ -87,16 +88,16 @@ message SlowStartConfig {
   //
   // During slow start window, effective weight of an endpoint would be scaled with time factor and aggression:
   // ``new_weight = weight * max(min_weight_percent / 100, time_factor ^ (1 / aggression))``,
-  // where ``time_factor=max(time_since_start_seconds, 1) / slow_start_window_seconds``.
+  // where ``time_factor = max(time_since_start_seconds, 1) / slow_start_window_seconds``.
   //
   // As time progresses, more and more traffic would be sent to endpoint, which is in slow start window.
   // Once endpoint exits slow start, time_factor and aggression no longer affect its weight.
-  google.protobuf.FloatValue aggression = 2;
+  float aggression = 2;
 
   // Configures the minimum percentage of the original weight that will be used for an endpoint
   // in slow start. This helps to avoid a scenario in which endpoints receive no traffic during the
-  // slow start window. Valid range is between 0 and 100. If the value is not specified, the default is 10%.
-  google.protobuf.UInt32Value min_weight_percent = 3;
+  // slow start window. Valid range is [0.0, 100.0]. If the value is not specified, the default is 10%.
+  float min_weight_percent = 3;
 }
 ```
 
@@ -140,21 +141,22 @@ When an endpoint is not in the warmup period, the scale factor is set to 1.0, me
 without modification. This ensures that the slow start mechanism only affects endpoints during their initial warmup
 phase, after which they participate in normal load balancing based on their actual performance metrics.
 
-### Blackout Period vs Slow Start
+### Blackout Period vs. Slow Start
 
 The WRR load balancing policy will offer two independent mechanisms for handling new endpoints: the blackout period and
 slow start. These mechanisms can be used independently or in combination, allowing operators to choose the approach that
 best fits their needs.
 
-The blackout period, which defaults to 10 seconds, begins when an endpoint receives its first non-zero load report (
-tracked by `non_empty_since` timestamp). During this period, the endpoint continues to receive traffic, but instead of
-using the weights reported by the backend servers, the load balancer uses the mean of all backend-reported weights. This
-period helps prevent churn in the load balancing decisions when the set of endpoint addresses changes, ensuring that the
-weights used are based on stable, continuous load reporting.
+The blackout period begins when an endpoint’s first non‑zero load report is observed (tracked by `non_empty_since`) and
+defaults to 10 seconds. Traffic is still routed to the endpoint during this time, but the load balancer ignores
+per‑endpoint reported weights and uses the average of all backend‑reported weights instead. This reduces churn in
+load‑balancing decisions as the set of endpoints changes and allows time for weight reports to become stable before they
+are used.
 
 The slow start period begins when an endpoint transitions to ready state (tracked by `ready_since` timestamp) and
 applies a gradual scaling factor to the weights over a configurable duration. This scaling is applied to whatever weight
-is being used (either the mean weight during blackout period or the actual backend-reported weight after blackout
+is being used (either the mean weight during the blackout period or the actual backend-reported weight after the
+blackout
 period). The slow start period operates independently of the blackout period, meaning it will continue to scale the
 weights regardless of whether the blackout period is still active or has ended.
 
@@ -181,11 +183,12 @@ expected.
 
 These mechanisms can be configured in different ways:
 
-- Using only blackout period: Ensures stable weight reporting by using mean weights before switching to backend weights
+- Using only the blackout period: Ensures stable weight reporting by using mean weights before switching to backend
+  weights
 - Using only slow start: Allows immediate use of backend weights but scales them gradually
 - Using both: Provides both stable weight reporting and gradual scaling
-    - The slow start period will scale the mean weights during blackout period
-    - After blackout period ends, it will continue to scale the actual backend-reported weights
+    - The slow start period will scale the mean weights during the blackout period
+    - After the blackout period ends, it will continue to scale the actual backend-reported weights
 
 This flexible design allows operators to tune the behavior based on their specific needs, whether they want to
 prioritize stable weight reporting, faster weight adoption, or gradual traffic ramp-up.
@@ -223,14 +226,14 @@ function get_final_weight(endpoint_weight: float, scaling_factor: float) -> floa
 
 ### xDS Integration
 
-The slow start configuration will be added to the xDS proto for the weighted round robin policy:
+The slow start configuration will be added to the xDS proto for the weighted round-robin policy:
 
 ```textproto
 package envoy.extensions.load_balancing_policies.client_side_weighted_round_robin.v3;
 
 message ClientSideWeightedRoundRobin {
   // ... existing fields ...
-  cluster.v3.Cluster.SlowStartConfig slow_start_config = 8;
+  common.v3.SlowStartConfig slow_start_config = 8;
 }
 
 message SlowStartConfig {
@@ -242,14 +245,25 @@ message SlowStartConfig {
 
 xDS PR: https://github.com/envoyproxy/envoy/pull/40090
 
+#### Transforming xDS message to gRPC service config
+
+The gRPC client converts the xDS policy config into the gRPC service config format defined
+in [LB Policy Config](#lb-policy-config-and-parameters). For the slow start fields, the transformation is as follows:
+
+* `ClientSideWeightedRoundRobin.slow_start_config` -> `LoadBalancingConfig.weighted_round_robin.slow_start_config`
+* `SlowStartConfig.slow_start_window` -> `SlowStartConfig.slow_start_window`
+* `SlowStartConfig.aggression` -> `SlowStartConfig.aggression`
+* `SlowStartConfig.min_weight_percent.value` -> `SlowStartConfig.min_weight_percent`
+
 ### Metrics
 
 The following metric will be exposed to help monitor the slow start behavior:
 
 `grpc.lb.wrr.endpoints_in_slow_start`
 
 - Type: Counter
-- Description: Number of endpoints currently in slow start period
+- Description: Number of endpoints currently in the slow start period. This is incremented when a new scheduler is
+  created.
 - Labels:
     - `grpc.lb.locality`: The locality of the endpoints [gRFC A78][A78]
     - `grpc.lb.backend_service`: The backend service name [gRFC A89][A89]
@@ -280,7 +294,7 @@ The slow start feature is most effective in scenarios where:
 
 - Few new endpoints are added at a time (e.g., scale events in Kubernetes)
 - Endpoints need time to warm up caches or establish connections
-- The system has sufficient traffic to gradually increase load
+- The system has enough traffic to gradually increase the load
 
 The feature may be less effective when:
 
@@ -296,14 +310,14 @@ In these cases, the slow start feature may lead to:
 
 ### Scope and Limitations
 
-This proposal specifically focuses on implementing slow start for the weighted round robin load balancing policy. While
-similar slow start functionality could potentially be implemented for other load balancing algorithms like Round Robin
-and Least Request, these are not included in this proposal for the following reasons:
+This proposal only adds a slow start to weighted round-robin. While similar slow start functionality could potentially
+be implemented for other load balancing algorithms like Round Robin and Least Request, these are not included in this
+proposal for the following reasons:
 
 1. These algorithms don't use weights to determine endpoint selection, making the implementation of slow start more
    complex
 2. Additional considerations would be needed for how to gradually increase traffic to new endpoints in these algorithms
-3. The implementation details would likely differ significantly from the weighted round robin approach
+3. The implementation details would likely differ significantly from the weighted round-robin approach
 
 These other load balancing algorithms can be considered for slow start implementation in future proposals, with their
 own specific design considerations and requirements.
@@ -325,3 +339,5 @@ Java Implementation: https://github.com/grpc/grpc-java/pull/12200
 [A79]: A79-non-per-call-metrics-architecture.md
 
 [A89]: A89-backend-service-metric-label.md
+
+[A24]: A24-lb-policy-config.md
diff --git a/A24-lb-policy-config.md b/A24-lb-policy-config.md
@@ -6,6 +6,7 @@ Load Balancing Policy Configuration
 * Implemented in: C-core
 * Last updated: 2018-12-05
 * Discussion at: https://groups.google.com/d/topic/grpc-io/K03NV5H8HoE/discussion
+* Updated By: [A100-client-side-weighted-round-robin-slow-start.md](A100-client-side-weighted-round-robin-slow-start.md)
 
 ## Abstract