Skip to content

Conversation

hugehoo
Copy link
Contributor

@hugehoo hugehoo commented Oct 4, 2025

Fixes: #8516

The priority balancer's init timer was restarting when a child balancer received multiple CONNECTING state updates.
This caused unnecessary delays in failover to lower priority children.

RELEASE NOTES:

  • priority: Fix a bug that was resulting in increased failover time

Copy link

codecov bot commented Oct 4, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.97%. Comparing base (8389ddb) to head (d515983).
⚠️ Report is 4 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8628      +/-   ##
==========================================
- Coverage   82.13%   81.97%   -0.17%     
==========================================
  Files         415      415              
  Lines       40711    40712       +1     
==========================================
- Hits        33437    33372      -65     
- Misses       5897     5948      +51     
- Partials     1377     1392      +15     
Files with missing lines Coverage Δ
...nternal/xds/balancer/priority/balancer_priority.go 76.74% <100.00%> (-4.44%) ⬇️

... and 29 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hugehoo hugehoo marked this pull request as ready for review October 6, 2025 15:39
@easwars easwars added this to the 1.77 Release milestone Oct 6, 2025
Comment on lines +2123 to +2126
defer func(old time.Duration) {
DefaultPriorityInitTimeout = old
}(DefaultPriorityInitTimeout)
DefaultPriorityInitTimeout = defaultTestShortTimeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for existing art like this.

Would you mind adding a helper function instead:

func overrideInitTimeout(t *testing.T, val time.Duration) {
  orig := DefaultPriorityInitTimeout
  DefaultPriorityInitTimeout = val
  t.Cleanup(func() { DefaultPriorityInitTimeout = orig })
}

And all existing tests and this new can have a one-liner to override the init timeout.

// child-0 will be started, and will create a SubConn.
addrs0 := <-cc.NewSubConnAddrsCh
if got, want := addrs0[0].Addr, testBackendAddrStrs[0]; got != want {
t.Fatalf("got unexpected new subconn addr: %v, want %v", got, want)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: How about a more readable error message like New subchannel created for address: %q, want: %q?


// handleChildStateUpdate start/close priorities based on the connectivity
// state.
func (b *priorityBalancer) handleChildStateUpdate(childName string, s balancer.State) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you are here, would you mind renaming s to newState? And then you can rename the new local variable to origState or oldState. That way, it is more explicit which state we are dealing with when

},
BalancerConfig: &LBConfig{
Children: map[string]*Child{
"child-0": {Config: &internalserviceconfig.BalancerConfig{Name: roundrobin.Name}},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Could we use pick_first instead of round_robin, since the latter anyways delegates to the former.

}
sc0 := <-cc.NewSubConnCh

// Send CONNECTING for child-0 - this should start the init timer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The init timer is initially started when the child policy for that priority is created, which will happen when we call pb.UpdateClientConnState above. And as part of that in newChildBalancer, we also set the state in the childBalancer to Connecting.

Instead of relying on keeping track of the time elapsed between these events, what do you think about using the following state transitions which is possible in the real world:

  • Move the subchannel corresponding to priority 0 to Connecting
  • Move the subchannel corresponding to priority 0 to Ready
  • Move the subchannel corresponding to priority 0 to Idle
  • Move the subchannel corresponding to priority 0 to Connecting

And create a local variable for named timeAfterFunc and set it to what is happening currently:

// As a package global
var timeAfterFunc time.AfterFunc

func (cb *childBalancer) startInitTimer() {
	...
	// Instead of directly using time.AfterFunc, use the variable timeAfterFunc
	timerW.timer = timeAfterFunc(DefaultPriorityInitTimeout, func() {
		...
	})
}

From the test, change timeAfterFunc such that you write to a channel controlled by the test, and delegate to the actual afterFunc that was passed in. So, something like:

initTimerStarted := make(chan struct{}, 1)
origTimeAfterFunc := timeAfterFunc
timeAfterFunc = func(d time.Duration, f func()) *time.Timer {
  initTimerStarted <- struct{}{}
  time.AfterFunc(d, f)
}

Then, you can verify that a new timer is not created by reading from the channel.

@easwars easwars added Type: Bug Area: Resolvers/Balancers Includes LB policy & NR APIs, resolver/balancer/picker wrappers, LB policy impls and utilities. Status: Requires Reporter Clarification labels Oct 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Resolvers/Balancers Includes LB policy & NR APIs, resolver/balancer/picker wrappers, LB policy impls and utilities. Status: Requires Reporter Clarification Type: Bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

xds: Priority policy restarts timer on CONNECTING->CONNECTING transition
2 participants