Skip to content

Conversation

elijah-rou
Copy link
Contributor

Implement comprehensive pod state management system in the activator to enable
graceful handling of pod lifecycle transitions and connection management.

Pod State Management:

  • Add pod state tracking with states: podHealthy, podPending, podDraining, podRemoved
  • Implement graceful draining with reference counting for in-flight requests
  • Add proper synchronization for concurrent pod state transitions
  • Ensure atomic updates to pod tracker assignments

Connection Management Improvements:

  • Add connection-level metrics for monitoring pod health
  • Track transport failures and timeouts per pod
  • Implement proper error attribution for healthy vs unhealthy targets
  • Add pending request tracking metrics

Breaker Integration:

  • Update queue breaker to track pending vs in-flight requests
  • Add metrics for requests waiting to acquire pod trackers
  • Implement proper capacity management during state transitions

Testing:

  • Add comprehensive tests for pod state transitions
  • Add race condition tests for concurrent state updates
  • Verify graceful draining behavior with in-flight requests
  • Test proper capacity calculations with multiple activators

Performance:

  • Optimize pod assignment using consistent hashing
  • Reduce lock contention in hot paths
  • Improve capacity update calculations

This change improves reliability by ensuring connections are properly managed
during pod lifecycle events, preventing request failures during scaling operations
and pod terminations.

This succeeds work initially done in #15811

@knative-prow knative-prow bot requested review from dprotaso and skonto September 12, 2025 15:27
@knative-prow-robot knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 12, 2025
@knative-prow knative-prow bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 12, 2025
Copy link

knative-prow bot commented Sep 12, 2025

Hi @elijah-rou. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@knative-prow knative-prow bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Sep 12, 2025
Copy link

knative-prow bot commented Sep 12, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: elijah-rou
Once this PR has been reviewed and has the lgtm label, please assign dprotaso for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@elijah-rou elijah-rou force-pushed the feat/inconsistent-breaker-state branch 2 times, most recently from 6a11d8a to d006bd9 Compare September 12, 2025 15:39
@knative-prow-robot knative-prow-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 12, 2025
Copy link

codecov bot commented Sep 12, 2025

Codecov Report

❌ Patch coverage is 78.38617% with 75 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.99%. Comparing base (26a8cec) to head (2cc54da).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/activator/net/throttler.go 80.93% 45 Missing and 12 partials ⚠️
pkg/activator/handler/context.go 10.00% 17 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #16083      +/-   ##
==========================================
- Coverage   80.20%   79.99%   -0.21%     
==========================================
  Files         214      214              
  Lines       16887    17111     +224     
==========================================
+ Hits        13544    13688     +144     
- Misses       2987     3054      +67     
- Partials      356      369      +13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@elijah-rou
Copy link
Contributor Author

/retest

Copy link

knative-prow bot commented Sep 12, 2025

@elijah-rou: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@elijah-rou elijah-rou force-pushed the feat/inconsistent-breaker-state branch from d15000e to 117fa87 Compare September 12, 2025 17:33
@dprotaso
Copy link
Member

/ok-to-test

@knative-prow knative-prow bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 15, 2025
@knative-prow-robot knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 16, 2025
@elijah-rou elijah-rou force-pushed the feat/inconsistent-breaker-state branch from 117fa87 to f2c2e98 Compare September 19, 2025 18:36
@knative-prow-robot knative-prow-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 19, 2025
Copy link

knative-prow bot commented Sep 19, 2025

There are empty aliases in OWNER_ALIASES, cleanup is advised.

@elijah-rou elijah-rou force-pushed the feat/inconsistent-breaker-state branch from ff00a58 to 43edb91 Compare September 19, 2025 18:46
@elijah-rou
Copy link
Contributor Author

/retest

@elijah-rou
Copy link
Contributor Author

@dprotaso what's happening with build-tests_serving_main? I am a bit confused what is broken

@dprotaso
Copy link
Member

diff in the protobuf formatting it looks like - the build tests run this script

https://github.com/knative/serving/blob/main/hack/verify-codegen.sh

We're using this version of protoc - https://github.com/knative/infra/blob/8acbaddaa605095b52df9f569e97a631a412ecb8/images/prow-tests/Dockerfile#L25

If you're messing with proto files we have a flag in our ./hack/update-codegen.sh script here

--generate-protobufs) generate_protobufs=1 ;;

knative-automation and others added 4 commits September 30, 2025 12:40
bumping knative.dev/hack f88b7db...af735b2:
  > af735b2 Fix dot releases (# 434)

Signed-off-by: Knative Automation <[email protected]>
…on handling

Implement comprehensive pod state management system in the activator to
enable
graceful handling of pod lifecycle transitions and connection
management.

Pod State Management:
- Add pod state tracking with states: podHealthy, podPending,
podDraining, podRemoved
- Implement graceful draining with reference counting for in-flight
requests
- Add proper synchronization for concurrent pod state transitions
- Ensure atomic updates to pod tracker assignments

Connection Management Improvements:
- Add connection-level metrics for monitoring pod health
- Track transport failures and timeouts per pod
- Implement proper error attribution for healthy vs unhealthy targets
- Add pending request tracking metrics

Breaker Integration:
- Update queue breaker to track pending vs in-flight requests
- Add metrics for requests waiting to acquire pod trackers
- Implement proper capacity management during state transitions

Testing:
- Add comprehensive tests for pod state transitions
- Add race condition tests for concurrent state updates
- Verify graceful draining behavior with in-flight requests
- Test proper capacity calculations with multiple activators

Performance:
- Optimize pod assignment using consistent hashing
- Reduce lock contention in hot paths
- Improve capacity update calculations

This change improves reliability by ensuring connections are properly
managed
during pod lifecycle events, preventing request failures during scaling
operations
and pod terminations.

chore: go mod vendor after rebase

chore: update codegen
… management

Add extensive race condition tests to verify thread-safe behavior in:
- Pod state transitions (healthy, draining, pending states)
- Concurrent Reserve operations and state changes
- Reference counting under concurrent access
- Weight modifications across goroutines
- RevisionThrottler state updates
- Tracker acquisition with concurrent modifications

These tests help ensure atomic operations properly prevent race conditions
in the activator's pod tracking and throttling logic.
@elijah-rou elijah-rou force-pushed the feat/inconsistent-breaker-state branch from 43edb91 to a92824b Compare September 30, 2025 16:40
@elijah-rou
Copy link
Contributor Author

@dprotaso any thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants