Skip to content

OCPBUGS-60524: podresources: list: use active pods #2413

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: release-4.18
Choose a base branch
from

Conversation

haircommander
Copy link
Member

What type of PR is this?

What this PR does / why we need it:

backport of #2391

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Aug 14, 2025
@openshift-ci-robot
Copy link

@haircommander: This pull request references Jira Issue OCPBUGS-60524, which is invalid:

  • expected the bug to target either version "4.20." or "openshift-4.20.", but it targets "4.19.z" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

What type of PR is this?

What this PR does / why we need it:

backport of #2391

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 14, 2025
@openshift-ci-robot openshift-ci-robot added the backports/unvalidated-commits Indicates that not all commits come to merged upstream PRs. label Aug 14, 2025
@openshift-ci-robot
Copy link

@haircommander: the contents of this pull request could not be automatically validated.

The following commits are valid:

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

@haircommander haircommander changed the base branch from master to release-4.18 August 14, 2025 13:53
@openshift-ci-robot
Copy link

@haircommander: This pull request references Jira Issue OCPBUGS-60524, which is invalid:

  • expected the bug to target either version "4.18." or "openshift-4.18.", but it targets "4.19.z" instead
  • expected Jira Issue OCPBUGS-60524 to depend on a bug targeting a version in 4.19.0, 4.19.z and in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

What type of PR is this?

What this PR does / why we need it:

backport of #2391

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from rphillips and sjenning August 14, 2025 13:54
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 14, 2025
@haircommander
Copy link
Member Author

/retest

Copy link

openshift-ci bot commented Aug 14, 2025

@haircommander: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-techpreview-serial 50d0ecb link false /test e2e-aws-ovn-techpreview-serial
ci/prow/e2e-agnostic-ovn-cmd 50d0ecb link false /test e2e-agnostic-ovn-cmd
ci/prow/e2e-aws-ovn-downgrade 50d0ecb link true /test e2e-aws-ovn-downgrade
ci/prow/e2e-aws-ovn-fips 50d0ecb link true /test e2e-aws-ovn-fips
ci/prow/e2e-aws-ovn-hypershift 50d0ecb link true /test e2e-aws-ovn-hypershift
ci/prow/e2e-aws-ovn-cgroupsv2 50d0ecb link true /test e2e-aws-ovn-cgroupsv2
ci/prow/okd-scos-images 50d0ecb link true /test okd-scos-images
ci/prow/e2e-aws-ovn-crun 50d0ecb link true /test e2e-aws-ovn-crun
ci/prow/integration 50d0ecb link true /test integration
ci/prow/k8s-e2e-aws-ovn-serial 50d0ecb link false /test k8s-e2e-aws-ovn-serial
ci/prow/okd-scos-e2e-aws-ovn 50d0ecb link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-aws-ovn-serial 50d0ecb link true /test e2e-aws-ovn-serial
ci/prow/e2e-aws-ovn-techpreview 50d0ecb link false /test e2e-aws-ovn-techpreview
ci/prow/k8s-e2e-gcp-serial 50d0ecb link true /test k8s-e2e-gcp-serial
ci/prow/k8s-e2e-gcp-ovn 50d0ecb link true /test k8s-e2e-gcp-ovn
ci/prow/k8s-e2e-conformance-aws 50d0ecb link true /test k8s-e2e-conformance-aws
ci/prow/images 50d0ecb link true /test images
ci/prow/e2e-aws-ovn-runc 50d0ecb link true /test e2e-aws-ovn-runc
ci/prow/verify-commits 50d0ecb link true /test verify-commits
ci/prow/verify 50d0ecb link true /test verify
ci/prow/e2e-aws-csi 50d0ecb link false /test e2e-aws-csi
ci/prow/e2e-gcp 50d0ecb link true /test e2e-gcp
ci/prow/e2e-aws-crun-wasm 50d0ecb link true /test e2e-aws-crun-wasm
ci/prow/artifacts 50d0ecb link true /test artifacts
ci/prow/unit 50d0ecb link true /test unit

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@haircommander
Copy link
Member Author

/skip
/jira refresh

@openshift-ci-robot
Copy link

@haircommander: This pull request references Jira Issue OCPBUGS-60524, which is invalid:

  • expected the bug to target either version "4.18." or "openshift-4.18.", but it targets "4.19.z" instead
  • expected Jira Issue OCPBUGS-60524 to depend on a bug targeting a version in 4.19.0, 4.19.z and in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/skip
/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@rphillips
Copy link

/approve
/label backport-risk-assessed

@openshift-ci openshift-ci bot added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label Aug 18, 2025
@rphillips
Copy link

/remove-label backports/unvalidated-commits

Copy link

openshift-ci bot commented Aug 18, 2025

@rphillips: Can not set label backports/unvalidated-commits: Must be member in one of these teams: [openshift-staff-engineers]

In response to this:

/remove-label backports/unvalidated-commits

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@haircommander
Copy link
Member Author

/jira refresh

@openshift-ci-robot
Copy link

@haircommander: This pull request references Jira Issue OCPBUGS-60524, which is invalid:

  • expected Jira Issue OCPBUGS-60524 to depend on a bug targeting a version in 4.19.0, 4.19.z and in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@haircommander
Copy link
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Aug 18, 2025
@openshift-ci-robot
Copy link

@haircommander: This pull request references Jira Issue OCPBUGS-60524, which is valid. The bug has been moved to the POST state.

7 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.18.z) matches configured target version for branch (4.18.z)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
  • release note type set to "Release Note Not Required"
  • dependent bug Jira Issue OCPBUGS-60074 is in the state Verified, which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA))
  • dependent Jira Issue OCPBUGS-60074 targets the "4.19.z" version, which is one of the valid target versions: 4.19.0, 4.19.z
  • bug has dependents

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@haircommander
Copy link
Member Author

/skip

@tkashem
Copy link

tkashem commented Aug 19, 2025

/lgtm
/approve
/remove-label backports/unvalidated-commits

/cc @bertinatto

@openshift-ci openshift-ci bot requested a review from bertinatto August 19, 2025 13:54
@openshift-ci openshift-ci bot removed the backports/unvalidated-commits Indicates that not all commits come to merged upstream PRs. label Aug 19, 2025
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 19, 2025
Copy link

openshift-ci bot commented Aug 19, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: haircommander, rphillips, tkashem

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 19, 2025
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD d3330e0 and 2 for PR HEAD 50d0ecb in total

@bertinatto
Copy link
Member

  1. Is this part of an upstream PR? If so, can you add the PR number to the PR instead of <carry>?
  2. The second commit says it could be dropped once we rebase to 1.34, shouldn't it be a <drop>
  3. If not, can these commits be squashed instead?

CC @jacobsee

The podresources API List implementation uses the internal data of the
resource managers as source of truth.
Looking at the implementation here:
https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/apis/podresources/server_v1.go#L60
we take care of syncing the device allocation data before querying the
device manager to return its pod->devices assignment.
This is needed because otherwise the device manager (and all the other
resource managers) would do the cleanup asynchronously, so the `List` call
will return incorrect data.

But we don't do this syncing neither for CPUs or for memory,
so when we report these we will get stale data as the issue kubernetes#132020 demonstrates.

For CPU manager, we however have the reconcile loop which cleans the stale data periodically.
Turns out this timing interplay was actually the reason the existing issue kubernetes#119423 seemed fixed
(see: kubernetes#119423 (comment)).
But it's actually timing. If in the reproducer we set the `cpuManagerReconcilePeriod` to a time
very high (>= 5 minutes), then the issue still reproduces against current master branch
(https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/test/e2e_node/podresources_test.go#L983).

Taking a step back, we can see multiple problems:
1. not syncing the resource managers internal data before to query for
   pod assignment (no removeStaleState calls) but most importantly
2. the List call iterate overs all the pod known to the kubelet. But the
   resource managers do NOT hold resources for non-running pod, so it is
   better, actually it's correct to iterate only over the active pods.
   This will also avoid issue 1 above.

Furthermore, the resource managers all iterate over the active pods
anyway:
`List` is using all the pods known about:
1. https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/kubelet.go#L3135 goes in
2. https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/pod/pod_manager.go#L215

But all the resource managers are using the list of active pods:
1. https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/kubelet.go#L1666 goes in
2. https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/kubelet_pods.go#L198

So this change will also make the `List` view consistent with the
resource managers view, which is also a promise of the API currently
broken.

We also need to acknowledge the the warning in the docstring of GetActivePods.
Arguably, having the endpoint using a different podset wrt the resource managers with the
related desync causes way more harm than good.
And arguably, it's better to fix this issue in just one place instead of
having the `List` use a different pod set for unclear reason.
For these reasons, while important, I don't think the warning per se
invalidated this change.

We need to further acknowledge the `List` endpoint used the full pod
list since its inception. So, we will add a Feature Gate to disable this
fix and restore the old behavior. We plan to keep this Feature Gate for
quite a long time (at least 4 more releases) considering how stable this
change was. Should a consumer of the API being broken by this change,
we have the option to restore the old behavior and to craft a more
elaborate fix.

The old `v1alpha1` endpoint will be not modified intentionally.

***RELEASE-4.19 BACKPORT NOTE***
dropped the versioned feature gate entry as we don't have the versioned
geature gates in this version.

Signed-off-by: Francesco Romani <[email protected]>
In order to facilitate backports (see OCPBUGS-56785) we prefer
to remove the feature gate added as safety measure upstream and
disable this escape hatch upstream added.

This commit must be dropped once we rebase on top of 1.34.

Signed-off-by: Francesco Romani <[email protected]>
@haircommander haircommander force-pushed the podresources-list-active-pods-backport-4.18 branch from 50d0ecb to 2bb7140 Compare August 21, 2025 16:06
@openshift-ci-robot openshift-ci-robot added the backports/unvalidated-commits Indicates that not all commits come to merged upstream PRs. label Aug 21, 2025
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Aug 21, 2025
@openshift-ci-robot
Copy link

@haircommander: the contents of this pull request could not be automatically validated.

The following commits are valid:

The following commits could not be validated and must be approved by a top-level approver:

Comment /validate-backports to re-evaluate validity of the upstream PRs, for example when they are merged upstream.

Copy link

openshift-ci bot commented Aug 21, 2025

New changes are detected. LGTM label has been removed.

@haircommander
Copy link
Member Author

@bertinatto I've updated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. backports/unvalidated-commits Indicates that not all commits come to merged upstream PRs. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.