Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finalize Pods if StatefulSet is not found. #4150

Conversation

mbobrovskyi
Copy link
Contributor

@mbobrovskyi mbobrovskyi commented Feb 5, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

Finalize Pods if StatefulSet is not found.

Which issue(s) this PR fixes:

Fixes #4160
Fixes #4138

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fix a bug that doesn't allow Kueue to delete Pods after a StatefulSet is deleted.

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 5, 2025
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Feb 5, 2025
Copy link

netlify bot commented Feb 5, 2025

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit b1e6a9f
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/67a9f838aa5c3e0008974c66

@mbobrovskyi mbobrovskyi force-pushed the fix/sts-remove-finalizer-on-pod-reconciler branch from 83f5b60 to b0d5f5c Compare February 5, 2025 13:24
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 5, 2025
@mbobrovskyi mbobrovskyi force-pushed the fix/sts-remove-finalizer-on-pod-reconciler branch 2 times, most recently from 7f9ce80 to 1905f9d Compare February 6, 2025 16:17
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 6, 2025
@mbobrovskyi mbobrovskyi changed the title Remove finalizer on PodReconciler instead of StatefulSetReconciler. Remove Pod finalizers if StatefulSet is not found. Feb 6, 2025
@mbobrovskyi mbobrovskyi changed the title Remove Pod finalizers if StatefulSet is not found. Finalize Pods if StatefulSet is not found. Feb 6, 2025
@mbobrovskyi mbobrovskyi force-pushed the fix/sts-remove-finalizer-on-pod-reconciler branch from 1905f9d to 1c345bb Compare February 7, 2025 09:05
@mbobrovskyi mbobrovskyi marked this pull request as ready for review February 7, 2025 09:05
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 7, 2025
@k8s-ci-robot k8s-ci-robot requested a review from PBundyra February 7, 2025 09:05
@mbobrovskyi
Copy link
Contributor Author

/cc @mimowo

@k8s-ci-robot k8s-ci-robot requested a review from mimowo February 7, 2025 09:07
@mbobrovskyi mbobrovskyi force-pushed the fix/sts-remove-finalizer-on-pod-reconciler branch 2 times, most recently from 836f0e5 to c0b5f6d Compare February 7, 2025 09:21
@mbobrovskyi
Copy link
Contributor Author

/cc @mszadkow

@k8s-ci-robot k8s-ci-robot requested a review from mszadkow February 7, 2025 10:59
@mimowo
Copy link
Contributor

mimowo commented Feb 10, 2025

/hold cancel

I believe this comment needs more work / investigation: https://github.com/kubernetes-sigs/kueue/pull/4150/files#r1946593745

Actually, the pod is indeed owned directly by STS (tested without Kueue)

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 10, 2025
@mbobrovskyi mbobrovskyi force-pushed the fix/sts-remove-finalizer-on-pod-reconciler branch from 7d63d33 to b1e6a9f Compare February 10, 2025 12:59
@mszadkow
Copy link
Contributor

I have nothing to add here, lgtm

Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve
Thanks!
/cherry-pick release-0.10 release-0.9

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 10, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 7e0c03cc43f697cb30e8ecdabf0a427cda0e9e08

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mbobrovskyi, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 10, 2025
@k8s-ci-robot k8s-ci-robot merged commit 6d724d2 into kubernetes-sigs:main Feb 10, 2025
18 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.11 milestone Feb 10, 2025
@mbobrovskyi mbobrovskyi deleted the fix/sts-remove-finalizer-on-pod-reconciler branch February 10, 2025 13:40
@mimowo
Copy link
Contributor

mimowo commented Feb 10, 2025

@mbobrovskyi please follow up with an analogous fix for LWS.

@mimowo
Copy link
Contributor

mimowo commented Feb 10, 2025

/cherry-pick release-0.10 release-0.9

@k8s-infra-cherrypick-robot
Copy link
Contributor

@mimowo: #4150 failed to apply on top of branch "release-0.10":

Applying: Finalize Pods if StatefulSet is not found.
Using index info to reconstruct a base tree...
M	pkg/controller/jobs/statefulset/statefulset_reconciler.go
M	pkg/controller/jobs/statefulset/statefulset_reconciler_test.go
M	test/e2e/singlecluster/statefulset_test.go
Falling back to patching base and 3-way merge...
Auto-merging test/e2e/singlecluster/statefulset_test.go
CONFLICT (content): Merge conflict in test/e2e/singlecluster/statefulset_test.go
Auto-merging pkg/controller/jobs/statefulset/statefulset_reconciler_test.go
CONFLICT (content): Merge conflict in pkg/controller/jobs/statefulset/statefulset_reconciler_test.go
Auto-merging pkg/controller/jobs/statefulset/statefulset_reconciler.go
CONFLICT (content): Merge conflict in pkg/controller/jobs/statefulset/statefulset_reconciler.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0001 Finalize Pods if StatefulSet is not found.

In response to this:

/cherry-pick release-0.10 release-0.9

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mimowo
Copy link
Contributor

mimowo commented Feb 10, 2025

@mbobrovskyi please also prepare the cherry-picks manually

@tenzen-y
Copy link
Member

/cherry-pick release-0.9

@k8s-infra-cherrypick-robot
Copy link
Contributor

@tenzen-y: #4150 failed to apply on top of branch "release-0.9":

Applying: Finalize Pods if StatefulSet is not found.
Using index info to reconstruct a base tree...
A	pkg/controller/jobs/statefulset/statefulset_reconciler.go
A	pkg/controller/jobs/statefulset/statefulset_reconciler_test.go
M	test/e2e/singlecluster/statefulset_test.go
Falling back to patching base and 3-way merge...
Auto-merging test/e2e/singlecluster/statefulset_test.go
CONFLICT (content): Merge conflict in test/e2e/singlecluster/statefulset_test.go
CONFLICT (modify/delete): pkg/controller/jobs/statefulset/statefulset_reconciler_test.go deleted in HEAD and modified in Finalize Pods if StatefulSet is not found.. Version Finalize Pods if StatefulSet is not found. of pkg/controller/jobs/statefulset/statefulset_reconciler_test.go left in tree.
CONFLICT (modify/delete): pkg/controller/jobs/statefulset/statefulset_reconciler.go deleted in HEAD and modified in Finalize Pods if StatefulSet is not found.. Version Finalize Pods if StatefulSet is not found. of pkg/controller/jobs/statefulset/statefulset_reconciler.go left in tree.
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0001 Finalize Pods if StatefulSet is not found.

In response to this:

/cherry-pick release-0.9

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mbobrovskyi
Copy link
Contributor Author

/cherry-pick release-0.9

The release-0.9 branch is tricky because we don't have SuspendedByParentAnnotation to determine if a Pod belongs to a StatefulSet integration.

@tenzen-y
Copy link
Member

SuspendedByParentAnnotation

Oh, I see. In that case, I'm ok without cherry-picking to release-0.9
@mimowo wdyt?

@mimowo
Copy link
Contributor

mimowo commented Feb 10, 2025

I'm not sure, let's consider options for 0.9:

  1. without this fix the StatefulSet integration has a severe issue which can result in orphaned pods when deleting a StatefulSet
  2. if we cherry-pick without looking at SuspendedByParentAnnotation it will have worse performance, because the STS reconciler would fire for all STS (not only managed by Kueue)
  3. somehow check if a pod belongs to STS managed by Kueue (maybe by the presence of pod-group annotation)?

out of the options 1 < 2 < 3. (3.) requires a bit of code that is not in the main branch, but I think it is still better than (1.) and (2.).

@tenzen-y
Copy link
Member

tenzen-y commented Feb 10, 2025

I think that key point is which aspect is more valuable (performance vs leaving orphan pods issue).
TBH, I'm leaning toward (2.) since hundreds of orphan pods will affect the cluster performance like api-server, etcd...

@mimowo
Copy link
Contributor

mimowo commented Feb 10, 2025

Right, to me correctness is also more important than performance.

However, I believe with (3.) we can have both.
Just check if the Kueue-specific pod-group annotation is present on the pod (and check if the StatefulSet if ownerReference). With these two constraints the performance will be good and behavior correct.

I think the check for SuspendedByParentAnnotation on the main branch is only marginally better to skip reconciliation of pods owned by LWS via StatefulSet. But since 0.9 does not support LWS then looking for pod-group annotation is perfectly fine IMO.

@tenzen-y
Copy link
Member

However, I believe with (3.) we can have both. Just check if the Kueue-specific pod-group annotation is present on the pod (and check if the StatefulSet if ownerReference). With these two constraints the performance will be good and behavior correct.

I think the check for SuspendedByParentAnnotation on the main branch is only marginally better to skip reconciliation of pods owned by LWS via StatefulSet. But since 0.9 does not support LWS then looking for pod-group annotation is perfectly fine IMO.

Yeah, one more problem is conflicts. What if we select (3.)? I want to compare the amount of conflicts between release-0.9 branch and main.
If the about conflicts are mostly same, it would be great to select (3.), but if (3.) brings more conflicts, I'd like to select (2.).
@mimowo wdyt?

@mimowo
Copy link
Contributor

mimowo commented Feb 10, 2025

I think the "fix" is very small, PTAL #4206 (comment). In this comment I also proposed to add a comment which will help us to resolve conflicts in the future I believe by clearly indicating why the code is added.

@tenzen-y
Copy link
Member

I think the "fix" is very small, PTAL #4206 (comment). In this comment I also proposed to add a comment which will help us to resolve conflicts in the future I believe by clearly indicating why the code is added.

commented in PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
6 participants