-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow to mutate PodTemplate when suspending a JobSet and support resuming such JobSet #624
Comments
/assign |
Let's fix this, but rather than solely doing a one-off fix here, we need to iron out the specific requirements for JobSet + Kueue integration, as well as align our roadmaps so changes in Kueue don't break JobSet integration. We just recently had an issue similar to this a couple months ago, with Kueue trying to mutate certain podTemplate fields on suspended JobSets, but these are immutable fields in JobSet, which led to a customer/user reporting the issue (#579). One thing we could potentially do is make the entire podTemplate mutable in JobSet, to prevent any further issues like this. |
I think this is a good point. I think at the technical layer we should keep extending the JobSet e2e test suite in Kueue which was started by recently. EDIT: the test suite for reference: https://github.com/kubernetes-sigs/kueue/blob/main/test/e2e/singlecluster/jobset_test.go. I'm going to extend it as part of kubernetes-sigs/kueue#2691 (started the PR in kubernetes-sigs/kueue#2700). |
The proposal for the e2e test scenario which covers this and #623: #623 (comment) |
There are currently two related issues which prevent JobSet - Kueue integration:
When Kueue evicts a workload (represented by JobSet) it stops the JobSet and tries to restore the PodTemplate to enable re-admitting the same JobSet to another ResourceFlavor (with potentially different nodeSelectors).
For example, the following e2e test for Job shows how Kueue can preempt a workload and re-admit with another nodeSelector: link.
However, the integration with Kueue does not work currently, because the Kueue request to suspend
the JobSet fails if it also wants to update the PodTemaplte.
The text was updated successfully, but these errors were encountered: