-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support elastic jobset #622
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for kubernetes-sigs-jobset canceled.
|
/assign @danielvegamyhre @mimowo wouls you be open for a review also? |
/cc @tenzen-y |
/hold Missed the conflicts.. |
Ack, I will try this week, but don't wait for my review unnecessarily if you get enough input from others. |
086a389
to
3f3a2ac
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: kannon92 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@danielvegamyhre do you know how to remove the rebase label? I did resolve conflicts in my branch and pushes those up. I think the prow removal of it is not working. ref: https://kubernetes.slack.com/archives/C09QZ4DQB/p1708018595891739 Looks like rebase label can have issues. So I think removing is fine. It also doesn't block the merge so this is ready for review. |
/remove needs-rebase |
I removed it via the github UI, the "labels" button on the top right. |
I think since you have admin rights to the repo you have access to that. Thank you though! |
/hold cancel |
// Comparing job templates can be slow | ||
// Only do it if we detect a difference. | ||
if !equality.Semantic.DeepEqual(mungedSpec.ReplicatedJobs, oldJS.Spec.ReplicatedJobs) { | ||
if err := validateReplicatedJobsUpdate(mungedSpec.ReplicatedJobs, oldJS.Spec.ReplicatedJobs); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this allows to update the number of replicas for an unsuspended JobSet. Is this desired?
I this would not work if there is no mechanism to recreate the Jobs. Maybe it is better to only support mutability of replicas when the JobSet is suspended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that one would want to add more replicated jobs as a major use case for Elastic JobSet.
We debated about whether or not we should modify the indexed job or the JobSet. I think we went with scaling up/down on the replicas of a ReplicatedJob.
Are you thinking that elasticity should only occur on a suspended object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is ok to support elasticity for running Jobs, but you may test e2e if this really works in this PR (I doubt).
IIUC this requires an update to the underlying Job object no? IIUC you only update the JobSet object, but this does not impact the underlying Jobs which are running, but I might be missing something.
Also, I think changing paralellism only works for IndexedJobs where completions = parallelism
.
If my thinking is correct, then your PR works ok, by only if combined with #625, and only if the JobSet is updated while suspended.
Overall it would be good to demonstrate this feature works with an e2e test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I have probably mixed a single Job mutability and the number of Job replicas.
In this PR you only let to update the Job replicas count, right?
Still, it is not clear to me if we already have code to wcale up and down the number of replicas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a e2e test and scaling up works. I create a JobSet and then I edit the replicated jobs of a worker jobset and I can confirm that new jobs are created.
I change the replicated jobs. I should check on scaling down actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I create a JobSet and then I edit the replicated jobs of a worker jobset and I can confirm that new jobs are created.
Great!
I change the replicated jobs. I should check on scaling down actually.
Please share the results of the testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Downscaling doesn’t seem to work.
I think I’ll add some e2e tests for this as it “works” for integration tests so there may be some subtleties
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update!
@@ -1599,7 +1599,7 @@ func TestValidateUpdate(t *testing.T) { | |||
}.ToAggregate(), | |||
}, | |||
{ | |||
name: "replicated job pod template can be updated for suspended jobset", | |||
name: "replicated jobs updates can change replicas", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure there are two versions of the test: for suspended JobSet and unsuspended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would that be relevant once #625 merges?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I see what you mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pushed up a change.
}, | ||
{ | ||
checkJobCreation: func(js *jobset.JobSet) { | ||
expectedStarts := 7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be good to add a comment why 7 is expected. It is a bit magical number here. Maybe also "numJobs" is a better name? Not sure why this is called expectedStarts
while it is checked against NumJobs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
}, | ||
{ | ||
checkJobCreation: func(js *jobset.JobSet) { | ||
expectedStarts := 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly here. Please add a comment why 2 is expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
Pushed up some e2e tests for this. I need to work on the downscale case. What should be the behavior for elastic jobs on a downscale? If more jobs exists, do we delete the extra jobs only? Or should we delete and recreate all the jobs? On upscale, the extra jobs are created and I think the original ones are left alone. On downscale, I guess we could delete the extra replicas. |
@kannon92 Looks like e2e tests are failing, can you take a look? |
Yes, my hope is to get some clarity on the design/code. Downscaling doesn't work for sure. What should we do on a downscale of replicas in ReplicatedJobs? Do we delete all the jobs and recreate? Or do we terminate the jobs that are not in the list of replicas? |
Hmm, I think if replicatedJob.replicas is downscaled from N to N-2, we should just delete 2 jobs so the state matches the spec. |
Maybe I’m overthinking it. Say I have 4 jobs and I downscale to 2. Should I only delete the 3rd and 4th replicated job via index? Say I want to keep the 1st and 2nd job running and I remove the excess ones? |
If I recall correctly the Job controller works this way for Elastic Indexed Jobs, removing indexes from largest to smallest until there are no more excess pods, so it would make sense to follow that pattern here for consistency I think. |
f4e6034
to
b266034
Compare
2c8ef3a
to
fcd1bd1
Compare
fcd1bd1
to
58ca914
Compare
I got the downscale to work so I think this PR is ready for review. |
want: fmt.Errorf("updates can not change job names or reorder the jobs"), | ||
}, | ||
{ | ||
name: "updates on replicated job templates are not allowed", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want to disallow job template updates? This would prevent us from using Elastic Indexed Jobs, which I'm hoping could be used in a more efficient implementation of exclusive job placement (see #482)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See this comment: #463 (comment).
I wasn’t sure if there was a use case for this so I omitted support for it.
@ahg-g would this be a strong enough usecase to add support for ElasticIndexedJobs.
Fixes #463
Add elastic jobset support (downscale/upscaling replicas of a Replicated Job).
Before this change, we treated ReplicateJobs (in JobSet.Spec) as immutable. This relaxes that restriction.
I choose to validate that names cannot be changed and the job templates must not change during updates. I don't see a reason for names to change and we also don't allow order change of the replicated job list so we can detect differences easily.
I am reopening this to resolve the rebase labels..