Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support elastic jobset #622

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kannon92
Copy link
Contributor

Fixes #463

Add elastic jobset support (downscale/upscaling replicas of a Replicated Job).

Before this change, we treated ReplicateJobs (in JobSet.Spec) as immutable. This relaxes that restriction.

I choose to validate that names cannot be changed and the job templates must not change during updates. I don't see a reason for names to change and we also don't allow order change of the replicated job list so we can detect differences easily.

I am reopening this to resolve the rebase labels..

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 22, 2024
Copy link

netlify bot commented Jul 22, 2024

Deploy Preview for kubernetes-sigs-jobset canceled.

Name Link
🔨 Latest commit 58ca914
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-jobset/deploys/66c0acbe016c860008d463a0

@kannon92
Copy link
Contributor Author

/assign @danielvegamyhre

@mimowo wouls you be open for a review also?

@kannon92
Copy link
Contributor Author

/cc @tenzen-y

@kannon92
Copy link
Contributor Author

/hold

Missed the conflicts..

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 27, 2024
@mimowo
Copy link
Contributor

mimowo commented Jul 29, 2024

@mimowo wouls you be open for a review also?

Ack, I will try this week, but don't wait for my review unnecessarily if you get enough input from others.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kannon92
Once this PR has been reviewed and has the lgtm label, please ask for approval from danielvegamyhre. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kannon92
Copy link
Contributor Author

kannon92 commented Jul 29, 2024

@danielvegamyhre do you know how to remove the rebase label? I did resolve conflicts in my branch and pushes those up. I think the prow removal of it is not working.

ref: https://kubernetes.slack.com/archives/C09QZ4DQB/p1708018595891739

Looks like rebase label can have issues. So I think removing is fine. It also doesn't block the merge so this is ready for review.

@kannon92
Copy link
Contributor Author

/remove needs-rebase

@danielvegamyhre danielvegamyhre removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 29, 2024
@danielvegamyhre
Copy link
Contributor

@danielvegamyhre do you know how to remove the rebase label? I did resolve conflicts in my branch and pushes those up. I think the prow removal of it is not working.

ref: https://kubernetes.slack.com/archives/C09QZ4DQB/p1708018595891739

Looks like rebase label can have issues. So I think removing is fine. It also doesn't block the merge so this is ready for review.

I removed it via the github UI, the "labels" button on the top right.

@kannon92
Copy link
Contributor Author

kannon92 commented Jul 29, 2024

I think since you have admin rights to the repo you have access to that.

Thank you though!

@kannon92
Copy link
Contributor Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 30, 2024
// Comparing job templates can be slow
// Only do it if we detect a difference.
if !equality.Semantic.DeepEqual(mungedSpec.ReplicatedJobs, oldJS.Spec.ReplicatedJobs) {
if err := validateReplicatedJobsUpdate(mungedSpec.ReplicatedJobs, oldJS.Spec.ReplicatedJobs); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this allows to update the number of replicas for an unsuspended JobSet. Is this desired?
I this would not work if there is no mechanism to recreate the Jobs. Maybe it is better to only support mutability of replicas when the JobSet is suspended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that one would want to add more replicated jobs as a major use case for Elastic JobSet.

We debated about whether or not we should modify the indexed job or the JobSet. I think we went with scaling up/down on the replicas of a ReplicatedJob.

Are you thinking that elasticity should only occur on a suspended object?

Copy link
Contributor

@mimowo mimowo Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is ok to support elasticity for running Jobs, but you may test e2e if this really works in this PR (I doubt).

IIUC this requires an update to the underlying Job object no? IIUC you only update the JobSet object, but this does not impact the underlying Jobs which are running, but I might be missing something.

Also, I think changing paralellism only works for IndexedJobs where completions = parallelism.

If my thinking is correct, then your PR works ok, by only if combined with #625, and only if the JobSet is updated while suspended.

Overall it would be good to demonstrate this feature works with an e2e test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I have probably mixed a single Job mutability and the number of Job replicas.

In this PR you only let to update the Job replicas count, right?

Still, it is not clear to me if we already have code to wcale up and down the number of replicas.

Copy link
Contributor Author

@kannon92 kannon92 Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a e2e test and scaling up works. I create a JobSet and then I edit the replicated jobs of a worker jobset and I can confirm that new jobs are created.

I change the replicated jobs. I should check on scaling down actually.

Copy link
Contributor

@mimowo mimowo Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I create a JobSet and then I edit the replicated jobs of a worker jobset and I can confirm that new jobs are created.

Great!

I change the replicated jobs. I should check on scaling down actually.

Please share the results of the testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Downscaling doesn’t seem to work.

I think I’ll add some e2e tests for this as it “works” for integration tests so there may be some subtleties

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update!

@@ -1599,7 +1599,7 @@ func TestValidateUpdate(t *testing.T) {
}.ToAggregate(),
},
{
name: "replicated job pod template can be updated for suspended jobset",
name: "replicated jobs updates can change replicas",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure there are two versions of the test: for suspended JobSet and unsuspended.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that be relevant once #625 merges?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I see what you mean.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed up a change.

},
{
checkJobCreation: func(js *jobset.JobSet) {
expectedStarts := 7
Copy link
Contributor

@mimowo mimowo Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be good to add a comment why 7 is expected. It is a bit magical number here. Maybe also "numJobs" is a better name? Not sure why this is called expectedStarts while it is checked against NumJobs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

},
{
checkJobCreation: func(js *jobset.JobSet) {
expectedStarts := 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here. Please add a comment why 2 is expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@kannon92
Copy link
Contributor Author

kannon92 commented Aug 4, 2024

Pushed up some e2e tests for this. I need to work on the downscale case.

What should be the behavior for elastic jobs on a downscale?

If more jobs exists, do we delete the extra jobs only? Or should we delete and recreate all the jobs?

On upscale, the extra jobs are created and I think the original ones are left alone.

On downscale, I guess we could delete the extra replicas.

@danielvegamyhre
Copy link
Contributor

@kannon92 Looks like e2e tests are failing, can you take a look?

@kannon92
Copy link
Contributor Author

kannon92 commented Aug 6, 2024

Yes, my hope is to get some clarity on the design/code.

Downscaling doesn't work for sure. What should we do on a downscale of replicas in ReplicatedJobs? Do we delete all the jobs and recreate? Or do we terminate the jobs that are not in the list of replicas?

@danielvegamyhre
Copy link
Contributor

Yes, my hope is to get some clarity on the design/code.

Downscaling doesn't work for sure. What should we do on a downscale of replicas in ReplicatedJobs? Do we delete all the jobs and recreate? Or do we terminate the jobs that are not in the list of replicas?

Hmm, I think if replicatedJob.replicas is downscaled from N to N-2, we should just delete 2 jobs so the state matches the spec.

@kannon92
Copy link
Contributor Author

kannon92 commented Aug 8, 2024

Yes, my hope is to get some clarity on the design/code.
Downscaling doesn't work for sure. What should we do on a downscale of replicas in ReplicatedJobs? Do we delete all the jobs and recreate? Or do we terminate the jobs that are not in the list of replicas?

Hmm, I think if replicatedJob.replicas is downscaled from N to N-2, we should just delete 2 jobs so the state matches the spec.

Maybe I’m overthinking it.

Say I have 4 jobs and I downscale to 2. Should I only delete the 3rd and 4th replicated job via index? Say I want to keep the 1st and 2nd job running and I remove the excess ones?

@danielvegamyhre
Copy link
Contributor

danielvegamyhre commented Aug 8, 2024

Yes, my hope is to get some clarity on the design/code.
Downscaling doesn't work for sure. What should we do on a downscale of replicas in ReplicatedJobs? Do we delete all the jobs and recreate? Or do we terminate the jobs that are not in the list of replicas?

Hmm, I think if replicatedJob.replicas is downscaled from N to N-2, we should just delete 2 jobs so the state matches the spec.

Maybe I’m overthinking it.

Say I have 4 jobs and I downscale to 2. Should I only delete the 3rd and 4th replicated job via index? Say I want to keep the 1st and 2nd job running and I remove the excess ones?

If I recall correctly the Job controller works this way for Elastic Indexed Jobs, removing indexes from largest to smallest until there are no more excess pods, so it would make sense to follow that pattern here for consistency I think.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 10, 2024
@kannon92 kannon92 force-pushed the elastic-jobset-reopen branch 2 times, most recently from f4e6034 to b266034 Compare August 13, 2024 00:18
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 13, 2024
@kannon92
Copy link
Contributor Author

@danielvegamyhre @mimowo

I got the downscale to work so I think this PR is ready for review.

@danielvegamyhre danielvegamyhre added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Sep 8, 2024
want: fmt.Errorf("updates can not change job names or reorder the jobs"),
},
{
name: "updates on replicated job templates are not allowed",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to disallow job template updates? This would prevent us from using Elastic Indexed Jobs, which I'm hoping could be used in a more efficient implementation of exclusive job placement (see #482)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See this comment: #463 (comment).

I wasn’t sure if there was a use case for this so I omitted support for it.

@ahg-g would this be a strong enough usecase to add support for ElasticIndexedJobs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Elastic JobSets
4 participants