support successPolicy and failurePolicy on pytorchjob #1575

qiankunli · 2022-04-17T13:41:35Z

support successPolicy and failurePolicy, related issue support elastic scheduling for pytorch #1571
support add volcano.sh/preemptable label on pod. related pr of volcano: support elastic annotation in preempt/reclaim plugin volcano-sh/volcano#2105
support watch podgroup. fix trainning-operator may need to monitor PodGroup #1574 (deleted)

aws-kf-ci-bot · 2022-04-17T13:41:49Z

Hi @qiankunli. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cheimu · 2022-04-18T08:07:09Z

Will other job controllers also need to watch podGroup? It seems like others also have this problem.

zw0610

Please also change to the title of this pull request so other developers can understand this change only applies to PyTorchJob

zw0610 · 2022-04-18T08:17:24Z

pkg/apis/pytorch/v1/types.go

+type SuccessPolicy string
+
+const (
+	SuccessPolicyDefault    SuccessPolicy = ""


What does this default SuccessPolicy refer?

ok, i will add comment

const ( SuccessPolicyDefault SuccessPolicy = "" // if worker0 is success, the job is set to be success SuccessPolicyAllWorkers SuccessPolicy = "AllWorkers" // only if all pods is success, the job is set to be success )

zw0610 · 2022-04-18T08:24:56Z

pkg/controller.v1/pytorch/envvar.go

 )

 // EnvVarGenerator is the environment variable generator interface.
 type EnvVarGenerator interface {
 	Generate(job *pytorchv1.PyTorchJob) ([]corev1.EnvVar, error)
 }

+func setPodLabel(obj interface{}, podTemplateSpec *corev1.PodTemplateSpec, rtype, index string) error {


This function name seems not self-explanatory enough. Without looking into the function body, reader may never know this function actually set the volcano preemptable labels for PodTemplate in PyTorchJob.

I just want to reserve an extension here by the way. If there is a need for adding a new label in the future, we can write it here.

zw0610 · 2022-04-18T08:25:50Z

pkg/controller.v1/pytorch/pytorchjob_controller.go

@@ -489,6 +560,9 @@ func (r *PyTorchJobReconciler) UpdateJobStatusInApiServer(job interface{}, jobSt

 // SetClusterSpec sets the cluster spec and init container for the pod
 func (r *PyTorchJobReconciler) SetClusterSpec(job interface{}, podTemplate *corev1.PodTemplateSpec, rtype, index string) error {
+	if err := setPodLabel(job, podTemplate, rtype, index); err != nil {


So that means no matter this PyTorchJob is elastic or not, volcano preemptable labels will always be injected?

zw0610 · 2022-04-18T08:29:57Z

pkg/controller.v1/pytorch/pytorchjob_controller.go

+				// Leave a succeeded condition for the following two cases:
+				// 1. If default success policy is used and worker 0 has completed.
+				// 2. If `SuccessPolicyAllWorkers` success policy is used and all workers are succeeded.
+				if expected == 0 || (worker0Completed && *pytorchjob.Spec.ElasticPolicy.SuccessPolicy != pytorchv1.SuccessPolicyAllWorkers) {


What if ElasticPolicy is nil?

in tf-operator/pkg/apis/pytorch/v1/defaults.go, i add code to make sure that ElasticPolicy must not be nil.

Sorry, I cannot find the exact code. Could you point out which line in defaults.go sets ElasticPolicy not be nil?

zw0610 · 2022-04-18T08:30:27Z

pkg/controller.v1/pytorch/pytorchjob_controller.go

@@ -428,6 +450,8 @@ func (r *PyTorchJobReconciler) UpdateJobStatus(job interface{},
 					return err
 				}
 				trainingoperatorcommon.RestartedJobsCounterInc(pytorchjob.Namespace, pytorchv1.FrameworkName)
+			} else if running >= *pytorchjob.Spec.ElasticPolicy.MinReplicas && *pytorchjob.Spec.ElasticPolicy.FailurePolicy == pytorchv1.FailurePolicyByMinReplicas {


Same question, what if ElasticPolicy is nil?

zw0610 · 2022-04-18T08:34:29Z

pkg/controller.v1/pytorch/pytorchjob_controller.go

+
+// getPodSlices returns a slice, which element is the slice of pod.
+// It gives enough information to caller to make decision to up/down scale resources.
+func (p *PyTorchJobReconciler) getPodSlices(


This function name seems not self-explanatory as it gets pod slices for replicaType Worker instead of all Pods for this job.

zw0610 · 2022-04-18T08:34:51Z

pkg/controller.v1/pytorch/pytorchjob_controller_suite_test.go

 	ctrl "sigs.k8s.io/controller-runtime"
 	"sigs.k8s.io/controller-runtime/pkg/client"
 	"sigs.k8s.io/controller-runtime/pkg/envtest"
 	"sigs.k8s.io/controller-runtime/pkg/envtest/printer"
 	logf "sigs.k8s.io/controller-runtime/pkg/log"
 	"sigs.k8s.io/controller-runtime/pkg/log/zap"
+	"testing"


The import is not in order.

zw0610 · 2022-04-20T07:19:48Z

pkg/controller.v1/pytorch/envvar.go

+	}
+	if pytorchjob.Spec.PyTorchReplicaSpecs[pytorchv1.PyTorchReplicaTypeMaster] != nil {
+		if rtype == strings.ToLower(string(pytorchv1.PyTorchReplicaTypeMaster)) {
+			podTemplateSpec.Labels[volcanov1beta1.PodPreemptable] = "false"


Could podTemplateSpec.Labels be nil?

qiankunli · 2022-04-20T07:39:41Z

Will other job controllers also need to watch podGroup? It seems like others also have this problem.

@cheimu no,it only work on pytorchjob now, tfjob and mpijob may use other elastic mechanisms, it will take time to see if move the elasticPolicy in kubeflow/common

cheimu · 2022-04-20T08:01:51Z

Will other job controllers also need to watch podGroup? It seems like others also have this problem.

@cheimu no,it only work on pytorchjob now, tfjob and mpijob may use other elastic mechanisms, it will take time to see if move the elasticPolicy in kubeflow/common

Oh, I mean the pending podgroup leads to job having no pod bug. I think this problem also exist in other controllers

qiankunli · 2022-04-20T08:34:15Z

@cheimu it is better to have a community meeting and make changes after everyone agrees. @zw0610

zw0610 · 2022-04-20T10:42:45Z

pkg/controller.v1/pytorch/pytorchjob_controller.go

@@ -207,6 +210,18 @@ func (r *PyTorchJobReconciler) SetupWithManager(mgr ctrl.Manager) error {
 		return err
 	}

+	// inject watching for job related podgroup


Is adding PodGroup to watch list a must for this pull request?

coveralls · 2022-04-20T12:51:32Z

Pull Request Test Coverage Report for Build 3407247525

51 of 108 (47.22%) changed or added relevant lines in 6 files are covered.
16 unchanged lines in 2 files lost coverage.
Overall coverage increased (+0.3%) to 40.073%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/common/util/util.go	0	8	0.0%
pkg/apis/kubeflow.org/v1/zz_generated.deepcopy.go	0	10	0.0%
pkg/controller.v1/pytorch/label.go	17	27	62.96%
pkg/apis/kubeflow.org/v1/openapi_generated.go	0	12	0.0%
pkg/controller.v1/pytorch/pytorchjob_controller.go	26	43	60.47%

Files with Coverage Reduction	New Missed Lines	%
pkg/controller.v1/mpi/mpijob_controller.go	7	77.58%
pkg/controller.v1/pytorch/pytorchjob_controller.go	9	60.19%

Totals
Change from base Build 3392246605:	0.3%
Covered Lines:	2416
Relevant Lines:	6029

💛 - Coveralls

zw0610 · 2022-04-20T13:08:23Z

/ok-to-test
/hold

zw0610 · 2022-04-22T01:55:10Z

/retest

gaocegege · 2022-04-22T02:10:35Z

Summarizing 1 Failure:

[Fail] TFJob controller Test Status [It] should update TFJob with desired status
/home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/status_test.go:458

Ran 15 of 15 Specs in 20.956 seconds
FAIL! -- 14 Passed | 1 Failed | 0 Pending | 0 Skipped

gaocegege · 2022-04-22T02:10:50Z

There is a unit test case failed.

zw0610 · 2022-04-28T11:21:45Z

/retest

Jeffwan · 2022-04-29T02:15:19Z

Please update the cluster version in CI scripts

Jeffwan · 2022-04-29T02:16:12Z

BTW, are you aware of AWS testing infra deprecation? Cost support might be stopped soon and we need to migrate testing infra.. @kubeflow/wg-training-leads

zw0610 · 2022-04-29T06:28:50Z

/retest

terrytangyuan

Great work! Are you planning to support this in other jobs as well?

zw0610 · 2022-04-30T03:19:06Z

/retest

aws-kf-ci-bot · 2022-04-30T03:27:23Z

@qiankunli: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
kubeflow-training-operator-presubmit	`3642e74`	link	`/test kubeflow-training-operator-presubmit`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

zw0610 · 2022-04-30T03:40:06Z

@Jeffwan it seems the eks kubernetes version in test is still 1.18. Could you take a look?

Jeffwan · 2022-05-03T09:44:57Z

I cut PR #1582 to fix the problem. Once it's merged, please rebase upstream master and update the PR

google-oss-prow · 2022-10-25T11:49:06Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: qiankunli
Once this PR has been reviewed and has the lgtm label, please assign gaocegege for approval by writing /assign @gaocegege in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: qiankunli <[email protected]> run codegen Signed-off-by: qiankunli <[email protected]> support watch pg and preemptable label Signed-off-by: qiankunli <[email protected]> fix test case Signed-off-by: qiankunli <[email protected]> fix test case Signed-off-by: qiankunli <[email protected]> fix test case Signed-off-by: qiankunli <[email protected]> refactor Signed-off-by: qiankunli <[email protected]> fix make Signed-off-by: qiankunli <[email protected]> fix test Signed-off-by: qiankunli <[email protected]> add corev1 schema Signed-off-by: qiankunli <[email protected]> add podgroups crd Signed-off-by: qiankunli <[email protected]> remove watch podgroups Signed-off-by: qiankunli <[email protected]> fix test Signed-off-by: qiankunli <[email protected]> fix test Signed-off-by: qiankunli <[email protected]> check elasticPolicy nil Signed-off-by: qiankunli <[email protected]> refactor label.go Signed-off-by: qiankunli <[email protected]> fix python sdk Signed-off-by: qiankunli <[email protected]>

github-actions · 2023-09-14T20:01:53Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions · 2023-10-05T00:05:13Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

google-oss-prow bot added the size/L label Apr 17, 2022

aws-kf-ci-bot added the needs-ok-to-test label Apr 17, 2022

google-oss-prow bot requested review from jinchihe, johnugeorge and terrytangyuan April 17, 2022 13:41

qiankunli force-pushed the feature/success_failure_policy branch 3 times, most recently from c50681b to 679807e Compare April 18, 2022 03:16

zw0610 reviewed Apr 18, 2022

View reviewed changes

qiankunli changed the title ~~support successPolicy and failurePolicy~~ support successPolicy and failurePolicy on pytorchjob Apr 20, 2022

zw0610 reviewed Apr 20, 2022

View reviewed changes

qiankunli force-pushed the feature/success_failure_policy branch from 679807e to faa78d9 Compare April 20, 2022 07:56

qiankunli force-pushed the feature/success_failure_policy branch 2 times, most recently from 86bb1ae to 2f32ce4 Compare April 20, 2022 08:30

qiankunli force-pushed the feature/success_failure_policy branch from 2f32ce4 to a5dac12 Compare April 20, 2022 09:48

zw0610 reviewed Apr 20, 2022

View reviewed changes

qiankunli force-pushed the feature/success_failure_policy branch from a5dac12 to 552e42b Compare April 20, 2022 12:46

google-oss-prow bot added size/XL and removed size/L labels Apr 20, 2022

google-oss-prow bot added the do-not-merge/hold label Apr 20, 2022

aws-kf-ci-bot added ok-to-test and removed needs-ok-to-test labels Apr 20, 2022

qiankunli force-pushed the feature/success_failure_policy branch from 552e42b to 8e9f794 Compare April 20, 2022 13:29

google-oss-prow bot added the size/L label Apr 20, 2022

qiankunli force-pushed the feature/success_failure_policy branch 2 times, most recently from fe8d90f to 33558a7 Compare April 20, 2022 13:40

qiankunli force-pushed the feature/success_failure_policy branch 2 times, most recently from 0ef0931 to 3642e74 Compare April 27, 2022 13:16

zw0610 mentioned this pull request Apr 29, 2022

update eks kubernetes version to 1.19 kubeflow/testing#998

Merged

1 task

terrytangyuan reviewed Apr 29, 2022

View reviewed changes

Jeffwan mentioned this pull request May 3, 2022

CI stop working due to unsupported kubernetes version being used #1581

Closed

qiankunli force-pushed the feature/success_failure_policy branch from 3642e74 to e588c04 Compare June 15, 2022 04:34

qiankunli force-pushed the feature/success_failure_policy branch from e588c04 to 200e513 Compare July 19, 2022 05:02

qiankunli force-pushed the feature/success_failure_policy branch from df86f71 to 200e513 Compare October 25, 2022 11:49

qiankunli force-pushed the feature/success_failure_policy branch from 200e513 to a951633 Compare November 7, 2022 02:25

qiankunli force-pushed the feature/success_failure_policy branch from a951633 to aab7042 Compare November 7, 2022 03:09

github-actions bot added the lifecycle/stale label Sep 14, 2023

github-actions bot closed this Oct 5, 2023

support successPolicy and failurePolicy on pytorchjob #1575

support successPolicy and failurePolicy on pytorchjob #1575

Conversation

qiankunli commented Apr 17, 2022 • edited Loading

aws-kf-ci-bot commented Apr 17, 2022

cheimu commented Apr 18, 2022

zw0610 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zw0610 Apr 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zw0610 Apr 20, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qiankunli commented Apr 20, 2022 • edited Loading

cheimu commented Apr 20, 2022

qiankunli commented Apr 20, 2022 • edited Loading

Choose a reason for hiding this comment

coveralls commented Apr 20, 2022 • edited Loading

Pull Request Test Coverage Report for Build 3407247525

💛 - Coveralls

zw0610 commented Apr 20, 2022

zw0610 commented Apr 22, 2022

gaocegege commented Apr 22, 2022

gaocegege commented Apr 22, 2022

zw0610 commented Apr 28, 2022

Jeffwan commented Apr 29, 2022

Jeffwan commented Apr 29, 2022

zw0610 commented Apr 29, 2022

terrytangyuan left a comment

Choose a reason for hiding this comment

zw0610 commented Apr 30, 2022

aws-kf-ci-bot commented Apr 30, 2022

zw0610 commented Apr 30, 2022

Jeffwan commented May 3, 2022

google-oss-prow bot commented Oct 25, 2022

github-actions bot commented Sep 14, 2023

github-actions bot commented Oct 5, 2023

qiankunli commented Apr 17, 2022 •

edited

Loading

zw0610 Apr 18, 2022 •

edited

Loading

zw0610 Apr 20, 2022 •

edited

Loading

qiankunli commented Apr 20, 2022 •

edited

Loading

qiankunli commented Apr 20, 2022 •

edited

Loading

coveralls commented Apr 20, 2022 •

edited

Loading