Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support successPolicy and failurePolicy on pytorchjob #1575

Closed

Conversation

qiankunli
Copy link
Contributor

@qiankunli qiankunli commented Apr 17, 2022

  1. support successPolicy and failurePolicy, related issue support elastic scheduling for pytorch #1571
  2. support add volcano.sh/preemptable label on pod. related pr of volcano: support elastic annotation in preempt/reclaim plugin volcano-sh/volcano#2105
  3. support watch podgroup. fix trainning-operator may need to monitor PodGroup #1574 (deleted)

@aws-kf-ci-bot
Copy link
Contributor

Hi @qiankunli. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@qiankunli qiankunli force-pushed the feature/success_failure_policy branch 3 times, most recently from c50681b to 679807e Compare April 18, 2022 03:16
@cheimu
Copy link
Member

cheimu commented Apr 18, 2022

Will other job controllers also need to watch podGroup? It seems like others also have this problem.

Copy link
Member

@zw0610 zw0610 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also change to the title of this pull request so other developers can understand this change only applies to PyTorchJob

type SuccessPolicy string

const (
SuccessPolicyDefault SuccessPolicy = ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this default SuccessPolicy refer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i will add comment

const (
	SuccessPolicyDefault    SuccessPolicy = ""           // if worker0 is success, the job is set to be success
	SuccessPolicyAllWorkers SuccessPolicy = "AllWorkers" // only if all pods is success, the job is set to be success
)

)

// EnvVarGenerator is the environment variable generator interface.
type EnvVarGenerator interface {
Generate(job *pytorchv1.PyTorchJob) ([]corev1.EnvVar, error)
}

func setPodLabel(obj interface{}, podTemplateSpec *corev1.PodTemplateSpec, rtype, index string) error {
Copy link
Member

@zw0610 zw0610 Apr 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function name seems not self-explanatory enough. Without looking into the function body, reader may never know this function actually set the volcano preemptable labels for PodTemplate in PyTorchJob.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to reserve an extension here by the way. If there is a need for adding a new label in the future, we can write it here.

@@ -489,6 +560,9 @@ func (r *PyTorchJobReconciler) UpdateJobStatusInApiServer(job interface{}, jobSt

// SetClusterSpec sets the cluster spec and init container for the pod
func (r *PyTorchJobReconciler) SetClusterSpec(job interface{}, podTemplate *corev1.PodTemplateSpec, rtype, index string) error {
if err := setPodLabel(job, podTemplate, rtype, index); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that means no matter this PyTorchJob is elastic or not, volcano preemptable labels will always be injected?

// Leave a succeeded condition for the following two cases:
// 1. If default success policy is used and worker 0 has completed.
// 2. If `SuccessPolicyAllWorkers` success policy is used and all workers are succeeded.
if expected == 0 || (worker0Completed && *pytorchjob.Spec.ElasticPolicy.SuccessPolicy != pytorchv1.SuccessPolicyAllWorkers) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if ElasticPolicy is nil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in tf-operator/pkg/apis/pytorch/v1/defaults.go, i add code to make sure that ElasticPolicy must not be nil.

Copy link
Member

@zw0610 zw0610 Apr 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I cannot find the exact code. Could you point out which line in defaults.go sets ElasticPolicy not be nil?

@@ -428,6 +450,8 @@ func (r *PyTorchJobReconciler) UpdateJobStatus(job interface{},
return err
}
trainingoperatorcommon.RestartedJobsCounterInc(pytorchjob.Namespace, pytorchv1.FrameworkName)
} else if running >= *pytorchjob.Spec.ElasticPolicy.MinReplicas && *pytorchjob.Spec.ElasticPolicy.FailurePolicy == pytorchv1.FailurePolicyByMinReplicas {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question, what if ElasticPolicy is nil?


// getPodSlices returns a slice, which element is the slice of pod.
// It gives enough information to caller to make decision to up/down scale resources.
func (p *PyTorchJobReconciler) getPodSlices(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function name seems not self-explanatory as it gets pod slices for replicaType Worker instead of all Pods for this job.

ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/envtest"
"sigs.k8s.io/controller-runtime/pkg/envtest/printer"
logf "sigs.k8s.io/controller-runtime/pkg/log"
"sigs.k8s.io/controller-runtime/pkg/log/zap"
"testing"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import is not in order.

@qiankunli qiankunli changed the title support successPolicy and failurePolicy support successPolicy and failurePolicy on pytorchjob Apr 20, 2022
}
if pytorchjob.Spec.PyTorchReplicaSpecs[pytorchv1.PyTorchReplicaTypeMaster] != nil {
if rtype == strings.ToLower(string(pytorchv1.PyTorchReplicaTypeMaster)) {
podTemplateSpec.Labels[volcanov1beta1.PodPreemptable] = "false"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could podTemplateSpec.Labels be nil?

@qiankunli
Copy link
Contributor Author

qiankunli commented Apr 20, 2022

Will other job controllers also need to watch podGroup? It seems like others also have this problem.

@cheimu no,it only work on pytorchjob now, tfjob and mpijob may use other elastic mechanisms, it will take time to see if move the elasticPolicy in kubeflow/common

@qiankunli qiankunli force-pushed the feature/success_failure_policy branch from 679807e to faa78d9 Compare April 20, 2022 07:56
@cheimu
Copy link
Member

cheimu commented Apr 20, 2022

Will other job controllers also need to watch podGroup? It seems like others also have this problem.

@cheimu no,it only work on pytorchjob now, tfjob and mpijob may use other elastic mechanisms, it will take time to see if move the elasticPolicy in kubeflow/common

Oh, I mean the pending podgroup leads to job having no pod bug. I think this problem also exist in other controllers

@qiankunli qiankunli force-pushed the feature/success_failure_policy branch 2 times, most recently from 86bb1ae to 2f32ce4 Compare April 20, 2022 08:30
@qiankunli
Copy link
Contributor Author

qiankunli commented Apr 20, 2022

@cheimu it is better to have a community meeting and make changes after everyone agrees. @zw0610

@qiankunli qiankunli force-pushed the feature/success_failure_policy branch from 2f32ce4 to a5dac12 Compare April 20, 2022 09:48
@@ -207,6 +210,18 @@ func (r *PyTorchJobReconciler) SetupWithManager(mgr ctrl.Manager) error {
return err
}

// inject watching for job related podgroup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is adding PodGroup to watch list a must for this pull request?

@qiankunli qiankunli force-pushed the feature/success_failure_policy branch from a5dac12 to 552e42b Compare April 20, 2022 12:46
@google-oss-prow google-oss-prow bot added size/XL and removed size/L labels Apr 20, 2022
@coveralls
Copy link

coveralls commented Apr 20, 2022

Pull Request Test Coverage Report for Build 3407247525

  • 51 of 108 (47.22%) changed or added relevant lines in 6 files are covered.
  • 16 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+0.3%) to 40.073%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/common/util/util.go 0 8 0.0%
pkg/apis/kubeflow.org/v1/zz_generated.deepcopy.go 0 10 0.0%
pkg/controller.v1/pytorch/label.go 17 27 62.96%
pkg/apis/kubeflow.org/v1/openapi_generated.go 0 12 0.0%
pkg/controller.v1/pytorch/pytorchjob_controller.go 26 43 60.47%
Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/mpi/mpijob_controller.go 7 77.58%
pkg/controller.v1/pytorch/pytorchjob_controller.go 9 60.19%
Totals Coverage Status
Change from base Build 3392246605: 0.3%
Covered Lines: 2416
Relevant Lines: 6029

💛 - Coveralls

@zw0610
Copy link
Member

zw0610 commented Apr 20, 2022

/ok-to-test
/hold

@qiankunli qiankunli force-pushed the feature/success_failure_policy branch 2 times, most recently from fe8d90f to 33558a7 Compare April 20, 2022 13:40
@zw0610
Copy link
Member

zw0610 commented Apr 22, 2022

/retest

@gaocegege
Copy link
Member

Summarizing 1 Failure:

[Fail] TFJob controller Test Status [It] should update TFJob with desired status
/home/runner/work/training-operator/training-operator/go/src/github.com/kubeflow/training-operator/pkg/controller.v1/tensorflow/status_test.go:458

Ran 15 of 15 Specs in 20.956 seconds
FAIL! -- 14 Passed | 1 Failed | 0 Pending | 0 Skipped

@gaocegege
Copy link
Member

There is a unit test case failed.

@qiankunli qiankunli force-pushed the feature/success_failure_policy branch 2 times, most recently from 0ef0931 to 3642e74 Compare April 27, 2022 13:16
@zw0610
Copy link
Member

zw0610 commented Apr 28, 2022

/retest

@Jeffwan
Copy link
Member

Jeffwan commented Apr 29, 2022

image
Please update the cluster version in CI scripts

@Jeffwan
Copy link
Member

Jeffwan commented Apr 29, 2022

BTW, are you aware of AWS testing infra deprecation? Cost support might be stopped soon and we need to migrate testing infra.. @kubeflow/wg-training-leads

@zw0610
Copy link
Member

zw0610 commented Apr 29, 2022

/retest

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Are you planning to support this in other jobs as well?

@zw0610
Copy link
Member

zw0610 commented Apr 30, 2022

/retest

@aws-kf-ci-bot
Copy link
Contributor

@qiankunli: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
kubeflow-training-operator-presubmit 3642e74 link /test kubeflow-training-operator-presubmit

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@zw0610
Copy link
Member

zw0610 commented Apr 30, 2022

@Jeffwan it seems the eks kubernetes version in test is still 1.18. Could you take a look?

@Jeffwan
Copy link
Member

Jeffwan commented May 3, 2022

I cut PR #1582 to fix the problem. Once it's merged, please rebase upstream master and update the PR

@qiankunli qiankunli force-pushed the feature/success_failure_policy branch from 3642e74 to e588c04 Compare June 15, 2022 04:34
@qiankunli qiankunli force-pushed the feature/success_failure_policy branch from e588c04 to 200e513 Compare July 19, 2022 05:02
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: qiankunli
Once this PR has been reviewed and has the lgtm label, please assign gaocegege for approval by writing /assign @gaocegege in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@qiankunli qiankunli force-pushed the feature/success_failure_policy branch from df86f71 to 200e513 Compare October 25, 2022 11:49
@qiankunli qiankunli force-pushed the feature/success_failure_policy branch from 200e513 to a951633 Compare November 7, 2022 02:25
Signed-off-by: qiankunli <[email protected]>

run codegen

Signed-off-by: qiankunli <[email protected]>

support watch pg and preemptable label

Signed-off-by: qiankunli <[email protected]>

fix test case

Signed-off-by: qiankunli <[email protected]>

fix test case

Signed-off-by: qiankunli <[email protected]>

fix test case

Signed-off-by: qiankunli <[email protected]>

refactor

Signed-off-by: qiankunli <[email protected]>

fix make

Signed-off-by: qiankunli <[email protected]>

fix test

Signed-off-by: qiankunli <[email protected]>

add corev1 schema

Signed-off-by: qiankunli <[email protected]>

add podgroups crd

Signed-off-by: qiankunli <[email protected]>

remove watch podgroups

Signed-off-by: qiankunli <[email protected]>

fix test

Signed-off-by: qiankunli <[email protected]>

fix test

Signed-off-by: qiankunli <[email protected]>

check elasticPolicy nil

Signed-off-by: qiankunli <[email protected]>

refactor label.go

Signed-off-by: qiankunli <[email protected]>

fix python sdk

Signed-off-by: qiankunli <[email protected]>
@qiankunli qiankunli force-pushed the feature/success_failure_policy branch from a951633 to aab7042 Compare November 7, 2022 03:09
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions
Copy link

github-actions bot commented Oct 5, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot closed this Oct 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

trainning-operator may need to monitor PodGroup
8 participants