-
Notifications
You must be signed in to change notification settings - Fork 686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support successPolicy and failurePolicy on pytorchjob #1575
support successPolicy and failurePolicy on pytorchjob #1575
Conversation
Hi @qiankunli. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
c50681b
to
679807e
Compare
Will other job controllers also need to watch podGroup? It seems like others also have this problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also change to the title of this pull request so other developers can understand this change only applies to PyTorchJob
pkg/apis/pytorch/v1/types.go
Outdated
type SuccessPolicy string | ||
|
||
const ( | ||
SuccessPolicyDefault SuccessPolicy = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this default SuccessPolicy refer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i will add comment
const (
SuccessPolicyDefault SuccessPolicy = "" // if worker0 is success, the job is set to be success
SuccessPolicyAllWorkers SuccessPolicy = "AllWorkers" // only if all pods is success, the job is set to be success
)
pkg/controller.v1/pytorch/envvar.go
Outdated
) | ||
|
||
// EnvVarGenerator is the environment variable generator interface. | ||
type EnvVarGenerator interface { | ||
Generate(job *pytorchv1.PyTorchJob) ([]corev1.EnvVar, error) | ||
} | ||
|
||
func setPodLabel(obj interface{}, podTemplateSpec *corev1.PodTemplateSpec, rtype, index string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function name seems not self-explanatory enough. Without looking into the function body, reader may never know this function actually set the volcano preemptable labels for PodTemplate in PyTorchJob.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want to reserve an extension here by the way. If there is a need for adding a new label in the future, we can write it here.
@@ -489,6 +560,9 @@ func (r *PyTorchJobReconciler) UpdateJobStatusInApiServer(job interface{}, jobSt | |||
|
|||
// SetClusterSpec sets the cluster spec and init container for the pod | |||
func (r *PyTorchJobReconciler) SetClusterSpec(job interface{}, podTemplate *corev1.PodTemplateSpec, rtype, index string) error { | |||
if err := setPodLabel(job, podTemplate, rtype, index); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that means no matter this PyTorchJob is elastic or not, volcano preemptable labels will always be injected?
// Leave a succeeded condition for the following two cases: | ||
// 1. If default success policy is used and worker 0 has completed. | ||
// 2. If `SuccessPolicyAllWorkers` success policy is used and all workers are succeeded. | ||
if expected == 0 || (worker0Completed && *pytorchjob.Spec.ElasticPolicy.SuccessPolicy != pytorchv1.SuccessPolicyAllWorkers) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if ElasticPolicy is nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in tf-operator/pkg/apis/pytorch/v1/defaults.go
, i add code to make sure that ElasticPolicy must not be nil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I cannot find the exact code. Could you point out which line in defaults.go
sets ElasticPolicy
not be nil?
@@ -428,6 +450,8 @@ func (r *PyTorchJobReconciler) UpdateJobStatus(job interface{}, | |||
return err | |||
} | |||
trainingoperatorcommon.RestartedJobsCounterInc(pytorchjob.Namespace, pytorchv1.FrameworkName) | |||
} else if running >= *pytorchjob.Spec.ElasticPolicy.MinReplicas && *pytorchjob.Spec.ElasticPolicy.FailurePolicy == pytorchv1.FailurePolicyByMinReplicas { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question, what if ElasticPolicy is nil?
|
||
// getPodSlices returns a slice, which element is the slice of pod. | ||
// It gives enough information to caller to make decision to up/down scale resources. | ||
func (p *PyTorchJobReconciler) getPodSlices( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function name seems not self-explanatory as it gets pod slices for replicaType Worker instead of all Pods for this job.
ctrl "sigs.k8s.io/controller-runtime" | ||
"sigs.k8s.io/controller-runtime/pkg/client" | ||
"sigs.k8s.io/controller-runtime/pkg/envtest" | ||
"sigs.k8s.io/controller-runtime/pkg/envtest/printer" | ||
logf "sigs.k8s.io/controller-runtime/pkg/log" | ||
"sigs.k8s.io/controller-runtime/pkg/log/zap" | ||
"testing" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The import is not in order.
pkg/controller.v1/pytorch/envvar.go
Outdated
} | ||
if pytorchjob.Spec.PyTorchReplicaSpecs[pytorchv1.PyTorchReplicaTypeMaster] != nil { | ||
if rtype == strings.ToLower(string(pytorchv1.PyTorchReplicaTypeMaster)) { | ||
podTemplateSpec.Labels[volcanov1beta1.PodPreemptable] = "false" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could podTemplateSpec.Labels be nil?
@cheimu no,it only work on pytorchjob now, tfjob and mpijob may use other elastic mechanisms, it will take time to see if move the elasticPolicy in kubeflow/common |
679807e
to
faa78d9
Compare
Oh, I mean the pending podgroup leads to job having no pod bug. I think this problem also exist in other controllers |
86bb1ae
to
2f32ce4
Compare
2f32ce4
to
a5dac12
Compare
@@ -207,6 +210,18 @@ func (r *PyTorchJobReconciler) SetupWithManager(mgr ctrl.Manager) error { | |||
return err | |||
} | |||
|
|||
// inject watching for job related podgroup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is adding PodGroup to watch list a must for this pull request?
a5dac12
to
552e42b
Compare
Pull Request Test Coverage Report for Build 3407247525
💛 - Coveralls |
/ok-to-test |
552e42b
to
8e9f794
Compare
fe8d90f
to
33558a7
Compare
/retest |
Summarizing 1 Failure: [Fail] TFJob controller Test Status [It] should update TFJob with desired status Ran 15 of 15 Specs in 20.956 seconds |
There is a unit test case failed. |
0ef0931
to
3642e74
Compare
/retest |
BTW, are you aware of AWS testing infra deprecation? Cost support might be stopped soon and we need to migrate testing infra.. @kubeflow/wg-training-leads |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Are you planning to support this in other jobs as well?
/retest |
@qiankunli: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@Jeffwan it seems the eks kubernetes version in test is still 1.18. Could you take a look? |
I cut PR #1582 to fix the problem. Once it's merged, please rebase upstream master and update the PR |
3642e74
to
e588c04
Compare
e588c04
to
200e513
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: qiankunli The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
df86f71
to
200e513
Compare
200e513
to
a951633
Compare
Signed-off-by: qiankunli <[email protected]> run codegen Signed-off-by: qiankunli <[email protected]> support watch pg and preemptable label Signed-off-by: qiankunli <[email protected]> fix test case Signed-off-by: qiankunli <[email protected]> fix test case Signed-off-by: qiankunli <[email protected]> fix test case Signed-off-by: qiankunli <[email protected]> refactor Signed-off-by: qiankunli <[email protected]> fix make Signed-off-by: qiankunli <[email protected]> fix test Signed-off-by: qiankunli <[email protected]> add corev1 schema Signed-off-by: qiankunli <[email protected]> add podgroups crd Signed-off-by: qiankunli <[email protected]> remove watch podgroups Signed-off-by: qiankunli <[email protected]> fix test Signed-off-by: qiankunli <[email protected]> fix test Signed-off-by: qiankunli <[email protected]> check elasticPolicy nil Signed-off-by: qiankunli <[email protected]> refactor label.go Signed-off-by: qiankunli <[email protected]> fix python sdk Signed-off-by: qiankunli <[email protected]>
a951633
to
aab7042
Compare
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
volcano.sh/preemptable
label on pod. related pr of volcano: support elastic annotation in preempt/reclaim plugin volcano-sh/volcano#2105