-
Notifications
You must be signed in to change notification settings - Fork 671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Evaluator in Kubeflow TensorFlow Training Operator #4168
Support Evaluator in Kubeflow TensorFlow Training Operator #4168
Conversation
Signed-off-by: Future Outlier <[email protected]>
Signed-off-by: Future Outlier <[email protected]>
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #4168 +/- ##
==========================================
+ Coverage 59.00% 59.98% +0.98%
==========================================
Files 619 534 -85
Lines 52827 39205 -13622
==========================================
- Hits 31170 23519 -7651
+ Misses 19173 13398 -5775
+ Partials 2484 2288 -196
Flags with carried forward coverage won't be shown. Click here to find out more.
☔ View full report in Codecov by Sentry. |
… kf-operator-evaluator Signed-off-by: Future Outlier <[email protected]>
Signed-off-by: Future Outlier <[email protected]>
DistributedTensorflowTrainingReplicaSpec evaluator_replicas = 4; | ||
|
||
// RunPolicy encapsulates various runtime policies of the distributed training | ||
// job, for example how to clean up resources and how long the job can stay | ||
// active. | ||
RunPolicy run_policy = 4; | ||
RunPolicy run_policy = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not change the type of an existing field in a protobuf message as per https://protobuf.dev/programming-guides/dos-donts/ as that breaks backwards compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks really much!
This is pretty useful information, thanks a lot for your time.
I will improve it.
flyteplugins/go/tasks/plugins/k8s/kfoperators/common/common_operator_test.go
Outdated
Show resolved
Hide resolved
… kf-operator-evaluator
Signed-off-by: Future Outlier <[email protected]>
… kf-operator-evaluator
Signed-off-by: Future Outlier <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks
I've asked Linkedin software engineer @yubofredwang about the PR, he said that it is great! |
Describe your changes
Enable running a data service by utilizing the evaluator section in the TF_CONFIG to configure data service worker information, as discussed in this Slack conversation.
The use case previously doesn't include the evaluator section, so we have to give it a default value so that we can take the case into account.
I test it in two ways, by specifying the Dockerfile or using ImageSpec.
Dockerfile
Use the code below
Run it to flyte-console by this command
ImageSpec
Screenshot
Dockerfile
ImageSpec
Kubeflow Training Operator Pods
Tracking issue
#4167
flyteorg/flytekit#1870