Support Evaluator in Kubeflow TensorFlow Training Operator #4168

Future-Outlier · 2023-10-04T09:58:57Z

Describe your changes

Enable running a data service by utilizing the evaluator section in the TF_CONFIG to configure data service worker information, as discussed in this Slack conversation.

The use case previously doesn't include the evaluator section, so we have to give it a default value so that we can take the case into account.

I test it in two ways, by specifying the Dockerfile or using ImageSpec.

Dockerfile

FROM python:3.9-slim-buster
USER root
WORKDIR /root
ENV PYTHONPATH /root
RUN apt-get update && apt-get install build-essential -y
RUN apt-get install git -y
# The following line is an example of how to install your modified plugins. In this case, it demonstrates how to install the 'deck' plugin.
# RUN pip install -U git+https://github.com/Yicheng-Lu-llll/flytekit.git@"demo#egg=flytekitplugins-deck-standard&subdirectory=plugins/flytekit-deck-standard" # replace with your own repo and branch
RUN pip install -U git+https://github.com/Future-Outlier/flytekit.git@98ddd542a02551a9a9eb122b98004d0d092abbe9#subdirectory=plugins/flytekit-kf-tensorflow

RUN pip install -U git+https://github.com/Future-Outlier/flyte.git@647b8f4eeeab1a65866d19fab13c416ed0e4a07f#subdirectory=flyteidl

RUN pip install -U git+https://github.com/Future-Outlier/flytekit.git@98ddd542a02551a9a9eb122b98004d0d092abbe9

Use the code below

from flytekit import ImageSpec, Resources, task
from flytekit.configuration import Image, ImageConfig, SerializationSettings
from flytekitplugins.kftensorflow import (PS, Chief, CleanPodPolicy, Evaluator,
                                          RestartPolicy, RunPolicy, TfJob,
                                          Worker)

task_config = TfJob(
    worker=Worker(replicas=1),
    chief=Chief(replicas=1),
    ps=PS(replicas=1),
    evaluator=Evaluator(replicas=1),
)


@task(
    task_config=task_config,
    cache=True,
    requests=Resources(cpu="1"),
    cache_version="1",
)
def my_tensorflow_task(x: int, y: str) -> int:
    return x


if __name__ == "__main__":
    print(my_tensorflow_task(x=10, y="hello"))

Run it to flyte-console by this command

pyflyte run --remote --image futureoutlier/kubeflow:tfoperator-v2 \
kubeflow_tf_evaluator.py my_tensorflow_task --x 100 --y acc

ImageSpec

from flytekit import ImageSpec, Resources, task
from flytekit.configuration import Image, ImageConfig, SerializationSettings
from flytekitplugins.kftensorflow import (PS, Chief, CleanPodPolicy, Evaluator,
                                          RestartPolicy, RunPolicy, TfJob,
                                          Worker)

kubeflow_plugin = "git+https://github.com/Future-Outlier/flytekit.git@98ddd542a02551a9a9eb122b98004d0d092abbe9#subdirectory=plugins/flytekit-kf-tensorflow"
kubeflow_idl = "git+https://github.com/Future-Outlier/flyte.git@e3d022ae86466632f0b8eeae80bc07441827e403#subdirectory=flyteidl"
flytekit = "git+https://github.com/Future-Outlier/flytekit.git@98ddd542a02551a9a9eb122b98004d0d092abbe9"

# base_image="futureoutlier/kubeflow:tfoperator-v2"
image_spec = ImageSpec(
    packages=[flytekit, kubeflow_idl, kubeflow_plugin],
    apt_packages=["git"],
    registry="futureoutlier",
)
# build-essential git
task_config = TfJob(
    worker=Worker(replicas=1),
    chief=Chief(replicas=1),
    ps=PS(replicas=1),
    evaluator=Evaluator(replicas=1),
)


@task(
    task_config=task_config,
    cache=True,
    requests=Resources(cpu="1"),
    cache_version="1",
    container_image=image_spec,
)
def my_tensorflow_task(x: int, y: str) -> int:
    return x


if __name__ == "__main__":
    print(my_tensorflow_task(x=10, y="hello"))

pyflyte run --remote kubeflow_tf_evaluator.py my_tensorflow_task --x 20231008 --y AMAZING

Screenshot

Dockerfile

ImageSpec

Kubeflow Training Operator Pods

Tracking issue

#4167
flyteorg/flytekit#1870

I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.

Signed-off-by: Future Outlier <[email protected]>

codecov · 2023-10-04T21:18:44Z

Codecov Report

Attention: 11 lines in your changes are missing coverage. Please review.

Comparison is base (a18da03) 59.00% compared to head (755a60e) 59.98%.
Report is 2 commits behind head on master.

❗ Current head 755a60e differs from pull request most recent head e3d022a. Consider uploading reports for the commit e3d022a to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #4168      +/-   ##
==========================================
+ Coverage   59.00%   59.98%   +0.98%     
==========================================
  Files         619      534      -85     
  Lines       52827    39205   -13622     
==========================================
- Hits        31170    23519    -7651     
+ Misses      19173    13398    -5775     
+ Partials     2484     2288     -196

Flag	Coverage Δ
unittests	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
...lugins/go/tasks/plugins/k8s/kfoperators/mpi/mpi.go	`74.49% <100.00%> (+2.51%)`	⬆️
...o/tasks/plugins/k8s/kfoperators/pytorch/pytorch.go	`79.22% <100.00%> (+1.32%)`	⬆️
...s/plugins/k8s/kfoperators/tensorflow/tensorflow.go	`78.69% <90.47%> (+3.69%)`	⬆️
.../plugins/k8s/kfoperators/common/common_operator.go	`64.55% <10.00%> (+1.42%)`	⬆️

... and 573 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

… kf-operator-evaluator Signed-off-by: Future Outlier <[email protected]>

Signed-off-by: Future Outlier <[email protected]>

eapolinario · 2023-10-06T18:44:48Z

flyteidl/protos/flyteidl/plugins/kubeflow/tensorflow.proto

+  DistributedTensorflowTrainingReplicaSpec evaluator_replicas = 4;
+
  // RunPolicy encapsulates various runtime policies of the distributed training
  // job, for example how to clean up resources and how long the job can stay
  // active.
-  RunPolicy run_policy = 4;
+  RunPolicy run_policy = 5;


We should not change the type of an existing field in a protobuf message as per https://protobuf.dev/programming-guides/dos-donts/ as that breaks backwards compatibility.

Thanks really much!
This is pretty useful information, thanks a lot for your time.
I will improve it.

flyteidl/protos/flyteidl/plugins/tensorflow.proto

flyteplugins/go/tasks/plugins/k8s/kfoperators/common/common_operator_test.go

… kf-operator-evaluator

Signed-off-by: Future Outlier <[email protected]>

… kf-operator-evaluator

Signed-off-by: Future Outlier <[email protected]>

pingsutw

LGTM, thanks

Future-Outlier · 2023-10-11T13:06:46Z

I've asked Linkedin software engineer @yubofredwang about the PR, he said that it is great!

support evaluator in kubeflow tensorflow training job

ec31fde

Signed-off-by: Future Outlier <[email protected]>

Future-Outlier mentioned this pull request Oct 4, 2023

Kubeflow TensorFlow Training Operator Add Evaluator flyteorg/flytekit#1870

Merged

8 tasks

Future Outlier and others added 2 commits October 4, 2023 18:01

others

6eb6d92

Signed-off-by: Future Outlier <[email protected]>

Merge branch 'flyteorg:master' into kf-operator-evaluator

f4b4a82

Future Outlier added 2 commits October 6, 2023 11:32

Merge branch 'master' of https://github.com/Future-Outlier/flyte into…

bcb1e74

… kf-operator-evaluator Signed-off-by: Future Outlier <[email protected]>

new flyteidl

172031c

Signed-off-by: Future Outlier <[email protected]>

eapolinario reviewed Oct 6, 2023

View reviewed changes

Future Outlier added 4 commits October 7, 2023 12:57

Merge branch 'master' of https://github.com/Future-Outlier/flyte into…

30426cf

… kf-operator-evaluator

add evaluator proto with correct order ;gpsh;

647b8f4

Signed-off-by: Future Outlier <[email protected]>

Merge branch 'master' of https://github.com/Future-Outlier/flyte into…

a11c7f5

… kf-operator-evaluator

fix generate bugs and time zone error

e3d022a

Signed-off-by: Future Outlier <[email protected]>

Future-Outlier requested a review from eapolinario October 8, 2023 15:47

pingsutw approved these changes Oct 10, 2023

View reviewed changes

pingsutw merged commit 26228bd into flyteorg:master Oct 11, 2023
40 checks passed

Future-Outlier mentioned this pull request Nov 6, 2023

[Housekeeping] Flyte and Flytekit both of there pull request template should add more sections. #4366

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Evaluator in Kubeflow TensorFlow Training Operator #4168

Support Evaluator in Kubeflow TensorFlow Training Operator #4168

Future-Outlier commented Oct 4, 2023 •

edited

Loading

codecov bot commented Oct 4, 2023 •

edited

Loading

eapolinario Oct 6, 2023

Future-Outlier Oct 7, 2023

pingsutw left a comment

Future-Outlier commented Oct 11, 2023

Support Evaluator in Kubeflow TensorFlow Training Operator #4168

Support Evaluator in Kubeflow TensorFlow Training Operator #4168

Conversation

Future-Outlier commented Oct 4, 2023 • edited Loading

Describe your changes

Dockerfile

ImageSpec

Screenshot

Dockerfile

ImageSpec

Kubeflow Training Operator Pods

Tracking issue

codecov bot commented Oct 4, 2023 • edited Loading

Codecov Report

eapolinario Oct 6, 2023

Choose a reason for hiding this comment

Future-Outlier Oct 7, 2023

Choose a reason for hiding this comment

pingsutw left a comment

Choose a reason for hiding this comment

Future-Outlier commented Oct 11, 2023

Future-Outlier commented Oct 4, 2023 •

edited

Loading

codecov bot commented Oct 4, 2023 •

edited

Loading