Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Evaluator in Kubeflow TensorFlow Training Operator #4168

Merged
merged 9 commits into from
Oct 11, 2023

Conversation

Future-Outlier
Copy link
Member

@Future-Outlier Future-Outlier commented Oct 4, 2023

Describe your changes

Enable running a data service by utilizing the evaluator section in the TF_CONFIG to configure data service worker information, as discussed in this Slack conversation.

The use case previously doesn't include the evaluator section, so we have to give it a default value so that we can take the case into account.

I test it in two ways, by specifying the Dockerfile or using ImageSpec.

Dockerfile

FROM python:3.9-slim-buster
USER root
WORKDIR /root
ENV PYTHONPATH /root
RUN apt-get update && apt-get install build-essential -y
RUN apt-get install git -y
# The following line is an example of how to install your modified plugins. In this case, it demonstrates how to install the 'deck' plugin.
# RUN pip install -U git+https://github.com/Yicheng-Lu-llll/flytekit.git@"demo#egg=flytekitplugins-deck-standard&subdirectory=plugins/flytekit-deck-standard" # replace with your own repo and branch
RUN pip install -U git+https://github.com/Future-Outlier/flytekit.git@98ddd542a02551a9a9eb122b98004d0d092abbe9#subdirectory=plugins/flytekit-kf-tensorflow

RUN pip install -U git+https://github.com/Future-Outlier/flyte.git@647b8f4eeeab1a65866d19fab13c416ed0e4a07f#subdirectory=flyteidl

RUN pip install -U git+https://github.com/Future-Outlier/flytekit.git@98ddd542a02551a9a9eb122b98004d0d092abbe9

Use the code below

from flytekit import ImageSpec, Resources, task
from flytekit.configuration import Image, ImageConfig, SerializationSettings
from flytekitplugins.kftensorflow import (PS, Chief, CleanPodPolicy, Evaluator,
                                          RestartPolicy, RunPolicy, TfJob,
                                          Worker)

task_config = TfJob(
    worker=Worker(replicas=1),
    chief=Chief(replicas=1),
    ps=PS(replicas=1),
    evaluator=Evaluator(replicas=1),
)


@task(
    task_config=task_config,
    cache=True,
    requests=Resources(cpu="1"),
    cache_version="1",
)
def my_tensorflow_task(x: int, y: str) -> int:
    return x


if __name__ == "__main__":
    print(my_tensorflow_task(x=10, y="hello"))

Run it to flyte-console by this command

pyflyte run --remote --image futureoutlier/kubeflow:tfoperator-v2 \
kubeflow_tf_evaluator.py my_tensorflow_task --x 100 --y acc

ImageSpec

from flytekit import ImageSpec, Resources, task
from flytekit.configuration import Image, ImageConfig, SerializationSettings
from flytekitplugins.kftensorflow import (PS, Chief, CleanPodPolicy, Evaluator,
                                          RestartPolicy, RunPolicy, TfJob,
                                          Worker)

kubeflow_plugin = "git+https://github.com/Future-Outlier/flytekit.git@98ddd542a02551a9a9eb122b98004d0d092abbe9#subdirectory=plugins/flytekit-kf-tensorflow"
kubeflow_idl = "git+https://github.com/Future-Outlier/flyte.git@e3d022ae86466632f0b8eeae80bc07441827e403#subdirectory=flyteidl"
flytekit = "git+https://github.com/Future-Outlier/flytekit.git@98ddd542a02551a9a9eb122b98004d0d092abbe9"

# base_image="futureoutlier/kubeflow:tfoperator-v2"
image_spec = ImageSpec(
    packages=[flytekit, kubeflow_idl, kubeflow_plugin],
    apt_packages=["git"],
    registry="futureoutlier",
)
# build-essential git
task_config = TfJob(
    worker=Worker(replicas=1),
    chief=Chief(replicas=1),
    ps=PS(replicas=1),
    evaluator=Evaluator(replicas=1),
)


@task(
    task_config=task_config,
    cache=True,
    requests=Resources(cpu="1"),
    cache_version="1",
    container_image=image_spec,
)
def my_tensorflow_task(x: int, y: str) -> int:
    return x


if __name__ == "__main__":
    print(my_tensorflow_task(x=10, y="hello"))
pyflyte run --remote kubeflow_tf_evaluator.py my_tensorflow_task --x 20231008 --y AMAZING

Screenshot

Dockerfile

image

ImageSpec

image

Kubeflow Training Operator Pods

image

Tracking issue

#4167
flyteorg/flytekit#1870

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Future Outlier and others added 2 commits October 4, 2023 18:01
@codecov
Copy link

codecov bot commented Oct 4, 2023

Codecov Report

Attention: 11 lines in your changes are missing coverage. Please review.

Comparison is base (a18da03) 59.00% compared to head (755a60e) 59.98%.
Report is 2 commits behind head on master.

❗ Current head 755a60e differs from pull request most recent head e3d022a. Consider uploading reports for the commit e3d022a to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4168      +/-   ##
==========================================
+ Coverage   59.00%   59.98%   +0.98%     
==========================================
  Files         619      534      -85     
  Lines       52827    39205   -13622     
==========================================
- Hits        31170    23519    -7651     
+ Misses      19173    13398    -5775     
+ Partials     2484     2288     -196     
Flag Coverage Δ
unittests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...lugins/go/tasks/plugins/k8s/kfoperators/mpi/mpi.go 74.49% <100.00%> (+2.51%) ⬆️
...o/tasks/plugins/k8s/kfoperators/pytorch/pytorch.go 79.22% <100.00%> (+1.32%) ⬆️
...s/plugins/k8s/kfoperators/tensorflow/tensorflow.go 78.69% <90.47%> (+3.69%) ⬆️
.../plugins/k8s/kfoperators/common/common_operator.go 64.55% <10.00%> (+1.42%) ⬆️

... and 573 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Future Outlier added 2 commits October 6, 2023 11:32
… kf-operator-evaluator

Signed-off-by: Future Outlier <[email protected]>
Signed-off-by: Future Outlier <[email protected]>
Comment on lines 22 to 27
DistributedTensorflowTrainingReplicaSpec evaluator_replicas = 4;

// RunPolicy encapsulates various runtime policies of the distributed training
// job, for example how to clean up resources and how long the job can stay
// active.
RunPolicy run_policy = 4;
RunPolicy run_policy = 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not change the type of an existing field in a protobuf message as per https://protobuf.dev/programming-guides/dos-donts/ as that breaks backwards compatibility.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks really much!
This is pretty useful information, thanks a lot for your time.
I will improve it.

flyteidl/protos/flyteidl/plugins/tensorflow.proto Outdated Show resolved Hide resolved
Copy link
Member

@pingsutw pingsutw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

@Future-Outlier
Copy link
Member Author

I've asked Linkedin software engineer @yubofredwang about the PR, he said that it is great!

@pingsutw pingsutw merged commit 26228bd into flyteorg:master Oct 11, 2023
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants