Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Update files with respect to common ReplicaSpec refactor #2424

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

MortalHappiness
Copy link
Member

@MortalHappiness MortalHappiness commented May 16, 2024

Tracking issue

Resolves: flyteorg/flyte#4408

Why are the changes needed?

flyteorg/flyte#5355 changes protobuf files, so we need to update the corresponding files in flytekit.

What changes were proposed in this pull request?

Update files with respect to common ReplicaSpec refactor.

How was this patch tested?

Setup process

In flyte repo

  1. Checkout [Feature] Refactor distributed job using common ReplicaSpec flyte#5355
  2. make compile
  3. flytectl demo start --dev
  4. kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0"
  5. POD_NAMESPACE=flyte ./flyte start --config kubeflow.yaml

where kubeflow.yaml is

# This is a sample configuration file for running single-binary Flyte locally against
# a sandbox.
admin:
  # This endpoint is used by flytepropeller to talk to admin
  # and artifacts to talk to admin,
  # and _also_, admin to talk to artifacts
  endpoint: localhost:30080
  insecure: true

catalog-cache:
  endpoint: localhost:8081
  insecure: true
  type: datacatalog

cluster_resources:
  standaloneDeployment: false
  templatePath: $HOME/.flyte/sandbox/cluster-resource-templates

logger:
  show-source: true
  level: 5

propeller:
  create-flyteworkflow-crd: true
  kube-config: $HOME/.flyte/sandbox/kubeconfig
  rawoutput-prefix: s3://my-s3-bucket/data

server:
  kube-config: $HOME/.flyte/sandbox/kubeconfig

webhook:
  certDir: $HOME/.flyte/webhook-certs
  localCert: true
  secretName: flyte-sandbox-webhook-secret
  serviceName: flyte-sandbox-local
  servicePort: 9443

tasks:
  task-plugins:
    enabled-plugins:
      - container
      - sidecar
      - K8S-ARRAY
      #- pytorch
      - tensorflow
      #- mpi
    default-for-task-types:
      - container: container
      - container_array: K8S-ARRAY
      - sidecar: sidecar
      #- pytorch: pytorch
      - tensorflow: tensorflow
      #- mpi: mpi
    fallback-to-container-handler: false

plugins:
  logs:
    kubernetes-enabled: true
    kubernetes-template-uri: http://localhost:30080/kubernetes-dashboard/#/log/{{.namespace }}/{{ .podName }}/pod?namespace={{ .namespace }}
    cloudwatch-enabled: false
    stackdriver-enabled: false
  k8s:
    image-pull-policy: Always
    default-env-vars:
      - FLYTE_AWS_ENDPOINT: http://flyte-sandbox-minio.flyte:9000
      - FLYTE_AWS_ACCESS_KEY_ID: minio
      - FLYTE_AWS_SECRET_ACCESS_KEY: miniostorage
  k8s-array:
    logs:
      config:
        kubernetes-enabled: true
        kubernetes-template-uri: http://localhost:30080/kubernetes-dashboard/#/log/{{.namespace }}/{{ .podName }}/pod?namespace={{ .namespace }}
        cloudwatch-enabled: false
        stackdriver-enabled: false

database:
  postgres:
    username: postgres
    password: postgres
    host: 127.0.0.1
    port: 30001
    dbname: flyte
    options: "sslmode=disable"
storage:
  type: stow
  stow:
    kind: s3
    config:
      region: us-east-1
      disable_ssl: true
      v2_signing: true
      endpoint: http://localhost:30002
      auth_type: accesskey
      access_key_id: minio
      secret_key: miniostorage
  container: my-s3-bucket

task_resources:
  defaults:
    cpu: 2
    memory: 1Gi
  limits:
    cpu: 4
    memory: 4Gi

In the parent folder of flyte and flytekit repo

  1. Create Dockerfile
FROM python:3.11-slim-bookworm as builder

WORKDIR /root
ENV PYTHONPATH /root

# Install build dependencies
RUN apt update \
    && apt install build-essential git wget -y \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

# Copy necessary directories
COPY flyte /flyte
COPY flytekit /flytekit

# Install Python packages (Order is important!)
RUN pip install --no-cache-dir /flytekit/plugins/flytekit-kf-tensorflow \
    && pip install --no-cache-dir /flytekit \
    && pip install --no-cache-dir /flyte/flyteidl
  1. Run docker buildx build -t localhost:30000/flytekit:dev --file Dockerfile --push .

In an arbitrary folder

  1. Create kubeflow_tf_evaluator.py
from flytekitplugins.kftensorflow import PS, Chief, Evaluator, TfJob, Worker

from flytekit import Resources, task

task_config = TfJob(
    worker=Worker(replicas=2),
    chief=Chief(replicas=1),
    ps=PS(replicas=1),
    evaluator=Evaluator(replicas=1),
)


@task(
    task_config=task_config,
    requests=Resources(cpu="1"),
)
def my_tensorflow_task(x: int, y: str) -> str:
    return f"{x=}, {y=}"
  1. Run pyflyte run --remote --image localhost:30000/flytekit:dev kubeflow_tf_evaluator.py my_tensorflow_task --x 100 --y acc

Test backward compatibility

  1. Checkout to master branch in flytekit repo.
  2. Rebuild docker image with docker buildx build -t localhost:30000/flytekit:dev --file Dockerfile --push . (In the parent folder of flyte and flytekit folder)
  3. Run pyflyte run --remote --image localhost:30000/flytekit:dev kubeflow_tf_evaluator.py my_tensorflow_task --x 100 --y acc

Screenshots

Note that the worker replica is 2.

image

image

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

flyteorg/flyte#5355

Docs link

@Future-Outlier
Copy link
Member

This is a PR for you to test evaluator.
#1870

@MortalHappiness MortalHappiness force-pushed the feature/#4408-refactor-replica-spec branch 2 times, most recently from 4a076c6 to 94e8e43 Compare May 17, 2024 04:51
@MortalHappiness MortalHappiness force-pushed the feature/#4408-refactor-replica-spec branch from 94e8e43 to 7b5ea22 Compare May 25, 2024 15:52
@MortalHappiness MortalHappiness marked this pull request as ready for review May 25, 2024 16:03
@MortalHappiness MortalHappiness marked this pull request as draft May 25, 2024 16:05
@MortalHappiness MortalHappiness marked this pull request as ready for review May 25, 2024 16:44
@MortalHappiness MortalHappiness force-pushed the feature/#4408-refactor-replica-spec branch 2 times, most recently from df30c4f to 5e21058 Compare May 26, 2024 10:03
Copy link

codecov bot commented May 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 58.31%. Comparing base (69445ff) to head (cd6232a).
Report is 4 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #2424       +/-   ##
===========================================
- Coverage   79.24%   58.31%   -20.94%     
===========================================
  Files         196      250       +54     
  Lines       19785    22092     +2307     
  Branches     4008     4006        -2     
===========================================
- Hits        15678    12882     -2796     
- Misses       3407     8700     +5293     
+ Partials      700      510      -190     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@eapolinario
Copy link
Collaborator

@MortalHappiness , can you merge master to get rid of the dbt failures?

In order to fix the CI failures for kf-mpi and kf-tensorflow you'll need to add a line similar to this to https://github.com/flyteorg/flytekit/blob/master/plugins/flytekit-kf-pytorch/dev-requirements.in and create the corresponding dev-requirements.in in https://github.com/flyteorg/flytekit/tree/master/plugins/flytekit-kf-tensorflow.

@MortalHappiness MortalHappiness force-pushed the feature/#4408-refactor-replica-spec branch 2 times, most recently from e2d3a4b to d79a6f1 Compare June 14, 2024 05:11
@MortalHappiness MortalHappiness force-pushed the feature/#4408-refactor-replica-spec branch 2 times, most recently from 426f72c to 443f8b1 Compare June 14, 2024 08:46
pingsutw
pingsutw previously approved these changes Jun 14, 2024
@MortalHappiness MortalHappiness force-pushed the feature/#4408-refactor-replica-spec branch from d263f73 to 5f38258 Compare June 16, 2024 03:35
@MortalHappiness MortalHappiness force-pushed the feature/#4408-refactor-replica-spec branch from 5f38258 to a972abb Compare July 8, 2024 10:44
Comment on lines +468 to 470
common=plugins_common.CommonReplicaSpec(replicas=self.max_nodes),
# The following fields are deprecated. They are kept for backwards compatibility.
replicas=self.max_nodes,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the replica arg stay duplicated here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? To ensure backward-compatibility, both replicas and common.replicas need to be sent to the backend.

Copy link
Member

@fg91 fg91 Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the respective backend plugin we already distinguish between taskTemplate.TaskTypeVersion == 0/1, see here. We'll have to add backwards compatibility for the refactoring in this PR there as well, right?

I wonder whether removing the duplication in the proto definitions is worth having to check for backwards compatibility in flytekit and flyteplugins. While it might have been better to share the replica spec from the beginning, maybe it's now better to leave as is?

Happy to be convinced otherwise!! 🙏

@pingsutw fyi, let's maybe discuss in the contrib sync?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fg91 @pingsutw Have you finished the discussion in the contrib sync? What do you think in the end?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fg91 We have duplicated replica here to maintain backward compatibility. If people only upgrade flytekit and don't upgrade flyte backend, it should still work, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test it? If it's duplicated, it should but let's definitely test it.

My personal opinion is that maybe we should have used a shared replica in the first place but duplicating the replica now feels more cluttered than leaving separate replicas.

Copy link
Member

@fg91 fg91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's please explicitly discuss this before we decide to merge.

Signed-off-by: Kevin Su <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core feature] Refactor distributed job using common ReplicaSpec
5 participants