[Bug] Code runs in sequence rather than parallel on RayJobs #1964

mbzomowski · 2024-03-05T19:06:09Z

Search before asking

I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

I am attempting to run a simple python script through a RayJob on a GKE cluster with a multi-host v5e TPU nodepool. Instead of running in parallel on the hosts, it attempts to run the code sequentially.

This appears to only be an issue on RayJobs, as I attempted the exact same code on a RayCluster, which was also using an identical multi-host v5e TPU nodepool, and it executed correctly, and in parallel. I ensured that the rayClusterConfig in the RayJob exactly matched the RayCluster configuration as well.

Reproduction script

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-sample
  labels:
    app.kubernetes.io/name: kuberay
spec:
  entrypoint: python3 /home/ray/samples/sample_code.py
  shutdownAfterJobFinishes: true
  ttlSecondsAfterFinished: 0
  rayClusterSpec:
    rayVersion: '2.6.1' # should match the Ray version in the image of the containers
    headGroupSpec:
      rayStartParams:
        dashboard-host: '0.0.0.0'
        block: 'true'
      template:
        metadata:
          labels:
            cloud.google.com/gke-ray-node-type: head
            app.kubernetes.io/name: kuberay
        spec:
          containers:
            - name: ray-head
              image: gcr.io/tpu-vm-gke-testing/bzmarke-ray:latest # # rayproject/ray:2.9.3-py311
              resources:
                limits:
                  cpu: "8"
                  memory: "40G"
                  ephemeral-storage: 2Gi
                requests:
                  cpu: "8"
                  memory: "40G"
                  ephemeral-storage: 2Gi
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
              volumeMounts:
                - mountPath: /home/ray/samples
                  name: code-sample
          volumes:
            - name: code-sample
              configMap:
                name: ray-job-code-sample
                items:
                  - key: sample_code.py
                    path: sample_code.py
    workerGroupSpecs:
      - replicas: 2
        minReplicas: 2
        maxReplicas: 2
        groupName: small-group
        rayStartParams:
          block: 'true'
          resources: '"{\"TPU\": 4}"'
        template:
          metadata:
            labels:
              cloud.google.com/gke-ray-node-type: worker
              app.kubernetes.io/name: kuberay
          spec:
            containers:
              - name: ray-worker
                image: gcr.io/tpu-vm-gke-testing/bzmarke-ray:latest
                lifecycle:
                  preStop:
                    exec:
                      command: [ "/bin/sh","-c","ray stop" ]
                resources:
                  limits:
                    google.com/tpu: "4"
                    memory: "40G"
                    cpu: "1"
                    ephemeral-storage: 20Gi
                  requests:
                    google.com/tpu: "4"
                    memory: "40G"
                    cpu: "1"
                    ephemeral-storage: 20Gi
            nodeSelector:
              cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
              cloud.google.com/gke-tpu-topology: 2x4
              cloud.google.com/gke-placement-group: "tpu-nodepool-multihost"
######################Ray code sample#################################
# this sample is from https://docs.ray.io/en/latest/cluster/job-submission.html#quick-start-example
# it is mounted into the container and executed to show the Ray job at work
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ray-job-code-sample
data:
  sample_code.py: |
    import ray
    ray.init()
    import os

    @ray.remote
    def hello_world():
        print("started hello_world")
        import socket
        print(socket.gethostname())
        print(os.environ)
        import time
        time.sleep(30)

    num_workers = 2
    result = [hello_world.remote() for _ in range(num_workers)]
    print(ray.get(result))

Anything else

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

kevin85421 · 2024-03-07T11:41:37Z

cc @richardsliu @ryanaoleary for the TPU questions

mbzomowski · 2024-03-12T00:48:29Z

Solved by:

(1) setting workerGroupSpecs to:

    workerGroupSpecs:
      - replicas: 1
        minReplicas: 1
        maxReplicas: 1
        numOfHosts: 2
...

(2) using kuberay-tpu-webhook from this PR: GoogleCloudPlatform/ai-on-gke#180

(3) switching to kuberay-operator v1.1.0-rc.0

(4) switching to images that used Ray v2.9.3

The job was running in parallel, but Jax was having an issue communicating across the TPU hosts due to the lack of a -tpu-worker-svc k8s Service.

mbzomowski added bug Something isn't working triage labels Mar 5, 2024

kevin85421 added tpu P1 Issue that should be fixed within a few weeks and removed triage labels Mar 7, 2024

mbzomowski closed this as completed Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Code runs in sequence rather than parallel on RayJobs #1964

[Bug] Code runs in sequence rather than parallel on RayJobs #1964

mbzomowski commented Mar 5, 2024 •

edited

Loading

kevin85421 commented Mar 7, 2024

mbzomowski commented Mar 12, 2024

[Bug] Code runs in sequence rather than parallel on RayJobs #1964

[Bug] Code runs in sequence rather than parallel on RayJobs #1964

Comments

mbzomowski commented Mar 5, 2024 • edited Loading

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

kevin85421 commented Mar 7, 2024

mbzomowski commented Mar 12, 2024

mbzomowski commented Mar 5, 2024 •

edited

Loading