Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Code runs in sequence rather than parallel on RayJobs #1964

Closed
2 tasks done
mbzomowski opened this issue Mar 5, 2024 · 2 comments
Closed
2 tasks done

[Bug] Code runs in sequence rather than parallel on RayJobs #1964

mbzomowski opened this issue Mar 5, 2024 · 2 comments
Labels
bug Something isn't working P1 Issue that should be fixed within a few weeks tpu

Comments

@mbzomowski
Copy link

mbzomowski commented Mar 5, 2024

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

I am attempting to run a simple python script through a RayJob on a GKE cluster with a multi-host v5e TPU nodepool. Instead of running in parallel on the hosts, it attempts to run the code sequentially.

This appears to only be an issue on RayJobs, as I attempted the exact same code on a RayCluster, which was also using an identical multi-host v5e TPU nodepool, and it executed correctly, and in parallel. I ensured that the rayClusterConfig in the RayJob exactly matched the RayCluster configuration as well.

Reproduction script

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-sample
  labels:
    app.kubernetes.io/name: kuberay
spec:
  entrypoint: python3 /home/ray/samples/sample_code.py
  shutdownAfterJobFinishes: true
  ttlSecondsAfterFinished: 0
  rayClusterSpec:
    rayVersion: '2.6.1' # should match the Ray version in the image of the containers
    headGroupSpec:
      rayStartParams:
        dashboard-host: '0.0.0.0'
        block: 'true'
      template:
        metadata:
          labels:
            cloud.google.com/gke-ray-node-type: head
            app.kubernetes.io/name: kuberay
        spec:
          containers:
            - name: ray-head
              image: gcr.io/tpu-vm-gke-testing/bzmarke-ray:latest # # rayproject/ray:2.9.3-py311
              resources:
                limits:
                  cpu: "8"
                  memory: "40G"
                  ephemeral-storage: 2Gi
                requests:
                  cpu: "8"
                  memory: "40G"
                  ephemeral-storage: 2Gi
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265 # Ray dashboard
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
              volumeMounts:
                - mountPath: /home/ray/samples
                  name: code-sample
          volumes:
            - name: code-sample
              configMap:
                name: ray-job-code-sample
                items:
                  - key: sample_code.py
                    path: sample_code.py
    workerGroupSpecs:
      - replicas: 2
        minReplicas: 2
        maxReplicas: 2
        groupName: small-group
        rayStartParams:
          block: 'true'
          resources: '"{\"TPU\": 4}"'
        template:
          metadata:
            labels:
              cloud.google.com/gke-ray-node-type: worker
              app.kubernetes.io/name: kuberay
          spec:
            containers:
              - name: ray-worker
                image: gcr.io/tpu-vm-gke-testing/bzmarke-ray:latest
                lifecycle:
                  preStop:
                    exec:
                      command: [ "/bin/sh","-c","ray stop" ]
                resources:
                  limits:
                    google.com/tpu: "4"
                    memory: "40G"
                    cpu: "1"
                    ephemeral-storage: 20Gi
                  requests:
                    google.com/tpu: "4"
                    memory: "40G"
                    cpu: "1"
                    ephemeral-storage: 20Gi
            nodeSelector:
              cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
              cloud.google.com/gke-tpu-topology: 2x4
              cloud.google.com/gke-placement-group: "tpu-nodepool-multihost"
######################Ray code sample#################################
# this sample is from https://docs.ray.io/en/latest/cluster/job-submission.html#quick-start-example
# it is mounted into the container and executed to show the Ray job at work
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ray-job-code-sample
data:
  sample_code.py: |
    import ray
    ray.init()
    import os

    @ray.remote
    def hello_world():
        print("started hello_world")
        import socket
        print(socket.gethostname())
        print(os.environ)
        import time
        time.sleep(30)

    num_workers = 2
    result = [hello_world.remote() for _ in range(num_workers)]
    print(ray.get(result))

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@mbzomowski mbzomowski added bug Something isn't working triage labels Mar 5, 2024
@kevin85421
Copy link
Member

cc @richardsliu @ryanaoleary for the TPU questions

@kevin85421 kevin85421 added tpu P1 Issue that should be fixed within a few weeks and removed triage labels Mar 7, 2024
@mbzomowski
Copy link
Author

Solved by:

(1) setting workerGroupSpecs to:

    workerGroupSpecs:
      - replicas: 1
        minReplicas: 1
        maxReplicas: 1
        numOfHosts: 2
...

(2) using kuberay-tpu-webhook from this PR: GoogleCloudPlatform/ai-on-gke#180

(3) switching to kuberay-operator v1.1.0-rc.0

(4) switching to images that used Ray v2.9.3

The job was running in parallel, but Jax was having an issue communicating across the TPU hosts due to the lack of a -tpu-worker-svc k8s Service.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1 Issue that should be fixed within a few weeks tpu
Projects
None yet
Development

No branches or pull requests

2 participants