You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
I am attempting to run a simple python script through a RayJob on a GKE cluster with a multi-host v5e TPU nodepool. Instead of running in parallel on the hosts, it attempts to run the code sequentially.
This appears to only be an issue on RayJobs, as I attempted the exact same code on a RayCluster, which was also using an identical multi-host v5e TPU nodepool, and it executed correctly, and in parallel. I ensured that the rayClusterConfig in the RayJob exactly matched the RayCluster configuration as well.
Reproduction script
apiVersion: ray.io/v1
kind: RayJob
metadata:
name: rayjob-sample
labels:
app.kubernetes.io/name: kuberay
spec:
entrypoint: python3 /home/ray/samples/sample_code.py
shutdownAfterJobFinishes: true
ttlSecondsAfterFinished: 0
rayClusterSpec:
rayVersion: '2.6.1' # should match the Ray version in the image of the containers
headGroupSpec:
rayStartParams:
dashboard-host: '0.0.0.0'
block: 'true'
template:
metadata:
labels:
cloud.google.com/gke-ray-node-type: head
app.kubernetes.io/name: kuberay
spec:
containers:
- name: ray-head
image: gcr.io/tpu-vm-gke-testing/bzmarke-ray:latest # # rayproject/ray:2.9.3-py311
resources:
limits:
cpu: "8"
memory: "40G"
ephemeral-storage: 2Gi
requests:
cpu: "8"
memory: "40G"
ephemeral-storage: 2Gi
ports:
- containerPort: 6379
name: gcs-server
- containerPort: 8265 # Ray dashboard
name: dashboard
- containerPort: 10001
name: client
- containerPort: 8000
name: serve
volumeMounts:
- mountPath: /home/ray/samples
name: code-sample
volumes:
- name: code-sample
configMap:
name: ray-job-code-sample
items:
- key: sample_code.py
path: sample_code.py
workerGroupSpecs:
- replicas: 2
minReplicas: 2
maxReplicas: 2
groupName: small-group
rayStartParams:
block: 'true'
resources: '"{\"TPU\": 4}"'
template:
metadata:
labels:
cloud.google.com/gke-ray-node-type: worker
app.kubernetes.io/name: kuberay
spec:
containers:
- name: ray-worker
image: gcr.io/tpu-vm-gke-testing/bzmarke-ray:latest
lifecycle:
preStop:
exec:
command: [ "/bin/sh","-c","ray stop" ]
resources:
limits:
google.com/tpu: "4"
memory: "40G"
cpu: "1"
ephemeral-storage: 20Gi
requests:
google.com/tpu: "4"
memory: "40G"
cpu: "1"
ephemeral-storage: 20Gi
nodeSelector:
cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
cloud.google.com/gke-tpu-topology: 2x4
cloud.google.com/gke-placement-group: "tpu-nodepool-multihost"
######################Ray code sample#################################
# this sample is from https://docs.ray.io/en/latest/cluster/job-submission.html#quick-start-example
# it is mounted into the container and executed to show the Ray job at work
---
apiVersion: v1
kind: ConfigMap
metadata:
name: ray-job-code-sample
data:
sample_code.py: |
import ray
ray.init()
import os
@ray.remote
def hello_world():
print("started hello_world")
import socket
print(socket.gethostname())
print(os.environ)
import time
time.sleep(30)
num_workers = 2
result = [hello_world.remote() for _ in range(num_workers)]
print(ray.get(result))
Anything else
No response
Are you willing to submit a PR?
Yes I am willing to submit a PR!
The text was updated successfully, but these errors were encountered:
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
I am attempting to run a simple python script through a RayJob on a GKE cluster with a multi-host v5e TPU nodepool. Instead of running in parallel on the hosts, it attempts to run the code sequentially.
This appears to only be an issue on RayJobs, as I attempted the exact same code on a RayCluster, which was also using an identical multi-host v5e TPU nodepool, and it executed correctly, and in parallel. I ensured that the
rayClusterConfig
in the RayJob exactly matched the RayCluster configuration as well.Reproduction script
Anything else
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: