This repository contains a Terraform template for running Ray on Google Kubernetes Engine.
We've also included some example notebooks (applications/ray/example_notebooks
), including one that serves a GPT-J-6B model with Ray AIR (see
here for the original notebook).
This module assumes you already have a functional GKE cluster. If not, follow the instructions under infrastructure/README.md
to install a Standard or Autopilot GKE cluster, then follow the instructions in this module to install Ray.
This module deploys the following, once per user:
- User namespace
- Kubernetes service accounts
- Kuberay cluster
- Prometheus monitoring
- Logging container
Preinstall the following on your computer:
- Terraform
- Gcloud
NOTE: Terraform keeps state metadata in a local file called
terraform.tfstate
. Deleting the file may cause some resources to not be cleaned up correctly even if you delete the cluster. We suggest usingterraform destory
before reapplying/reinstalling.
-
If needed, git clone https://github.com/GoogleCloudPlatform/ai-on-gke
-
cd applications/ray
-
Find the name and location of the GKE cluster you want to use. Run
gcloud container clusters list --project=<your GCP project>
to see all the available clusters. Note: If you created the GKE cluster via the infrastructure repo, you can get the cluster info fromplatform.tfvars
-
Edit
workloads.tfvars
with your environment specific variables and configurations. -
Run
terraform init && terraform apply --var-file workloads.tfvars
- To connect to the remote GKE cluster with the Ray API, setup the Ray dashboard. Run the following command to port-forward:
kubectl port-forward -n <namespace> service/example-cluster-kuberay-head-svc 8265:8265
And then open the dashboard using the following URL:
http://localhost:8265
-
Set the RAY_ADDRESS environment variable:
export RAY_ADDRESS="http://127.0.0.1:8265"
-
Create a working directory with some job file
ray_job.py
. -
Submit the job:
ray job submit --working-dir %YOUR_WORKING_DIRECTORY% -- python ray_job.py
-
Note the job submission ID from the output, eg.:
Job 'raysubmit_inB2ViQuE29aZRJ5' succeeded
See Ray docs for more info.
If you want to connect to the Ray cluster via a Jupyter notebook or to try the example notebooks in the repo, please
first install JupyterHub via applications/jupyter/README.md
.
This repository comes with out-of-the-box integrations with Google Cloud Logging and Managed Prometheus for monitoring. To see your Ray cluster logs:
- Open Cloud Console and open Logging
- If using Jupyter notebook for job submission, use the following query parameters:
resource.type="k8s_container"
resource.labels.cluster_name=%CLUSTER_NAME%
resource.labels.pod_name=%RAY_HEAD_POD_NAME%
resource.labels.container_name="fluentbit"
- If using Ray Jobs API:
(a) Note the job ID returned by the
ray job submit
API. Eg: Job submission:ray job submit --working-dir %YOUR_WORKING_DIRECTORY% -- python script.py
Job submission ID:Job 'raysubmit_kFWB6VkfyqK1CbEV' submitted successfully
(b) Get the namespace name fromuser/variables.tf
orkubectl get namespaces
(c) Use the following query to search for the job logs:
resource.labels.namespace_name=%NAMESPACE_NAME%
jsonpayload.job_id=%RAY_JOB_ID%
To see monitoring metrics:
- Open Cloud Console and open Metrics Explorer
- In "Target", select "Prometheus Target" and then "Ray".
- Select the metric you want to view, and then click "Apply".