-
Notifications
You must be signed in to change notification settings - Fork 177
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Reorg in folders and added new README and LICENSE
- Loading branch information
Showing
60 changed files
with
669 additions
and
155 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,3 @@ | ||
|
||
Apache License | ||
Version 2.0, January 2004 | ||
http://www.apache.org/licenses/ | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,159 +1,13 @@ | ||
# Ray on GKE | ||
# Kubernetes Engine Samples | ||
|
||
This repository contains a Terraform template for running [Ray](https://www.ray.io/) on Google Kubernetes Engine. | ||
We've also included some example notebooks, including one that serves a GPT-J-6B model with Ray AIR (see | ||
[here](https://docs.ray.io/en/master/ray-air/examples/gptj_serving.html) for the original notebook). | ||
[![Open in Cloud Shell](https://gstatic.com/cloudssh/images/open-btn.svg)](https://ssh.cloud.google.com/cloudshell/editor?cloudshell_git_repo=https://github.com/GoogleCloudPlatform/ai-on-gke&cloudshell_tutorial=README.md&cloudshell_workspace=hello-app) | ||
|
||
The solution is split into `platform` and `user` resources. | ||
This repository contains assets related to AI/ML workloads on | ||
[Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine/). | ||
|
||
Platform resources (deployed once): | ||
* GKE Cluster | ||
* Nvidia GPU drivers | ||
* Kuberay operator and CRDs | ||
## Important Note | ||
The use of the assets contained in this repository is subject to compliance with [Google's AI Principles](https://ai.google/responsibility/principles/) | ||
|
||
User resources (deployed once per user): | ||
* User namespace | ||
* Kubernetes service accounts | ||
* Kuberay cluster | ||
* Prometheus monitoring | ||
* Logging container | ||
* Jupyter notebook | ||
## Licensing | ||
|
||
## Installation | ||
|
||
Note: Terraform keeps state metadata in a local file called `terraform.tfstate`. | ||
If you need to reinstall any resources, make sure to delete this file as well. | ||
|
||
### Platform | ||
|
||
1. `cd platform` | ||
|
||
2. Edit `variables.tf` with your GCP settings. | ||
|
||
3. Run `terraform init` | ||
|
||
4. Run `terraform apply` | ||
|
||
### User | ||
|
||
1. `cd user` | ||
|
||
2. Edit `variables.tf` with your GCP settings. | ||
|
||
3. Run `terraform init` | ||
|
||
4. Get the GKE cluster name and location/region from `platform/variables.tf`. | ||
Run `gcloud container clusters get-credentials %gke_cluster_name% --location=%location%` | ||
Configuring `gcloud` [instructions](https://cloud.google.com/sdk/docs/initializing) | ||
|
||
5. Run `terraform apply` | ||
|
||
## Using Ray with Jupyter | ||
|
||
1. Run `kubectl get services -n <namespace>` | ||
|
||
2. Copy the external IP for the notebook. | ||
|
||
3. Open the external IP in a browser and login. The default user names and | ||
passwords can be found in the [Jupyter | ||
settings](https://github.com/richardsliu/ray-on-gke/blob/main/user/modules/jupyterhub/jupyterhub-values.yaml) file. | ||
|
||
4. The Ray cluster is available at `ray://example-cluster-kuberay-head-svc:10001`. To access the cluster, you can open one of the sample notebooks under `example_notebooks` (via `File` -> `Open from URL` in the Jupyter notebook window and use the raw file URL from GitHub) and run through the example. Ex url: https://raw.githubusercontent.com/richardsliu/ray-on-gke/main/example_notebooks/gpt-j-online.ipynb | ||
|
||
5. To use the Ray dashboard, run the following command to port-forward: | ||
``` | ||
kubectl port-forward -n ray service/example-cluster-kuberay-head-svc 8265:8265 | ||
``` | ||
|
||
And then open the dashboard using the following URL: | ||
``` | ||
http://localhost:8265 | ||
``` | ||
|
||
## Using Ray with Ray Jobs API | ||
|
||
1. To connect to the remote GKE cluster with the Ray API, setup the Ray dashboard. | ||
Run the following command to port-forward: | ||
``` | ||
kubectl port-forward -n ray service/example-cluster-kuberay-head-svc 8265:8265 | ||
``` | ||
|
||
And then open the dashboard using the following URL: | ||
``` | ||
http://localhost:8265 | ||
``` | ||
|
||
2. Set the RAY_ADDRESS environment variable: | ||
`export RAY_ADDRESS="http://127.0.0.1:8265"` | ||
|
||
3. Create a working directory with some job file `ray_job.py`. | ||
|
||
4. Submit the job: | ||
`ray job submit --working-dir %your_working_directory% -- python ray_job.py` | ||
|
||
5. Note the job submission ID from the output, eg.: | ||
`Job 'raysubmit_inB2ViQuE29aZRJ5' succeeded` | ||
|
||
See [Ray docs](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/quickstart.html#submitting-a-job) for more info. | ||
|
||
## Securing Your Cluster Endpoints | ||
|
||
For demo purposes, this repo creates a public IP for the Jupyter notebook with basic dummy authentication. To secure your cluster, it is *strong recommended* to replace | ||
this with your own secure endpoints. | ||
|
||
For more information, please take a look at the following links: | ||
* https://cloud.google.com/iap/docs/enabling-kubernetes-howto | ||
* https://cloud.google.com/endpoints/docs/openapi/get-started-kubernetes-engine | ||
* https://jupyterhub.readthedocs.io/en/stable/tutorial/getting-started/authenticators-users-basics.html | ||
|
||
|
||
## Running GPT-J-6B | ||
|
||
This example is adapted from Ray AIR's examples [here](https://docs.ray.io/en/master/ray-air/examples/gptj_serving.html). | ||
|
||
1. Open the `gpt-j-online.ipynb` notebook. | ||
|
||
2. Open a terminal in the Jupyter session and install Ray AIR: | ||
``` | ||
pip install ray[air] | ||
``` | ||
|
||
3. Run through the notebook cells. You can change the prompt in the last cell: | ||
``` | ||
prompt = ( | ||
## Input your own prompt here | ||
) | ||
``` | ||
|
||
4. This should output a generated text response. | ||
|
||
|
||
## Logging and Monitoring | ||
|
||
This repository comes with out-of-the-box integrations with Google Cloud Logging | ||
and Managed Prometheus for monitoring. To see your Ray cluster logs: | ||
|
||
1. Open Cloud Console and open Logging | ||
2. If using Jupyter notebook for job submission, use the following query parameters: | ||
``` | ||
resource.type="k8s_container" | ||
resource.labels.cluster_name=%CLUSTER_NAME% | ||
resource.labels.pod_name=%RAY_HEAD_POD_NAME% | ||
resource.labels.container_name="fluentbit" | ||
``` | ||
3. If using Ray Jobs API: | ||
(a) Note the job ID returned by the `ray job submit` API. | ||
Eg: Job submission: `ray job submit --working-dir /Users/imreddy/ray_working_directory -- python script.py` | ||
Job submission ID: `Job 'raysubmit_kFWB6VkfyqK1CbEV' submitted successfully` | ||
(b) Get the namespace name from `user/variables.tf` or `kubectl get namespaces` | ||
(c) Use the following query to search for the job logs: | ||
|
||
``` | ||
resource.labels.namespace_name=%NAMESPACE_NAME% | ||
jsonpayload.job_id=%RAY_JOB_ID% | ||
``` | ||
|
||
To see monitoring metrics: | ||
1. Open Cloud Console and open Metrics Explorer | ||
2. In "Target", select "Prometheus Target" and then "Ray". | ||
3. Select the metric you want to view, and then click "Apply". | ||
* See [LICENSE](/LICENSE) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
FROM nvcr.io/nvidia/tensorflow:22.11-tf2-py3 | ||
|
||
# Install the latest jax | ||
RUN pip install --upgrade pip | ||
RUN pip install jax[cuda]==0.4.2 -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html | ||
|
||
WORKDIR /scripts | ||
ADD train.py /scripts |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# JAX 'Hello World' on GKE + A100-80GB | ||
|
||
[![Open in Cloud Shell](https://gstatic.com/cloudssh/images/open-btn.svg)](https://ssh.cloud.google.com/cloudshell/editor?cloudshell_git_repo=https://github.com/GoogleCloudPlatform/kubernetes-engine-samples&cloudshell_tutorial=README.md&cloudshell_workspace=ai-ml/gke-a100-jax) | ||
|
||
This tutorial shows how to run a simple JAX program using NVIDIA GPUs A100-80GB on a GKE cluster | ||
|
||
Visit https://cloud.google.com/blog/products/containers-kubernetes/machine-learning-with-jax-on-kubernetes-with-nvidia-gpus to follow the tutorial |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
# Copyright 2023 Google LLC | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# https://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
#!/bin/bash | ||
|
||
export PROJECT=$(gcloud config list project --format "value(core.project)") | ||
|
||
docker build . -f Dockerfile -t "gcr.io/${PROJECT}/jax/hello:latest" | ||
|
||
docker push "gcr.io/${PROJECT}/jax/hello:latest" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
# Copyright 2023 Google LLC | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
apiVersion: batch/v1 | ||
kind: Job | ||
metadata: | ||
name: job-name | ||
spec: | ||
template: | ||
spec: | ||
tolerations: | ||
- key: nvidia.com/gpu | ||
operator: Exists | ||
effect: NoSchedule | ||
containers: | ||
- name: jax-worker | ||
resources: | ||
limits: | ||
nvidia.com/gpu: 1 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
# Copyright 2023 Google LLC | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
apiVersion: batch/v1 | ||
kind: Job | ||
metadata: | ||
name: job-name | ||
spec: | ||
completions: 1 | ||
parallelism: 1 | ||
completionMode: Indexed | ||
backoffLimit: 1 | ||
template: | ||
spec: | ||
subdomain: headless-svc | ||
restartPolicy: Never | ||
containers: | ||
- name: jax-worker | ||
image: gcr.io/<<PROJECT>>/jax/hello:latest | ||
command: ["python", "train.py"] | ||
args: | ||
- --num_processes | ||
- WILL_BE_REPLACED | ||
- --job_name | ||
- WILL_BE_REPLACED | ||
- --sub_domain | ||
- WILL_BE_REPLACED | ||
- --coordinator_port | ||
- WILL_BE_REPLACED | ||
ports: | ||
- containerPort: 1234 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,97 @@ | ||
# Copyright 2023 Google LLC | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
|
||
namespace: default | ||
|
||
resources: | ||
- job.yaml | ||
- service.yaml | ||
|
||
apiVersion: kustomize.config.k8s.io/v1beta1 | ||
kind: Kustomization | ||
patches: | ||
- path: job-patch.yaml | ||
- patch: |- | ||
- op: replace | ||
path: /spec/completions | ||
value: 32 | ||
- op: replace | ||
path: /metadata/name | ||
value: jax-hello-world | ||
- op: add | ||
path: /spec/template/spec/nodeSelector | ||
value: | ||
cloud.google.com/gke-accelerator: | ||
nvidia-a100-80gb | ||
target: | ||
kind: Job | ||
version: v1 | ||
|
||
images: | ||
- name: gcr.io/<<PROJECT>>/jax/hello:latest | ||
newName: gcr.io/<<PROJECT>>/jax/hello | ||
newTag: latest | ||
|
||
replacements: | ||
- source: | ||
kind: Job | ||
version: v1 | ||
targets: | ||
- select: | ||
kind: Service | ||
version: v1 | ||
fieldPaths: | ||
- spec.selector.job-name | ||
- select: | ||
kind: Job | ||
version: v1 | ||
fieldPaths: | ||
- spec.template.spec.containers.[name=jax-worker].args.3 | ||
- source: | ||
kind: Job | ||
version: v1 | ||
fieldPath: spec.template.spec.subdomain | ||
targets: | ||
- select: | ||
kind: Service | ||
version: v1 | ||
fieldPaths: | ||
- metadata.name | ||
- select: | ||
kind: Job | ||
version: v1 | ||
fieldPaths: | ||
- spec.template.spec.containers.[name=jax-worker].args.5 | ||
- source: | ||
kind: Job | ||
version: v1 | ||
fieldPath: spec.completions | ||
targets: | ||
- select: | ||
kind: Job | ||
version: v1 | ||
fieldPaths: | ||
- spec.template.spec.containers.[name=jax-worker].args.1 | ||
- spec.parallelism | ||
- source: | ||
kind: Job | ||
version: v1 | ||
fieldPath: spec.template.spec.containers.[name=jax-worker].ports.0.containerPort | ||
targets: | ||
- select: | ||
kind: Job | ||
version: v1 | ||
fieldPaths: | ||
- spec.template.spec.containers.[name=jax-worker].args.7 |
Oops, something went wrong.