Skip to content

Commit

Permalink
Merge pull request #2 from rbarberop/main
Browse files Browse the repository at this point in the history
Reorg in folders and added new README and LICENSE
  • Loading branch information
richardsliu committed Jun 13, 2023
2 parents ece26c9 + faf6fff commit 320348a
Show file tree
Hide file tree
Showing 60 changed files with 669 additions and 155 deletions.
1 change: 0 additions & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
Expand Down
162 changes: 8 additions & 154 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,159 +1,13 @@
# Ray on GKE
# Kubernetes Engine Samples

This repository contains a Terraform template for running [Ray](https://www.ray.io/) on Google Kubernetes Engine.
We've also included some example notebooks, including one that serves a GPT-J-6B model with Ray AIR (see
[here](https://docs.ray.io/en/master/ray-air/examples/gptj_serving.html) for the original notebook).
[![Open in Cloud Shell](https://gstatic.com/cloudssh/images/open-btn.svg)](https://ssh.cloud.google.com/cloudshell/editor?cloudshell_git_repo=https://github.com/GoogleCloudPlatform/ai-on-gke&cloudshell_tutorial=README.md&cloudshell_workspace=hello-app)

The solution is split into `platform` and `user` resources.
This repository contains assets related to AI/ML workloads on
[Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine/).

Platform resources (deployed once):
* GKE Cluster
* Nvidia GPU drivers
* Kuberay operator and CRDs
## Important Note
The use of the assets contained in this repository is subject to compliance with [Google's AI Principles](https://ai.google/responsibility/principles/)

User resources (deployed once per user):
* User namespace
* Kubernetes service accounts
* Kuberay cluster
* Prometheus monitoring
* Logging container
* Jupyter notebook
## Licensing

## Installation

Note: Terraform keeps state metadata in a local file called `terraform.tfstate`.
If you need to reinstall any resources, make sure to delete this file as well.

### Platform

1. `cd platform`

2. Edit `variables.tf` with your GCP settings.

3. Run `terraform init`

4. Run `terraform apply`

### User

1. `cd user`

2. Edit `variables.tf` with your GCP settings.

3. Run `terraform init`

4. Get the GKE cluster name and location/region from `platform/variables.tf`.
Run `gcloud container clusters get-credentials %gke_cluster_name% --location=%location%`
Configuring `gcloud` [instructions](https://cloud.google.com/sdk/docs/initializing)

5. Run `terraform apply`

## Using Ray with Jupyter

1. Run `kubectl get services -n <namespace>`

2. Copy the external IP for the notebook.

3. Open the external IP in a browser and login. The default user names and
passwords can be found in the [Jupyter
settings](https://github.com/richardsliu/ray-on-gke/blob/main/user/modules/jupyterhub/jupyterhub-values.yaml) file.

4. The Ray cluster is available at `ray://example-cluster-kuberay-head-svc:10001`. To access the cluster, you can open one of the sample notebooks under `example_notebooks` (via `File` -> `Open from URL` in the Jupyter notebook window and use the raw file URL from GitHub) and run through the example. Ex url: https://raw.githubusercontent.com/richardsliu/ray-on-gke/main/example_notebooks/gpt-j-online.ipynb

5. To use the Ray dashboard, run the following command to port-forward:
```
kubectl port-forward -n ray service/example-cluster-kuberay-head-svc 8265:8265
```

And then open the dashboard using the following URL:
```
http://localhost:8265
```

## Using Ray with Ray Jobs API

1. To connect to the remote GKE cluster with the Ray API, setup the Ray dashboard.
Run the following command to port-forward:
```
kubectl port-forward -n ray service/example-cluster-kuberay-head-svc 8265:8265
```

And then open the dashboard using the following URL:
```
http://localhost:8265
```

2. Set the RAY_ADDRESS environment variable:
`export RAY_ADDRESS="http://127.0.0.1:8265"`

3. Create a working directory with some job file `ray_job.py`.

4. Submit the job:
`ray job submit --working-dir %your_working_directory% -- python ray_job.py`

5. Note the job submission ID from the output, eg.:
`Job 'raysubmit_inB2ViQuE29aZRJ5' succeeded`

See [Ray docs](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/quickstart.html#submitting-a-job) for more info.

## Securing Your Cluster Endpoints

For demo purposes, this repo creates a public IP for the Jupyter notebook with basic dummy authentication. To secure your cluster, it is *strong recommended* to replace
this with your own secure endpoints.

For more information, please take a look at the following links:
* https://cloud.google.com/iap/docs/enabling-kubernetes-howto
* https://cloud.google.com/endpoints/docs/openapi/get-started-kubernetes-engine
* https://jupyterhub.readthedocs.io/en/stable/tutorial/getting-started/authenticators-users-basics.html


## Running GPT-J-6B

This example is adapted from Ray AIR's examples [here](https://docs.ray.io/en/master/ray-air/examples/gptj_serving.html).

1. Open the `gpt-j-online.ipynb` notebook.

2. Open a terminal in the Jupyter session and install Ray AIR:
```
pip install ray[air]
```

3. Run through the notebook cells. You can change the prompt in the last cell:
```
prompt = (
## Input your own prompt here
)
```

4. This should output a generated text response.


## Logging and Monitoring

This repository comes with out-of-the-box integrations with Google Cloud Logging
and Managed Prometheus for monitoring. To see your Ray cluster logs:

1. Open Cloud Console and open Logging
2. If using Jupyter notebook for job submission, use the following query parameters:
```
resource.type="k8s_container"
resource.labels.cluster_name=%CLUSTER_NAME%
resource.labels.pod_name=%RAY_HEAD_POD_NAME%
resource.labels.container_name="fluentbit"
```
3. If using Ray Jobs API:
(a) Note the job ID returned by the `ray job submit` API.
Eg: Job submission: `ray job submit --working-dir /Users/imreddy/ray_working_directory -- python script.py`
Job submission ID: `Job 'raysubmit_kFWB6VkfyqK1CbEV' submitted successfully`
(b) Get the namespace name from `user/variables.tf` or `kubectl get namespaces`
(c) Use the following query to search for the job logs:

```
resource.labels.namespace_name=%NAMESPACE_NAME%
jsonpayload.job_id=%RAY_JOB_ID%
```

To see monitoring metrics:
1. Open Cloud Console and open Metrics Explorer
2. In "Target", select "Prometheus Target" and then "Ray".
3. Select the metric you want to view, and then click "Apply".
* See [LICENSE](/LICENSE)
8 changes: 8 additions & 0 deletions gke-a100-jax/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
FROM nvcr.io/nvidia/tensorflow:22.11-tf2-py3

# Install the latest jax
RUN pip install --upgrade pip
RUN pip install jax[cuda]==0.4.2 -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

WORKDIR /scripts
ADD train.py /scripts
7 changes: 7 additions & 0 deletions gke-a100-jax/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# JAX 'Hello World' on GKE + A100-80GB

[![Open in Cloud Shell](https://gstatic.com/cloudssh/images/open-btn.svg)](https://ssh.cloud.google.com/cloudshell/editor?cloudshell_git_repo=https://github.com/GoogleCloudPlatform/kubernetes-engine-samples&cloudshell_tutorial=README.md&cloudshell_workspace=ai-ml/gke-a100-jax)

This tutorial shows how to run a simple JAX program using NVIDIA GPUs A100-80GB on a GKE cluster

Visit https://cloud.google.com/blog/products/containers-kubernetes/machine-learning-with-jax-on-kubernetes-with-nvidia-gpus to follow the tutorial
21 changes: 21 additions & 0 deletions gke-a100-jax/build_push_container.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

#!/bin/bash

export PROJECT=$(gcloud config list project --format "value(core.project)")

docker build . -f Dockerfile -t "gcr.io/${PROJECT}/jax/hello:latest"

docker push "gcr.io/${PROJECT}/jax/hello:latest"
30 changes: 30 additions & 0 deletions gke-a100-jax/kubernetes/job-patch.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: batch/v1
kind: Job
metadata:
name: job-name
spec:
template:
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: jax-worker
resources:
limits:
nvidia.com/gpu: 1
42 changes: 42 additions & 0 deletions gke-a100-jax/kubernetes/job.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: batch/v1
kind: Job
metadata:
name: job-name
spec:
completions: 1
parallelism: 1
completionMode: Indexed
backoffLimit: 1
template:
spec:
subdomain: headless-svc
restartPolicy: Never
containers:
- name: jax-worker
image: gcr.io/<<PROJECT>>/jax/hello:latest
command: ["python", "train.py"]
args:
- --num_processes
- WILL_BE_REPLACED
- --job_name
- WILL_BE_REPLACED
- --sub_domain
- WILL_BE_REPLACED
- --coordinator_port
- WILL_BE_REPLACED
ports:
- containerPort: 1234
97 changes: 97 additions & 0 deletions gke-a100-jax/kubernetes/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


namespace: default

resources:
- job.yaml
- service.yaml

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
patches:
- path: job-patch.yaml
- patch: |-
- op: replace
path: /spec/completions
value: 32
- op: replace
path: /metadata/name
value: jax-hello-world
- op: add
path: /spec/template/spec/nodeSelector
value:
cloud.google.com/gke-accelerator:
nvidia-a100-80gb
target:
kind: Job
version: v1

images:
- name: gcr.io/<<PROJECT>>/jax/hello:latest
newName: gcr.io/<<PROJECT>>/jax/hello
newTag: latest

replacements:
- source:
kind: Job
version: v1
targets:
- select:
kind: Service
version: v1
fieldPaths:
- spec.selector.job-name
- select:
kind: Job
version: v1
fieldPaths:
- spec.template.spec.containers.[name=jax-worker].args.3
- source:
kind: Job
version: v1
fieldPath: spec.template.spec.subdomain
targets:
- select:
kind: Service
version: v1
fieldPaths:
- metadata.name
- select:
kind: Job
version: v1
fieldPaths:
- spec.template.spec.containers.[name=jax-worker].args.5
- source:
kind: Job
version: v1
fieldPath: spec.completions
targets:
- select:
kind: Job
version: v1
fieldPaths:
- spec.template.spec.containers.[name=jax-worker].args.1
- spec.parallelism
- source:
kind: Job
version: v1
fieldPath: spec.template.spec.containers.[name=jax-worker].ports.0.containerPort
targets:
- select:
kind: Job
version: v1
fieldPaths:
- spec.template.spec.containers.[name=jax-worker].args.7
Loading

0 comments on commit 320348a

Please sign in to comment.