Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor the repo structure to make Jupyterhub/Kuberay modular #33

Merged
merged 4 commits into from
Sep 13, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions gke-platform/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Ray on GKE

This repository contains a Terraform template for installing a Standard or Autopilot GKE cluster in your GCP project.
It sets up the cluster to work seamlessly with the `ray-on-gke` and `jupyter-on-gke` modules in the `ai-on-gke` repo.

Platform resources:
* GKE Cluster
* Nvidia GPU drivers
* Kuberay operator and CRDs

## Installation

Preinstall the following on your computer:
* Kubectl
* Terraform
* Helm
* Gcloud

> **_NOTE:_** Terraform keeps state metadata in a local file called `terraform.tfstate`. If you need to reinstall any resources, make sure to delete this file as well.

### Platform

1. git clone https://github.com/GoogleCloudPlatform/ai-on-gke

2. `cd ai-on-gke/gke-platform`

3. Edit `variables.tf` with your GCP settings.

4. Run `terraform init`

5. Run `terraform apply`
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -40,4 +40,4 @@ variable "enable_tpu" {
type = bool
description = "Set to true to create TPU node pool"
default = false
}
}
File renamed without changes.
119 changes: 119 additions & 0 deletions jupyter-on-gke/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# Ray on GKE
imreddy13 marked this conversation as resolved.
Show resolved Hide resolved

This repository contains a Terraform template for running [JupyterHub](https://jupyter.org/hub) on Google Kubernetes Engine.
We've also included some example notebooks (`ai-on-gke/ray-on-gke/example_notebooks`), including one that serves a GPT-J-6B model with Ray AIR (see
[here](https://docs.ray.io/en/master/ray-air/examples/gptj_serving.html) for the original notebook).

This module assumes you already have a functional GKE cluster. If not, follow the instructions under `ai-on-gke/gke-platform/README.md`
to install a Standard or Autopilot GKE cluster, then follow the instructions in this module to install Ray.

This module deploys the following resources, once per user:
* JupyterHub deployment
* User namespace
* Kubernetes service accounts

## Installation

Preinstall the following on your computer:
* Kubectl
* Terraform
* Helm
* Gcloud

> **_NOTE:_** Terraform keeps state metadata in a local file called `terraform.tfstate`. If you need to reinstall any resources, make sure to delete this file as well.

### JupyterHub

> **_NOTE:_** Currently the cd/user/jupyterhub/jupyter_config/config.yaml has 3 profiles that uses the same jupyter images, this can be changed, as well as the description of these profiles.

1. `cd ai-on-gke/jupyter-on-gke/jupyterhub`

2. Edit `variables.tf` with your GCP settings. The `<your user name>` that you specify will become a K8s namespace for your Jupyterhub services.
Note:
If using this with the Ray module (`ai-on-gke/ray-on-gke/`), it is recommended to use the same k8s namespace
for both i.e. set this to the same namespace as `ai-on-gke/ray-on-gke/user/variables.tf`.
If not, set `enable_create_namespace` to `true` so a new k8s namespace is created for the Jupyter resources.

3. Run `terraform init`

4. Find the name and location of the GKE cluster you want to use.
Run `gcloud container clusters list --project=<your GCP project> to see all the available clusters.
Note: If you created the GKE cluster via the ai-on-gke/gke-platform repo, you can get the cluster info from `ai-on-gke/gke-platform/variables.tf`

5. Run `gcloud container clusters get-credentials %gke_cluster_name% --location=%location%`
Configuring `gcloud` [instructions](https://cloud.google.com/sdk/docs/initializing)

6. Run `terraform apply`

## Using Ray with Jupyter

1. Run `kubectl get services -n <namespace>`. The namespace is the user name that you specified above.

2. Copy the external IP for the notebook.

3. Open the external IP in a browser and login. The default user names and
passwords can be found in the [Jupyter
settings](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/jupyter-on-gke/jupyter_config/jupyter_config/config.yaml) file.

4. The Ray cluster is available at `ray://example-cluster-kuberay-head-svc:10001`. To access the cluster, you can open one of the sample notebooks under `example_notebooks` (via `File` -> `Open from URL` in the Jupyter notebook window and use the raw file URL from GitHub) and run through the example. Ex url: https://raw.githubusercontent.com/GoogleCloudPlatform/ai-on-gke/main/ray-on-gke/example_notebooks/gpt-j-online.ipynb

5. To use the Ray dashboard, run the following command to port-forward:
```
kubectl port-forward -n <namespace> service/example-cluster-kuberay-head-svc 8265:8265
```

And then open the dashboard using the following URL:
```
http://localhost:8265
```

## Securing Your Cluster Endpoints

For demo purposes, this repo creates a public IP for the Jupyter notebook with basic dummy authentication. To secure your cluster, it is *strong recommended* to replace
this with your own secure endpoints.

For more information, please take a look at the following links:
* https://cloud.google.com/iap/docs/enabling-kubernetes-howto
* https://cloud.google.com/endpoints/docs/openapi/get-started-kubernetes-engine
* https://jupyterhub.readthedocs.io/en/stable/tutorial/getting-started/authenticators-users-basics.html


## Running GPT-J-6B

This example is adapted from Ray AIR's examples [here](https://docs.ray.io/en/master/ray-air/examples/gptj_serving.html).

1. Open the `gpt-j-online.ipynb` notebook under `ai-on-gke/ray-on-gke/example_notebooks`.

2. Open a terminal in the Jupyter session and install Ray AIR:
```
pip install ray[air]
```

3. Run through the notebook cells. You can change the prompt in the last cell:
```
prompt = (
## Input your own prompt here
)
```

4. This should output a generated text response.


## Logging and Monitoring

This repository comes with out-of-the-box integrations with Google Cloud Logging
and Managed Prometheus for monitoring. To see your Ray cluster logs:

1. Open Cloud Console and open Logging
2. If using Jupyter notebook for job submission, use the following query parameters:
```
resource.type="k8s_container"
resource.labels.cluster_name=%CLUSTER_NAME%
resource.labels.pod_name=%RAY_HEAD_POD_NAME%
resource.labels.container_name="fluentbit"
```

To see monitoring metrics:
1. Open Cloud Console and open Metrics Explorer
2. In "Target", select "Prometheus Target" and then "Ray".
3. Select the metric you want to view, and then click "Apply".
Original file line number Diff line number Diff line change
Expand Up @@ -72,11 +72,11 @@ singleuser:
# - >
# gitpuller https://github.com/data-8/materials-fa17 master materials-fa;
profileList:
- display_name: "Profile1 name"
description: "description here"
- display_name: "Basic"
description: "Creates CPU VMs as the compute for notebook execution."
default: true
- display_name: "Profile2 name"
description: "description here"
- display_name: "GPU T4"
description: "Creates GPU VMs (T4) as the compute for notebook execution"
kubespawner_override:
image: jupyter/tensorflow-notebook:python-3.10
extra_resource_limits:
Expand All @@ -85,8 +85,8 @@ singleuser:
# possible values: nvidia-tesla-k80, nvidia-tesla-p100, nvidia-tesla-p4, nvidia-tesla-v100, nvidia-tesla-t4, nvidia-tesla-a100, nvidia-a100-80gb, nvidia-l4
node_selector:
cloud.google.com/gke-accelerator: "nvidia-tesla-t4"
- display_name: "Profile3 name"
description: "description"
- display_name: "GPU A100"
description: "Creates GPU VMs (A100) as the compute for notebook execution"
kubespawner_override:
image: jupyter/tensorflow-notebook:python-3.10
extra_resource_limits:
Expand All @@ -95,7 +95,7 @@ singleuser:
extra_resource_guarantees:
nvidia.com/gpu: "2"
node_selector:
cloud.google.com/gke-accelerator: "nvidia-tesla-t4"
cloud.google.com/gke-accelerator: "nvidia-tesla-a100"
cmd: null
cloudMetadata:
blockWithIptables: false
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
110 changes: 14 additions & 96 deletions ray-on-gke/README.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,19 @@
# Ray on GKE

This repository contains a Terraform template for running [Ray](https://www.ray.io/) on Google Kubernetes Engine.
We've also included some example notebooks, including one that serves a GPT-J-6B model with Ray AIR (see
We've also included some example notebooks (`ai-on-gke/ray-on-gke/example_notebooks`), including one that serves a GPT-J-6B model with Ray AIR (see
[here](https://docs.ray.io/en/master/ray-air/examples/gptj_serving.html) for the original notebook).

The solution is split into `platform` and `user` resources.
This module assumes you already have a functional GKE cluster. If not, follow the instructions under `ai-on-gke/gke-platform/README.md`
to install a Standard or Autopilot GKE cluster, then follow the instructions in this module to install Ray.

Platform resources (deployed once):
* GKE Cluster
* Nvidia GPU drivers
* Kuberay operator and CRDs

User resources (deployed once per user):
This module deploys the following, once per user:
* User namespace
* Kubernetes service accounts
* Kuberay cluster
* Prometheus monitoring
* Logging container

Jupyter resources:
* Jupyter notebook
* User namespace (if not exist)

## Installation

Preinstall the following on your computer:
Expand All @@ -32,69 +24,22 @@ Preinstall the following on your computer:

> **_NOTE:_** Terraform keeps state metadata in a local file called `terraform.tfstate`. If you need to reinstall any resources, make sure to delete this file as well.

### Platform

1. git clone https://github.com/GoogleCloudPlatform/ai-on-gke
imreddy13 marked this conversation as resolved.
Show resolved Hide resolved

2. `cd ray-on-gke/platform`
2. `cd ai-on-gke/ray-on-gke/user`

3. Edit `variables.tf` with your GCP settings.
3. Edit `variables.tf` with your GCP settings. The `<your user name>` that you specify will become a K8s namespace for your Ray services.

4. Run `terraform init`

5. Run `terraform apply`

### User (No JupyterHub)

> **_NOTE:_** Skip this part if you only want to install JupyterHub

1. `cd ../user`
5. Find the name and location of the GKE cluster you want to use.
Run `gcloud container clusters list --project=<your GCP project> to see all the available clusters.
Note: If you created the GKE cluster via the ai-on-gke/gke-platform repo, you can get the cluster info from `ai-on-gke/gke-platform/variables.tf`

2. Edit `variables.tf` with your GCP settings. The `<your user name>` that you specify will become a K8s namespace for your Ray services.

3. Run `terraform init`

4. Get the GKE cluster name and location/region from `platform/variables.tf`.
Run `gcloud container clusters get-credentials %gke_cluster_name% --location=%location%`
6. Run `gcloud container clusters get-credentials %gke_cluster_name% --location=%location%`
Configuring `gcloud` [instructions](https://cloud.google.com/sdk/docs/initializing)

5. Run `terraform apply`

### JupyterHub

> **_NOTE:_** Currently the cd/user/jupyterhub/jupyter_config/config.yaml has 3 profiles that uses the same jupyter images, this can be changed, as well as the description of these profiles

1. `cd user/jupyterhub`

2. Edit `variables.tf` with your GCP settings. The `<your user name>` that you specify will become a K8s namespace for your Jupyterhub services.

3. If the namespace does not exist, set `enable_create_namespace` to `true`

4. Run `terraform init`

5. Run `terraform apply`

## Using Ray with Jupyter

1. Run `kubectl get services -n <namespace>`. The namespace is the user name that you specified above.

2. Copy the external IP for the notebook.

3. Open the external IP in a browser and login. The default user names and
passwords can be found in the [Jupyter
settings](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/ray-on-gke/user/jupyterhub/jupyter_config/config.yaml) file.

4. The Ray cluster is available at `ray://example-cluster-kuberay-head-svc:10001`. To access the cluster, you can open one of the sample notebooks under `example_notebooks` (via `File` -> `Open from URL` in the Jupyter notebook window and use the raw file URL from GitHub) and run through the example. Ex url: https://raw.githubusercontent.com/GoogleCloudPlatform/ai-on-gke/main/ray-on-gke/example_notebooks/gpt-j-online.ipynb

5. To use the Ray dashboard, run the following command to port-forward:
```
kubectl port-forward -n <namespace> service/example-cluster-kuberay-head-svc 8265:8265
```

And then open the dashboard using the following URL:
```
http://localhost:8265
```
7. Run `terraform apply`

## Using Ray with Ray Jobs API

Expand Down Expand Up @@ -122,37 +67,10 @@ http://localhost:8265

See [Ray docs](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/quickstart.html#submitting-a-job) for more info.

## Securing Your Cluster Endpoints

For demo purposes, this repo creates a public IP for the Jupyter notebook with basic dummy authentication. To secure your cluster, it is *strong recommended* to replace
this with your own secure endpoints.

For more information, please take a look at the following links:
* https://cloud.google.com/iap/docs/enabling-kubernetes-howto
* https://cloud.google.com/endpoints/docs/openapi/get-started-kubernetes-engine
* https://jupyterhub.readthedocs.io/en/stable/tutorial/getting-started/authenticators-users-basics.html


## Running GPT-J-6B

This example is adapted from Ray AIR's examples [here](https://docs.ray.io/en/master/ray-air/examples/gptj_serving.html).

1. Open the `gpt-j-online.ipynb` notebook.

2. Open a terminal in the Jupyter session and install Ray AIR:
```
pip install ray[air]
```

3. Run through the notebook cells. You can change the prompt in the last cell:
```
prompt = (
## Input your own prompt here
)
```

4. This should output a generated text response.
## Using Ray with Jupyter

If you want to connect to the Ray cluster via a Jupyter notebook or to try the example notebooks in the repo, please
first install JupyterHub via `ai-on-gke/jupyter-on-gke/README.md`.

## Logging and Monitoring

Expand Down