GoogleCloudPlatform · imreddy13 · Sep 13, 2023 · Sep 11, 2023 · Sep 13, 2023 · Sep 13, 2023
diff --git a/gke-platform/README.md b/gke-platform/README.md
@@ -0,0 +1,31 @@
+# Ray on GKE
+
+This repository contains a Terraform template for installing a Standard or Autopilot GKE cluster in your GCP project.
+It sets up the cluster to work seamlessly with the `ray-on-gke` and `jupyter-on-gke` modules in the `ai-on-gke` repo.
+
+Platform resources:
+* GKE Cluster
+* Nvidia GPU drivers
+* Kuberay operator and CRDs
+
+## Installation
+
+Preinstall the following on your computer:
+* Kubectl
+* Terraform 
+* Helm
+* Gcloud
+
+> **_NOTE:_** Terraform keeps state metadata in a local file called `terraform.tfstate`. If you need to reinstall any resources, make sure to delete this file as well.
+
+### Platform
+
+1. git clone https://github.com/GoogleCloudPlatform/ai-on-gke
+
+2. `cd ai-on-gke/gke-platform`
+
+3. Edit `variables.tf` with your GCP settings.
+
+4. Run `terraform init`
+
+5. Run `terraform apply`
diff --git a/ray-on-gke/platform/main.tf → gke-platform/main.tf b/ray-on-gke/platform/main.tf → gke-platform/main.tf
diff --git a/...ke/platform/modules/gke_autopilot/main.tf → gke-platform/modules/gke_autopilot/main.tf b/...ke/platform/modules/gke_autopilot/main.tf → gke-platform/modules/gke_autopilot/main.tf
diff --git a/.../platform/modules/gke_autopilot/output.tf → gke-platform/modules/gke_autopilot/output.tf b/.../platform/modules/gke_autopilot/output.tf → gke-platform/modules/gke_autopilot/output.tf
diff --git a/...atform/modules/gke_autopilot/variables.tf → ...atform/modules/gke_autopilot/variables.tf b/...atform/modules/gke_autopilot/variables.tf → ...atform/modules/gke_autopilot/variables.tf
diff --git a/...latform/modules/gke_autopilot/versions.tf → ...latform/modules/gke_autopilot/versions.tf b/...latform/modules/gke_autopilot/versions.tf → ...latform/modules/gke_autopilot/versions.tf
diff --git a/...gke/platform/modules/gke_standard/main.tf → gke-platform/modules/gke_standard/main.tf b/...gke/platform/modules/gke_standard/main.tf → gke-platform/modules/gke_standard/main.tf
diff --git a/...e/platform/modules/gke_standard/output.tf → gke-platform/modules/gke_standard/output.tf b/...e/platform/modules/gke_standard/output.tf → gke-platform/modules/gke_standard/output.tf
diff --git a/...latform/modules/gke_standard/variables.tf → ...latform/modules/gke_standard/variables.tf b/...latform/modules/gke_standard/variables.tf → ...latform/modules/gke_standard/variables.tf
diff --git a/...platform/modules/gke_standard/versions.tf → ...platform/modules/gke_standard/versions.tf b/...platform/modules/gke_standard/versions.tf → ...platform/modules/gke_standard/versions.tf
diff --git a/...n-gke/platform/modules/kuberay/kuberay.tf → gke-platform/modules/kuberay/kuberay.tf b/...n-gke/platform/modules/kuberay/kuberay.tf → gke-platform/modules/kuberay/kuberay.tf
diff --git a/...gke/platform/modules/kuberay/variables.tf → gke-platform/modules/kuberay/variables.tf b/...gke/platform/modules/kuberay/variables.tf → gke-platform/modules/kuberay/variables.tf
diff --git a/...-gke/platform/modules/kuberay/versions.tf → gke-platform/modules/kuberay/versions.tf b/...-gke/platform/modules/kuberay/versions.tf → gke-platform/modules/kuberay/versions.tf
diff --git a/...platform/modules/kubernetes/kubernetes.tf → ...platform/modules/kubernetes/kubernetes.tf b/...platform/modules/kubernetes/kubernetes.tf → ...platform/modules/kubernetes/kubernetes.tf
diff --git a/.../platform/modules/kubernetes/variables.tf → gke-platform/modules/kubernetes/variables.tf b/.../platform/modules/kubernetes/variables.tf → gke-platform/modules/kubernetes/variables.tf
diff --git a/...e/platform/modules/kubernetes/versions.tf → gke-platform/modules/kubernetes/versions.tf b/...e/platform/modules/kubernetes/versions.tf → gke-platform/modules/kubernetes/versions.tf
diff --git a/ray-on-gke/platform/variables.tf → gke-platform/variables.tf b/ray-on-gke/platform/variables.tf → gke-platform/variables.tf
@@ -40,4 +40,4 @@ variable "enable_tpu" {
   type        = bool
   description = "Set to true to create TPU node pool"
   default     = false
-}
+}
diff --git a/ray-on-gke/platform/versions.tf → gke-platform/versions.tf b/ray-on-gke/platform/versions.tf → gke-platform/versions.tf
diff --git a/jupyter-on-gke/README.md b/jupyter-on-gke/README.md
@@ -0,0 +1,119 @@
+# Ray on GKE
+
+This repository contains a Terraform template for running [JupyterHub](https://jupyter.org/hub) on Google Kubernetes Engine.
+We've also included some example notebooks (`ai-on-gke/ray-on-gke/example_notebooks`), including one that serves a GPT-J-6B model with Ray AIR (see
+[here](https://docs.ray.io/en/master/ray-air/examples/gptj_serving.html) for the original notebook).
+
+This module assumes you already have a functional GKE cluster. If not, follow the instructions under `ai-on-gke/gke-platform/README.md`
+to install a Standard or Autopilot GKE cluster, then follow the instructions in this module to install Ray. 
+
+This module deploys the following resources, once per user:
+* JupyterHub deployment
+* User namespace
+* Kubernetes service accounts
+
+## Installation
+
+Preinstall the following on your computer:
+* Kubectl
+* Terraform 
+* Helm
+* Gcloud
+
+> **_NOTE:_** Terraform keeps state metadata in a local file called `terraform.tfstate`. If you need to reinstall any resources, make sure to delete this file as well.
+
+### JupyterHub
+
+> **_NOTE:_** Currently the cd/user/jupyterhub/jupyter_config/config.yaml has 3 profiles that uses the same jupyter images, this can be changed, as well as the description of these profiles.
+
+1. `cd ai-on-gke/jupyter-on-gke/jupyterhub`
+
+2. Edit `variables.tf` with your GCP settings. The `<your user name>` that you specify will become a K8s namespace for your Jupyterhub services.
+Note:
+If using this with the Ray module (`ai-on-gke/ray-on-gke/`), it is recommended to use the same k8s namespace
+for both i.e. set this to the same namespace as `ai-on-gke/ray-on-gke/user/variables.tf`.
+If not, set `enable_create_namespace` to `true` so a new k8s namespace is created for the Jupyter resources.
+
+3. Run `terraform init`
+
+4. Find the name and location of the GKE cluster you want to use.
+   Run `gcloud container clusters list --project=<your GCP project> to see all the available clusters.
+   Note: If you created the GKE cluster via the ai-on-gke/gke-platform repo, you can get the cluster info from `ai-on-gke/gke-platform/variables.tf`
+
+5. Run `gcloud container clusters get-credentials %gke_cluster_name% --location=%location%`
+   Configuring `gcloud` [instructions](https://cloud.google.com/sdk/docs/initializing)
+
+6. Run `terraform apply`
+
+## Using Ray with Jupyter
+
+1. Run `kubectl get services -n <namespace>`. The namespace is the user name that you specified above.
+
+2. Copy the external IP for the notebook.
+
+3. Open the external IP in a browser and login. The default user names and
+   passwords can be found in the [Jupyter
+   settings](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/jupyter-on-gke/jupyter_config/jupyter_config/config.yaml) file.
+
+4. The Ray cluster is available at `ray://example-cluster-kuberay-head-svc:10001`. To access the cluster, you can open one of the sample notebooks under `example_notebooks` (via `File` -> `Open from URL` in the Jupyter notebook window and use the raw file URL from GitHub) and run through the example. Ex url: https://raw.githubusercontent.com/GoogleCloudPlatform/ai-on-gke/main/ray-on-gke/example_notebooks/gpt-j-online.ipynb
+
+5. To use the Ray dashboard, run the following command to port-forward:
+```
+kubectl port-forward -n <namespace> service/example-cluster-kuberay-head-svc 8265:8265
+```
+
+And then open the dashboard using the following URL:
+```
+http://localhost:8265
+```
+
+## Securing Your Cluster Endpoints
+
+For demo purposes, this repo creates a public IP for the Jupyter notebook with basic dummy authentication. To secure your cluster, it is *strong recommended* to replace
+this with your own secure endpoints. 
+
+For more information, please take a look at the following links:
+* https://cloud.google.com/iap/docs/enabling-kubernetes-howto
+* https://cloud.google.com/endpoints/docs/openapi/get-started-kubernetes-engine
+* https://jupyterhub.readthedocs.io/en/stable/tutorial/getting-started/authenticators-users-basics.html
+
+
+## Running GPT-J-6B
+
+This example is adapted from Ray AIR's examples [here](https://docs.ray.io/en/master/ray-air/examples/gptj_serving.html).
+
+1. Open the `gpt-j-online.ipynb` notebook under `ai-on-gke/ray-on-gke/example_notebooks`.
+
+2. Open a terminal in the Jupyter session and install Ray AIR:
+```
+pip install ray[air]
+```
+
+3. Run through the notebook cells. You can change the prompt in the last cell:
+```
+prompt = (
+     ## Input your own prompt here
+)
+```
+
+4. This should output a generated text response.
+
+
+## Logging and Monitoring
+
+This repository comes with out-of-the-box integrations with Google Cloud Logging
+and Managed Prometheus for monitoring. To see your Ray cluster logs:
+
+1. Open Cloud Console and open Logging
+2. If using Jupyter notebook for job submission, use the following query parameters:
+```
+resource.type="k8s_container"
+resource.labels.cluster_name=%CLUSTER_NAME%
+resource.labels.pod_name=%RAY_HEAD_POD_NAME%
+resource.labels.container_name="fluentbit"
+```
+
+To see monitoring metrics:
+1. Open Cloud Console and open Metrics Explorer
+2. In "Target", select "Prometheus Target" and then "Ray".
+3. Select the metric you want to view, and then click "Apply".
diff --git a/...ser/jupyterhub/jupyter_config/config.yaml → jupyter-on-gke/jupyter_config/config.yaml b/...ser/jupyterhub/jupyter_config/config.yaml → jupyter-on-gke/jupyter_config/config.yaml
@@ -72,11 +72,11 @@ singleuser:
 #                - >
 #                  gitpuller https://github.com/data-8/materials-fa17 master materials-fa;
   profileList:
-    - display_name: "Profile1 name"
-      description: "description here"
+    - display_name: "Basic"
+      description: "Creates CPU VMs as the compute for notebook execution."
       default: true
-    - display_name: "Profile2 name"
-      description: "description here"
+    - display_name: "GPU T4"
+      description: "Creates GPU VMs (T4) as the compute for notebook execution"
       kubespawner_override:
         image: jupyter/tensorflow-notebook:python-3.10
         extra_resource_limits:
@@ -85,8 +85,8 @@ singleuser:
       #  possible values: nvidia-tesla-k80, nvidia-tesla-p100, nvidia-tesla-p4, nvidia-tesla-v100, nvidia-tesla-t4, nvidia-tesla-a100, nvidia-a100-80gb, nvidia-l4
         node_selector:
           cloud.google.com/gke-accelerator: "nvidia-tesla-t4"
-    - display_name: "Profile3 name"
-      description: "description"
+    - display_name: "GPU A100"
+      description: "Creates GPU VMs (A100) as the compute for notebook execution"
       kubespawner_override:
         image: jupyter/tensorflow-notebook:python-3.10
         extra_resource_limits:
@@ -95,7 +95,7 @@ singleuser:
         extra_resource_guarantees:
           nvidia.com/gpu: "2"
         node_selector:
-          cloud.google.com/gke-accelerator: "nvidia-tesla-t4"
+          cloud.google.com/gke-accelerator: "nvidia-tesla-a100"
   cmd: null
   cloudMetadata:
     blockWithIptables: false

diff --git a/.../jupyter_config/singleprofile-config.yaml → .../jupyter_config/singleprofile-config.yaml b/.../jupyter_config/singleprofile-config.yaml → .../jupyter_config/singleprofile-config.yaml
diff --git a/ray-on-gke/user/jupyterhub/jupyterhub.tf → jupyter-on-gke/jupyterhub.tf b/ray-on-gke/user/jupyterhub/jupyterhub.tf → jupyter-on-gke/jupyterhub.tf
diff --git a/ray-on-gke/user/jupyterhub/variables.tf → jupyter-on-gke/variables.tf b/ray-on-gke/user/jupyterhub/variables.tf → jupyter-on-gke/variables.tf
diff --git a/ray-on-gke/user/jupyterhub/versions.tf → jupyter-on-gke/versions.tf b/ray-on-gke/user/jupyterhub/versions.tf → jupyter-on-gke/versions.tf
diff --git a/ray-on-gke/README.md b/ray-on-gke/README.md
@@ -1,27 +1,19 @@
 # Ray on GKE
 
 This repository contains a Terraform template for running [Ray](https://www.ray.io/) on Google Kubernetes Engine.
-We've also included some example notebooks, including one that serves a GPT-J-6B model with Ray AIR (see
+We've also included some example notebooks (`ai-on-gke/ray-on-gke/example_notebooks`), including one that serves a GPT-J-6B model with Ray AIR (see
 [here](https://docs.ray.io/en/master/ray-air/examples/gptj_serving.html) for the original notebook).
 
-The solution is split into `platform` and `user` resources. 
+This module assumes you already have a functional GKE cluster. If not, follow the instructions under `ai-on-gke/gke-platform/README.md`
+to install a Standard or Autopilot GKE cluster, then follow the instructions in this module to install Ray. 
 
-Platform resources (deployed once):
-* GKE Cluster
-* Nvidia GPU drivers
-* Kuberay operator and CRDs
-
-User resources (deployed once per user):
+This module deploys the following, once per user:
 * User namespace
 * Kubernetes service accounts
 * Kuberay cluster
 * Prometheus monitoring
 * Logging container
 
-Jupyter resources:
-* Jupyter notebook
-* User namespace (if not exist)
-
 ## Installation
 
 Preinstall the following on your computer:
@@ -32,69 +24,22 @@ Preinstall the following on your computer:
 
 > **_NOTE:_** Terraform keeps state metadata in a local file called `terraform.tfstate`. If you need to reinstall any resources, make sure to delete this file as well.
 
-### Platform
-
 1. git clone https://github.com/GoogleCloudPlatform/ai-on-gke
 
-2. `cd ray-on-gke/platform`
+2. `cd ai-on-gke/ray-on-gke/user`
 
-3. Edit `variables.tf` with your GCP settings.
+3. Edit `variables.tf` with your GCP settings. The `<your user name>` that you specify will become a K8s namespace for your Ray services.
 
 4. Run `terraform init`
 
-5. Run `terraform apply`
-
-### User (No JupyterHub)
-
-> **_NOTE:_** Skip this part if you only want to install JupyterHub
-
-1. `cd ../user`
+5. Find the name and location of the GKE cluster you want to use.
+   Run `gcloud container clusters list --project=<your GCP project> to see all the available clusters.
+   Note: If you created the GKE cluster via the ai-on-gke/gke-platform repo, you can get the cluster info from `ai-on-gke/gke-platform/variables.tf`
 
-2. Edit `variables.tf` with your GCP settings. The `<your user name>` that you specify will become a K8s namespace for your Ray services.
-
-3. Run `terraform init`
-
-4. Get the GKE cluster name and location/region from `platform/variables.tf`.
-   Run `gcloud container clusters get-credentials %gke_cluster_name% --location=%location%`
+6. Run `gcloud container clusters get-credentials %gke_cluster_name% --location=%location%`
    Configuring `gcloud` [instructions](https://cloud.google.com/sdk/docs/initializing)
 
-5. Run `terraform apply`
-
-### JupyterHub
-
-> **_NOTE:_** Currently the cd/user/jupyterhub/jupyter_config/config.yaml has 3 profiles that uses the same jupyter images, this can be changed, as well as the description of these profiles
-
-1. `cd user/jupyterhub`
-
-2. Edit `variables.tf` with your GCP settings. The `<your user name>` that you specify will become a K8s namespace for your Jupyterhub services.
-
-3. If the namespace does not exist, set `enable_create_namespace` to `true`
-
-4. Run `terraform init`
-
-5. Run `terraform apply`
-
-## Using Ray with Jupyter
-
-1. Run `kubectl get services -n <namespace>`. The namespace is the user name that you specified above.
-
-2. Copy the external IP for the notebook.
-
-3. Open the external IP in a browser and login. The default user names and
-   passwords can be found in the [Jupyter
-   settings](https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/ray-on-gke/user/jupyterhub/jupyter_config/config.yaml) file.
-
-4. The Ray cluster is available at `ray://example-cluster-kuberay-head-svc:10001`. To access the cluster, you can open one of the sample notebooks under `example_notebooks` (via `File` -> `Open from URL` in the Jupyter notebook window and use the raw file URL from GitHub) and run through the example. Ex url: https://raw.githubusercontent.com/GoogleCloudPlatform/ai-on-gke/main/ray-on-gke/example_notebooks/gpt-j-online.ipynb
-
-5. To use the Ray dashboard, run the following command to port-forward:
-```
-kubectl port-forward -n <namespace> service/example-cluster-kuberay-head-svc 8265:8265
-```
-
-And then open the dashboard using the following URL:
-```
-http://localhost:8265
-```
+7. Run `terraform apply`
 
 ## Using Ray with Ray Jobs API
 
@@ -122,37 +67,10 @@ http://localhost:8265
 
 See [Ray docs](https://docs.ray.io/en/latest/cluster/running-applications/job-submission/quickstart.html#submitting-a-job) for more info.
 
-## Securing Your Cluster Endpoints
-
-For demo purposes, this repo creates a public IP for the Jupyter notebook with basic dummy authentication. To secure your cluster, it is *strong recommended* to replace
-this with your own secure endpoints. 
-
-For more information, please take a look at the following links:
-* https://cloud.google.com/iap/docs/enabling-kubernetes-howto
-* https://cloud.google.com/endpoints/docs/openapi/get-started-kubernetes-engine
-* https://jupyterhub.readthedocs.io/en/stable/tutorial/getting-started/authenticators-users-basics.html
-
-
-## Running GPT-J-6B
-
-This example is adapted from Ray AIR's examples [here](https://docs.ray.io/en/master/ray-air/examples/gptj_serving.html).
-
-1. Open the `gpt-j-online.ipynb` notebook.
-
-2. Open a terminal in the Jupyter session and install Ray AIR:
-```
-pip install ray[air]
-```
-
-3. Run through the notebook cells. You can change the prompt in the last cell:
-```
-prompt = (
-     ## Input your own prompt here
-)
-```
-
-4. This should output a generated text response.
+## Using Ray with Jupyter
 
+If you want to connect to the Ray cluster via a Jupyter notebook or to try the example notebooks in the repo, please
+first install JupyterHub via `ai-on-gke/jupyter-on-gke/README.md`.
 
 ## Logging and Monitoring