Skip to content

Commit

Permalink
Merge branch 'main' into website
Browse files Browse the repository at this point in the history
  • Loading branch information
moficodes committed Apr 2, 2024
2 parents d478143 + 96e54be commit 129ee01
Show file tree
Hide file tree
Showing 5,127 changed files with 1,547,740 additions and 12,596 deletions.
The diff you're trying to view is too large. We only load the first 3000 changed files.
35 changes: 35 additions & 0 deletions .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
name: Terraform CI
on:
push:
branches:
- main
pull_request:
branches:
- main
jobs:
Terraform-Lint-Check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: "1.5.7"

- name: Terraform fmt
id: fmt
run: terraform fmt -check -recursive

- name: Terraform Init
id: init
run: |
terraform -chdir=applications/rag init
terraform -chdir=applications/ray init
terraform -chdir=applications/jupyter init
- name: Terraform Validate
id: validate
run: |
terraform -chdir=applications/rag validate -no-color
terraform -chdir=applications/ray validate -no-color
terraform -chdir=applications/jupyter validate -no-color
26 changes: 23 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,31 @@
# See the License for the specific language governing permissions and
# limitations under the License.

terraform.tfvars
terraform.tfstate*
## Archives
**/*.tar
**/*.tar.gz
**/*.zip

# Directories
bin/
deploy/

# IDEs
.idea/

# Python
__pycache__/

# Terraform
default.tfstate
default.tfstate.backup
.terraform*
__pycache__/

public/
node_modules/
resources/
resources/
terraform.tfstate*
terraform.tfvars
tfplan

4 changes: 4 additions & 0 deletions CODEOWNERS
Validating CODEOWNERS rules …
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# @name owns any files in the /benchmarks/
# directory at the root of the repository and any of its
# subdirectories.
/benchmarks/ @achandrasekar @ahg-g @annapendleton
171 changes: 165 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,172 @@
# Kubernetes Engine Samples

[![Open in Cloud Shell](https://gstatic.com/cloudssh/images/open-btn.svg)](https://ssh.cloud.google.com/cloudshell/editor?cloudshell_git_repo=https://github.com/GoogleCloudPlatform/ai-on-gke&cloudshell_tutorial=README.md&cloudshell_workspace=hello-app)
# AI on GKE Assets

This repository contains assets related to AI/ML workloads on
[Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine/).
[Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine/docs/integrations/ai-infra).

## Overview

Run optimized AI/ML workloads with Google Kubernetes Engine (GKE) platform orchestration capabilities. A robust AI/ML platform considers the following layers:

- Infrastructure orchestration that support GPUs and TPUs for training and serving workloads at scale
- Flexible integration with distributed computing and data processing frameworks
- Support for multiple teams on the same infrastructure to maximize utilization of resources

## Infrastructure

The AI-on-GKE application modules assumes you already have a functional GKE cluster. If not, follow the instructions under [infrastructure/README.md](./infrastructure/README.md) to install a Standard or Autopilot GKE cluster.

```bash
.
├── LICENSE
├── README.md
├── infrastructure
│ ├── README.md
│ ├── backend.tf
│ ├── main.tf
│ ├── outputs.tf
│ ├── platform.tfvars
│ ├── variables.tf
│ └── versions.tf
├── modules
│ ├── gke-autopilot-private-cluster
│ ├── gke-autopilot-public-cluster
│ ├── gke-standard-private-cluster
│ ├── gke-standard-public-cluster
│ ├── jupyter
│ ├── jupyter_iap
│ ├── jupyter_service_accounts
│ ├── kuberay-cluster
│ ├── kuberay-logging
│ ├── kuberay-monitoring
│ ├── kuberay-operator
│ └── kuberay-serviceaccounts
└── tutorial.md
```

To deploy new GKE cluster update the `platform.tfvars` file with the appropriate values and then execute below terraform commands:
```
terraform init
terraform apply -var-file platform.tfvars
```


## Applications

The repo structure looks like this:

```bash
.
├── LICENSE
├── Makefile
├── README.md
├── applications
│ ├── jupyter
│ └── ray
├── contributing.md
├── dcgm-on-gke
│ ├── grafana
│ └── quickstart
├── gke-a100-jax
│ ├── Dockerfile
│ ├── README.md
│ ├── build_push_container.sh
│ ├── kubernetes
│ └── train.py
├── gke-batch-refarch
│ ├── 01_gke
│ ├── 02_platform
│ ├── 03_low_priority
│ ├── 04_high_priority
│ ├── 05_compact_placement
│ ├── 06_jobset
│ ├── Dockerfile
│ ├── README.md
│ ├── cloudbuild-create.yaml
│ ├── cloudbuild-destroy.yaml
│ ├── create-platform.sh
│ ├── destroy-platform.sh
│ └── images
├── gke-disk-image-builder
│ ├── README.md
│ ├── cli
│ ├── go.mod
│ ├── go.sum
│ ├── imager.go
│ └── script
├── gke-dws-examples
│ ├── README.md
│ ├── dws-queues.yaml
│ ├── job.yaml
│ └── kueue-manifests.yaml
├── gke-online-serving-single-gpu
│ ├── README.md
│ └── src
├── gke-tpu-examples
│ ├── single-host-inference
│ └── training
├── indexed-job
│ ├── Dockerfile
│ ├── README.md
│ └── mnist.py
├── jobset
│ └── pytorch
├── modules
│ ├── gke-autopilot-private-cluster
│ ├── gke-autopilot-public-cluster
│ ├── gke-standard-private-cluster
│ ├── gke-standard-public-cluster
│ ├── jupyter
│ ├── jupyter_iap
│ ├── jupyter_service_accounts
│ ├── kuberay-cluster
│ ├── kuberay-logging
│ ├── kuberay-monitoring
│ ├── kuberay-operator
│ └── kuberay-serviceaccounts
├── saxml-on-gke
│ ├── httpserver
│ └── single-host-inference
├── training-single-gpu
│ ├── README.md
│ ├── data
│ └── src
├── tutorial.md
└── tutorials
├── e2e-genai-langchain-app
├── finetuning-llama-7b-on-l4
└── serving-llama2-70b-on-l4-gpus
```


### Jupyter Hub

This repository contains a Terraform template for running JupyterHub on Google Kubernetes Engine. We've also included some example notebooks ( under `applications/ray/example_notebooks`), including one that serves a GPT-J-6B model with Ray AIR (see here for the original notebook). To run these, follow the instructions at [applications/ray/README.md](./applications/ray/README.md) to install a Ray cluster.

This jupyter module deploys the following resources, once per user:
- JupyterHub deployment
- User namespace
- Kubernetes service accounts

Learn more [about JupyterHub on GKE here](./applications/jupyter/README.md)

### Ray

This repository contains a Terraform template for running Ray on Google Kubernetes Engine.

This module deploys the following, once per user:
- User namespace
- Kubernetes service accounts
- Kuberay cluster
- Prometheus monitoring
- Logging container

Learn more [about Ray on GKE here](./applications/ray/README.md)

## Important Considerations
- Make sure to configure terraform backend to use GCS bucket, in order to persist terraform state across different environments.

## Important Note
The use of the assets contained in this repository is subject to compliance with [Google's AI Principles](https://ai.google/responsibility/principles/)

## Licensing

* The use of the assets contained in this repository is subject to compliance with [Google's AI Principles](https://ai.google/responsibility/principles/)
* See [LICENSE](/LICENSE)
Loading

0 comments on commit 129ee01

Please sign in to comment.