diff --git a/best-practices/ml-platform/README.md b/best-practices/ml-platform/README.md new file mode 100644 index 000000000..10c43eb52 --- /dev/null +++ b/best-practices/ml-platform/README.md @@ -0,0 +1,74 @@ +# Machine learning platform (MLP) on GKE reference architecture for enabling Machine Learning Operations (MLOps) + +## Platform Principles + +This reference architecture demonstrates how to build a GKE platform that facilitates Machine Learning. The reference architecture is based on the following principles: + +- The platform admin will create the GKE platform using IaC tool like [Terraform][terraform]. The IaC will come with re-usable modules that can be referred to create more resources as the demand grows. +- The platform will be based on [GitOps][gitops]. +- After the GKE platform has been created, cluster scoped resources on it will be created through [Config Sync][config-sync] by the admins. +- Platform admins will create a namespace per application and provide the application team member full access to it. +- The namespace scoped resources will be created by the Application/ML teams either via [Config Sync][config-sync] or through a deployment tool like [Cloud Deploy][cloud-deploy] + +## Critical User Journeys (CUJs) + +### Persona : Platform Admin + +- Offer a platform that incorporates established best practices. +- Grant end users the essential resources, guided by the principle of least privilege, empowering them to manage and maintain their workloads. +- Establish secure channels for end users to interact seamlessly with the platform. +- Empower the enforcement of robust security policies across the platform. + +### Persona : Machine Learning Engineer + +- Deploy the model with ease and make the endpoints available only to the intended audience +- Continuously monitor the model performance and resource utilization +- Troubleshoot any performance or integration issues +- Ability to version, store and access the models and model artifacts: + - To debug & troubleshoot in production and track back to the specific model version & associated training data + - To quick & controlled rollback to a previous, more stable version +- Implement the feedback loop to adapt to changing data & business needs: + - Ability to retrain / fine-tune the model. + - Ability to split the traffic between models (A/B testing) + - Switching between the models without breaking inference system for the end-users +- Ability to scaling up/down the infra to accommodate changing needs +- Ability to share the insights and findings with stakeholders to take data-driven decisions + +### Persona : Machine Learning Operator + +- Provide and maintain software required by the end users of the platform. +- Operationalize experimental workload by providing guidance and best practices for running the workload on the platform. +- Deploy the workloads on the platform. +- Assist with enabling observability and monitoring for the workloads to ensure smooth operations. + +## Prerequisites + +- This guide is meant to be run on [Cloud Shell](https://shell.cloud.google.com) which comes preinstalled with the [Google Cloud SDK](https://cloud.google.com/sdk) and other tools that are required to complete this tutorial. +- Familiarity with following + - [Google Kubernetes Engine][gke] + - [Terraform][terraform] + - [git][git] + - [Google Configuration Management root-sync][root-sync] + - [Google Configuration Management repo-sync][repo-sync] + - [GitHub][github] + +## Deploy the platform + +[Sandbox Reference Architecture Guide](examples/platform/sandbox/README.md): Set up an environment to familiarize yourself with the architecture and get an understanding of the concepts. + +## Use cases + +- [Distributed Data Processing with Ray](examples/use-case/ray/dataprocessing/README.md): Run a distributed data processing job using Ray. + +[gitops]: https://about.gitlab.com/topics/gitops/ +[repo-sync]: https://cloud.google.com/anthos-config-management/docs/reference/rootsync-reposync-fields +[root-sync]: https://cloud.google.com/anthos-config-management/docs/reference/rootsync-reposync-fields +[config-sync]: https://cloud.google.com/anthos-config-management/docs/config-sync-overview +[cloud-deploy]: https://cloud.google.com/deploy?hl=en +[terraform]: https://www.terraform.io/ +[gke]: https://cloud.google.com/kubernetes-engine?hl=en +[git]: https://git-scm.com/ +[github]: https://github.com/ +[gcp-project]: https://cloud.google.com/resource-manager/docs/creating-managing-projects +[personal-access-token]: https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens +[machine-user-account]: https://docs.github.com/en/get-started/learning-about-github/types-of-github-accounts diff --git a/best-practices/ml-platform/docs/images/configsync.png b/best-practices/ml-platform/docs/images/configsync.png new file mode 100644 index 000000000..75ed75ca4 Binary files /dev/null and b/best-practices/ml-platform/docs/images/configsync.png differ diff --git a/best-practices/ml-platform/docs/images/ray-dataprocessing-workflow.png b/best-practices/ml-platform/docs/images/ray-dataprocessing-workflow.png new file mode 100644 index 000000000..3d99daae1 Binary files /dev/null and b/best-practices/ml-platform/docs/images/ray-dataprocessing-workflow.png differ diff --git a/best-practices/ml-platform/examples/platform/sandbox/README.md b/best-practices/ml-platform/examples/platform/sandbox/README.md new file mode 100644 index 000000000..0ce83cbc7 --- /dev/null +++ b/best-practices/ml-platform/examples/platform/sandbox/README.md @@ -0,0 +1,340 @@ +# Machine learning platform (MLP) on GKE reference architecture: Sandbox + +This quick-start deployment guide can be used to set up an environment to familiarize yourself with the architecture and get an understanding of the concepts. + +**NOTE: This environment is not intended to be a long lived environment. It is intended for temporary demonstration and learning purposes.** + +### Requirements + +In this guide you can choose to bring your project (BYOP) or have Terraform create a new project for you. The requirements are difference based on the option that you choose. + +#### Bring your own project (BYOP) + +- Project ID of a new Google Cloud Project, preferably with no APIs enabled +- `roles/owner` IAM permissions on the project +- GitHub Personal Access Token, steps to create the token are provided below + +#### Terraform managed project + +- Billing account ID +- Organization or folder ID +- `roles/billing.user` IAM permissions on the billing account specified +- `roles/resourcemanager.projectCreator` IAM permissions on the organization or folder specified +- GitHub Personal Access Token, steps to create the token are provided below + +### Pull the source code + +- Clone the repository and change directory to the guide directory + + ``` + git clone https://github.com/GoogleCloudPlatform/ai-on-gke && \ + cd ai-on-gke/best-practices/ml-platform + ``` + +- Set environment variables + + ``` + export MLP_BASE_DIR=$(pwd) && \ + echo "export MLP_BASE_DIR=${MLP_BASE_DIR}" >> ${HOME}/.bashrc + + cd examples/platform/sandbox && \ + export MLP_TYPE_BASE_DIR=$(pwd) && \ + echo "export MLP_TYPE_BASE_DIR=${MLP_TYPE_BASE_DIR}" >> ${HOME}/.bashrc + ``` + +### GitHub Configuration + +- Create a [Personal Access Token][personal-access-token] in [GitHub][github]: + + Note: It is recommended to use a [machine user account][machine-user-account] for this but you can use a personal user account just to try this reference architecture. + + **Fine-grained personal access token** + + - Go to https://github.com/settings/tokens and login using your credentials + - Click "Generate new token" >> "Generate new token (Beta)". + - Enter a Token name. + - Select the expiration. + - Select the Resource owner. + - Select All repositories + - Set the following Permissions: + - Repository permissions + - Administration: Read and write + - Content: Read and write + - Click "Generate token" + + **Personal access tokens (classic)** + + - Go to https://github.com/settings/tokens and login using your credentials + - Click "Generate new token" >> "Generate new token (classic)". + - You will be directed to a screen to created the new token. Provide the note and expiration. + - Choose the following two access: + - [x] repo - Full control of private repositories + - [x] delete_repo - Delete repositories + - Click "Generate token" + +- Store the token in a secure file. + + ``` + # Create a secure directory + mkdir -p ${HOME}/secrets/ + chmod go-rwx ${HOME}/secrets + + # Create a secure file + touch ${HOME}/secrets/mlp-github-token + chmod go-rwx ${HOME}/secrets/mlp-github-token + + # Put the token in the secure file using your preferred editor + nano ${HOME}/secrets/mlp-github-token + ``` + +- Set the GitHub environment variables in Cloud Shell + + Replace the following values: + + - `` is the GitHub organization or user namespace to use for the repositories + - `` is the GitHub account to use for authentication + - `` is the email address to use for commit + + ``` + export MLP_GITHUB_ORG="" + export MLP_GITHUB_USER="" + export MLP_GITHUB_EMAIL="" + ``` + +- Set the configuration variables + + ``` + sed -i "s/YOUR_GITHUB_EMAIL/${MLP_GITHUB_EMAIL}/g" ${MLP_TYPE_BASE_DIR}/mlp.auto.tfvars + sed -i "s/YOUR_GITHUB_ORG/${MLP_GITHUB_ORG}/g" ${MLP_TYPE_BASE_DIR}/mlp.auto.tfvars + sed -i "s/YOUR_GITHUB_USER/${MLP_GITHUB_USER}/g" ${MLP_TYPE_BASE_DIR}/mlp.auto.tfvars + ``` + +### Project Configuration + +You only need to complete the section for the option that you have selected. + +#### Bring your own project (BYOP) + +- Set the project environment variables in Cloud Shell + + Replace the following values + + - `` is the ID of your existing Google Cloud project + + ``` + export MLP_PROJECT_ID="" + export MLP_STATE_BUCKET="${MLP_PROJECT_ID}-tf-state" + ``` + +- Set the default `gcloud` project + + ``` + gcloud config set project ${MLP_PROJECT_ID} + ``` + +- Authorize `gcloud` + + ``` + gcloud auth login --activate --no-launch-browser --quiet --update-adc + ``` + +- Create a Cloud Storage bucket to store the Terraform state + + ``` + gcloud storage buckets create gs://${MLP_STATE_BUCKET} --project ${MLP_PROJECT_ID} + ``` + +- Set the configuration variables + + ``` + sed -i "s/YOUR_STATE_BUCKET/${MLP_STATE_BUCKET}/g" ${MLP_TYPE_BASE_DIR}/backend.tf + sed -i "s/YOUR_PROJECT_ID/${MLP_PROJECT_ID}/g" ${MLP_TYPE_BASE_DIR}/mlp.auto.tfvars + ``` + +#### Terraform managed project + +- Set the configuration variables + + ``` + nano ${MLP_BASE_DIR}/terraform/features/initialize/initialize.auto.tfvars + ``` + + ``` + project = { + billing_account_id = "XXXXXX-XXXXXX-XXXXXX" + folder_id = "############" + name = "mlp" + org_id = "############" + } + ``` + + > `project.billing_account_id` the billing account ID + > + > Enter either `project.folder_id` **OR** `project.org_id` + > `project.folder_id` the folder ID + > `project.org_id` the organization ID + +- Authorize `gcloud` + + ``` + gcloud auth login --activate --no-launch-browser --quiet --update-adc + ``` + +- Create a new project + + ``` + cd ${MLP_BASE_DIR}/terraform/features/initialize + terraform init && \ + terraform plan -input=false -out=tfplan && \ + terraform apply -input=false tfplan && \ + rm tfplan && \ + terraform init -force-copy -migrate-state && \ + rm -rf state + ``` + +### Run Terraform + +- Create the resources + + ``` + cd ${MLP_TYPE_BASE_DIR} && \ + terraform init && \ + terraform plan -input=false -var github_token="$(tr --delete '\n' < ${HOME}/secrets/mlp-github-token)" -out=tfplan && \ + terraform apply -input=false tfplan + rm tfplan + ``` + +### Review the resources + +#### GKE clusters and ConfigSync + +- Go to Google Cloud Console, click on the navigation menu and click on Kubernetes Engine > Clusters. You should see one cluster. + +- Go to Google Cloud Console, click on the navigation menu and click on Kubernetes Engine > Config. If you haven't enabled GKE Enterprise in the project earlier, Click `LEARN AND ENABLE` button and then `ENABLE GKE ENTERPRISE`. You should see a RootSync and RepoSync object. + ![configsync](/best-practices/ml-platform/docs/images/configsync.png) + +#### Software installed via RepoSync and RootSync + +Open Cloud Shell to execute the following commands: + +- Store your GKE cluster name in env variable: + + `export GKE_CLUSTER=` + +- Get cluster credentials: + + ``` + gcloud container fleet memberships get-credentials ${GKE_CLUSTER} + ``` + +- Fetch KubeRay operator CRDs + + ``` + kubectl get crd | grep ray + ``` + + The output will be similar to the following: + + ``` + rayclusters.ray.io 2024-02-12T21:19:06Z + rayjobs.ray.io 2024-02-12T21:19:09Z + rayservices.ray.io 2024-02-12T21:19:12Z + ``` + +- Fetch KubeRay operator pod + + ``` + kubectl get pods + ``` + + The output will be similar to the following: + + ``` + NAME READY STATUS RESTARTS AGE + kuberay-operator-56b8d98766-2nvht 1/1 Running 0 6m26s + ``` + +- Check the namespace `ml-team` created: + + ``` + kubectl get ns | grep ml-team + ``` + +- Check the RepoSync object created `ml-team` namespace: + ``` + kubectl get reposync -n ml-team + ``` +- Check the `raycluster` in `ml-team` namespace + + ``` + kubectl get raycluster -n ml-team + ``` + + The output will be similar to the following: + + ``` + NAME DESIRED WORKERS AVAILABLE WORKERS STATUS AGE + ray-cluster-kuberay 1 1 ready 29m + ``` + +- Check the head and worker pods of kuberay in `ml-team` namespace + ``` + kubectl get pods -n ml-team + ``` + The output will be similar to the following: + ``` + NAME READY STATUS RESTARTS AGE + ray-cluster-kuberay-head-sp6dg 2/2 Running 0 3m21s + ray-cluster-kuberay-worker-workergroup-rzpjw 2/2 Running 0 3m21s + ``` + +### Cleanup + +- Destroy the resources + + ``` + cd ${MLP_TYPE_BASE_DIR} && \ + terraform init && \ + terraform destroy -auto-approve -var github_token="$(tr --delete '\n' < ${HOME}/secrets/mlp-github-token)" && \ + rm -rf .terraform .terraform.lock.hcl + ``` + +#### Project + +You only need to complete the section for the option that you have selected. + +##### Bring your own project (BYOP) + +- Delete the project + + ``` + gcloud projects delete ${MLP_PROJECT_ID} + ``` + +#### Terraform managed project + +- Destroy the project + + ``` + cd ${MLP_BASE_DIR}/terraform/features/initialize && \ + TERRAFORM_BUCKET_NAME=$(grep bucket backend.tf | awk -F"=" '{print $2}' | xargs) && \ + cp backend.tf.local backend.tf && \ + terraform init -force-copy -lock=false -migrate-state && \ + gsutil -m rm -rf gs://${TERRAFORM_BUCKET_NAME}/* && \ + terraform init && \ + terraform destroy -auto-approve && \ + rm -rf .terraform .terraform.lock.hcl + ``` + +[gitops]: https://about.gitlab.com/topics/gitops/ +[repo-sync]: https://cloud.google.com/anthos-config-management/docs/reference/rootsync-reposync-fields +[root-sync]: https://cloud.google.com/anthos-config-management/docs/reference/rootsync-reposync-fields +[config-sync]: https://cloud.google.com/anthos-config-management/docs/config-sync-overview +[cloud-deploy]: https://cloud.google.com/deploy?hl=en +[terraform]: https://www.terraform.io/ +[gke]: https://cloud.google.com/kubernetes-engine?hl=en +[git]: https://git-scm.com/ +[github]: https://github.com/ +[gcp-project]: https://cloud.google.com/resource-manager/docs/creating-managing-projects +[personal-access-token]: https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens +[machine-user-account]: https://docs.github.com/en/get-started/learning-about-github/types-of-github-accounts diff --git a/best-practices/ml-platform/examples/platform/sandbox/backend.tf b/best-practices/ml-platform/examples/platform/sandbox/backend.tf new file mode 100644 index 000000000..959028bb0 --- /dev/null +++ b/best-practices/ml-platform/examples/platform/sandbox/backend.tf @@ -0,0 +1,20 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +terraform { + backend "gcs" { + prefix = "terraform" + bucket = "YOUR_STATE_BUCKET" + } +} diff --git a/best-practices/ml-platform/examples/platform/sandbox/main.tf b/best-practices/ml-platform/examples/platform/sandbox/main.tf new file mode 100644 index 000000000..8da37d4b1 --- /dev/null +++ b/best-practices/ml-platform/examples/platform/sandbox/main.tf @@ -0,0 +1,520 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# +# Project +########################################################################## +data "google_project" "environment" { + project_id = var.environment_project_id +} + +resource "google_project_service" "containerfilesystem_googleapis_com" { + disable_dependent_services = false + disable_on_destroy = false + project = data.google_project.environment.project_id + service = "containerfilesystem.googleapis.com" +} + +resource "google_project_service" "serviceusage_googleapis_com" { + disable_dependent_services = false + disable_on_destroy = false + project = data.google_project.environment.project_id + service = "serviceusage.googleapis.com" +} + +resource "google_project_service" "project_services-cr" { + disable_dependent_services = false + disable_on_destroy = false + project = data.google_project.environment.project_id + service = "cloudresourcemanager.googleapis.com" +} + +resource "google_project_service" "project_services-an" { + disable_dependent_services = false + disable_on_destroy = false + project = data.google_project.environment.project_id + service = "anthos.googleapis.com" +} + +resource "google_project_service" "project_services-anc" { + disable_dependent_services = false + disable_on_destroy = false + project = data.google_project.environment.project_id + service = "anthosconfigmanagement.googleapis.com" +} + +resource "google_project_service" "project_services-con" { + disable_dependent_services = false + disable_on_destroy = false + project = data.google_project.environment.project_id + service = "container.googleapis.com" +} + +resource "google_project_service" "project_services-com" { + disable_dependent_services = false + disable_on_destroy = false + project = data.google_project.environment.project_id + service = "compute.googleapis.com" +} + +resource "google_project_service" "project_services-gkecon" { + disable_dependent_services = false + disable_on_destroy = false + project = data.google_project.environment.project_id + service = "gkeconnect.googleapis.com" +} + +resource "google_project_service" "project_services-gkeh" { + disable_dependent_services = false + disable_on_destroy = false + project = data.google_project.environment.project_id + service = "gkehub.googleapis.com" +} + +resource "google_project_service" "project_services-iam" { + disable_dependent_services = false + disable_on_destroy = false + project = data.google_project.environment.project_id + service = "iam.googleapis.com" +} + +resource "google_project_service" "project_services-gate" { + disable_dependent_services = false + disable_on_destroy = false + project = data.google_project.environment.project_id + service = "connectgateway.googleapis.com" +} + +# +# Networking +########################################################################## +module "create-vpc" { + source = "../../../terraform/modules/network" + + depends_on = [ + google_project_service.project_services-com + ] + + network_name = format("%s-%s", var.network_name, var.environment_name) + project_id = data.google_project.environment.project_id + routing_mode = var.routing_mode + subnet_01_ip = var.subnet_01_ip + subnet_01_name = format("%s-%s", var.subnet_01_name, var.environment_name) + subnet_01_region = var.subnet_01_region + subnet_02_ip = var.subnet_02_ip + subnet_02_name = format("%s-%s", var.subnet_02_name, var.environment_name) + subnet_02_region = var.subnet_02_region +} + +module "cloud-nat" { + source = "../../../terraform/modules/cloud-nat" + + create_router = true + name = format("%s-%s", "nat-for-acm", var.environment_name) + network = module.create-vpc.vpc + project_id = data.google_project.environment.project_id + region = split("/", module.create-vpc.subnet-1)[3] + router = format("%s-%s", "router-for-acm", var.environment_name) +} + +# +# GKE +########################################################################## +resource "google_gke_hub_feature" "configmanagement_acm_feature" { + depends_on = [ + google_project_service.project_services-gkeh, + google_project_service.project_services-anc, + google_project_service.project_services-an, + google_project_service.project_services-com, + google_project_service.project_services-gkecon + ] + + location = "global" + name = "configmanagement" + project = data.google_project.environment.project_id +} + +module "gke" { + source = "../../../terraform/modules/cluster" + + depends_on = [ + google_gke_hub_feature.configmanagement_acm_feature, + google_project_service.project_services-con, + google_project_service.project_services-com + ] + + cluster_name = format("%s-%s", var.cluster_name, var.environment_name) + env = var.environment_name + initial_node_count = 1 + machine_type = "n2-standard-8" + master_auth_networks_ipcidr = var.subnet_01_ip + network = module.create-vpc.vpc + project_id = data.google_project.environment.project_id + region = var.subnet_01_region + remove_default_node_pool = false + subnet = module.create-vpc.subnet-1 + zone = "${var.subnet_01_region}-a" +} + +module "reservation" { + source = "../../../terraform/modules/vm-reservations" + + cluster_name = module.gke.cluster_name + project_id = data.google_project.environment.project_id + zone = "${var.subnet_01_region}-a" +} + +module "node_pool-reserved" { + source = "../../../terraform/modules/node-pools" + + depends_on = [ + module.reservation + ] + + cluster_name = module.gke.cluster_name + node_pool_name = "reservation" + project_id = data.google_project.environment.project_id + region = var.subnet_01_region + reservation_name = module.reservation.reservation_name + resource_type = "reservation" + taints = var.reserved_taints +} + +module "node_pool-ondemand" { + source = "../../../terraform/modules/node-pools" + + depends_on = [ + module.gke + ] + + cluster_name = module.gke.cluster_name + node_pool_name = "ondemand" + project_id = data.google_project.environment.project_id + region = var.subnet_01_region + resource_type = "ondemand" + taints = var.ondemand_taints +} + +module "node_pool-spot" { + source = "../../../terraform/modules/node-pools" + + depends_on = [ + module.gke + ] + + cluster_name = module.gke.cluster_name + node_pool_name = "spot" + project_id = data.google_project.environment.project_id + region = var.subnet_01_region + resource_type = "spot" + taints = var.spot_taints +} + +resource "google_gke_hub_membership" "membership" { + depends_on = [ + google_gke_hub_feature.configmanagement_acm_feature, + google_project_service.project_services-gkeh, + google_project_service.project_services-gkecon + ] + + membership_id = module.gke.cluster_name + project = data.google_project.environment.project_id + + endpoint { + gke_cluster { + resource_link = "//container.googleapis.com/${module.gke.cluster_id}" + } + } +} + +resource "google_gke_hub_feature_membership" "feature_member" { + depends_on = [ + google_project_service.project_services-gkecon, + google_project_service.project_services-gkeh, + google_project_service.project_services-an, + google_project_service.project_services-anc + ] + + feature = "configmanagement" + location = "global" + membership = google_gke_hub_membership.membership.membership_id + project = data.google_project.environment.project_id + + configmanagement { + version = var.config_management_version + + config_sync { + source_format = "unstructured" + + git { + policy_dir = "manifests/clusters" + secret_type = "token" + sync_branch = github_branch.environment.branch + sync_repo = github_repository.acm_repo.http_clone_url + } + } + + policy_controller { + enabled = true + referential_rules_enabled = true + template_library_installed = true + + } + } +} + +# +# Git Repository +########################################################################## +# data "github_organization" "default" { +# name = var.github_org +# } + +resource "github_repository" "acm_repo" { + # depends_on = [ + # data.github_organization.default + # ] + + allow_merge_commit = true + allow_rebase_merge = true + allow_squash_merge = true + auto_init = true + delete_branch_on_merge = false + description = "Repo for Config Sync" + has_issues = false + has_projects = false + has_wiki = false + name = var.configsync_repo_name + visibility = "private" + vulnerability_alerts = true +} + +resource "github_branch" "environment" { + branch = var.environment_name + repository = github_repository.acm_repo.name +} + +resource "github_branch_default" "environment" { + branch = github_branch.environment.branch + repository = github_repository.acm_repo.name +} + +resource "github_branch_protection_v3" "environment" { + repository = github_repository.acm_repo.name + branch = github_branch.environment.branch + + required_pull_request_reviews { + require_code_owner_reviews = true + required_approving_review_count = 1 + } + + restrictions { + } +} + +# +# Scripts +########################################################################## +resource "null_resource" "create_cluster_yamls" { + depends_on = [ + google_gke_hub_feature_membership.feature_member + ] + + provisioner "local-exec" { + command = "${path.module}/scripts/create_cluster_yamls.sh ${var.github_org} ${github_repository.acm_repo.full_name} ${var.github_user} ${var.github_email} ${var.environment_name} ${module.gke.cluster_name}" + environment = { + GIT_TOKEN = var.github_token + } + } + + triggers = { + md5_files = md5(join("", [for f in fileset("${path.module}/templates/acm-template", "**") : md5("${path.module}/templates/acm-template/${f}")])) + md5_script = filemd5("${path.module}/scripts/create_cluster_yamls.sh") + } +} + +resource "null_resource" "create_git_cred_cms" { + depends_on = [ + google_gke_hub_feature_membership.feature_member, + module.gke, + module.node_pool-reserved, + module.node_pool-ondemand, + module.node_pool-spot, + module.cloud-nat + ] + + provisioner "local-exec" { + command = "${path.module}/scripts/create_git_cred.sh ${module.gke.cluster_name} ${data.google_project.environment.project_id} ${var.github_user} config-management-system" + environment = { + GIT_TOKEN = var.github_token + } + } + + triggers = { + md5_credentials = md5(join("", [var.github_user, var.github_token])) + md5_script = filemd5("${path.module}/scripts/create_git_cred.sh") + } +} + +resource "null_resource" "install_kuberay_operator" { + depends_on = [ + google_gke_hub_feature_membership.feature_member, + null_resource.create_git_cred_cms + ] + + provisioner "local-exec" { + command = "${path.module}/scripts/install_kuberay_operator.sh ${github_repository.acm_repo.full_name} ${var.github_email} ${var.github_org} ${var.github_user}" + environment = { + GIT_TOKEN = var.github_token + } + } + + triggers = { + md5_files = md5(join("", [for f in fileset("${path.module}/templates/acm-template/templates/_cluster_template/kuberay", "**") : md5("${path.module}/templates/acm-template/templates/_cluster_template/kuberay/${f}")])) + md5_script = filemd5("${path.module}/scripts/install_kuberay_operator.sh") + } +} + +locals { + namespace_default_kubernetes_service_account = "default" +} + +resource "google_service_account" "namespace_default" { + account_id = "wi-${var.namespace}-${local.namespace_default_kubernetes_service_account}" + display_name = "${var.namespace}/${local.namespace_default_kubernetes_service_account} workload identity service account" + project = data.google_project.environment.project_id +} + +resource "google_service_account_iam_member" "namespace_default_iam_workload_identity_user" { + depends_on = [ + module.gke + ] + + member = "serviceAccount:${data.google_project.environment.project_id}.svc.id.goog[${var.namespace}/${local.namespace_default_kubernetes_service_account}]" + role = "roles/iam.workloadIdentityUser" + service_account_id = google_service_account.namespace_default.id +} + +resource "null_resource" "create_namespace" { + depends_on = [ + google_gke_hub_feature_membership.feature_member, + null_resource.install_kuberay_operator + ] + + provisioner "local-exec" { + command = "${path.module}/scripts/create_namespace.sh ${github_repository.acm_repo.full_name} ${var.github_email} ${var.github_org} ${var.github_user} ${var.namespace} ${var.environment_name}" + environment = { + GIT_TOKEN = var.github_token + } + } + + triggers = { + md5_files = md5(join("", [for f in fileset("${path.module}/templates/acm-template/templates/_cluster_template/team", "**") : md5("${path.module}/templates/acm-template/templates/_cluster_template/team/${f}")])) + md5_script = filemd5("${path.module}/scripts/create_namespace.sh") + } +} + +resource "null_resource" "create_git_cred_ns" { + depends_on = [ + google_gke_hub_feature_membership.feature_member, + null_resource.create_namespace + ] + + provisioner "local-exec" { + command = "${path.module}/scripts/create_git_cred.sh ${module.gke.cluster_name} ${module.gke.gke_project_id} ${var.github_user} ${var.namespace}" + environment = { + GIT_TOKEN = var.github_token + } + } + + triggers = { + md5_credentials = md5(join("", [var.github_user, var.github_token])) + md5_script = filemd5("${path.module}/scripts/create_git_cred.sh") + } +} + +locals { + ray_head_kubernetes_service_account = "ray-head" + ray_worker_kubernetes_service_account = "ray-worker" +} + +resource "google_service_account" "namespace_ray_head" { + account_id = "wi-${var.namespace}-${local.ray_head_kubernetes_service_account}" + display_name = "${var.namespace}/${local.ray_head_kubernetes_service_account} workload identity service account" + project = data.google_project.environment.project_id +} + +resource "google_service_account_iam_member" "namespace_ray_head_iam_workload_identity_user" { + depends_on = [ + module.gke + ] + + member = "serviceAccount:${data.google_project.environment.project_id}.svc.id.goog[${var.namespace}/${local.ray_head_kubernetes_service_account}]" + role = "roles/iam.workloadIdentityUser" + service_account_id = google_service_account.namespace_ray_head.id +} + +resource "google_service_account" "namespace_ray_worker" { + account_id = "wi-${var.namespace}-${local.ray_worker_kubernetes_service_account}" + display_name = "${var.namespace}/${local.ray_worker_kubernetes_service_account} workload identity service account" + project = data.google_project.environment.project_id +} + +resource "google_service_account_iam_member" "namespace_ray_worker_iam_workload_identity_user" { + depends_on = [ + module.gke + ] + + member = "serviceAccount:${data.google_project.environment.project_id}.svc.id.goog[${var.namespace}/${local.ray_worker_kubernetes_service_account}]" + role = "roles/iam.workloadIdentityUser" + service_account_id = google_service_account.namespace_ray_worker.id +} + +resource "null_resource" "install_ray_cluster" { + depends_on = [ + google_gke_hub_feature_membership.feature_member, + null_resource.create_git_cred_ns + ] + + provisioner "local-exec" { + command = "${path.module}/scripts/install_ray_cluster.sh ${github_repository.acm_repo.full_name} ${var.github_email} ${var.github_org} ${var.github_user} ${var.namespace} ${google_service_account.namespace_ray_head.email} ${local.ray_head_kubernetes_service_account} ${google_service_account.namespace_ray_worker.email} ${local.ray_worker_kubernetes_service_account}" + environment = { + GIT_TOKEN = var.github_token + } + } + + triggers = { + md5_files = md5(join("", [for f in fileset("${path.module}/templates/acm-template/templates/_namespace_template/app", "**") : md5("${path.module}/templates/acm-template/templates/_namespace_template/app/${f}")])) + md5_script = filemd5("${path.module}/scripts/install_ray_cluster.sh") + } +} + +resource "null_resource" "manage_ray_ns" { + depends_on = [ + google_gke_hub_feature_membership.feature_member, + null_resource.create_git_cred_ns, + null_resource.install_ray_cluster + ] + + provisioner "local-exec" { + command = "${path.module}/scripts/manage_ray_ns.sh ${github_repository.acm_repo.full_name} ${var.github_email} ${var.github_org} ${var.github_user} ${var.namespace}" + environment = { + GIT_TOKEN = var.github_token + } + } + + triggers = { + md5_script = filemd5("${path.module}/scripts/manage_ray_ns.sh") + } +} diff --git a/best-practices/ml-platform/examples/platform/sandbox/mlp.auto.tfvars b/best-practices/ml-platform/examples/platform/sandbox/mlp.auto.tfvars new file mode 100644 index 000000000..a20c54ee0 --- /dev/null +++ b/best-practices/ml-platform/examples/platform/sandbox/mlp.auto.tfvars @@ -0,0 +1,5 @@ +environment_name = "dev" +environment_project_id = "YOUR_PROJECT_ID" +github_email = "YOUR_GITHUB_EMAIL" +github_org = "YOUR_GITHUB_ORG" +github_user = "YOUR_GITHUB_USER" diff --git a/best-practices/ml-platform/examples/platform/sandbox/outputs.tf b/best-practices/ml-platform/examples/platform/sandbox/outputs.tf new file mode 100644 index 000000000..633bb7f1e --- /dev/null +++ b/best-practices/ml-platform/examples/platform/sandbox/outputs.tf @@ -0,0 +1,13 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/best-practices/ml-platform/examples/platform/sandbox/scripts/create_cluster_yamls.sh b/best-practices/ml-platform/examples/platform/sandbox/scripts/create_cluster_yamls.sh new file mode 100755 index 000000000..36cece324 --- /dev/null +++ b/best-practices/ml-platform/examples/platform/sandbox/scripts/create_cluster_yamls.sh @@ -0,0 +1,60 @@ +#!/bin/bash +# +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +github_org=${1} +acm_repo_name=${2} +github_user=${3} +github_email=${4} +cluster_env=${5} +cluster_name=${6} + +random=$( + echo $RANDOM | md5sum | head -c 20 + echo +) +log="$(pwd)/log" +flag=0 + +download_acm_repo_name="/tmp/$(echo ${acm_repo_name} | awk -F "/" '{print $2}')-${random}" +git config --global user.name ${github_user} +git config --global user.email ${github_emai} +git clone https://${github_user}:${GIT_TOKEN}@github.com/${acm_repo_name} ${download_acm_repo_name} || exit 1 + +if [ ! -d "${download_acm_repo_name}/manifests" ] && [ ! -d "${download_acm_repo_name}/templates" ]; then + echo "copying files" + cp -r templates/acm-template/* ${download_acm_repo_name} + flag=1 +fi + +cd ${download_acm_repo_name}/manifests/clusters +if [ "${flag}" -eq 0 ]; then + echo "not copying files" +fi + +cp ../../templates/_cluster_template/cluster.yaml ./${cluster_name}-cluster.yaml +cp ../../templates/_cluster_template/selector.yaml ./${cluster_env}-selector.yaml + +find . -type f -name ${cluster_name}-cluster.yaml -exec sed -i "s/CLUSTER_NAME/${cluster_name}/g" {} + +find . -type f -name ${cluster_name}-cluster.yaml -exec sed -i "s/ENV/${cluster_env}/g" {} + +find . -type f -name ${cluster_env}-selector.yaml -exec sed -i "s/ENV/${cluster_env}/g" {} + + +git add ../../. +git config --global user.name ${github_user} +git config --global user.email ${github_email} +git commit -m "Adding ${cluster_name} cluster to the ${cluster_env} environment." +git push origin + +rm -rf ${download_acm_repo_name} diff --git a/best-practices/ml-platform/examples/platform/sandbox/scripts/create_git_cred.sh b/best-practices/ml-platform/examples/platform/sandbox/scripts/create_git_cred.sh new file mode 100755 index 000000000..3c9711558 --- /dev/null +++ b/best-practices/ml-platform/examples/platform/sandbox/scripts/create_git_cred.sh @@ -0,0 +1,33 @@ +#!/bin/bash +# +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +gke_cluster=${1} +project_id=${2} +git_user=${3} +namespace=${4} + +gcloud container fleet memberships get-credentials ${gke_cluster} --project ${project_id} + +echo "Waiting for namespace '${namespace}' to be created..." +while ! kubectl get ns ${namespace} >/dev/null 2>&1; do + sleep 2 +done + +if kubectl get secret git-creds -n ${namespace} >/dev/null 2>&1; then + kubectl create secret generic git-creds --namespace="${namespace}" --save-config --dry-run=client --from-literal=username="${git_user}" --from-literal=token="${GIT_TOKEN}" -o yaml | kubectl apply -f - +else + kubectl create secret generic git-creds --namespace="${namespace}" --save-config --from-literal=username="${git_user}" --from-literal=token="${GIT_TOKEN}" +fi diff --git a/best-practices/ml-platform/examples/platform/sandbox/scripts/create_namespace.sh b/best-practices/ml-platform/examples/platform/sandbox/scripts/create_namespace.sh new file mode 100755 index 000000000..938c62b4e --- /dev/null +++ b/best-practices/ml-platform/examples/platform/sandbox/scripts/create_namespace.sh @@ -0,0 +1,64 @@ +#!/bin/bash +# +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +configsync_repo_name=${1} +github_email=${2} +github_org=${3} +github_user=${4} +namespace=${5} +cluster_env=${6} + +logfile=$(pwd)/log +random=$( + echo $RANDOM | md5sum | head -c 20 + echo +) +download_acm_repo_name="/tmp/$(echo ${configsync_repo_name} | awk -F "/" '{print $2}')-${random}" +git config --global user.name ${github_user} +git config --global user.email ${github_emai} +git clone https://${github_user}:${GIT_TOKEN}@github.com/${configsync_repo_name} ${download_acm_repo_name} || exit 1 +cd ${download_acm_repo_name}/manifests/clusters + +if [ -d "${namespace}" ]; then + exit 0 +fi + +#TODO: This most likely needs to be fixed for multiple environments +chars_in_namespace=$(echo -n ${namespace} | wc -c) +chars_in_cluster_env=$(echo -n ${cluster_env} | wc -c) +chars_in_reposync_name=$(expr $chars_in_namespace + ${chars_in_cluster_env} + 1) +mkdir ${namespace} || exit 1 +cp -r ../../templates/_cluster_template/team/* ${namespace} +sed -i "s?NAMESPACE?$namespace?g" ${namespace}/* +sed -ni '/#END OF SINGLE ENV DECLARATION/q;p' ${namespace}/reposync.yaml +sed -i "s?ENV?$cluster_env?g" ${namespace}/reposync.yaml +sed -i "s?GIT_REPO?https://github.com/$configsync_repo_name?g" ${namespace}/reposync.yaml +sed -i "s??$chars_in_reposync_name?g" ${namespace}/reposync.yaml + +mkdir ../apps/${namespace} +touch ../apps/${namespace}/.gitkeep + +cat <>kustomization.yaml +- ./${namespace} +EOF +cd .. +git add . +git config --global user.name ${github_user} +git config --global user.email ${github_email} +git commit -m "Adding manifests to create a new namespace." +git push origin + +rm -rf ${download_acm_repo_name} diff --git a/best-practices/ml-platform/examples/platform/sandbox/scripts/install_kuberay_operator.sh b/best-practices/ml-platform/examples/platform/sandbox/scripts/install_kuberay_operator.sh new file mode 100755 index 000000000..11f0397aa --- /dev/null +++ b/best-practices/ml-platform/examples/platform/sandbox/scripts/install_kuberay_operator.sh @@ -0,0 +1,51 @@ +#!/bin/bash +# +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +configsync_repo_name=${1} +github_email=${2} +github_org=${3} +github_user=${4} + +random=$( + echo $RANDOM | md5sum | head -c 20 + echo +) +download_acm_repo_name="/tmp/$(echo ${configsync_repo_name} | awk -F "/" '{print $2}')-${random}" +git config --global user.name ${github_user} +git config --global user.email ${github_emai} +git clone https://${github_user}:${GIT_TOKEN}@github.com/${configsync_repo_name} ${download_acm_repo_name} || exit 1 +cd ${download_acm_repo_name}/manifests/clusters +if [ -f "kustomization.yaml" ]; then + exit 0 +fi + +yamlfiles=$(find . -type f -name "*.yaml") +cp ../../templates/_cluster_template/kustomization.yaml . +for yamlfile in $(echo ${yamlfiles}); do + cat <>kustomization.yaml + +- ${yamlfile} +EOF +done + +cp -r ../../templates/_cluster_template/kuberay . +git add . +git config --global user.name ${github_user} +git config --global user.email ${github_email} +git commit -m "Adding manifests to install kuberay operator." +git push origin + +rm -rf ${download_acm_repo_name} diff --git a/best-practices/ml-platform/examples/platform/sandbox/scripts/install_ray_cluster.sh b/best-practices/ml-platform/examples/platform/sandbox/scripts/install_ray_cluster.sh new file mode 100755 index 000000000..3e95bf170 --- /dev/null +++ b/best-practices/ml-platform/examples/platform/sandbox/scripts/install_ray_cluster.sh @@ -0,0 +1,59 @@ +#!/bin/bash +# +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +configsync_repo_name=${1} +github_email=${2} +github_org=${3} +github_user=${4} +namespace=${5} +google_service_account_head=${6} +kubernetes_service_account_head=${7} +google_service_account_worker=${8} +kubernetes_service_account_worker=${9} + +random=$( + echo $RANDOM | md5sum | head -c 20 + echo +) +download_acm_repo_name="/tmp/$(echo ${configsync_repo_name} | awk -F "/" '{print $2}')-${random}" +git config --global user.name ${github_user} +git config --global user.email ${github_emai} +git clone https://${github_user}:${GIT_TOKEN}@github.com/${configsync_repo_name} ${download_acm_repo_name} || exit 1 +cd ${download_acm_repo_name}/manifests/apps +if [ ! -d "${namespace}" ]; then + echo "${namespace} folder doesnt exist in the configsync repo" + exit 1 +fi + +if [ -f "${namespace}/kustomization.yaml" ]; then + echo "${namespace} is already set up" + exit 0 +fi + +cp -r ../../templates/_namespace_template/app/* ${namespace}/ +sed -i "s?NAMESPACE?${namespace}?g" ${namespace}/* +sed -i "s?GOOGLE_SERVICE_ACCOUNT_RAY_HEAD?$google_service_account_head?g" ${namespace}/* +sed -i "s?KUBERNETES_SERVICE_ACCOUNT_RAY_HEAD?$kubernetes_service_account_head?g" ${namespace}/* +sed -i "s?GOOGLE_SERVICE_ACCOUNT_RAY_WORKER?$google_service_account_worker?g" ${namespace}/* +sed -i "s?KUBERNETES_SERVICE_ACCOUNT_RAY_WORKER?$kubernetes_service_account_worker?g" ${namespace}/* + +git add . +git config --global user.name ${github_user} +git config --global user.email ${github_email} +git commit -m "Installing ray cluster in ${namespace} namespace." +git push origin + +rm -rf ${download_acm_repo_name} diff --git a/best-practices/ml-platform/examples/platform/sandbox/scripts/manage_ray_ns.sh b/best-practices/ml-platform/examples/platform/sandbox/scripts/manage_ray_ns.sh new file mode 100755 index 000000000..a1ca8b2cb --- /dev/null +++ b/best-practices/ml-platform/examples/platform/sandbox/scripts/manage_ray_ns.sh @@ -0,0 +1,46 @@ +#!/bin/bash +# +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +configsync_repo_name=${1} +github_email=${2} +github_org=${3} +github_user=${4} +namespace=${5} + +random=$( + echo $RANDOM | md5sum | head -c 20 + echo +) +download_acm_repo_name="/tmp/$(echo ${configsync_repo_name} | awk -F "/" '{print $2}')-${random}" +git config --global user.name ${github_user} +git config --global user.email ${github_emai} +git clone https://${github_user}:${GIT_TOKEN}@github.com/${configsync_repo_name} ${download_acm_repo_name} || exit 1 +cd ${download_acm_repo_name}/manifests/clusters/kuberay +ns_exists=$(grep ${namespace} values.yaml | wc -l) +if [ "${ns_exists}" -ne 0 ]; then + echo "namespace already present in values.yaml" + exit 0 +fi + +sed -i "s/watchNamespace:/watchNamespace:\n - ${namespace}/g" values.yaml + +git add . +git config --global user.name ${github_user} +git config --global user.email ${github_email} +git commit -m "Installing ray cluster in ${namespace} namespace." +git push origin + +rm -rf ${download_acm_repo_name} diff --git a/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/manifests/apps/.gitkeep b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/manifests/apps/.gitkeep new file mode 100644 index 000000000..e69de29bb diff --git a/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/manifests/clusters/.gitkeep b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/manifests/clusters/.gitkeep new file mode 100644 index 000000000..e69de29bb diff --git a/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/cluster.yaml b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/cluster.yaml new file mode 100644 index 000000000..c27d6a578 --- /dev/null +++ b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/cluster.yaml @@ -0,0 +1,21 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +kind: Cluster +apiVersion: clusterregistry.k8s.io/v1alpha1 +metadata: + name: CLUSTER_NAME + labels: + environment: ENV + clusterName: CLUSTER_NAME \ No newline at end of file diff --git a/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/config-selector.yaml b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/config-selector.yaml new file mode 100644 index 000000000..3f22b4d64 --- /dev/null +++ b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/config-selector.yaml @@ -0,0 +1,23 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +kind: ClusterSelector +apiVersion: configmanagement.gke.io/v1 +metadata: + name: config +spec: + selector: + matchLabels: + clusterName: CLUSTER_NAME + diff --git a/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/kuberay/kustomization.yaml b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/kuberay/kustomization.yaml new file mode 100644 index 000000000..cc63d55d2 --- /dev/null +++ b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/kuberay/kustomization.yaml @@ -0,0 +1,30 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization +resources: +- rbac.yaml +patches: +- path: rayclusters.yaml +- path: rayservices.yaml +- path: rayjobs.yaml + +helmCharts: +- name: kuberay-operator + repo: https://ray-project.github.io/kuberay-helm/ + version: 1.0.0 + releaseName: kuberay-operator + includeCRDs: true + valuesFile: values.yaml diff --git a/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/kuberay/rayclusters.yaml b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/kuberay/rayclusters.yaml new file mode 100644 index 000000000..d552cc2d9 --- /dev/null +++ b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/kuberay/rayclusters.yaml @@ -0,0 +1,22 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: apiextensions.k8s.io/v1 +kind: CustomResourceDefinition +metadata: + name: rayclusters.ray.io + annotations: + controller-gen.kubebuilder.io/version: v0.6.0 +status: + $patch: delete \ No newline at end of file diff --git a/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/kuberay/rayjobs.yaml b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/kuberay/rayjobs.yaml new file mode 100644 index 000000000..c18a0c21b --- /dev/null +++ b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/kuberay/rayjobs.yaml @@ -0,0 +1,22 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: apiextensions.k8s.io/v1 +kind: CustomResourceDefinition +metadata: + name: rayjobs.ray.io + annotations: + controller-gen.kubebuilder.io/version: v0.6.0 +status: + $patch: delete \ No newline at end of file diff --git a/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/kuberay/rayservices.yaml b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/kuberay/rayservices.yaml new file mode 100644 index 000000000..4e10d8ab6 --- /dev/null +++ b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/kuberay/rayservices.yaml @@ -0,0 +1,22 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: apiextensions.k8s.io/v1 +kind: CustomResourceDefinition +metadata: + name: rayservices.ray.io + annotations: + controller-gen.kubebuilder.io/version: v0.6.0 +status: + $patch: delete \ No newline at end of file diff --git a/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/kuberay/rbac.yaml b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/kuberay/rbac.yaml new file mode 100644 index 000000000..a0a1a686d --- /dev/null +++ b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/kuberay/rbac.yaml @@ -0,0 +1,44 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: rbac.authorization.k8s.io/v1 +kind: Role +metadata: + name: kuberay-operator-role +rules: +- apiGroups: + - '*' + resources: + - '*' + verbs: + - get + - list + - watch + - create + - update + - patch + - delete +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: RoleBinding +metadata: + name: kuberay-operator-binding +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: Role + name: kuberay-operator-role +subjects: +- kind: ServiceAccount + name: kuberay-operator + namespace: default diff --git a/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/kuberay/values.yaml b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/kuberay/values.yaml new file mode 100644 index 000000000..626a6cb2a --- /dev/null +++ b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/kuberay/values.yaml @@ -0,0 +1,112 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +image: + repository: kuberay/operator + tag: v1.0.0 + pullPolicy: IfNotPresent + +nameOverride: "kuberay-operator" +fullnameOverride: "kuberay-operator" + +serviceAccount: + # Specifies whether a service account should be created + create: true + # The name of the service account to use. + # If not set and create is true, a name is generated using the fullname template + name: "kuberay-operator" + +service: + type: ClusterIP + port: 8080 + +resources: + # We usually recommend not to specify default resources and to leave this as a conscious + # choice for the user. This also increases chances charts run on environments with little + # resources, such as Minikube. If you do whelm to specify resources, uncomment the following + # lines, adjust them as necessary, and remove the curly braces after 'resources:'. + limits: + cpu: 100m + # Anecdotally, managing 500 Ray pods requires roughly 500MB memory. + # Monitor memory usage and adjust as needed. + memory: 512Mi + # requests: + # cpu: 100m + # memory: 512Mi + +livenessProbe: + initialDelaySeconds: 10 + periodSeconds: 5 + failureThreshold: 5 + +readinessProbe: + initialDelaySeconds: 10 + periodSeconds: 5 + failureThreshold: 5 + +batchScheduler: + enabled: false + + # Set up `securityContext` to improve Pod security. + # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/pod-security.md for further guidance. +securityContext: {} + + + # If rbacEnable is set to false, no RBAC resources will be created, including the Role for leader election, the Role for Pods and Services, and so on. +rbacEnable: true + + # When crNamespacedRbacEnable is set to true, the KubeRay operator will create a Role for RayCluster preparation (e.g., Pods, Services) + # and a corresponding RoleBinding for each namespace listed in the "watchNamespace" parameter. Please note that even if crNamespacedRbacEnable + # is set to false, the Role and RoleBinding for leader election will still be created. + # + # Note: + # (1) This variable is only effective when rbacEnable and singleNamespaceInstall are both set to true. + # (2) In most cases, it should be set to true, unless you are using a Kubernetes cluster managed by GitOps tools such as ArgoCD. +crNamespacedRbacEnable: true + + # When singleNamespaceInstall is true: + # - Install namespaced RBAC resources such as Role and RoleBinding instead of cluster-scoped ones like ClusterRole and ClusterRoleBinding so that + # the chart can be installed by users with permissions restricted to a single namespace. + # (Please note that this excludes the CRDs, which can only be installed at the cluster scope.) + # - If "watchNamespace" is not set, the KubeRay operator will, by default, only listen + # to resource events within its own namespace. +singleNamespaceInstall: true + +# The KubeRay operator will watch the custom resources in the namespaces listed in the "watchNamespace" parameter. +watchNamespace: + + +# Environment variables +env: +# If not set or set to true, kuberay auto injects an init container waiting for ray GCS. +# If false, you will need to inject your own init container to ensure ray GCS is up before the ray workers start. +# Warning: we highly recommend setting to true and let kuberay handle for you. +# - name: ENABLE_INIT_CONTAINER_INJECTION +# value: "true" +# If not set or set to "", kuberay will pick up the default k8s cluster domain `cluster.local` +# Otherwise, kuberay will use your custom domain +# - name: CLUSTER_DOMAIN +# value: "" +# If not set or set to false, when running on OpenShift with Ingress creation enabled, kuberay will create OpenShift route +# Otherwise, regardless of the type of cluster with Ingress creation enabled, kuberay will create Ingress +# - name: USE_INGRESS_ON_OPENSHIFT +# value: "true" +# Unconditionally requeue after the number of seconds specified in the +# environment variable RAYCLUSTER_DEFAULT_REQUEUE_SECONDS_ENV. If the +# environment variable is not set, requeue after the default value (300). +# - name: RAYCLUSTER_DEFAULT_REQUEUE_SECONDS_ENV +# value: 300 +# If not set or set to "true", KubeRay will clean up the Redis storage namespace when a GCS FT-enabled RayCluster is deleted. +# - name: ENABLE_GCS_FT_REDIS_CLEANUP +# value: "true" \ No newline at end of file diff --git a/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/kustomization.yaml b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/kustomization.yaml new file mode 100644 index 000000000..448f68961 --- /dev/null +++ b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/kustomization.yaml @@ -0,0 +1,19 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization +resources: +- "https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml" +- ./kuberay \ No newline at end of file diff --git a/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/selector.yaml b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/selector.yaml new file mode 100644 index 000000000..cfd6f6ede --- /dev/null +++ b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/selector.yaml @@ -0,0 +1,22 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +kind: ClusterSelector +apiVersion: configmanagement.gke.io/v1 +metadata: + name: ENV +spec: + selector: + matchLabels: + environment: ENV \ No newline at end of file diff --git a/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/team/kustomization.yaml b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/team/kustomization.yaml new file mode 100644 index 000000000..93f6f77e9 --- /dev/null +++ b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/team/kustomization.yaml @@ -0,0 +1,22 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization + +resources: +- namespace.yaml +- network-policy.yaml +- rbac.yaml +- reposync.yaml \ No newline at end of file diff --git a/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/team/namespace.yaml b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/team/namespace.yaml new file mode 100644 index 000000000..08474cb90 --- /dev/null +++ b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/team/namespace.yaml @@ -0,0 +1,20 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: Namespace +metadata: + name: NAMESPACE + labels: + app: NAMESPACE \ No newline at end of file diff --git a/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/team/network-policy.yaml b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/team/network-policy.yaml new file mode 100644 index 000000000..de02d2a5a --- /dev/null +++ b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/team/network-policy.yaml @@ -0,0 +1,32 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: networking.k8s.io/v1 +kind: NetworkPolicy +metadata: + name: deny-from-other-namespaces + namespace: NAMESPACE +spec: + podSelector: + # Apply policy to all pods in this namespace. + matchLabels: {} + ingress: + - from: + # Allow traffic between all pods in this namespace. + - podSelector: {} + # Example that allows traffic from another app's namespace. + #- from: + # - namespaceSelector: + # matchLabels: + # app: another-app-namespace \ No newline at end of file diff --git a/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/team/rbac.yaml b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/team/rbac.yaml new file mode 100644 index 000000000..398a617bb --- /dev/null +++ b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/team/rbac.yaml @@ -0,0 +1,59 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: rbac.authorization.k8s.io/v1 +kind: Role +metadata: + name: NAMESPACE-FullAccess + namespace: NAMESPACE +rules: +- apiGroups: + - '*' + resources: + - '*' + verbs: + - get + - list + - watch + - create + - update + - patch + - delete +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: RoleBinding +metadata: + name: NAMESPACE-user-access + namespace: NAMESPACE +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: Role + name: NAMESPACE-FullAccess +subjects: +- kind: User + name: USERNAME1 +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: RoleBinding +metadata: + name: kuberay-sa-access + namespace: NAMESPACE +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: Role + name: NAMESPACE-FullAccess +subjects: +- kind: ServiceAccount + name: kuberay-operator + namespace: default \ No newline at end of file diff --git a/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/team/reposync.yaml b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/team/reposync.yaml new file mode 100644 index 000000000..4b50dcee5 --- /dev/null +++ b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_cluster_template/team/reposync.yaml @@ -0,0 +1,172 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +#ROOT_SOURCE/namespaces/NAMESPACE/repo-sync.yaml +apiVersion: configsync.gke.io/v1beta1 +kind: RepoSync +metadata: + name: ENV-NAMESPACE + namespace: NAMESPACE + annotations: + configmanagement.gke.io/cluster-selector: ENV +spec: + sourceType: git + # Since this is for a namespace repository, the format is unstructured + sourceFormat: unstructured + git: + repo: "GIT_REPO" + revision: "ENV" + #branch: NAMESPACE_BRANCH + dir: "manifests/apps/NAMESPACE" + auth: token + secretRef: + name: git-creds +--- +#ROOT_REPO/namespaces/NAMESPACE/sync-rolebinding.yaml +apiVersion: rbac.authorization.k8s.io/v1 +kind: RoleBinding +metadata: + name: ENV-rb-NAMESPACE + namespace: NAMESPACE + annotations: + configmanagement.gke.io/cluster-selector: ENV +subjects: +- kind: ServiceAccount + name: ns-reconciler-NAMESPACE-ENV-NAMESPACE- + namespace: config-management-system +roleRef: + kind: ClusterRole + name: cluster-admin + apiGroup: rbac.authorization.k8s.io +--- +#END OF SINGLE ENV DECLARATION +apiVersion: configsync.gke.io/v1beta1 +kind: RepoSync +metadata: + name: dev-NAMESPACE + namespace: NAMESPACE + annotations: + configmanagement.gke.io/cluster-selector: dev +spec: + sourceType: git + # Since this is for a namespace repository, the format is unstructured + sourceFormat: unstructured + git: + repo: "GIT_REPO" + revision: "dev" + #branch: NAMESPACE_BRANCH + dir: "manifests/apps/NAMESPACE" + auth: token + secretRef: + name: git-creds +--- +#ROOT_REPO/namespaces/NAMESPACE/sync-rolebinding.yaml +apiVersion: rbac.authorization.k8s.io/v1 +kind: RoleBinding +metadata: + name: dev-rb-NAMESPACE + namespace: NAMESPACE + annotations: + configmanagement.gke.io/cluster-selector: dev +subjects: +- kind: ServiceAccount + name: ns-reconciler-NAMESPACE-dev-NAMESPACE- + namespace: config-management-system +roleRef: + kind: ClusterRole + name: cluster-admin + apiGroup: rbac.authorization.k8s.io +--- +#ROOT_SOURCE/namespaces/NAMESPACE/repo-sync.yaml +apiVersion: configsync.gke.io/v1beta1 +kind: RepoSync +metadata: + name: staging-NAMESPACE + namespace: NAMESPACE + annotations: + configmanagement.gke.io/cluster-selector: staging +spec: + sourceType: git + # Since this is for a namespace repository, the format is unstructured + sourceFormat: unstructured + git: + repo: "GIT_REPO" + revision: "staging" + #branch: NAMESPACE_BRANCH + dir: "manifests/apps/NAMESPACE" + auth: token + secretRef: + name: git-creds +--- +#ROOT_REPO/namespaces/NAMESPACE/sync-rolebinding.yaml +apiVersion: rbac.authorization.k8s.io/v1 +kind: RoleBinding +metadata: + name: staging-rb-NAMESPACE + namespace: NAMESPACE + annotations: + configmanagement.gke.io/cluster-selector: staging +subjects: +- kind: ServiceAccount + name: ns-reconciler-NAMESPACE-staging-NAMESPACE- + namespace: config-management-system +roleRef: + kind: ClusterRole + name: cluster-admin + apiGroup: rbac.authorization.k8s.io +--- + +#ROOT_SOURCE/namespaces/NAMESPACE/repo-sync.yaml +apiVersion: configsync.gke.io/v1beta1 +kind: RepoSync +metadata: + name: prod-NAMESPACE + namespace: NAMESPACE + annotations: + configmanagement.gke.io/cluster-selector: prod +spec: + sourceType: git + # Since this is for a namespace repository, the format is unstructured + sourceFormat: unstructured + git: + repo: "GIT_REPO" + revision: "prod" + #branch: NAMESPACE_BRANCH + dir: "manifests/apps/NAMESPACE" + auth: token + secretRef: + name: git-creds +--- +#ROOT_REPO/namespaces/NAMESPACE/sync-rolebinding.yaml +apiVersion: rbac.authorization.k8s.io/v1 +kind: RoleBinding +metadata: + name: prod-rb-NAMESPACE + namespace: NAMESPACE + annotations: + configmanagement.gke.io/cluster-selector: prod +subjects: +- kind: ServiceAccount + name: ns-reconciler-NAMESPACE-prod-NAMESPACE- + namespace: config-management-system +roleRef: + kind: ClusterRole + name: cluster-admin + apiGroup: rbac.authorization.k8s.io +--- +#What should be the name of the reconciler's service account? +#If the RepoSync name is repo-sync, SERVICE_ACCOUNT_NAME is ns-reconciler-NAMESPACE. +# Otherwise, it is ns-reconciler-NAMESPACE-REPO_SYNC_NAME-REPO_SYNC_NAME_LENGTH. +#For example, if your RepoSync name is prod, then the SERVICE_ACCOUNT_NAME would be ns-reconciler-NAMESPACE-prod-4. The integer 4 is used as prod contains 4 characters. +# https://cloud.google.com/anthos-config-management/docs/how-to/multiple-repositories \ No newline at end of file diff --git a/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_namespace_template/app/fluentd_config.yaml b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_namespace_template/app/fluentd_config.yaml new file mode 100644 index 000000000..e85cf38d0 --- /dev/null +++ b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_namespace_template/app/fluentd_config.yaml @@ -0,0 +1,44 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: ConfigMap +metadata: + name: fluentbit-config +data: + fluent-bit.conf: | + [SERVICE] + Parsers_File parsers.conf + [INPUT] + Name tail + Path /tmp/ray/session_latest/logs/* + Tag ray + Path_Key filename + Refresh_Interval 5 + [FILTERS] + Name parser + Match ray + Key_Name filename + Parser rayjob + Reserve_Data On + [OUTPUT] + Name stdout + Format json_lines + Match * + + parsers.conf: | + [PARSER] + Name rayjob + Format regex + Regex (?raysubmit_[^.]*) \ No newline at end of file diff --git a/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_namespace_template/app/kustomization.yaml b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_namespace_template/app/kustomization.yaml new file mode 100644 index 000000000..5d213bc4b --- /dev/null +++ b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_namespace_template/app/kustomization.yaml @@ -0,0 +1,29 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization +namespace: NAMESPACE + +resources: +- fluentd_config.yaml +- serviceaccount_ray_head.yaml +- serviceaccount_ray_worker.yaml + +helmCharts: +- name: ray-cluster + repo: https://ray-project.github.io/kuberay-helm/ + version: 1.0.0 + releaseName: ray-cluster + valuesFile: values.yaml diff --git a/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_namespace_template/app/serviceaccount_ray_head.yaml b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_namespace_template/app/serviceaccount_ray_head.yaml new file mode 100644 index 000000000..b88329a3c --- /dev/null +++ b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_namespace_template/app/serviceaccount_ray_head.yaml @@ -0,0 +1,21 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: ServiceAccount +metadata: + name: KUBERNETES_SERVICE_ACCOUNT_RAY_HEAD + namespace: NAMESPACE + annotations: + iam.gke.io/gcp-service-account: GOOGLE_SERVICE_ACCOUNT_RAY_HEAD diff --git a/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_namespace_template/app/serviceaccount_ray_worker.yaml b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_namespace_template/app/serviceaccount_ray_worker.yaml new file mode 100644 index 000000000..eefd56a56 --- /dev/null +++ b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_namespace_template/app/serviceaccount_ray_worker.yaml @@ -0,0 +1,21 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: ServiceAccount +metadata: + name: KUBERNETES_SERVICE_ACCOUNT_RAY_WORKER + namespace: NAMESPACE + annotations: + iam.gke.io/gcp-service-account: GOOGLE_SERVICE_ACCOUNT_RAY_WORKER diff --git a/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_namespace_template/app/values.yaml b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_namespace_template/app/values.yaml new file mode 100644 index 000000000..e8cf46a4b --- /dev/null +++ b/best-practices/ml-platform/examples/platform/sandbox/templates/acm-template/templates/_namespace_template/app/values.yaml @@ -0,0 +1,319 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +image: + # Replace this with your own image if needed. + repository: rayproject/ray + tag: 2.7.1-py310-gpu + pullPolicy: IfNotPresent + +nameOverride: "kuberay" +fullnameOverride: "" + +imagePullSecrets: [] +# - name: an-existing-secret + +head: + groupName: headgroup + rayVersion: 2.7.1 + enableInTreeAutoscaling: true + # If enableInTreeAutoscaling is true, the autoscaler sidecar will be added to the Ray head pod. + # Ray autoscaler integration is supported only for Ray versions >= 1.11.0 + # Ray autoscaler integration is Beta with KubeRay >= 0.3.0 and Ray >= 2.0.0. + # enableInTreeAutoscaling: true + # autoscalerOptions is an OPTIONAL field specifying configuration overrides for the Ray autoscaler. + # The example configuration shown below below represents the DEFAULT values. + autoscalerOptions: + # upscalingMode: Default + # idleTimeoutSeconds: 60 + # securityContext: {} + # env: [] + # envFrom: [] + # resources specifies optional resource request and limit overrides for the autoscaler container. + # For large Ray clusters, we recommend monitoring container resource usage to determine if overriding the defaults is required. + resources: + limits: + cpu: "500m" + memory: "512Mi" + requests: + cpu: "500m" + memory: "512Mi" + labels: + cloud.google.com/gke-ray-node-type: head + created-by: ray-on-gke + serviceAccountName: "KUBERNETES_SERVICE_ACCOUNT_RAY_HEAD" + rayStartParams: + dashboard-host: '0.0.0.0' + block: 'true' + num-cpus: '0' + # containerEnv specifies environment variables for the Ray container, + # Follows standard K8s container env schema. + image: + repository: rayproject/ray + tag: 2.7.1-py310 + pullPolicy: IfNotPresent + containerEnv: + # - name: EXAMPLE_ENV + # value: "1" + - name: RAY_memory_monitor_refresh_ms + value: "0" + envFrom: [] + # - secretRef: + # name: my-env-secret + # ports optionally allows specifying ports for the Ray container. + ports: [] + # resource requests and limits for the Ray head container. + # Modify as needed for your application. + # Note that the resources in this example are much too small for production; + # we don't recommend allocating less than 8G memory for a Ray pod in production. + # Ray pods should be sized to take up entire K8s nodes when possible. + # Always set CPU and memory limits for Ray pods. + # It is usually best to set requests equal to limits. + # See https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/config.html#resources + # for further guidance. + resources: + limits: + cpu: "4" + # To avoid out-of-memory issues, never allocate less than 2G memory for the Ray head. + memory: "10G" + ephemeral-storage: 20Gi + requests: + cpu: "4" + memory: "10G" + ephemeral-storage: 10Gi + annotations: {} + nodeSelector: + iam.gke.io/gke-metadata-server-enabled: "true" + tolerations: + - key: "reserved" + operator: "Exists" + effect: "NoSchedule" + affinity: {} + # Ray container security context. + securityContext: {} + volumes: + - name: ray-logs + emptyDir: {} + - name: fluentbit-config + configMap: + name: fluentbit-config + # Ray writes logs to /tmp/ray/session_latests/logs + volumeMounts: + - mountPath: /tmp/ray + name: ray-logs + # sidecarContainers specifies additional containers to attach to the Ray pod. + # Follows standard K8s container spec. + sidecarContainers: + - name: fluentbit + image: fluent/fluent-bit:1.9.6 + # These resource requests for Fluent Bit should be sufficient in production. + resources: + requests: + cpu: 100m + memory: 128Mi + ephemeral-storage: 2Gi + limits: + cpu: 100m + memory: 128Mi + ephemeral-storage: 4Gi + volumeMounts: + - mountPath: /tmp/ray + name: ray-logs + - mountPath: /fluent-bit/etc/ + name: fluentbit-config + +worker: + # If you want to disable the default workergroup + # uncomment the line below + # disabled: true + groupName: workergroup + minReplicas: 1 + maxReplicas: 3 + replicas: 1 + type: worker + labels: + cloud.google.com/gke-ray-node-type: worker + created-by: ray-on-gke + serviceAccountName: "KUBERNETES_SERVICE_ACCOUNT_RAY_WORKER" + rayStartParams: + block: 'true' + resources: '"{\"accelerator_type_l4\": 2}"' + initContainerImage: 'busybox:1.28' # Enable users to specify the image for init container. Users can pull the busybox image from their private repositories. + # Security context for the init container. + initContainerSecurityContext: {} + # containerEnv specifies environment variables for the Ray container, + # Follows standard K8s container env schema. + containerEnv: [] + # - name: EXAMPLE_ENV + # value: "1" + envFrom: [] + # - secretRef: + # name: my-env-secret + # ports optionally allows specifying ports for the Ray container. + ports: [] + # resource requests and limits for the Ray head container. + # Modify as needed for your application. + # Note that the resources in this example are much too small for production; + # we don't recommend allocating less than 8G memory for a Ray pod in production. + # Ray pods should be sized to take up entire K8s nodes when possible. + # Always set CPU and memory limits for Ray pods. + # It is usually best to set requests equal to limits. + # See https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/config.html#resources + # for further guidance. + resources: + limits: + cpu: "22" + nvidia.com/gpu: "2" + memory: "90G" + ephemeral-storage: 20Gi + requests: + cpu: "22" + nvidia.com/gpu: "2" + memory: "90G" + ephemeral-storage: 10Gi + annotations: + key: value + nodeSelector: + iam.gke.io/gke-metadata-server-enabled: "true" + cloud.google.com/gke-accelerator: "nvidia-l4" + tolerations: + - key: "nvidia.com/gpu" + operator: "Exists" + effect: "NoSchedule" + - key: "reserved" + operator: "Exists" + effect: "NoSchedule" + affinity: {} + # Ray container security context. + securityContext: {} + volumes: + - name: ray-logs + emptyDir: {} + - name: fluentbit-config + configMap: + name: fluentbit-config + # Ray writes logs to /tmp/ray/session_latests/logs + volumeMounts: + - mountPath: /tmp/ray + name: ray-logs + # sidecarContainers specifies additional containers to attach to the Ray pod. + # Follows standard K8s container spec. + sidecarContainers: + - name: fluentbit + image: fluent/fluent-bit:1.9.6 + # These resource requests for Fluent Bit should be sufficient in production. + resources: + requests: + cpu: 100m + memory: 128Mi + ephemeral-storage: 2Gi + limits: + cpu: 100m + memory: 128Mi + ephemeral-storage: 4Gi + volumeMounts: + - mountPath: /tmp/ray + name: ray-logs + - mountPath: /fluent-bit/etc/ + name: fluentbit-config + +# The map's key is used as the groupName. +# For example, key:small-group in the map below +# will be used as the groupName +additionalWorkerGroups: + smallGroup: + # Disabled by default + disabled: true + replicas: 1 + minReplicas: 1 + maxReplicas: 3 + type: worker + labels: {} + rayStartParams: + block: 'true' + initContainerImage: 'busybox:1.28' # Enable users to specify the image for init container. Users can pull the busybox image from their private repositories. + # Security context for the init container. + initContainerSecurityContext: {} + # containerEnv specifies environment variables for the Ray container, + # Follows standard K8s container env schema. + containerEnv: [] + # - name: EXAMPLE_ENV + # value: "1" + envFrom: [] + # - secretRef: + # name: my-env-secret + # ports optionally allows specifying ports for the Ray container. + ports: [] + # resource requests and limits for the Ray head container. + # Modify as needed for your application. + # Note that the resources in this example are much too small for production; + # we don't recommend allocating less than 8G memory for a Ray pod in production. + # Ray pods should be sized to take up entire K8s nodes when possible. + # Always set CPU and memory limits for Ray pods. + # It is usually best to set requests equal to limits. + # See https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/config.html#resources + # for further guidance. + resources: + limits: + cpu: 1 + memory: "1G" + requests: + cpu: 1 + memory: "1G" + annotations: + key: value + nodeSelector: {} + tolerations: + - key: "nvidia.com/gpu" + operator: "Exists" + effect: "NoSchedule" + - key: "reserved" + operator: "Exists" + effect: "NoSchedule" + affinity: {} + # Ray container security context. + securityContext: {} + volumes: + - name: ray-logs + emptyDir: {} + - name: fluentbit-config + configMap: + name: fluentbit-config + # Ray writes logs to /tmp/ray/session_latests/logs + volumeMounts: + - mountPath: /tmp/ray + name: ray-logs + # sidecarContainers specifies additional containers to attach to the Ray pod. + # Follows standard K8s container spec. + sidecarContainers: + - name: fluentbit + image: fluent/fluent-bit:1.9.6 + # These resource requests for Fluent Bit should be sufficient in production. + resources: + requests: + cpu: 100m + memory: 128Mi + ephemeral-storage: 2Gi + limits: + cpu: 100m + memory: 128Mi + ephemeral-storage: 4Gi + volumeMounts: + - mountPath: /tmp/ray + name: ray-logs + - mountPath: /fluent-bit/etc/ + name: fluentbit-config + +service: + type: ClusterIP diff --git a/best-practices/ml-platform/examples/platform/sandbox/variables.tf b/best-practices/ml-platform/examples/platform/sandbox/variables.tf new file mode 100644 index 000000000..e19c2538e --- /dev/null +++ b/best-practices/ml-platform/examples/platform/sandbox/variables.tf @@ -0,0 +1,182 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +variable "cluster_name" { + default = "gke-ml" + description = "Name of the GKE cluster" + type = string +} + +variable "config_management_version" { + default = "1.17.1" + description = "Version of Config Management to enable" + type = string +} + +variable "configsync_repo_name" { + default = "config-sync-repo" + description = "Name of the GitHub repo that will be synced to the cluster with Config sync." + type = string +} + +variable "environment_name" { + default = "dev" + description = "Name of the environment" + type = string +} + +variable "environment_project_id" { + description = "The GCP project where the resources will be created" + type = string +} + +variable "env" { + default = ["dev"] + description = "List of environments" + type = set(string) +} + +variable "github_email" { + description = "GitHub user email." + type = string +} + +variable "github_org" { + description = "GitHub org." + type = string +} + +variable "github_token" { + description = "GitHub token. It is a token with write permissions as it will create a repo in the GitHub org." + type = string +} + +variable "github_user" { + description = "GitHub user name." + type = string +} + +variable "namespace" { + default = "ml-team" + description = "Name of the namespace to demo." + type = string +} + +variable "network_name" { + default = "ml-vpc" + description = "VPC network where GKE cluster will be created" + type = string +} + +variable "ondemand_taints" { + default = [{ + key = "ondemand" + value = true + effect = "NO_SCHEDULE" + }] + description = "Taints to be applied to the on-demand node pool." + type = list(object({ + key = string + value = any + effect = string + })) +} + +variable "reserved_taints" { + default = [{ + key = "reserved" + value = true + effect = "NO_SCHEDULE" + }] + description = "Taints to be applied to the reserved node pool." + type = list(object({ + key = string + value = any + effect = string + })) +} + +variable "routing_mode" { + default = "GLOBAL" + description = "VPC routing mode." + type = string +} + +variable "secret_for_rootsync" { + default = 1 + description = "Create git-cred in config-management-system namespace." + type = number +} + +variable "spot_taints" { + default = [{ + key = "spot" + value = true + effect = "NO_SCHEDULE" + }] + description = "Taints to be applied to the spot node pool." + type = list(object({ + key = string + value = any + effect = string + })) +} + +variable "subnet_01_description" { + default = "subnet 01" + description = "Description of the first subnet." + type = string +} + +variable "subnet_01_ip" { + default = "10.40.0.0/22" + description = "CIDR of the first subnet." + type = string +} + +variable "subnet_01_name" { + default = "ml-vpc-subnet-01" + description = "Name of the first subnet in the VPC network." + type = string +} + +variable "subnet_01_region" { + default = "us-central1" + description = "Region of the first subnet." + type = string +} + +variable "subnet_02_description" { + default = "subnet 02" + description = "Description of the second subnet." + type = string +} + +variable "subnet_02_ip" { + default = "10.12.0.0/22" + description = "CIDR of the second subnet." + type = string +} + +variable "subnet_02_name" { + default = "gke-vpc-subnet-02" + description = "Name of the second subnet in the VPC network." + type = string +} + +variable "subnet_02_region" { + default = "us-west2" + description = "Region of the second subnet." + type = string +} diff --git a/best-practices/ml-platform/examples/platform/sandbox/versions.tf b/best-practices/ml-platform/examples/platform/sandbox/versions.tf new file mode 100644 index 000000000..1b4d21b72 --- /dev/null +++ b/best-practices/ml-platform/examples/platform/sandbox/versions.tf @@ -0,0 +1,39 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +terraform { + required_providers { + github = { + source = "integrations/github" + version = "6.0.1" + } + google = { + source = "hashicorp/google" + version = "5.19.0" + } + google-beta = { + source = "hashicorp/google-beta" + version = "5.19.0" + } + null = { + source = "hashicorp/null" + version = "3.2.2" + } + } +} + +provider "github" { + owner = var.github_org + token = var.github_token +} diff --git a/best-practices/ml-platform/examples/use-case/ray/dataprocessing/CONVERSION.md b/best-practices/ml-platform/examples/use-case/ray/dataprocessing/CONVERSION.md new file mode 100644 index 000000000..7e99345cd --- /dev/null +++ b/best-practices/ml-platform/examples/use-case/ray/dataprocessing/CONVERSION.md @@ -0,0 +1,31 @@ +# Steps to convert the code from Notebook to run with Ray on GKE + +1. Decorate the function which needs to be run as remote function in Ray workers + + ``` + import ray + @ray.remote(num_cpus=1) + ``` + +1. Create the run time environment with the libraries needed by remote function + + ``` + runtime_env = {"pip": ["google-cloud-storage==2.16.0", "spacy==3.7.4", "jsonpickle==3.0.3"]} + ``` + +1. Initialize the Ray with the Ray cluster created & pass the runtime environment along + + ``` + ray.init("ray://"+RAY_CLUSTER_HOST, runtime_env=runtime_env)`` + ``` + +1. Get remote object using ray.get() method + + ``` + results = ray.get([get_clean_df.remote(res[i]) for i in range(len(res))]) + ``` + +1. After completing the execution, shutdown Ray clusters + ``` + ray.shutdown() + ``` diff --git a/best-practices/ml-platform/examples/use-case/ray/dataprocessing/README.md b/best-practices/ml-platform/examples/use-case/ray/dataprocessing/README.md new file mode 100644 index 000000000..8e02fe3e7 --- /dev/null +++ b/best-practices/ml-platform/examples/use-case/ray/dataprocessing/README.md @@ -0,0 +1,133 @@ +# Distributed Data Processing with Ray on GKE + +## Dataset + +[This](https://www.kaggle.com/datasets/PromptCloudHQ/flipkart-products) is a pre-crawled public dataset, taken as a subset of a bigger dataset (more than 5.8 million products) that was created by extracting data from [Flipkart](https://www.flipkart.com/), a leading Indian eCommerce store. + +## Architecture + +![DataPreprocessing](/best-practices/ml-platform/docs/images/ray-dataprocessing-workflow.png) + +## Data processing steps + +The dataset has product information such as id, name, brand, description, image urls, product specifications. + +The preprocessing.py file does the following: + +- Read the csv from Cloud Storage +- Clean up the product description text +- Extract image urls, validate and download the images into cloud storage +- Cleanup & extract attributes as key-value pairs + +## How to use this repo: + +1. Clone the repository and change directory to the guide directory + + ``` + git clone https://github.com/GoogleCloudPlatform/ai-on-gke && \ + cd ai-on-gke/best-practices/ml-platform/examples/use-case/ray/dataprocessing + ``` + +1. Set environment variables + + ``` + CLUSTER_NAME= + PROJECT_ID= + PROCESSING_BUCKET= + DOCKER_IMAGE_URL=us-docker.pkg.dev/${PROJECT_ID}/dataprocessing/dp:v0.0.1 + ``` + +1. Create a Cloud Storage bucket to store raw data + + ``` + gcloud storage buckets create gs://${PROCESSING_BUCKET} --project ${PROJECT_ID} + ``` + +1. Download the raw data csv file from above and store into the bucket created in the previous step. + The kaggle cli can be installed using the following [instructions](https://github.com/Kaggle/kaggle-api#installation) + To use the cli you must create an API token (Kaggle > User Profile > API > Create New Token), the downloaded file should be stored in HOME/.kaggle/kaggle.json. + Alternatively, it can be [downloaded](https://www.kaggle.com/datasets/atharvjairath/flipkart-ecommerce-dataset) from the kaggle website + + ``` + kaggle datasets download --unzip atharvjairath/flipkart-ecommerce-dataset && \ + gcloud storage cp flipkart_com-ecommerce_sample.csv \ + gs://${PROCESSING_BUCKET}/flipkart_raw_dataset/flipkart_com-ecommerce_sample.csv + ``` + +1. Provide respective GCS bucket access rights to GKE Kubernetes Service Accounts. + Ray head with access to read the raw source data in the storage bucket + Ray worker(s) with the access to write data to the storage bucket. + + ``` + gcloud projects add-iam-policy-binding ${PROJECT_ID} \ + --member "serviceAccount:wi-ml-team-ray-head@${PROJECT_ID}.iam.gserviceaccount.com" \ + --role roles/storage.objectViewer + + gcloud projects add-iam-policy-binding ${PROJECT_ID} \ + --member "serviceAccount:wi-ml-team-ray-worker@${PROJECT_ID}.iam.gserviceaccount.com" \ + --role roles/storage.objectAdmin + ``` + +1. Create Artifact Registry repository for your docker image + + ``` + gcloud artifacts repositories create dataprocessing \ + --repository-format=docker \ + --location=us \ + --project=${PROJECT_ID} \ + --async + ``` + +1. Enable the Cloud Build APIs + + ``` + gcloud services enable cloudbuild.googleapis.com --project ${PROJECT_ID} + ``` + +1. Build container image using Cloud Build and push the image to Artifact Registry + + ``` + cd src && \ + gcloud builds submit --tag ${DOCKER_IMAGE_URL} . && \ + cd .. + ``` + +1. Update respective variables in the Job submission manifest to reflect your configuration. + + - Image is the docker image that was built in the previous step + - Processing bucket is the location of the GCS bucket where the source data and results will be stored + - Ray Cluster Host - if used in this example, it should not need to be changed, but if your Ray cluster service is named differently or in a different namespace, update accordingly. + + ``` + sed -i "s|#IMAGE|${DOCKER_IMAGE_URL}|" job.yaml && \ + sed -i "s|#PROCESSING_BUCKET|${PROCESSING_BUCKET}|" job.yaml + ``` + +1. Get credentials for the GKE cluster + + ``` + gcloud container fleet memberships get-credentials ${CLUSTER_NAME} + ``` + +1. Create the Job in the “ml-team” namespace using kubectl command + + ``` + kubectl apply -f job.yaml + ``` + +1. Monitor the execution in Ray Dashboard + + - Jobs -> Running Job ID + - See the Tasks/actors overview for Running jobs + - See the Task Table for a detailed view of task and assigned node(s) + - Cluster -> Node List + - See the Ray actors running on the worker process + +1. Once the Job is completed, both the prepared dataset as a CSV and the images are stored in Google Cloud Storage. + + ``` + gcloud storage ls gs://${PROCESSING_BUCKET}/flipkart_preprocessed_dataset/flipkart.csv + gcloud storage ls gs://${PROCESSING_BUCKET}/flipkart_images + ``` + +> For additional information about converting you code from a notebook to run as a Job on GKE see the [Conversion Guide](CONVERSION.md) diff --git a/best-practices/ml-platform/examples/use-case/ray/dataprocessing/job.yaml b/best-practices/ml-platform/examples/use-case/ray/dataprocessing/job.yaml new file mode 100644 index 000000000..cc44a972d --- /dev/null +++ b/best-practices/ml-platform/examples/use-case/ray/dataprocessing/job.yaml @@ -0,0 +1,21 @@ +apiVersion: batch/v1 +kind: Job +metadata: + name: job + namespace: ml-team +spec: + template: + metadata: + labels: + app: job + spec: + containers: + - name: job + image: #IMAGE + env: + - name: "PROCESSING_BUCKET" + value: #PROCESSING_BUCKET + - name: "RAY_CLUSTER_HOST" + value: "ray-cluster-kuberay-head-svc.ml-team:10001" + restartPolicy: Never + serviceAccountName: ray-worker diff --git a/best-practices/ml-platform/examples/use-case/ray/dataprocessing/src/Dockerfile b/best-practices/ml-platform/examples/use-case/ray/dataprocessing/src/Dockerfile new file mode 100644 index 000000000..98aed63de --- /dev/null +++ b/best-practices/ml-platform/examples/use-case/ray/dataprocessing/src/Dockerfile @@ -0,0 +1,14 @@ +FROM python:3.10-slim-bullseye as build-stage + +ENV PATH=/venv/bin:${PATH} +ENV PYTHONDONTWRITEBYTECODE=1 +ENV PYTHONUNBUFFERED=1 + +COPY requirements.txt /venv/requirements.txt +RUN pip install --no-cache-dir -r /venv/requirements.txt + +COPY preprocessing.py /app/preprocessing.py + +WORKDIR /app + +CMD python preprocessing.py diff --git a/best-practices/ml-platform/examples/use-case/ray/dataprocessing/src/preprocessing.py b/best-practices/ml-platform/examples/use-case/ray/dataprocessing/src/preprocessing.py new file mode 100644 index 000000000..7aef363d0 --- /dev/null +++ b/best-practices/ml-platform/examples/use-case/ray/dataprocessing/src/preprocessing.py @@ -0,0 +1,158 @@ +import os +import ray +import pandas as pd +from typing import List +import urllib.request, urllib.error +import time +from google.cloud import storage +import spacy +import jsonpickle +import re + +IMAGE_BUCKET = os.environ['PROCESSING_BUCKET'] +RAY_CLUSTER_HOST = os.environ['RAY_CLUSTER_HOST'] +GCS_IMAGE_FOLDER = 'flipkart_images' + +@ray.remote(num_cpus=1) +def get_clean_df(df): + + def extract_url(image_list: str) -> List[str]: + image_list = image_list.replace('[', '') + image_list = image_list.replace(']', '') + image_list = image_list.replace('"', '') + image_urls = image_list.split(',') + return image_urls + + def download_image(image_url, image_file_name, destination_blob_name): + storage_client = storage.Client() + image_found_flag = False + try: + urllib.request.urlretrieve(image_url, image_file_name) + bucket = storage_client.bucket(IMAGE_BUCKET) + blob = bucket.blob(destination_blob_name) + blob.upload_from_filename(image_file_name) + print( + f"File {image_file_name} uploaded to {destination_blob_name}." + ) + image_found_flag = True + except urllib.error.HTTPError: + print("HTTPError exception") + except urllib.error.URLError: + print("URLError exception") + except: + print("Unknown exception") + return image_found_flag + + def prep_product_desc(df): + # Cleaning the description text + spacy.cli.download("en_core_web_sm") + model = spacy.load("en_core_web_sm") + + def parse_nlp_description(description) -> str: + if not pd.isna(description): + doc = model(description.lower()) + lemmas = [] + for token in doc: + if token.lemma_ not in lemmas and not token.is_stop and token.is_alpha: + lemmas.append(token.lemma_) + return ' '.join(lemmas) + + df['description'] = df['description'].apply(parse_nlp_description) + return df + + # Extract product attributes as key-value pair + def parse_attributes(specification: str): + spec_match_one = re.compile("(.*?)\\[(.*)\\](.*)") + spec_match_two = re.compile("(.*?)=>\"(.*?)\"(.*?)=>\"(.*?)\"(.*)") + if pd.isna(specification): + return None + m = spec_match_one.match(specification) + out = {} + if m is not None and m.group(2) is not None: + phrase = '' + for c in m.group(2): + if c == '}': + m2 = spec_match_two.match(phrase) + if m2 and m2.group(2) is not None and m2.group(4) is not None: + out[m2.group(2)] = m2.group(4) + phrase = '' + else: + phrase += c + json_string = jsonpickle.encode(out) + return json_string + + def get_product_image(df): + products_with_no_image_count = 0 + products_with_no_image = [] + gcs_image_url = [] + image_found_flag = False + for id, image_list in zip(df['uniq_id'], df['image']): + + if pd.isnull(image_list): # No image url + # print("WARNING: No image url: product ", id) + products_with_no_image_count += 1 + products_with_no_image.append(id) + gcs_image_url.append(None) + continue + image_urls = extract_url(image_list) + for index in range(len(image_urls)): + image_url = image_urls[index] + image_file_name = '{}_{}.jpg'.format(id, index) + destination_blob_name = GCS_IMAGE_FOLDER + '/' + image_file_name + image_found_flag = download_image(image_url, image_file_name, destination_blob_name) + if image_found_flag: + gcs_image_url.append('gs://' + IMAGE_BUCKET + '/' + destination_blob_name) + break + if not image_found_flag: + # print("WARNING: No image: product ", id) + products_with_no_image_count += 1 + products_with_no_image.append(id) + gcs_image_url.append(None) + + # appending gcs image uri into dataframe + gcs_image_loc = pd.DataFrame(gcs_image_url, index=df.index) + gcs_image_loc.columns = ["image_uri"] + df_with_gcs_image_uri = pd.concat([df, gcs_image_loc], axis=1) + return df_with_gcs_image_uri + + df_with_gcs_image_uri = get_product_image(df) + df_with_desc = prep_product_desc(df_with_gcs_image_uri) + df_with_desc['attributes'] = df_with_desc['product_specifications'].apply(parse_attributes) + + return df_with_desc + + +def split_dataframe(df, chunk_size=199): + chunks = list() + num_chunks = len(df) // chunk_size + 1 + for i in range(num_chunks): + chunks.append(df[i * chunk_size:(i + 1) * chunk_size]) + return chunks + + +# This function invokes ray task +def run_remote(): + df = pd.read_csv('gs://'+IMAGE_BUCKET+'/flipkart_raw_dataset/flipkart_com-ecommerce_sample.csv') + df = df[['uniq_id','product_name','description','brand','image','product_specifications']] + runtime_env = {"pip": ["google-cloud-storage==2.16.0", "spacy==3.7.4", "jsonpickle==3.0.3"]} + ray.init("ray://"+RAY_CLUSTER_HOST, runtime_env=runtime_env) + print("STARTED") + start_time = time.time() + res = split_dataframe(df) + results = ray.get([get_clean_df.remote(res[i]) for i in range(len(res))]) + print("FINISHED IN ") + duration = time.time() - start_time + print(duration) + ray.shutdown() + result_df = pd.concat(results, axis=0, ignore_index=True) + result_df.to_csv('gs://'+IMAGE_BUCKET+'/flipkart_preprocessed_dataset/flipkart.csv', index=False) + return result_df + + +def main(): + clean_df = run_remote() + + +if __name__ == "__main__": + """ This is executed when run from the command line """ + main() diff --git a/best-practices/ml-platform/examples/use-case/ray/dataprocessing/src/requirements.txt b/best-practices/ml-platform/examples/use-case/ray/dataprocessing/src/requirements.txt new file mode 100644 index 000000000..f2abde391 --- /dev/null +++ b/best-practices/ml-platform/examples/use-case/ray/dataprocessing/src/requirements.txt @@ -0,0 +1,8 @@ +ray==2.7.1 +ray[client]==2.7.1 +spacy==3.7.4 +google-cloud-storage==2.16.0 +pandas==2.2.1 +gcsfs==2024.3.1 +fsspec==2024.3.1 +jsonpickle==3.0.3 diff --git a/best-practices/ml-platform/terraform/features/initialize/backend.tf b/best-practices/ml-platform/terraform/features/initialize/backend.tf new file mode 100644 index 000000000..8d5d67421 --- /dev/null +++ b/best-practices/ml-platform/terraform/features/initialize/backend.tf @@ -0,0 +1,19 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +terraform { + backend "local" { + path = "state/default.tfstate" + } +} diff --git a/best-practices/ml-platform/terraform/features/initialize/backend.tf.bucket b/best-practices/ml-platform/terraform/features/initialize/backend.tf.bucket new file mode 100644 index 000000000..991e86976 --- /dev/null +++ b/best-practices/ml-platform/terraform/features/initialize/backend.tf.bucket @@ -0,0 +1,20 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +terraform { + backend "gcs" { + prefix = "terraform/initialize" + bucket = "" + } +} diff --git a/best-practices/ml-platform/terraform/features/initialize/initialize.auto.tfvars b/best-practices/ml-platform/terraform/features/initialize/initialize.auto.tfvars new file mode 100644 index 000000000..8ef26e4b0 --- /dev/null +++ b/best-practices/ml-platform/terraform/features/initialize/initialize.auto.tfvars @@ -0,0 +1,7 @@ +environment_name = "dev" +project = { + billing_account_id = "" + folder_id = "" + name = "mlp" + org_id = "" +} diff --git a/best-practices/ml-platform/terraform/features/initialize/main.tf b/best-practices/ml-platform/terraform/features/initialize/main.tf new file mode 100644 index 000000000..bd9f5e55c --- /dev/null +++ b/best-practices/ml-platform/terraform/features/initialize/main.tf @@ -0,0 +1,131 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +locals { + backend_file = "../../../examples/platform/${var.platform_type}/backend.tf" + project_id_prefix = "${var.project.name}-${var.environment_name}" + project_id_suffix_length = 29 - length(local.project_id_prefix) + tfvars_file = "../../../examples/platform/${var.platform_type}/mlp.auto.tfvars" +} + +resource "random_string" "project_id_suffix" { + length = local.project_id_suffix_length + lower = true + numeric = true + special = false + upper = false +} + +resource "google_project" "environment" { + billing_account = var.project.billing_account_id + folder_id = var.project.folder_id == "" ? null : var.project.folder_id + name = local.project_id_prefix + org_id = var.project.org_id == "" ? null : var.project.org_id + project_id = "${local.project_id_prefix}-${random_string.project_id_suffix.result}" +} + + +resource "google_storage_bucket" "mlp" { + force_destroy = false + location = var.storage_bucket_location + name = "${google_project.environment.project_id}-mlp" + project = google_project.environment.project_id + uniform_bucket_level_access = true + + versioning { + enabled = true + } +} + +resource "null_resource" "write_environment_name" { + triggers = { + md5 = var.environment_name + tfvars_file = local.tfvars_file + } + + provisioner "local-exec" { + command = <=0.13, please open an issue. + +## Usage + +```hcl +module "cloud-nat" { + source = "terraform-google-modules/cloud-nat/google" + version = "~> 1.2" + project_id = var.project_id + region = var.region + router = google_compute_router.router.name +} +``` + +Then perform the following commands on the root folder: + +- `terraform init` to get the plugins +- `terraform plan` to see the infrastructure plan +- `terraform apply` to apply the infrastructure build +- `terraform destroy` to destroy the built infrastructure + + +## Inputs + +| Name | Description | Type | Default | Required | +|------|-------------|------|---------|:--------:| +| create\_router | Create router instead of using an existing one, uses 'router' variable for new resource name. | `bool` | `false` | no | +| enable\_dynamic\_port\_allocation | Enable Dynamic Port Allocation. If minPorts is set, minPortsPerVm must be set to a power of two greater than or equal to 32. | `bool` | `false` | no | +| enable\_endpoint\_independent\_mapping | Specifies if endpoint independent mapping is enabled. | `bool` | `null` | no | +| icmp\_idle\_timeout\_sec | Timeout (in seconds) for ICMP connections. Defaults to 30s if not set. Changing this forces a new NAT to be created. | `string` | `"30"` | no | +| log\_config\_enable | Indicates whether or not to export logs | `bool` | `false` | no | +| log\_config\_filter | Specifies the desired filtering of logs on this NAT. Valid values are: "ERRORS\_ONLY", "TRANSLATIONS\_ONLY", "ALL" | `string` | `"ALL"` | no | +| max\_ports\_per\_vm | Maximum number of ports allocated to a VM from this NAT. This field can only be set when enableDynamicPortAllocation is enabled.This will be ignored if enable\_dynamic\_port\_allocation is set to false. | `string` | `null` | no | +| min\_ports\_per\_vm | Minimum number of ports allocated to a VM from this NAT config. Defaults to 64 if not set. Changing this forces a new NAT to be created. | `string` | `"64"` | no | +| name | Defaults to 'cloud-nat-RANDOM\_SUFFIX'. Changing this forces a new NAT to be created. | `string` | `""` | no | +| nat\_ips | List of self\_links of external IPs. Changing this forces a new NAT to be created. Value of `nat_ip_allocate_option` is inferred based on nat\_ips. If present set to MANUAL\_ONLY, otherwise AUTO\_ONLY. | `list(string)` | `[]` | no | +| network | VPN name, only if router is not passed in and is created by the module. | `string` | `""` | no | +| project\_id | The project ID to deploy to | `string` | n/a | yes | +| region | The region to deploy to | `string` | n/a | yes | +| router | The name of the router in which this NAT will be configured. Changing this forces a new NAT to be created. | `string` | n/a | yes | +| router\_asn | Router ASN, only if router is not passed in and is created by the module. | `string` | `"64514"` | no | +| router\_keepalive\_interval | Router keepalive\_interval, only if router is not passed in and is created by the module. | `string` | `"20"` | no | +| source\_subnetwork\_ip\_ranges\_to\_nat | Defaults to ALL\_SUBNETWORKS\_ALL\_IP\_RANGES. How NAT should be configured per Subnetwork. Valid values include: ALL\_SUBNETWORKS\_ALL\_IP\_RANGES, ALL\_SUBNETWORKS\_ALL\_PRIMARY\_IP\_RANGES, LIST\_OF\_SUBNETWORKS. Changing this forces a new NAT to be created. | `string` | `"ALL_SUBNETWORKS_ALL_IP_RANGES"` | no | +| subnetworks | Specifies one or more subnetwork NAT configurations |
list(object({
name = string,
source_ip_ranges_to_nat = list(string)
secondary_ip_range_names = list(string)
}))
| `[]` | no | +| tcp\_established\_idle\_timeout\_sec | Timeout (in seconds) for TCP established connections. Defaults to 1200s if not set. Changing this forces a new NAT to be created. | `string` | `"1200"` | no | +| tcp\_time\_wait\_timeout\_sec | Timeout (in seconds) for TCP connections that are in TIME\_WAIT state. Defaults to 120s if not set. | `string` | `"120"` | no | +| tcp\_transitory\_idle\_timeout\_sec | Timeout (in seconds) for TCP transitory connections. Defaults to 30s if not set. Changing this forces a new NAT to be created. | `string` | `"30"` | no | +| udp\_idle\_timeout\_sec | Timeout (in seconds) for UDP connections. Defaults to 30s if not set. Changing this forces a new NAT to be created. | `string` | `"30"` | no | + +## Outputs + +| Name | Description | +|------|-------------| +| name | Name of the Cloud NAT | +| nat\_ip\_allocate\_option | NAT IP allocation mode | +| region | Cloud NAT region | +| router\_name | Cloud NAT router name | + + + +## Requirements + +Before this module can be used on a project, you must ensure that the following pre-requisites are fulfilled: + +1. Terraform and kubectl are [installed](#software-dependencies) on the machine where Terraform is executed. +2. The Service Account you execute the module with has the right [permissions](#iam-roles). +3. The APIs are [active](#enable-apis) on the project you will launch the cluster in. +4. If you are using a Shared VPC, the APIs must also be activated on the Shared VPC host project and your service account needs the proper permissions there. + +### Terraform plugins + +- [Terraform](https://www.terraform.io/downloads.html) >= 0.13.0 +- [terraform-provider-google](https://github.com/terraform-providers/terraform-provider-google) plugin v4.27.0 + +### Configure a Service Account + +In order to execute this module you must have a Service Account with the +following project roles: + +- [roles/compute.networkAdmin](https://cloud.google.com/nat/docs/using-nat#iam_permissions) + +### Enable APIs + +In order to operate with the Service Account you must activate the following APIs on the project where the Service Account was created: + +- Compute Engine API - compute.googleapis.com + +## Contributing + +Refer to the [contribution guidelines](./CONTRIBUTING.md) for information on contributing to this module. diff --git a/best-practices/ml-platform/terraform/modules/cloud-nat/main.tf b/best-practices/ml-platform/terraform/modules/cloud-nat/main.tf new file mode 100644 index 000000000..a85277dce --- /dev/null +++ b/best-practices/ml-platform/terraform/modules/cloud-nat/main.tf @@ -0,0 +1,80 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +resource "random_string" "name_suffix" { + length = 6 + special = false + upper = false +} + +locals { + default_name = "cloud-nat-${random_string.name_suffix.result}" + name = var.name != "" ? var.name : local.default_name + nat_ip_allocate_option = length(var.nat_ips) > 0 ? "MANUAL_ONLY" : "AUTO_ONLY" + router = var.create_router ? google_compute_router.router[0].name : var.router +} + +resource "google_compute_router" "router" { + count = var.create_router ? 1 : 0 + + name = var.router + network = var.network + project = var.project_id + region = var.region + + bgp { + asn = var.router_asn + keepalive_interval = var.router_keepalive_interval + } +} + +resource "google_compute_router_nat" "main" { + enable_dynamic_port_allocation = var.enable_dynamic_port_allocation + enable_endpoint_independent_mapping = var.enable_endpoint_independent_mapping + icmp_idle_timeout_sec = var.icmp_idle_timeout_sec + max_ports_per_vm = var.enable_dynamic_port_allocation ? var.max_ports_per_vm : null + min_ports_per_vm = var.min_ports_per_vm + name = local.name + nat_ip_allocate_option = local.nat_ip_allocate_option + nat_ips = var.nat_ips + project = var.project_id + region = var.region + router = local.router + source_subnetwork_ip_ranges_to_nat = var.source_subnetwork_ip_ranges_to_nat + tcp_established_idle_timeout_sec = var.tcp_established_idle_timeout_sec + tcp_time_wait_timeout_sec = var.tcp_time_wait_timeout_sec + tcp_transitory_idle_timeout_sec = var.tcp_transitory_idle_timeout_sec + udp_idle_timeout_sec = var.udp_idle_timeout_sec + + dynamic "log_config" { + for_each = var.log_config_enable == true ? [{ + enable = var.log_config_enable + filter = var.log_config_filter + }] : [] + + content { + enable = log_config.value.enable + filter = log_config.value.filter + } + } + + dynamic "subnetwork" { + for_each = var.subnetworks + content { + name = subnetwork.value.name + source_ip_ranges_to_nat = subnetwork.value.source_ip_ranges_to_nat + secondary_ip_range_names = contains(subnetwork.value.source_ip_ranges_to_nat, "LIST_OF_SECONDARY_IP_RANGES") ? subnetwork.value.secondary_ip_range_names : [] + } + } +} diff --git a/best-practices/ml-platform/terraform/modules/cloud-nat/outputs.tf b/best-practices/ml-platform/terraform/modules/cloud-nat/outputs.tf new file mode 100644 index 000000000..acd7f8ce6 --- /dev/null +++ b/best-practices/ml-platform/terraform/modules/cloud-nat/outputs.tf @@ -0,0 +1,33 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +output "name" { + description = "Name of the Cloud NAT" + value = local.name +} + +output "nat_ip_allocate_option" { + description = "NAT IP allocation mode" + value = local.nat_ip_allocate_option +} + +output "region" { + description = "Cloud NAT region" + value = google_compute_router_nat.main.region +} + +output "router_name" { + description = "Cloud NAT router name" + value = local.router +} diff --git a/best-practices/ml-platform/terraform/modules/cloud-nat/variables.tf b/best-practices/ml-platform/terraform/modules/cloud-nat/variables.tf new file mode 100644 index 000000000..a329cfbfb --- /dev/null +++ b/best-practices/ml-platform/terraform/modules/cloud-nat/variables.tf @@ -0,0 +1,146 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +variable "create_router" { + default = false + description = "Create router instead of using an existing one, uses 'router' variable for new resource name." + type = bool +} + +variable "enable_dynamic_port_allocation" { + default = false + description = "Enable Dynamic Port Allocation. If minPorts is set, minPortsPerVm must be set to a power of two greater than or equal to 32." + type = bool +} + +variable "enable_endpoint_independent_mapping" { + default = null + description = "Specifies if endpoint independent mapping is enabled." + type = bool +} + +variable "icmp_idle_timeout_sec" { + default = "30" + description = "Timeout (in seconds) for ICMP connections. Defaults to 30s if not set. Changing this forces a new NAT to be created." + type = string +} + +variable "log_config_enable" { + default = false + description = "Indicates whether or not to export logs" + type = bool +} + +variable "log_config_filter" { + default = "ALL" + description = "Specifies the desired filtering of logs on this NAT. Valid values are: \"ERRORS_ONLY\", \"TRANSLATIONS_ONLY\", \"ALL\"" + type = string +} + +variable "max_ports_per_vm" { + default = null + description = "Maximum number of ports allocated to a VM from this NAT. This field can only be set when enableDynamicPortAllocation is enabled.This will be ignored if enable_dynamic_port_allocation is set to false." + type = string +} + +variable "min_ports_per_vm" { + default = "64" + description = "Minimum number of ports allocated to a VM from this NAT config. Defaults to 64 if not set. Changing this forces a new NAT to be created." + type = string +} + +variable "name" { + default = "" + description = "Defaults to 'cloud-nat-RANDOM_SUFFIX'. Changing this forces a new NAT to be created." + type = string +} + +variable "nat_ips" { + default = [] + description = "List of self_links of external IPs. Changing this forces a new NAT to be created. Value of `nat_ip_allocate_option` is inferred based on nat_ips. If present set to MANUAL_ONLY, otherwise AUTO_ONLY." + type = list(string) +} + +variable "network" { + default = "" + description = "VPN name, only if router is not passed in and is created by the module." + type = string +} + +variable "project_id" { + description = "The project ID to deploy to" + type = string +} + +variable "region" { + description = "The region to deploy to" + type = string +} + +variable "router" { + description = "The name of the router in which this NAT will be configured. Changing this forces a new NAT to be created." + type = string +} + +variable "router_asn" { + default = "64514" + description = "Router ASN, only if router is not passed in and is created by the module." + type = string +} + +variable "router_keepalive_interval" { + default = "20" + description = "Router keepalive_interval, only if router is not passed in and is created by the module." + type = string +} + +variable "source_subnetwork_ip_ranges_to_nat" { + default = "ALL_SUBNETWORKS_ALL_IP_RANGES" + description = "Defaults to ALL_SUBNETWORKS_ALL_IP_RANGES. How NAT should be configured per Subnetwork. Valid values include: ALL_SUBNETWORKS_ALL_IP_RANGES, ALL_SUBNETWORKS_ALL_PRIMARY_IP_RANGES, LIST_OF_SUBNETWORKS. Changing this forces a new NAT to be created." + type = string +} + +variable "subnetworks" { + default = [] + description = "Specifies one or more subnetwork NAT configurations" + type = list(object({ + name = string, + secondary_ip_range_names = list(string) + source_ip_ranges_to_nat = list(string) + })) +} + +variable "tcp_established_idle_timeout_sec" { + default = "1200" + description = "Timeout (in seconds) for TCP established connections. Defaults to 1200s if not set. Changing this forces a new NAT to be created." + type = string +} + +variable "tcp_time_wait_timeout_sec" { + default = "120" + description = "Timeout (in seconds) for TCP connections that are in TIME_WAIT state. Defaults to 120s if not set." + type = string +} + +variable "tcp_transitory_idle_timeout_sec" { + default = "30" + description = "Timeout (in seconds) for TCP transitory connections. Defaults to 30s if not set. Changing this forces a new NAT to be created." + type = string +} + +variable "udp_idle_timeout_sec" { + default = "30" + description = "Timeout (in seconds) for UDP connections. Defaults to 30s if not set. Changing this forces a new NAT to be created." + type = string +} diff --git a/best-practices/ml-platform/terraform/modules/cloud-nat/versions.tf b/best-practices/ml-platform/terraform/modules/cloud-nat/versions.tf new file mode 100644 index 000000000..b19ea5231 --- /dev/null +++ b/best-practices/ml-platform/terraform/modules/cloud-nat/versions.tf @@ -0,0 +1,34 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +terraform { + required_providers { + github = { + source = "integrations/github" + version = "6.0.1" + } + google = { + source = "hashicorp/google" + version = "5.19.0" + } + google-beta = { + source = "hashicorp/google-beta" + version = "5.19.0" + } + random = { + source = "hashicorp/random" + version = "3.6.0" + } + } +} diff --git a/best-practices/ml-platform/terraform/modules/cluster/gke.tf b/best-practices/ml-platform/terraform/modules/cluster/gke.tf new file mode 100644 index 000000000..9c51f4d64 --- /dev/null +++ b/best-practices/ml-platform/terraform/modules/cluster/gke.tf @@ -0,0 +1,197 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +data "google_client_config" "default" {} + +data "google_project" "project" { + project_id = var.project_id +} + +resource "google_container_cluster" "mlp" { + provider = google-beta + + deletion_protection = false + enable_shielded_nodes = true + initial_node_count = var.initial_node_count + location = var.region + name = var.cluster_name + network = var.network + node_locations = ["${var.region}-a", "${var.region}-b", "${var.region}-c"] + project = var.project_id + remove_default_node_pool = var.remove_default_node_pool + subnetwork = var.subnet + + addons_config { + gcp_filestore_csi_driver_config { + enabled = true + } + + gcs_fuse_csi_driver_config { + enabled = true + } + + gce_persistent_disk_csi_driver_config { + enabled = true + } + } + + cluster_autoscaling { + autoscaling_profile = "OPTIMIZE_UTILIZATION" + enabled = true + + auto_provisioning_defaults { + oauth_scopes = [ + "https://www.googleapis.com/auth/cloud-platform" + ] + + management { + auto_repair = true + auto_upgrade = true + } + + shielded_instance_config { + enable_integrity_monitoring = true + enable_secure_boot = true + } + + upgrade_settings { + max_surge = 0 + max_unavailable = 1 + strategy = "SURGE" + } + } + + resource_limits { + resource_type = "cpu" + minimum = 4 + maximum = 600 + } + + resource_limits { + resource_type = "memory" + minimum = 16 + maximum = 2400 + } + + resource_limits { + resource_type = "nvidia-a100-80gb" + maximum = 30 + } + + resource_limits { + resource_type = "nvidia-l4" + maximum = 30 + } + + resource_limits { + resource_type = "nvidia-tesla-t4" + maximum = 300 + } + + resource_limits { + resource_type = "nvidia-tesla-a100" + maximum = 50 + } + + resource_limits { + resource_type = "nvidia-tesla-k80" + maximum = 30 + } + + resource_limits { + resource_type = "nvidia-tesla-p4" + maximum = 30 + } + + resource_limits { + resource_type = "nvidia-tesla-p100" + maximum = 30 + } + + resource_limits { + resource_type = "nvidia-tesla-v100" + maximum = 30 + } + } + + logging_config { + enable_components = [ + "APISERVER", + "CONTROLLER_MANAGER", + "SCHEDULER", + "SYSTEM_COMPONENTS", + "WORKLOADS" + ] + } + + ip_allocation_policy { + } + + master_authorized_networks_config { + cidr_blocks { + cidr_block = var.master_auth_networks_ipcidr + display_name = "vpc-cidr" + } + } + + monitoring_config { + enable_components = [ + "APISERVER", + "CONTROLLER_MANAGER", + "DAEMONSET", + "DEPLOYMENT", + "HPA", + "POD", + "SCHEDULER", + "STATEFULSET", + "STORAGE", + "SYSTEM_COMPONENTS" + ] + + managed_prometheus { + enabled = true + } + } + + node_config { + machine_type = var.machine_type + + shielded_instance_config { + enable_integrity_monitoring = true + enable_secure_boot = true + } + } + + node_pool_defaults { + node_config_defaults { + gcfs_config { + enabled = true + } + } + } + + private_cluster_config { + enable_private_nodes = true + enable_private_endpoint = true + master_ipv4_cidr_block = "172.16.0.32/28" + } + + release_channel { + channel = "STABLE" + } + + workload_identity_config { + workload_pool = "${var.project_id}.svc.id.goog" + } +} diff --git a/best-practices/ml-platform/terraform/modules/cluster/outputs.tf b/best-practices/ml-platform/terraform/modules/cluster/outputs.tf new file mode 100644 index 000000000..9c813071a --- /dev/null +++ b/best-practices/ml-platform/terraform/modules/cluster/outputs.tf @@ -0,0 +1,33 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +output "cluster_id" { + value = google_container_cluster.mlp.id +} + +output "cluster_location" { + value = google_container_cluster.mlp.location +} + +output "cluster_name" { + value = google_container_cluster.mlp.name +} + +output "env" { + value = var.env +} + +output "gke_project_id" { + value = var.project_id +} diff --git a/best-practices/ml-platform/terraform/modules/cluster/variables.tf b/best-practices/ml-platform/terraform/modules/cluster/variables.tf new file mode 100644 index 000000000..e2f8c0daa --- /dev/null +++ b/best-practices/ml-platform/terraform/modules/cluster/variables.tf @@ -0,0 +1,75 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +variable "cluster_name" { + default = "" + description = "GKE cluster name" + type = string +} + +variable "env" { + description = "environment" + type = string +} + +variable "initial_node_count" { + default = 1 + description = "The number of nodes to create in this cluster's default node pool. In regional or multi-zonal clusters, this is the number of nodes per zone. Must be set if node_pool is not set. If you're using google_container_node_pool objects with no default node pool, you'll need to set this to a value of at least 1, alongside setting remove_default_node_pool to true." + type = number +} + +variable "machine_type" { + default = "e2-medium" + description = "The name of a Google Compute Engine machine type." + type = string +} + +variable "master_auth_networks_ipcidr" { + description = "master authorized network" + type = string +} + +variable "network" { + description = "VPC network where the cluster will be created" + type = string +} + +variable "project_id" { + default = "" + description = "The GCP project where the resources will be created" + type = string +} + +variable "region" { + default = "us-central1" + description = "The GCP region where the GKE cluster will be created" + type = string +} + +variable "remove_default_node_pool" { + default = true + description = "If true, deletes the default node pool upon cluster creation. If you're using google_container_node_pool resources with no default node pool, this should be set to true, alongside setting initial_node_count to at least 1." + type = bool +} + +variable "subnet" { + description = "subnetwork where the cluster will be created" + type = string +} + +variable "zone" { + default = "us-central1-a" + description = "The GCP zone where the reservation will be created" + type = string +} diff --git a/best-practices/ml-platform/terraform/modules/cluster/versions.tf b/best-practices/ml-platform/terraform/modules/cluster/versions.tf new file mode 100644 index 000000000..b19f861ad --- /dev/null +++ b/best-practices/ml-platform/terraform/modules/cluster/versions.tf @@ -0,0 +1,26 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +terraform { + required_providers { + google = { + source = "hashicorp/google" + version = "5.19.0" + } + google-beta = { + source = "hashicorp/google-beta" + version = "5.19.0" + } + } +} diff --git a/best-practices/ml-platform/terraform/modules/network/README.md b/best-practices/ml-platform/terraform/modules/network/README.md new file mode 100644 index 000000000..6de9bdc13 --- /dev/null +++ b/best-practices/ml-platform/terraform/modules/network/README.md @@ -0,0 +1,110 @@ +Copyright 2024 Google LLC + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +## Requirements + +| Name | Version | +|------|---------| +| [google](#requirement\_google) | >= 4.28.0 | + +## Modules + +| Name | Source | Version | +|------|--------|---------| +| [vpc](#module\_vpc) | terraform-google-modules/network/google | 5.2.0 | + +## Inputs + +| Name | Description | Type | Default | Required | +|------|-------------|------|---------|:--------:| +| [network\_name](#input\_network\_name) | Name of the VPC network. | `string` | n/a | yes | +| [project\_id](#input\_project\_id) | Id of the GCP project where VPC is to be created. | `string` | n/a | yes | +| [routing\_mode](#input\_routing\_mode) | The network routing mode. | `string` | n/a | yes | +| [subnet\_01\_description](#input\_subnet\_01\_description) | Subnet description. | `string` | n/a | yes | +| [subnet\_01\_ip](#input\_subnet\_01\_ip) | IP range of first subnet. | `string` | n/a | yes | +| [subnet\_01\_name](#input\_subnet\_01\_name) | Name of first subnet. | `string` | n/a | yes | +| [subnet\_01\_region](#input\_subnet\_01\_region) | Region of first subnet. | `string` | n/a | yes | +| [subnet\_01\_secondary\_pod\_name](#input\_subnet\_01\_secondary\_pod\_name) | Name of pods IP range. | `string` | n/a | yes | +| [subnet\_01\_secondary\_pod\_range](#input\_subnet\_01\_secondary\_pod\_range) | IP range of the pods. | `string` | n/a | yes | +| [subnet\_01\_secondary\_svc\_1\_name](#input\_subnet\_01\_secondary\_svc\_1\_name) | Name of service IP range. | `string` | n/a | yes | +| [subnet\_01\_secondary\_svc\_1\_range](#input\_subnet\_01\_secondary\_svc\_1\_range) | IP range of the service. | `string` | n/a | yes | +| [subnet\_01\_secondary\_svc\_2\_name](#input\_subnet\_01\_secondary\_svc\_2\_name) | Name of service IP range. | `string` | n/a | yes | +| [subnet\_01\_secondary\_svc\_2\_range](#input\_subnet\_01\_secondary\_svc\_2\_range) | IP range of the service. | `string` | n/a | yes | +| [subnet\_02\_description](#input\_subnet\_02\_description) | Subnet description. | `string` | n/a | yes | +| [subnet\_02\_ip](#input\_subnet\_02\_ip) | IP range of second subnet. | `string` | n/a | yes | +| [subnet\_02\_name](#input\_subnet\_02\_name) | Name of the second subnet. | `string` | n/a | yes | +| [subnet\_02\_region](#input\_subnet\_02\_region) | Region of second subnet. | `string` | n/a | yes | +| [subnet\_02\_secondary\_pod\_name](#input\_subnet\_02\_secondary\_pod\_name) | Name of pods IP range. | `string` | n/a | yes | +| [subnet\_02\_secondary\_pod\_range](#input\_subnet\_02\_secondary\_pod\_range) | IP range of the pods. | `string` | n/a | yes | +| [subnet\_02\_secondary\_svc\_1\_name](#input\_subnet\_02\_secondary\_svc\_1\_name) | Name of service IP range. | `string` | n/a | yes | +| [subnet\_02\_secondary\_svc\_1\_range](#input\_subnet\_02\_secondary\_svc\_1\_range) | IP range of the service. | `string` | n/a | yes | +| [subnet\_02\_secondary\_svc\_2\_name](#input\_subnet\_02\_secondary\_svc\_2\_name) | Name of service IP range. | `string` | n/a | yes | +| [subnet\_02\_secondary\_svc\_2\_range](#input\_subnet\_02\_secondary\_svc\_2\_range) | IP range of the service. | `string` | n/a | yes | + +## Outputs + +| Name | Description | +|------|-------------| +| [network](#output\_network) | Object containing details of the VPC network. | + +## Usage + +```hcl + source = "git::https://github.com/YOUR_GITHUB_ORG/terraform-modules.git//vpc/" + project_id = "my-project" + network_name = "my-network" + routing_mode = "GLOBAL" + subnet_01_name = "subnet-1" + subnet_01_ip = "10.40.0.0/22" + subnet_01_region = "us-central1" + subnet_01_description = "subnet 1" + subnet_02_name = "subnet-2" + subnet_02_ip = "10.12.0.0/22" + subnet_02_region = "us-central1" + subnet_02_description = "subnet 2" + subnet_01_secondary_svc_1_name = "subnet1-service1" + subnet_01_secondary_svc_1_range = "10.5.0.0/20" + subnet_01_secondary_svc_2_name = "subnet1-service2" + subnet_01_secondary_svc_2_range = "10.5.16.0/20" + subnet_01_secondary_pod_name = "subnet1-pod" + subnet_01_secondary_pod_range = "10.0.0.0/14" + subnet_02_secondary_svc_1_name = "subnet2-service1" + subnet_02_secondary_svc_1_range = "10.13.0.0/20" + subnet_02_secondary_svc_2_name = "subnet2-service2" + subnet_02_secondary_svc_2_range = "10.13.16.0/20" + subnet_02_secondary_pod_name = "subnet2-pod" + subnet_02_secondary_pod_range = "10.8.0.0/14" + +} +``` + +## Workflow + +This module is called from [multi-tenant platform repo][muti-tenant-platform-repo] that stands up multi-tenant infrastructure for [dev][dev-multi-tenant], [staging][staging-multi-tenant] and [prod][prod-multi-tenant] environments to create a VPC network. Additionally, this module can be called by [infrastructure repo][infra-repo] if the application needs its own VPC networks inside its projects. + +## Contributing + +* [Contributing guidelines][contributing-guidelines] +* [Code of conduct][code-of-conduct] + + + +[contributing-guidelines]: CONTRIBUTING.md +[code-of-conduct]: code-of-conduct.md + + +[muti-tenant-platform-repo]: ../../platform-template +[dev-multi-tenant]: ../../platform-template/env/dev/main.tf?plain=1#L50 +[staging-multi-tenant]: ../../platform-template/env/staging/main.tf?plain=1#L50 +[prod-multi-tenant]: ../../platform-template/env/prod/main.tf?plain=1#L50 +[infra-repo]: ../../app-factory-template/README.md?plain=1#L64 diff --git a/best-practices/ml-platform/terraform/modules/network/outputs.tf b/best-practices/ml-platform/terraform/modules/network/outputs.tf new file mode 100644 index 000000000..d05cb77d0 --- /dev/null +++ b/best-practices/ml-platform/terraform/modules/network/outputs.tf @@ -0,0 +1,28 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +output "subnet-1" { + description = "subnet1." + value = google_compute_subnetwork.subnet-1.id +} + +output "subnet-2" { + description = "subnet2." + value = google_compute_subnetwork.subnet-2.id +} + +output "vpc" { + description = "VPC." + value = google_compute_network.vpc-network.id +} diff --git a/best-practices/ml-platform/terraform/modules/network/variables.tf b/best-practices/ml-platform/terraform/modules/network/variables.tf new file mode 100644 index 000000000..44a83c9bb --- /dev/null +++ b/best-practices/ml-platform/terraform/modules/network/variables.tf @@ -0,0 +1,59 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +variable "project_id" { + type = string + description = "Id of the GCP project where VPC is to be created." +} + +variable "network_name" { + type = string + description = "Name of the VPC network." +} + +variable "routing_mode" { + default = "GLOBAL" + description = "The network routing mode." + type = string +} + +variable "subnet_01_ip" { + type = string + description = "IP range of first subnet." +} + +variable "subnet_01_name" { + type = string + description = "Name of first subnet." +} + +variable "subnet_01_region" { + type = string + description = "Region of first subnet." +} + +variable "subnet_02_ip" { + type = string + description = "IP range of second subnet." +} + +variable "subnet_02_name" { + type = string + description = "Name of the second subnet." +} + +variable "subnet_02_region" { + type = string + description = "Region of second subnet." +} diff --git a/best-practices/ml-platform/terraform/modules/network/versions.tf b/best-practices/ml-platform/terraform/modules/network/versions.tf new file mode 100644 index 000000000..466fd04d7 --- /dev/null +++ b/best-practices/ml-platform/terraform/modules/network/versions.tf @@ -0,0 +1,22 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +terraform { + required_providers { + google = { + source = "hashicorp/google" + version = "5.19.0" + } + } +} diff --git a/best-practices/ml-platform/terraform/modules/network/vpc.tf b/best-practices/ml-platform/terraform/modules/network/vpc.tf new file mode 100644 index 000000000..a573374b4 --- /dev/null +++ b/best-practices/ml-platform/terraform/modules/network/vpc.tf @@ -0,0 +1,38 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +resource "google_compute_network" "vpc-network" { + auto_create_subnetworks = false + name = var.network_name + project = var.project_id + routing_mode = var.routing_mode +} + +resource "google_compute_subnetwork" "subnet-1" { + ip_cidr_range = var.subnet_01_ip + name = var.subnet_01_name + network = google_compute_network.vpc-network.id + private_ip_google_access = true + project = var.project_id + region = var.subnet_01_region +} + +resource "google_compute_subnetwork" "subnet-2" { + ip_cidr_range = var.subnet_02_ip + name = var.subnet_02_name + network = google_compute_network.vpc-network.id + private_ip_google_access = true + project = var.project_id + region = var.subnet_02_region +} diff --git a/best-practices/ml-platform/terraform/modules/node-pools/nodepools.tf b/best-practices/ml-platform/terraform/modules/node-pools/nodepools.tf new file mode 100644 index 000000000..79fd15029 --- /dev/null +++ b/best-practices/ml-platform/terraform/modules/node-pools/nodepools.tf @@ -0,0 +1,85 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +resource "google_container_node_pool" "node-pool" { + cluster = var.cluster_name + location = var.region + name = format("%s-%s", var.cluster_name, var.node_pool_name) + project = var.project_id + + autoscaling { + location_policy = var.autoscaling["location_policy"] + total_max_node_count = var.autoscaling["total_max_node_count"] + total_min_node_count = var.autoscaling["total_min_node_count"] + } + + network_config { + enable_private_nodes = true + } + + node_config { + machine_type = var.machine_type + oauth_scopes = [ + "https://www.googleapis.com/auth/cloud-platform" + ] + + labels = { + "resource-type" : var.resource_type + } + + gcfs_config { + enabled = true + } + + guest_accelerator { + count = var.accelerator_count + type = var.accelerator + } + + dynamic "reservation_affinity" { + for_each = var.reservation_name != "" ? [1] : [] + content { + consume_reservation_type = "SPECIFIC_RESERVATION" + key = "compute.googleapis.com/reservation-name" + values = [var.reservation_name] + } + } + + shielded_instance_config { + enable_integrity_monitoring = true + enable_secure_boot = true + } + + dynamic "taint" { + for_each = var.taints + content { + effect = taint.value.effect + key = taint.value.key + value = taint.value.value + } + } + } + + lifecycle { + ignore_changes = [ + node_config[0].labels, + node_config[0].taint, + ] + } + + timeouts { + create = "30m" + update = "20m" + } +} diff --git a/best-practices/ml-platform/terraform/modules/node-pools/variables.tf b/best-practices/ml-platform/terraform/modules/node-pools/variables.tf new file mode 100644 index 000000000..f298cf5eb --- /dev/null +++ b/best-practices/ml-platform/terraform/modules/node-pools/variables.tf @@ -0,0 +1,90 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +variable "accelerator" { + default = "nvidia-l4" + description = "The GPU accelerator to use." + type = string +} + +variable "accelerator_count" { + default = 2 + description = "The number of accelerators per machine." + type = number +} + +variable "autoscaling" { + default = { + "total_min_node_count" : 0, + "total_max_node_count" : 24, + "location_policy" : "ANY" + } + type = map(any) +} + +variable "cluster_name" { + default = "" + description = "GKE cluster name" + type = string +} + +variable "machine_reservation_count" { + default = 4 + description = "Number of machines reserved instances with GPUs" + type = number +} + +variable "machine_type" { + default = "g2-standard-24" + description = "The machine type to use." + type = string +} + +variable "node_pool_name" { + description = "Name of the node pool" + type = string +} + +variable "project_id" { + default = "" + description = "The GCP project where the resources will be created" + type = string +} + +variable "region" { + default = "us-central1-a" + description = "The GCP zone where the reservation will be created" + type = string +} + +variable "reservation_name" { + default = "" + description = "reservation name to which the nodepool will be associated" + type = string +} + +variable "resource_type" { + default = "ondemand" + description = "ondemand/spot/reserved." + type = string +} + +variable "taints" { + description = "Taints to be applied to the on-demand node pool." + type = list(object({ + effect = string + key = string + value = any + })) +} diff --git a/best-practices/ml-platform/terraform/modules/node-pools/versions.tf b/best-practices/ml-platform/terraform/modules/node-pools/versions.tf new file mode 100644 index 000000000..b19f861ad --- /dev/null +++ b/best-practices/ml-platform/terraform/modules/node-pools/versions.tf @@ -0,0 +1,26 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +terraform { + required_providers { + google = { + source = "hashicorp/google" + version = "5.19.0" + } + google-beta = { + source = "hashicorp/google-beta" + version = "5.19.0" + } + } +} diff --git a/best-practices/ml-platform/terraform/modules/vm-reservations/outputs.tf b/best-practices/ml-platform/terraform/modules/vm-reservations/outputs.tf new file mode 100644 index 000000000..11ffcc6d8 --- /dev/null +++ b/best-practices/ml-platform/terraform/modules/vm-reservations/outputs.tf @@ -0,0 +1,17 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +output "reservation_name" { + value = split("/", google_compute_reservation.machine_reservation.id)[5] +} diff --git a/best-practices/ml-platform/terraform/modules/vm-reservations/reservations.tf b/best-practices/ml-platform/terraform/modules/vm-reservations/reservations.tf new file mode 100644 index 000000000..d7a0c1ad3 --- /dev/null +++ b/best-practices/ml-platform/terraform/modules/vm-reservations/reservations.tf @@ -0,0 +1,33 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +resource "google_compute_reservation" "machine_reservation" { + name = format("%s-%s", var.cluster_name, "reservation") + project = var.project_id + specific_reservation_required = true + zone = var.zone + + specific_reservation { + count = var.machine_reservation_count + + instance_properties { + machine_type = var.machine_type + + guest_accelerators { + accelerator_count = var.accelerator_count + accelerator_type = var.accelerator + } + } + } +} diff --git a/best-practices/ml-platform/terraform/modules/vm-reservations/variables.tf b/best-practices/ml-platform/terraform/modules/vm-reservations/variables.tf new file mode 100644 index 000000000..c534c75a6 --- /dev/null +++ b/best-practices/ml-platform/terraform/modules/vm-reservations/variables.tf @@ -0,0 +1,55 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +variable "accelerator" { + default = "nvidia-l4" + description = "The GPU accelerator to use." + type = string +} + +variable "accelerator_count" { + default = 2 + description = "The number of accelerators per machine." + type = number +} + +variable "cluster_name" { + default = "" + description = "GKE cluster name" + type = string +} + +variable "machine_reservation_count" { + default = 2 + description = "Number of machines reserved instances with GPUs" + type = number +} + +variable "machine_type" { + default = "g2-standard-24" + description = "The machine type to use." + type = string +} + +variable "project_id" { + default = "" + description = "The GCP project where the resources will be created" + type = string +} + +variable "zone" { + default = "us-central1-a" + description = "The GCP zone where the reservation will be created" + type = string +} diff --git a/best-practices/ml-platform/terraform/modules/vm-reservations/versions.tf b/best-practices/ml-platform/terraform/modules/vm-reservations/versions.tf new file mode 100644 index 000000000..b19f861ad --- /dev/null +++ b/best-practices/ml-platform/terraform/modules/vm-reservations/versions.tf @@ -0,0 +1,26 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +terraform { + required_providers { + google = { + source = "hashicorp/google" + version = "5.19.0" + } + google-beta = { + source = "hashicorp/google-beta" + version = "5.19.0" + } + } +}