Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding ML platform reference architecture in the folder ml-platform #266

Merged
merged 39 commits into from
Mar 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
be5f89e
Adding ML platform reference architecture in the folder ml-platform (…
gushob21 Feb 29, 2024
9736e16
fixing project_id variable defaults (#267)
gushob21 Feb 29, 2024
6bde1ae
Fixing documentation (#268)
gushob21 Mar 1, 2024
9ab2b07
Formatted Terraform files
arueth Mar 6, 2024
2aba065
Commented out the license in the README files
arueth Mar 6, 2024
2aa6017
Mlops platform udates (#326)
gushob21 Mar 11, 2024
63f6272
MLP restructure (#347)
arueth Mar 13, 2024
1b16ad0
Changed the Configuration flow
arueth Mar 13, 2024
6f7807e
Fixed terraform fmt issues
arueth Mar 13, 2024
f6a4245
Enabled managed prometheus
arueth Mar 14, 2024
816ddf6
Enabled logging and monitoring
arueth Mar 14, 2024
86d1b5f
Alphabetized and standardized variables.tf
arueth Mar 14, 2024
8bf2696
Renamed cluster resource from gke_batch to mlp
arueth Mar 14, 2024
bccea3e
Fixed typos in README
arueth Mar 14, 2024
3df4a62
Fixed guide path
arueth Mar 14, 2024
d16c055
Changed README to reference a single cluster
arueth Mar 14, 2024
53cf5bd
Formatted for Terraform files readability and supportability
arueth Mar 14, 2024
44a23b0
Fixed 'set configuration variables' environment variable for project ID
arueth Mar 14, 2024
ca9fa68
Added serviceaccount.yaml to _namespace_template/app/kustomization.yaml
arueth Mar 14, 2024
d0861e0
Enabled serviceusage.googleapis.com APIs
arueth Mar 14, 2024
ba03eb2
Fixed terraform fmt issues
arueth Mar 14, 2024
30e88b9
Added shielded VMs
arueth Mar 14, 2024
3396d4f
Added a dependency on the GKE cluster for the WI SA IAM binding
arueth Mar 15, 2024
f8e2d8b
add shielded config for nap pools as well
kenthua Mar 15, 2024
0cf408b
mlops platform kh (#363)
kenthua Mar 15, 2024
a866414
Added additional setup and cleanup steps
arueth Mar 22, 2024
696c94c
Removed code for multiple environment and added additional service ac…
arueth Mar 22, 2024
d48b5be
Added remove_default_node_pool
arueth Mar 22, 2024
9d96a92
update to push work to workers
kenthua Mar 26, 2024
a9f352f
adding autoscaling config
kenthua Mar 26, 2024
dd7f910
Update values.yaml
kenthua Mar 26, 2024
8b59056
Project cleanup (#469)
arueth Mar 28, 2024
702c8da
Code Sample for Ray Dataprocessing on GKE (#507)
karajendran Mar 29, 2024
4951554
Restructured folder
arueth Mar 29, 2024
3f0df9a
Cleanup from folder restructuring
arueth Mar 29, 2024
cd069f0
Updated the CUJs
arueth Mar 29, 2024
bbff3bb
Leaving default node pool for sanbox to decrease provisioning time
arueth Mar 29, 2024
5984eac
Cleaned up the data processing README
arueth Mar 29, 2024
f421af9
Cleaned up the initialize feature
arueth Mar 29, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 74 additions & 0 deletions best-practices/ml-platform/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Machine learning platform (MLP) on GKE reference architecture for enabling Machine Learning Operations (MLOps)

## Platform Principles

This reference architecture demonstrates how to build a GKE platform that facilitates Machine Learning. The reference architecture is based on the following principles:

- The platform admin will create the GKE platform using IaC tool like [Terraform][terraform]. The IaC will come with re-usable modules that can be referred to create more resources as the demand grows.
- The platform will be based on [GitOps][gitops].
- After the GKE platform has been created, cluster scoped resources on it will be created through [Config Sync][config-sync] by the admins.
- Platform admins will create a namespace per application and provide the application team member full access to it.
- The namespace scoped resources will be created by the Application/ML teams either via [Config Sync][config-sync] or through a deployment tool like [Cloud Deploy][cloud-deploy]

## Critical User Journeys (CUJs)

### Persona : Platform Admin

- Offer a platform that incorporates established best practices.
- Grant end users the essential resources, guided by the principle of least privilege, empowering them to manage and maintain their workloads.
- Establish secure channels for end users to interact seamlessly with the platform.
- Empower the enforcement of robust security policies across the platform.

### Persona : Machine Learning Engineer

- Deploy the model with ease and make the endpoints available only to the intended audience
- Continuously monitor the model performance and resource utilization
- Troubleshoot any performance or integration issues
- Ability to version, store and access the models and model artifacts:
- To debug & troubleshoot in production and track back to the specific model version & associated training data
- To quick & controlled rollback to a previous, more stable version
- Implement the feedback loop to adapt to changing data & business needs:
- Ability to retrain / fine-tune the model.
- Ability to split the traffic between models (A/B testing)
- Switching between the models without breaking inference system for the end-users
- Ability to scaling up/down the infra to accommodate changing needs
- Ability to share the insights and findings with stakeholders to take data-driven decisions

### Persona : Machine Learning Operator

- Provide and maintain software required by the end users of the platform.
- Operationalize experimental workload by providing guidance and best practices for running the workload on the platform.
- Deploy the workloads on the platform.
- Assist with enabling observability and monitoring for the workloads to ensure smooth operations.

## Prerequisites

- This guide is meant to be run on [Cloud Shell](https://shell.cloud.google.com) which comes preinstalled with the [Google Cloud SDK](https://cloud.google.com/sdk) and other tools that are required to complete this tutorial.
- Familiarity with following
- [Google Kubernetes Engine][gke]
- [Terraform][terraform]
- [git][git]
- [Google Configuration Management root-sync][root-sync]
- [Google Configuration Management repo-sync][repo-sync]
- [GitHub][github]

## Deploy the platform

[Sandbox Reference Architecture Guide](examples/platform/sandbox/README.md): Set up an environment to familiarize yourself with the architecture and get an understanding of the concepts.

## Use cases

- [Distributed Data Processing with Ray](examples/use-case/ray/dataprocessing/README.md): Run a distributed data processing job using Ray.

[gitops]: https://about.gitlab.com/topics/gitops/
[repo-sync]: https://cloud.google.com/anthos-config-management/docs/reference/rootsync-reposync-fields
[root-sync]: https://cloud.google.com/anthos-config-management/docs/reference/rootsync-reposync-fields
[config-sync]: https://cloud.google.com/anthos-config-management/docs/config-sync-overview
[cloud-deploy]: https://cloud.google.com/deploy?hl=en
[terraform]: https://www.terraform.io/
[gke]: https://cloud.google.com/kubernetes-engine?hl=en
[git]: https://git-scm.com/
[github]: https://github.com/
[gcp-project]: https://cloud.google.com/resource-manager/docs/creating-managing-projects
[personal-access-token]: https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens
[machine-user-account]: https://docs.github.com/en/get-started/learning-about-github/types-of-github-accounts
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading