Cloud-Cost-Optimization

Summary

This document describes various ideas, opportunities and best practices to optimize the cloud cost for Red Hat Openshift AI. It also explains in detail about the design and implementation of the infrastructure to implement this plan. At the end it summarizes the output and the cost saving we achieved for Red Hat Openshift AI using this infrastructure.

Cost Optimization Plan

Each cluster has inactive durations when cluster is not getting used, but it is still running and increasing the cloud cost
We should hibernate the clusters or freeze its major cloud resources during inactive period to save the cloud cost
Possible inactive durations
- Weekend - All clusters should be hibernated each weekend and resumed back when the week starts. It can achieve up to 25% cost saving per month
- Daily - Each cluster has some inactive hours each day, ( which can be confirmed by the cluster owner). Cluster can be hibernated during these inactive hours and can result into ~50% cost saving per month
Hibernation and resume both should be automated to:
- Save manual efforts
- Make it happen consistently without failure
- Avoid disruption to cluster users
Users should be able to resume a cluster manually whenever needed
OSD and ROSA-Classic clusters have a standard hibernation support
ROSA-Hosted (Hypershift) and IPI clusters do not support hibernation, we should develop a custom hibernation infrastructure for these type of clusters

Implementation completed

Auditing and review of existing clusters along with guiding the team to select cost-optimized cluster for their use case using the Cluster Selection Guide
Hibernation Infrastructure design and development
1. Hypershift Hibernation infrastructure - to hibernate and resume the HCP clusters
2. IPI Hibernation infrastructure - to hibernate and resume the IPI clusters
3. Automated Weekend Hibernation infrastructure - to automatically hibernate all the clusters each weekend and resume when the week starts
4. Automated Daily Hibernation Infrastructure - to automatically hibernate and resume all the clusters based on the inactive hours provided by the cluster owner
5. Cluster Stats Smartsheet - An auto-populated smartsheet which is always latest with the details of all the RHOAI clusters from PROD and STAGE accounts, it is also used to configure the inactive hours for daily hibernation
6. On-Demand Hibernation / Resume infrastructure - to enable team members to hibernate or resume their clusters whenever they need
Automated Cloud Cleanup infrastructure - to regularly cleanup the leftover cloud resources from deleted openshift clusters
Best Practices Formulation - Devised and documented team-wide best practices to save the cloud cost along with educating the team about the same

Best Practices

Before creating a new cluster, please refer the Cluster Selection Guide to identify the least expensive cluster for your use case
Please ensure to update the “Inactive Hours” for your cluster to ensure Automated Daily Hibernation
There is an Automated Weekend Hibernation for all the clusters, please use DevOps infra to resume the cluster if needed in between
Please hibernate your personal OSD or ROSA clusters before going on any long vacation

Make sure to register your disconnected clusters to OCM, follow this doc for detailed steps

FAQs

How to update the inactive hours for your cluster

Open the RHOAI Clusters smartsheet
If login screen is shown, then select the “Sign in with Google”, provide your RH email and follow the single sign on
Find your cluster in the list
Update “Inactive Hours - Start (UTC)” and “Inactive Hours - End (UTC)” columns for your cluster
It has to be provided as per the UTC timezone
It has to be HH:MM:SS as per the 24 hours time format (without any AM or PM)
If the Inactive hours are left empty, then the cluster will not be hibernated or resumed.

How ROSA-Hosted clusters are hibernated

ROSA-Hosted clusters do not support hibernation as a standard feature
We have devised a custom hibernation strategy to switch-off the corresponding EC2 instances and delete the root EBS volumes to save the cost
We have designed and developed the tooling / infrastructure using python and github-actions to implement our custom hibernation strategy

How IPI clusters are hibernated

IPI clusters do not support hibernation as a standard feature
We have devised a custom hibernation strategy to switch-off the corresponding EC2 instances to save the cost
We have designed and developed the tooling / infrastructure using python and github-actions to implement our custom hibernation strategy

How to manually hibernate a ROSA-Hosted or IPI cluster

Go to Hibernate Cluster github action
Click on “Run Workflow” button on right side of the page
Provide the “Cluster Name” and select correct “OCM Account”
Click “Run Workflow”

PS - This workflow can hibernate an OSD cluster as well, but it will not wait for the hibernation to complete.

How to manually hibernate an OSD cluster

Login to the respective OCM account from where the cluster is created
Go to “Clusters” page, locate your cluster in the list
Click on 3 dots in front of the cluster, and Click “Hibernate”

How to manually resume a ROSA-Hosted or IPI cluster

Go to Resume Cluster github action
Click on “Run Workflow” button on right side of the page
Provide the “Cluster Name” and select correct “OCM Account”
Click “Run Workflow”

PS - This workflow can resume an OSD cluster as well, but it will not wait for the resumption to complete.

How to manually resume an OSD cluster

Login to the respective OCM account from where the cluster is created
Go to “Clusters” page, locate your cluster in the list
Click on 3 dots in front of the cluster, and Click “Resume from Hibernation”

How to check the status of your cluster

Open the RHOAI Clusters smartsheet
If login screen is shown, then select the “Sign in with Google”, provide your RH email and follow the single sign on
Find your cluster in the list and check the status

Name		Name	Last commit message	Last commit date
Latest commit History 195 Commits
.github/workflows		.github/workflows
bin		bin
refs		refs
script		script
src		src
.gitignore		.gitignore
Hypershift-Cluster-Hibernation.md		Hypershift-Cluster-Hibernation.md
IPI-Cluster-Hibernation.md		IPI-Cluster-Hibernation.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cloud-Cost-Optimization

Summary

Cost Optimization Plan

Implementation completed

Best Practices

FAQs

How to update the inactive hours for your cluster

How ROSA-Hosted clusters are hibernated

How IPI clusters are hibernated

How to manually hibernate a ROSA-Hosted or IPI cluster

How to manually hibernate an OSD cluster

How to manually resume a ROSA-Hosted or IPI cluster

How to manually resume an OSD cluster

How to check the status of your cluster

About

Releases

Packages

Languages

License

red-hat-data-services/Cloud-Cost-Optimization

Folders and files

Latest commit

History

Repository files navigation

Cloud-Cost-Optimization

Summary

Cost Optimization Plan

Implementation completed

Best Practices

FAQs

How to update the inactive hours for your cluster

How ROSA-Hosted clusters are hibernated

How IPI clusters are hibernated

How to manually hibernate a ROSA-Hosted or IPI cluster

How to manually hibernate an OSD cluster

How to manually resume a ROSA-Hosted or IPI cluster

How to manually resume an OSD cluster

How to check the status of your cluster

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages