This document describes various ideas, opportunities and best practices to optimize the cloud cost for Red Hat Openshift AI. It also explains in detail about the design and implementation of the infrastructure to implement this plan. At the end it summarizes the output and the cost saving we achieved for Red Hat Openshift AI using this infrastructure.
-
Each cluster has inactive durations when cluster is not getting used, but it is still running and increasing the cloud cost
-
We should hibernate the clusters or freeze its major cloud resources during inactive period to save the cloud cost
-
Possible inactive durations
-
Weekend - All clusters should be hibernated each weekend and resumed back when the week starts. It can achieve up to 25% cost saving per month
-
Daily - Each cluster has some inactive hours each day, ( which can be confirmed by the cluster owner). Cluster can be hibernated during these inactive hours and can result into ~50% cost saving per month
-
-
Hibernation and resume both should be automated to:
-
Save manual efforts
-
Make it happen consistently without failure
-
Avoid disruption to cluster users
-
-
Users should be able to resume a cluster manually whenever needed
-
OSD and ROSA-Classic clusters have a standard hibernation support
-
ROSA-Hosted (Hypershift) and IPI clusters do not support hibernation, we should develop a custom hibernation infrastructure for these type of clusters
-
Auditing and review of existing clusters along with guiding the team to select cost-optimized cluster for their use case using the Cluster Selection Guide
-
Hibernation Infrastructure design and development
-
Hypershift Hibernation infrastructure - to hibernate and resume the HCP clusters
-
IPI Hibernation infrastructure - to hibernate and resume the IPI clusters
-
Automated Weekend Hibernation infrastructure - to automatically hibernate all the clusters each weekend and resume when the week starts
-
Automated Daily Hibernation Infrastructure - to automatically hibernate and resume all the clusters based on the inactive hours provided by the cluster owner
-
Cluster Stats Smartsheet - An auto-populated smartsheet which is always latest with the details of all the RHOAI clusters from PROD and STAGE accounts, it is also used to configure the inactive hours for daily hibernation
-
On-Demand Hibernation / Resume infrastructure - to enable team members to hibernate or resume their clusters whenever they need
-
-
Automated Cloud Cleanup infrastructure - to regularly cleanup the leftover cloud resources from deleted openshift clusters
-
Best Practices Formulation - Devised and documented team-wide best practices to save the cloud cost along with educating the team about the same
-
Before creating a new cluster, please refer the Cluster Selection Guide to identify the least expensive cluster for your use case
-
Please ensure to update the “Inactive Hours” for your cluster to ensure Automated Daily Hibernation
-
There is an Automated Weekend Hibernation for all the clusters, please use DevOps infra to resume the cluster if needed in between
-
Please hibernate your personal OSD or ROSA clusters before going on any long vacation
Make sure to register your disconnected clusters to OCM, follow this doc for detailed steps
-
Open the RHOAI Clusters smartsheet
-
If login screen is shown, then select the “Sign in with Google”, provide your RH email and follow the single sign on
-
Find your cluster in the list
-
Update “Inactive Hours - Start (UTC)” and “Inactive Hours - End (UTC)” columns for your cluster
-
It has to be provided as per the UTC timezone
-
It has to be HH:MM:SS as per the 24 hours time format (without any AM or PM)
-
If the Inactive hours are left empty, then the cluster will not be hibernated or resumed.
-
ROSA-Hosted clusters do not support hibernation as a standard feature
-
We have devised a custom hibernation strategy to switch-off the corresponding EC2 instances and delete the root EBS volumes to save the cost
-
We have designed and developed the tooling / infrastructure using python and github-actions to implement our custom hibernation strategy
-
IPI clusters do not support hibernation as a standard feature
-
We have devised a custom hibernation strategy to switch-off the corresponding EC2 instances to save the cost
-
We have designed and developed the tooling / infrastructure using python and github-actions to implement our custom hibernation strategy
-
Go to Hibernate Cluster github action
-
Click on “Run Workflow” button on right side of the page
-
Provide the “Cluster Name” and select correct “OCM Account”
-
Click “Run Workflow”
PS - This workflow can hibernate an OSD cluster as well, but it will not wait for the hibernation to complete.
-
Login to the respective OCM account from where the cluster is created
-
Go to “Clusters” page, locate your cluster in the list
-
Click on 3 dots in front of the cluster, and Click “Hibernate”
-
Go to Resume Cluster github action
-
Click on “Run Workflow” button on right side of the page
-
Provide the “Cluster Name” and select correct “OCM Account”
-
Click “Run Workflow”
PS - This workflow can resume an OSD cluster as well, but it will not wait for the resumption to complete.
-
Login to the respective OCM account from where the cluster is created
-
Go to “Clusters” page, locate your cluster in the list
-
Click on 3 dots in front of the cluster, and Click “Resume from Hibernation”
-
Open the RHOAI Clusters smartsheet
-
If login screen is shown, then select the “Sign in with Google”, provide your RH email and follow the single sign on
-
Find your cluster in the list and check the status