add instructions for RHOAI 2.13 (#60)

project-codeflare · Sep 16, 2024 · acd30fe · acd30fe
1 parent b5ab2e7
commit acd30fe
Show file tree

Hide file tree

Showing 17 changed files with 969 additions and 2 deletions.
diff --git a/SETUP.md b/SETUP.md
@@ -29,6 +29,12 @@ Instructions are provided for the following OpenShift AI ***stable*** releases:
    + [RHOAI 2.10 Cluster Setup](./setup.RHOAI-v2.10/CLUSTER-SETUP.md)
    + [RHOAI 2.10 Team Setup](./setup.RHOAI-v2.10/TEAM-SETUP.md)
    + [RHOAI 2.10 Uninstall](./setup.RHOAI-v2.10/UNINSTALL.md)
++ OpenShift AI 2.13
+   + [RHOAI 2.13 Cluster Setup](./setup.RHOAI-v2.13/CLUSTER-SETUP.md)
+   + [RHOAI 2.13 Team Setup](./setup.RHOAI-v2.13/TEAM-SETUP.md)
+   + [UPGRADING from RHOAI 2.10](./setup.RHOAI-v2.13/UPGRADE-STABLE.md)
+   + [UPGRADING from RHOAI 2.12](./setup.RHOAI-v2.13/UPGRADE-FAST.md)
+   + [RHOAI 2.13 Uninstall](./setup.RHOAI-v2.13/UNINSTALL.md)
 
 Instructions are provided for the following OpenShift AI ***fast*** releases:
 + OpenShift AI 2.11

diff --git a/setup.RHOAI-v2.12/UPGRADE.md b/setup.RHOAI-v2.12/UPGRADE.md
@@ -28,10 +28,10 @@ oc apply -f setup.RHOAI-v2.12/mlbatch-upgrade-configmaps.yaml
 Second, approve the install plan replacing the example plan name below with the actual
 value on your cluster:
 ```sh
-oc patch ip -n redhat-ods-operator --type merge --patch '{"spec":{"approved":true}}' install-st8vh
+oc patch ip -n redhat-ods-operator --type merge --patch '{"spec":{"approved":true}}' install-xs6gq
 ```
 
-Apply this patch:
+Third, apply this patch:
 ```sh
 oc apply -f setup.RHOAI-v2.12/mlbatch-rbac-fix.yaml
 ```

diff --git a/setup.RHOAI-v2.13/CLUSTER-SETUP.md b/setup.RHOAI-v2.13/CLUSTER-SETUP.md
@@ -0,0 +1,160 @@
+# Cluster Setup
+
+The cluster setup installs OpenShift AI and Coscheduler, configures Kueue,
+cluster roles, and priority classes.
+
+If MLBatch is deployed on a cluster that used to run earlier versions of ODH,
+[MCAD](https://github.com/project-codeflare/mcad), OpenShift AI, or Coscheduler,
+make sure to scrub traces of these installations. In particular, make sure to
+delete the following custom resource definitions (CRD) if present on the
+cluster. Make sure to delete all instances prior to deleting the CRDs:
+```sh
+# Delete old appwrappers and crd
+oc delete appwrappers --all -A
+oc delete crd appwrappers.workload.codeflare.dev
+
+# Delete old noderesourcetopologies and crd
+oc delete noderesourcetopologies --all -A
+oc delete crd noderesourcetopologies.topology.node.k8s.io
+```
+
+## Priorities
+
+Create `default-priority`, `high-priority`, and `low-priority` priority classes:
+```sh
+oc apply -f setup.RHOAI-v2.13/mlbatch-priorities.yaml
+```
+
+## Coscheduler
+
+Install Coscheduler v0.28.9 as a secondary scheduler and configure packing:
+```sh
+helm install scheduler-plugins --namespace scheduler-plugins --create-namespace \
+  scheduler-plugins/manifests/install/charts/as-a-second-scheduler/ \
+  --set-json pluginConfig='[{"args":{"scoringStrategy":{"resources":[{"name":"nvidia.com/gpu","weight":1}],"requestedToCapacityRatio":{"shape":[{"utilization":0,"score":0},{"utilization":100,"score":10}]},"type":"RequestedToCapacityRatio"}},"name":"NodeResourcesFit"}]'
+```
+Patch Coscheduler pod priorities:
+```sh
+oc patch deployment -n scheduler-plugins --type=json --patch-file setup.RHOAI-v2.13/coscheduler-priority-patch.yaml scheduler-plugins-controller
+oc patch deployment -n scheduler-plugins --type=json --patch-file setup.RHOAI-v2.13/coscheduler-priority-patch.yaml scheduler-plugins-scheduler
+```
+
+## OpenShift AI
+
+Create the OpenShift AI subscription:
+```sh
+oc apply -f setup.RHOAI-v2.13/mlbatch-subscription.yaml
+````
+Identify install plan:
+```sh
+oc get ip -n redhat-ods-operator
+```
+```
+NAMESPACE             NAME            CSV                     APPROVAL   APPROVED
+redhat-ods-operator   install-kmh8w   rhods-operator.2.10.0   Manual     false
+```
+Approve install plan replacing the generated plan name below with the actual
+value:
+```sh
+oc patch ip -n redhat-ods-operator --type merge --patch '{"spec":{"approved":true}}' install-kmh8w
+```
+Create DSC Initialization:
+```sh
+oc apply -f setup.RHOAI-v2.13/mlbatch-dsci.yaml
+```
+Create Data Science Cluster:
+```sh
+oc apply -f setup.RHOAI-v2.13/mlbatch-dsc.yaml
+```
+The provided DSCI and DSC are intended to install a minimal set of OpenShift
+AI managed components: `codeflare`, `kueue`, `ray`, and `trainingoperator`. The
+remaining components such as `dashboard` can be optionally enabled.
+
+The configuration of the managed components differs from the default OpenShift
+AI configuration as follows:
+- Kubeflow Training Operator:
+  - `gang-scheduler-name` is set to `scheduler-plugins-scheduler`,
+- Kueue:
+  - `manageJobsWithoutQueueName` is enabled,
+  - `batch/job` integration is disabled,
+  - `waitForPodsReady` is disabled,
+  - `LendingLimit` feature gate is enabled,
+  - `enableClusterQueueResources` metrics is enabled,
+- Codeflare operator:
+  - the AppWrapper controller is enabled and configured as follows:
+    - `userRBACAdmissionCheck` is disabled,
+    - `schedulerName` is set to `scheduler-plugins-scheduler`,
+    - `queueName` is set to `default-queue`,
+- pod priorities, resource requests and limits have been adjusted.
+
+To work around https://issues.redhat.com/browse/RHOAIENG-7887 (a race condition
+in OpenShift AI installation), do a rolling restart of the Kueue manager.
+```sh
+oc rollout restart deployment/kueue-controller-manager -n redhat-ods-applications
+```
+
+After doing the restart, verify that you see the following lines in the
+kueue-controller-manager's log:
+```sh
+{"level":"info","ts":"2024-06-25T20:17:25.689638786Z","logger":"controller-runtime.builder","caller":"builder/webhook.go:189","msg":"Registering a validating webhook","GVK":"kubeflow.org/v1, Kind=PyTorchJob","path":"/validate-kubeflow-org-v1-pytorchjob"}
+{"level":"info","ts":"2024-06-25T20:17:25.689698615Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:183","msg":"Registering webhook","path":"/validate-kubeflow-org-v1-pytorchjob"}
+{"level":"info","ts":"2024-06-25T20:17:25.689743757Z","logger":"setup","caller":"jobframework/setup.go:81","msg":"Set up controller and webhook for job framework","jobFrameworkName":"kubeflow.org/pytorchjob"}
+
+```
+
+## Kueue Configuration
+
+Create Kueue's default flavor:
+```sh
+oc apply -f setup.RHOAI-v2.13/default-flavor.yaml
+```
+
+## Cluster Role
+
+Create `mlbatch-edit` role:
+```sh
+oc apply -f setup.RHOAI-v2.13/mlbatch-edit-role.yaml
+```
+
+## Slack Cluster Queue
+
+Create the designated slack `ClusterQueue` which will be used to automate
+minor adjustments to cluster capacity caused by node failures and
+scheduler maintanence.
+```sh
+oc apply -f- << EOF
+apiVersion: kueue.x-k8s.io/v1beta1
+kind: ClusterQueue
+metadata:
+  name: slack-cluster-queue
+spec:
+  namespaceSelector: {}
+  cohort: default-cohort
+  preemption:
+    withinClusterQueue: LowerOrNewerEqualPriority
+    reclaimWithinCohort: Any
+    borrowWithinCohort:
+      policy: Never
+  resourceGroups:
+  - coveredResources: ["cpu", "memory", "nvidia.com/gpu", "nvidia.com/roce_gdr", "pods"]
+    flavors:
+    - name: default-flavor
+      resources:
+      - name: "cpu"
+        nominalQuota: 8000m
+      - name: "memory"
+        nominalQuota: 128Gi
+      - name: "nvidia.com/gpu"
+        nominalQuota: 8
+      - name: "nvidia.com/roce_gdr"
+        nominalQuota: 1
+      - name: "pods"
+        nominalQuota: 100
+EOF
+```
+Edit the above quantities to adjust the quota to the desired
+values. Pod counts are optional and can be omitted from the list of
+covered resources.  The `lendingLimit` for each resource will be
+dynamically adjusted by the MLBatch system to reflect reduced cluster
+capacity. See [QUOTA_MAINTENANCE.md](../QUOTA_MAINTENANCE.md) for a
+detailed discussion of the role of the slack `ClusterQueue`.
diff --git a/setup.RHOAI-v2.13/TEAM-SETUP.md b/setup.RHOAI-v2.13/TEAM-SETUP.md
@@ -0,0 +1,91 @@
+# Team Setup
+
+A *team* in MLBatch is a group of users that share a resource quota.
+
+Before setting up your teams and quotas, please read [QUOTA_MAINTENANCE.md](../QUOTA_MAINTENANCE.md)
+for a discussion of our recommended best practices.
+
+
+Setting up a new team requires the cluster admin to create a project,
+a user group, a quota, a queue, and the required role bindings as described below.
+
+Create project:
+```sh
+oc new-project team1
+```
+Create user group:
+```sh
+oc adm groups new team1-edit-group
+```
+Add users to group for example:
+```sh
+oc adm groups add-users team1-edit-group user1
+```
+Bind cluster role to group in namespace:
+```sh
+oc adm policy add-role-to-group mlbatch-edit team1-edit-group --role-namespace="" --namespace team1
+```
+
+Specify the intended quota for the namespace by creating a `ClusterQueue`:
+```sh
+oc apply -f- << EOF
+apiVersion: kueue.x-k8s.io/v1beta1
+kind: ClusterQueue
+metadata:
+  name: team1-cluster-queue
+spec:
+  namespaceSelector: {}
+  cohort: default-cohort
+  preemption:
+    withinClusterQueue: LowerOrNewerEqualPriority
+    reclaimWithinCohort: Any
+    borrowWithinCohort:
+      policy: Never
+  resourceGroups:
+  - coveredResources: ["cpu", "memory", "nvidia.com/gpu", "nvidia.com/roce_gdr", "pods"]
+    flavors:
+    - name: default-flavor
+      resources:
+      - name: "cpu"
+        nominalQuota: 8000m
+        # borrowingLimit: 0
+        # lendingLimit: 0
+      - name: "memory"
+        nominalQuota: 128Gi
+        # borrowingLimit: 0
+        # lendingLimit: 0
+      - name: "nvidia.com/gpu"
+        nominalQuota: 16
+        # borrowingLimit: 0
+        # lendingLimit: 0
+      - name: "nvidia.com/roce_gdr"
+        nominalQuota: 4
+        # borrowingLimit: 0
+        # lendingLimit: 0
+      - name: "pods"
+        nominalQuota: 100
+        # borrowingLimit: 0
+        # lendingLimit: 0
+EOF
+```
+Edit the above quantities to adjust the quota to the desired values. Pod counts
+are optional and can be omitted from the list of covered resources.
+
+Uncomment all `borrowingLimit` lines to prevent this namespace from borrowing
+quota from other namespaces. Uncomment all `lendingLimit` lines to prevent other
+namespaces from borrowing quota from this namespace.
+
+Create a `LocalQueue` to bind the `ClusterQueue` to the namespace:
+```sh
+oc apply -n team1 -f- << EOF
+apiVersion: kueue.x-k8s.io/v1beta1
+kind: LocalQueue
+metadata:
+  name: default-queue
+spec:
+  clusterQueue: team1-cluster-queue
+EOF
+```
+We recommend naming the local queue `default-queue` as `AppWrappers` will
+default to this queue name.
+
diff --git a/setup.RHOAI-v2.13/UNINSTALL.md b/setup.RHOAI-v2.13/UNINSTALL.md
@@ -0,0 +1,23 @@
+# Uninstall
+
+***First, remove all team projects and corresponding cluster queues.***
+
+Then to uninstall the MLBatch controllers and reclaim the corresponding
+namespaces, run:
+```sh
+# OpenShift AI uninstall
+oc delete dsc mlbatch-dsc
+oc delete dsci mlbatch-dsci
+oc delete subscription -n redhat-ods-operator rhods-operator
+oc delete csv -n redhat-ods-operator -l operators.coreos.com/rhods-operator.redhat-ods-operator
+oc delete crd featuretrackers.features.opendatahub.io \
+  dscinitializations.dscinitialization.opendatahub.io \
+  datascienceclusters.datasciencecluster.opendatahub.io
+oc delete operators rhods-operator.redhat-ods-operator
+oc delete operatorgroup -n redhat-ods-operator rhods-operator
+oc delete namespace redhat-ods-applications redhat-ods-monitoring redhat-ods-operator
+
+# Coscheduler uninstall
+helm uninstall -n scheduler-plugins scheduler-plugins
+oc delete namespace scheduler-plugins
+```
diff --git a/setup.RHOAI-v2.13/UPGRADE-FAST.md b/setup.RHOAI-v2.13/UPGRADE-FAST.md
@@ -0,0 +1,31 @@
+# Upgrading from RHOAI 2.12
+
+These instructions assume you installed and configured RHOAI 2.12 following
+the MLBatch [install instructions for RHOAI-v2.12](../setup.RHOAI-v2.12/CLUSTER-SETUP.md)
+or the [upgrade instructions for RHOAI-V2.12](../setup.RHOAI-v2.12/UPGRADE.md)
+
+Your subscription will have automatically created an unapproved
+install plan to upgrade to RHOAI 2.13.
+
+Before beginning, verify that the expected install plan exists:
+```sh
+oc get ip -n redhat-ods-operator
+```
+Typical output would be:
+```sh
+NAME            CSV                     APPROVAL   APPROVED
+install-kpzzl   rhods-operator.2.13.0   Manual     false
+install-nqrbp   rhods-operator.2.10.0   Manual     true
+install-st8vh   rhods-operator.2.11.0   Manual     true
+install-xs6gq   rhods-operator.2.12.0   Manual     true
+```
+
+Assuming the install plan exists you can begin the upgrade process.
+
+There are no MLBatch modifications to the default RHOAI configuration maps
+beyond those already made in previous installs. Therefore, you can simply
+approve the install plan replacing the example plan name below with the actual
+value on your cluster:
+```sh
+oc patch ip -n redhat-ods-operator --type merge --patch '{"spec":{"approved":true}}' install-kpzzl
+```
diff --git a/setup.RHOAI-v2.13/UPGRADE-STABLE.md b/setup.RHOAI-v2.13/UPGRADE-STABLE.md
@@ -0,0 +1,33 @@
+# Upgrading from RHOAI 2.10
+
+These instructions assume you installed and configured RHOAI 2.10 following
+the MLBatch [install instructions for RHOAI-v2.10](../setup.RHOAI-v2.10/CLUSTER-SETUP.md).
+
+Your subscription will have automatically created an unapproved
+install plan to upgrade to RHOAI 2.13.
+
+Before beginning, verify that the expected install plan exists:
+```sh
+oc get ip -n redhat-ods-operator
+```
+Typical output would be:
+```sh
+NAME            CSV                     APPROVAL   APPROVED
+install-kpzzl   rhods-operator.2.13.0   Manual     false
+install-nqrbp   rhods-operator.2.10.0   Manual     true
+```
+
+Assuming the install plan exists you can begin the upgrade process.
+
+First, update the MLBatch modifications to the default RHOAI configuration maps.
+```sh
+oc apply -f setup.RHOAI-v2.13/mlbatch-upgrade-stable-configmaps.yaml
+```
+
+Second, approve the install plan replacing the example plan name below with the actual
+value on your cluster:
+```sh
+oc patch ip -n redhat-ods-operator --type merge --patch '{"spec":{"approved":true}}' install-kpzzl
+```
+
+Finally, create the Slack Cluster Queue as described in [CLUSTER-SETUP.md for RHOAI 2.13](./CLUSTER-SETUP.md#Slack-Cluster-Queue).
diff --git a/setup.RHOAI-v2.13/coscheduler-priority-patch.yaml b/setup.RHOAI-v2.13/coscheduler-priority-patch.yaml
@@ -0,0 +1,3 @@
+- op: add
+  path: /spec/template/spec/priorityClassName
+  value: system-node-critical
diff --git a/setup.RHOAI-v2.13/default-flavor.yaml b/setup.RHOAI-v2.13/default-flavor.yaml
@@ -0,0 +1,4 @@
+apiVersion: kueue.x-k8s.io/v1beta1
+kind: ResourceFlavor
+metadata:
+  name: default-flavor