Skip to content

Commit

Permalink
add instructions for RHOAI 2.13 (#60)
Browse files Browse the repository at this point in the history
  • Loading branch information
dgrove-oss authored Sep 16, 2024
1 parent b5ab2e7 commit acd30fe
Show file tree
Hide file tree
Showing 17 changed files with 969 additions and 2 deletions.
6 changes: 6 additions & 0 deletions SETUP.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,12 @@ Instructions are provided for the following OpenShift AI ***stable*** releases:
+ [RHOAI 2.10 Cluster Setup](./setup.RHOAI-v2.10/CLUSTER-SETUP.md)
+ [RHOAI 2.10 Team Setup](./setup.RHOAI-v2.10/TEAM-SETUP.md)
+ [RHOAI 2.10 Uninstall](./setup.RHOAI-v2.10/UNINSTALL.md)
+ OpenShift AI 2.13
+ [RHOAI 2.13 Cluster Setup](./setup.RHOAI-v2.13/CLUSTER-SETUP.md)
+ [RHOAI 2.13 Team Setup](./setup.RHOAI-v2.13/TEAM-SETUP.md)
+ [UPGRADING from RHOAI 2.10](./setup.RHOAI-v2.13/UPGRADE-STABLE.md)
+ [UPGRADING from RHOAI 2.12](./setup.RHOAI-v2.13/UPGRADE-FAST.md)
+ [RHOAI 2.13 Uninstall](./setup.RHOAI-v2.13/UNINSTALL.md)

Instructions are provided for the following OpenShift AI ***fast*** releases:
+ OpenShift AI 2.11
Expand Down
4 changes: 2 additions & 2 deletions setup.RHOAI-v2.12/UPGRADE.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,10 +28,10 @@ oc apply -f setup.RHOAI-v2.12/mlbatch-upgrade-configmaps.yaml
Second, approve the install plan replacing the example plan name below with the actual
value on your cluster:
```sh
oc patch ip -n redhat-ods-operator --type merge --patch '{"spec":{"approved":true}}' install-st8vh
oc patch ip -n redhat-ods-operator --type merge --patch '{"spec":{"approved":true}}' install-xs6gq
```

Apply this patch:
Third, apply this patch:
```sh
oc apply -f setup.RHOAI-v2.12/mlbatch-rbac-fix.yaml
```
Expand Down
160 changes: 160 additions & 0 deletions setup.RHOAI-v2.13/CLUSTER-SETUP.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
# Cluster Setup

The cluster setup installs OpenShift AI and Coscheduler, configures Kueue,
cluster roles, and priority classes.

If MLBatch is deployed on a cluster that used to run earlier versions of ODH,
[MCAD](https://github.com/project-codeflare/mcad), OpenShift AI, or Coscheduler,
make sure to scrub traces of these installations. In particular, make sure to
delete the following custom resource definitions (CRD) if present on the
cluster. Make sure to delete all instances prior to deleting the CRDs:
```sh
# Delete old appwrappers and crd
oc delete appwrappers --all -A
oc delete crd appwrappers.workload.codeflare.dev

# Delete old noderesourcetopologies and crd
oc delete noderesourcetopologies --all -A
oc delete crd noderesourcetopologies.topology.node.k8s.io
```

## Priorities

Create `default-priority`, `high-priority`, and `low-priority` priority classes:
```sh
oc apply -f setup.RHOAI-v2.13/mlbatch-priorities.yaml
```

## Coscheduler

Install Coscheduler v0.28.9 as a secondary scheduler and configure packing:
```sh
helm install scheduler-plugins --namespace scheduler-plugins --create-namespace \
scheduler-plugins/manifests/install/charts/as-a-second-scheduler/ \
--set-json pluginConfig='[{"args":{"scoringStrategy":{"resources":[{"name":"nvidia.com/gpu","weight":1}],"requestedToCapacityRatio":{"shape":[{"utilization":0,"score":0},{"utilization":100,"score":10}]},"type":"RequestedToCapacityRatio"}},"name":"NodeResourcesFit"}]'
```
Patch Coscheduler pod priorities:
```sh
oc patch deployment -n scheduler-plugins --type=json --patch-file setup.RHOAI-v2.13/coscheduler-priority-patch.yaml scheduler-plugins-controller
oc patch deployment -n scheduler-plugins --type=json --patch-file setup.RHOAI-v2.13/coscheduler-priority-patch.yaml scheduler-plugins-scheduler
```

## OpenShift AI

Create the OpenShift AI subscription:
```sh
oc apply -f setup.RHOAI-v2.13/mlbatch-subscription.yaml
````
Identify install plan:
```sh
oc get ip -n redhat-ods-operator
```
```
NAMESPACE NAME CSV APPROVAL APPROVED
redhat-ods-operator install-kmh8w rhods-operator.2.10.0 Manual false
```
Approve install plan replacing the generated plan name below with the actual
value:
```sh
oc patch ip -n redhat-ods-operator --type merge --patch '{"spec":{"approved":true}}' install-kmh8w
```
Create DSC Initialization:
```sh
oc apply -f setup.RHOAI-v2.13/mlbatch-dsci.yaml
```
Create Data Science Cluster:
```sh
oc apply -f setup.RHOAI-v2.13/mlbatch-dsc.yaml
```
The provided DSCI and DSC are intended to install a minimal set of OpenShift
AI managed components: `codeflare`, `kueue`, `ray`, and `trainingoperator`. The
remaining components such as `dashboard` can be optionally enabled.

The configuration of the managed components differs from the default OpenShift
AI configuration as follows:
- Kubeflow Training Operator:
- `gang-scheduler-name` is set to `scheduler-plugins-scheduler`,
- Kueue:
- `manageJobsWithoutQueueName` is enabled,
- `batch/job` integration is disabled,
- `waitForPodsReady` is disabled,
- `LendingLimit` feature gate is enabled,
- `enableClusterQueueResources` metrics is enabled,
- Codeflare operator:
- the AppWrapper controller is enabled and configured as follows:
- `userRBACAdmissionCheck` is disabled,
- `schedulerName` is set to `scheduler-plugins-scheduler`,
- `queueName` is set to `default-queue`,
- pod priorities, resource requests and limits have been adjusted.

To work around https://issues.redhat.com/browse/RHOAIENG-7887 (a race condition
in OpenShift AI installation), do a rolling restart of the Kueue manager.
```sh
oc rollout restart deployment/kueue-controller-manager -n redhat-ods-applications
```

After doing the restart, verify that you see the following lines in the
kueue-controller-manager's log:
```sh
{"level":"info","ts":"2024-06-25T20:17:25.689638786Z","logger":"controller-runtime.builder","caller":"builder/webhook.go:189","msg":"Registering a validating webhook","GVK":"kubeflow.org/v1, Kind=PyTorchJob","path":"/validate-kubeflow-org-v1-pytorchjob"}
{"level":"info","ts":"2024-06-25T20:17:25.689698615Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:183","msg":"Registering webhook","path":"/validate-kubeflow-org-v1-pytorchjob"}
{"level":"info","ts":"2024-06-25T20:17:25.689743757Z","logger":"setup","caller":"jobframework/setup.go:81","msg":"Set up controller and webhook for job framework","jobFrameworkName":"kubeflow.org/pytorchjob"}

```

## Kueue Configuration

Create Kueue's default flavor:
```sh
oc apply -f setup.RHOAI-v2.13/default-flavor.yaml
```

## Cluster Role

Create `mlbatch-edit` role:
```sh
oc apply -f setup.RHOAI-v2.13/mlbatch-edit-role.yaml
```

## Slack Cluster Queue

Create the designated slack `ClusterQueue` which will be used to automate
minor adjustments to cluster capacity caused by node failures and
scheduler maintanence.
```sh
oc apply -f- << EOF
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: slack-cluster-queue
spec:
namespaceSelector: {}
cohort: default-cohort
preemption:
withinClusterQueue: LowerOrNewerEqualPriority
reclaimWithinCohort: Any
borrowWithinCohort:
policy: Never
resourceGroups:
- coveredResources: ["cpu", "memory", "nvidia.com/gpu", "nvidia.com/roce_gdr", "pods"]
flavors:
- name: default-flavor
resources:
- name: "cpu"
nominalQuota: 8000m
- name: "memory"
nominalQuota: 128Gi
- name: "nvidia.com/gpu"
nominalQuota: 8
- name: "nvidia.com/roce_gdr"
nominalQuota: 1
- name: "pods"
nominalQuota: 100
EOF
```
Edit the above quantities to adjust the quota to the desired
values. Pod counts are optional and can be omitted from the list of
covered resources. The `lendingLimit` for each resource will be
dynamically adjusted by the MLBatch system to reflect reduced cluster
capacity. See [QUOTA_MAINTENANCE.md](../QUOTA_MAINTENANCE.md) for a
detailed discussion of the role of the slack `ClusterQueue`.
91 changes: 91 additions & 0 deletions setup.RHOAI-v2.13/TEAM-SETUP.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Team Setup

A *team* in MLBatch is a group of users that share a resource quota.

Before setting up your teams and quotas, please read [QUOTA_MAINTENANCE.md](../QUOTA_MAINTENANCE.md)
for a discussion of our recommended best practices.


Setting up a new team requires the cluster admin to create a project,
a user group, a quota, a queue, and the required role bindings as described below.

Create project:
```sh
oc new-project team1
```
Create user group:
```sh
oc adm groups new team1-edit-group
```
Add users to group for example:
```sh
oc adm groups add-users team1-edit-group user1
```
Bind cluster role to group in namespace:
```sh
oc adm policy add-role-to-group mlbatch-edit team1-edit-group --role-namespace="" --namespace team1
```

Specify the intended quota for the namespace by creating a `ClusterQueue`:
```sh
oc apply -f- << EOF
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: team1-cluster-queue
spec:
namespaceSelector: {}
cohort: default-cohort
preemption:
withinClusterQueue: LowerOrNewerEqualPriority
reclaimWithinCohort: Any
borrowWithinCohort:
policy: Never
resourceGroups:
- coveredResources: ["cpu", "memory", "nvidia.com/gpu", "nvidia.com/roce_gdr", "pods"]
flavors:
- name: default-flavor
resources:
- name: "cpu"
nominalQuota: 8000m
# borrowingLimit: 0
# lendingLimit: 0
- name: "memory"
nominalQuota: 128Gi
# borrowingLimit: 0
# lendingLimit: 0
- name: "nvidia.com/gpu"
nominalQuota: 16
# borrowingLimit: 0
# lendingLimit: 0
- name: "nvidia.com/roce_gdr"
nominalQuota: 4
# borrowingLimit: 0
# lendingLimit: 0
- name: "pods"
nominalQuota: 100
# borrowingLimit: 0
# lendingLimit: 0
EOF
```
Edit the above quantities to adjust the quota to the desired values. Pod counts
are optional and can be omitted from the list of covered resources.

Uncomment all `borrowingLimit` lines to prevent this namespace from borrowing
quota from other namespaces. Uncomment all `lendingLimit` lines to prevent other
namespaces from borrowing quota from this namespace.

Create a `LocalQueue` to bind the `ClusterQueue` to the namespace:
```sh
oc apply -n team1 -f- << EOF
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
name: default-queue
spec:
clusterQueue: team1-cluster-queue
EOF
```
We recommend naming the local queue `default-queue` as `AppWrappers` will
default to this queue name.

23 changes: 23 additions & 0 deletions setup.RHOAI-v2.13/UNINSTALL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Uninstall

***First, remove all team projects and corresponding cluster queues.***

Then to uninstall the MLBatch controllers and reclaim the corresponding
namespaces, run:
```sh
# OpenShift AI uninstall
oc delete dsc mlbatch-dsc
oc delete dsci mlbatch-dsci
oc delete subscription -n redhat-ods-operator rhods-operator
oc delete csv -n redhat-ods-operator -l operators.coreos.com/rhods-operator.redhat-ods-operator
oc delete crd featuretrackers.features.opendatahub.io \
dscinitializations.dscinitialization.opendatahub.io \
datascienceclusters.datasciencecluster.opendatahub.io
oc delete operators rhods-operator.redhat-ods-operator
oc delete operatorgroup -n redhat-ods-operator rhods-operator
oc delete namespace redhat-ods-applications redhat-ods-monitoring redhat-ods-operator

# Coscheduler uninstall
helm uninstall -n scheduler-plugins scheduler-plugins
oc delete namespace scheduler-plugins
```
31 changes: 31 additions & 0 deletions setup.RHOAI-v2.13/UPGRADE-FAST.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Upgrading from RHOAI 2.12

These instructions assume you installed and configured RHOAI 2.12 following
the MLBatch [install instructions for RHOAI-v2.12](../setup.RHOAI-v2.12/CLUSTER-SETUP.md)
or the [upgrade instructions for RHOAI-V2.12](../setup.RHOAI-v2.12/UPGRADE.md)

Your subscription will have automatically created an unapproved
install plan to upgrade to RHOAI 2.13.

Before beginning, verify that the expected install plan exists:
```sh
oc get ip -n redhat-ods-operator
```
Typical output would be:
```sh
NAME CSV APPROVAL APPROVED
install-kpzzl rhods-operator.2.13.0 Manual false
install-nqrbp rhods-operator.2.10.0 Manual true
install-st8vh rhods-operator.2.11.0 Manual true
install-xs6gq rhods-operator.2.12.0 Manual true
```

Assuming the install plan exists you can begin the upgrade process.

There are no MLBatch modifications to the default RHOAI configuration maps
beyond those already made in previous installs. Therefore, you can simply
approve the install plan replacing the example plan name below with the actual
value on your cluster:
```sh
oc patch ip -n redhat-ods-operator --type merge --patch '{"spec":{"approved":true}}' install-kpzzl
```
33 changes: 33 additions & 0 deletions setup.RHOAI-v2.13/UPGRADE-STABLE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Upgrading from RHOAI 2.10

These instructions assume you installed and configured RHOAI 2.10 following
the MLBatch [install instructions for RHOAI-v2.10](../setup.RHOAI-v2.10/CLUSTER-SETUP.md).

Your subscription will have automatically created an unapproved
install plan to upgrade to RHOAI 2.13.

Before beginning, verify that the expected install plan exists:
```sh
oc get ip -n redhat-ods-operator
```
Typical output would be:
```sh
NAME CSV APPROVAL APPROVED
install-kpzzl rhods-operator.2.13.0 Manual false
install-nqrbp rhods-operator.2.10.0 Manual true
```

Assuming the install plan exists you can begin the upgrade process.

First, update the MLBatch modifications to the default RHOAI configuration maps.
```sh
oc apply -f setup.RHOAI-v2.13/mlbatch-upgrade-stable-configmaps.yaml
```

Second, approve the install plan replacing the example plan name below with the actual
value on your cluster:
```sh
oc patch ip -n redhat-ods-operator --type merge --patch '{"spec":{"approved":true}}' install-kpzzl
```

Finally, create the Slack Cluster Queue as described in [CLUSTER-SETUP.md for RHOAI 2.13](./CLUSTER-SETUP.md#Slack-Cluster-Queue).
3 changes: 3 additions & 0 deletions setup.RHOAI-v2.13/coscheduler-priority-patch.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
- op: add
path: /spec/template/spec/priorityClassName
value: system-node-critical
4 changes: 4 additions & 0 deletions setup.RHOAI-v2.13/default-flavor.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: default-flavor
Loading

0 comments on commit acd30fe

Please sign in to comment.