Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

document setup of slack cluster queue #36

Merged
merged 1 commit into from
Aug 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 43 additions & 0 deletions setup.k8s-v1.25/CLUSTER-SETUP.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,3 +98,46 @@ Create `mlbatch-edit` role:
```sh
kubectl apply -f setup.k8s-v1.25/mlbatch-edit-role.yaml
```

## Slack Cluster Queue

Create the designated slack `ClusterQueue` which will be used to automate
minor adjustments to cluster capacity caused by node failures and
scheduler maintanence.
```sh
kubectl apply -f- << EOF
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: slack-cluster-queue
spec:
namespaceSelector: {}
cohort: default-cohort
preemption:
withinClusterQueue: LowerOrNewerEqualPriority
reclaimWithinCohort: Any
borrowWithinCohort:
policy: Never
resourceGroups:
- coveredResources: ["cpu", "memory", "nvidia.com/gpu", "nvidia.com/roce_gdr", "pods"]
flavors:
- name: default-flavor
resources:
- name: "cpu"
nominalQuota: 8000m
- name: "memory"
nominalQuota: 128Gi
- name: "nvidia.com/gpu"
nominalQuota: 8
- name: "nvidia.com/roce_gdr"
nominalQuota: 1
- name: "pods"
nominalQuota: 100
EOF
```
Edit the above quantities to adjust the quota to the desired
values. Pod counts are optional and can be omitted from the list of
covered resources. The `lendingLimit` for each resource will be
dynamically adjusted by the MLBatch system to reflect reduced cluster
capacity. See [QUOTA_MAINTENANCE.md](../QUOTA_MAINTENANCE.md) for a
detailed discussion of the role of the slack `ClusterQueue`.
1 change: 1 addition & 0 deletions setup.k8s-v1.25/appwrapper/config_patch.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ data:
enable: false
defaultQueueName: default-queue
schedulerName: scheduler-plugins-scheduler
slackQueueName: slack-cluster-queue
userRBACAdmissionCheck: false
controllerManager:
health:
Expand Down
43 changes: 43 additions & 0 deletions setup.k8s-v1.30/CLUSTER-SETUP.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,3 +104,46 @@ will have local queue names and thus be subject to Kueue's quota management.
```sh
kubectl apply -f setup.k8s-v1.30/admission-policy.yaml
```

## Slack Cluster Queue

Create the designated slack `ClusterQueue` which will be used to automate
minor adjustments to cluster capacity caused by node failures and
scheduler maintanence.
```sh
kubectl apply -f- << EOF
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: slack-cluster-queue
spec:
namespaceSelector: {}
cohort: default-cohort
preemption:
withinClusterQueue: LowerOrNewerEqualPriority
reclaimWithinCohort: Any
borrowWithinCohort:
policy: Never
resourceGroups:
- coveredResources: ["cpu", "memory", "nvidia.com/gpu", "nvidia.com/roce_gdr", "pods"]
flavors:
- name: default-flavor
resources:
- name: "cpu"
nominalQuota: 8000m
- name: "memory"
nominalQuota: 128Gi
- name: "nvidia.com/gpu"
nominalQuota: 8
- name: "nvidia.com/roce_gdr"
nominalQuota: 1
- name: "pods"
nominalQuota: 100
EOF
```
Edit the above quantities to adjust the quota to the desired
values. Pod counts are optional and can be omitted from the list of
covered resources. The `lendingLimit` for each resource will be
dynamically adjusted by the MLBatch system to reflect reduced cluster
capacity. See [QUOTA_MAINTENANCE.md](../QUOTA_MAINTENANCE.md) for a
detailed discussion of the role of the slack `ClusterQueue`.
1 change: 1 addition & 0 deletions setup.k8s-v1.30/appwrapper/config_patch.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ data:
enable: false
defaultQueueName: default-queue
schedulerName: scheduler-plugins-scheduler
slackQueueName: slack-cluster-queue
userRBACAdmissionCheck: false
controllerManager:
health:
Expand Down
46 changes: 46 additions & 0 deletions setup.tmpl/CLUSTER-SETUP.md.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -196,3 +196,49 @@ will have local queue names and thus be subject to Kueue's quota management.
{{ .KUBECTL }} apply -f setup.{{ .VERSION }}/admission-policy.yaml
```
{{- end }}

{{- if .SLACKCQ }}

## Slack Cluster Queue

Create the designated slack `ClusterQueue` which will be used to automate
minor adjustments to cluster capacity caused by node failures and
scheduler maintanence.
```sh
{{ .KUBECTL }} apply -f- << EOF
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: slack-cluster-queue
spec:
namespaceSelector: {}
cohort: default-cohort
preemption:
withinClusterQueue: LowerOrNewerEqualPriority
reclaimWithinCohort: Any
borrowWithinCohort:
policy: Never
resourceGroups:
- coveredResources: ["cpu", "memory", "nvidia.com/gpu", "nvidia.com/roce_gdr", "pods"]
flavors:
- name: default-flavor
resources:
- name: "cpu"
nominalQuota: 8000m
- name: "memory"
nominalQuota: 128Gi
- name: "nvidia.com/gpu"
nominalQuota: 8
- name: "nvidia.com/roce_gdr"
nominalQuota: 1
- name: "pods"
nominalQuota: 100
EOF
```
Edit the above quantities to adjust the quota to the desired
values. Pod counts are optional and can be omitted from the list of
covered resources. The `lendingLimit` for each resource will be
dynamically adjusted by the MLBatch system to reflect reduced cluster
capacity. See [QUOTA_MAINTENANCE.md](../QUOTA_MAINTENANCE.md) for a
detailed discussion of the role of the slack `ClusterQueue`.
{{- end }}
1 change: 1 addition & 0 deletions setup.tmpl/Kubernetes-v1.25.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@ OPENSHIFT: false
VERSION: k8s-v1.25
KUBECTL: kubectl
VAP: false
SLACKCQ: true
1 change: 1 addition & 0 deletions setup.tmpl/Kubernetes-v1.30.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@ OPENSHIFT: false
VERSION: k8s-v1.30
KUBECTL: kubectl
VAP: true
SLACKCQ: true
3 changes: 2 additions & 1 deletion setup.tmpl/RHOAI-v2.10.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,5 @@

OPENSHIFT: true
VERSION: RHOAI-v2.10
KUBECTL: oc
KUBECTL: oc
SLACKCQ: false
3 changes: 1 addition & 2 deletions setup.tmpl/RHOAI-v2.11.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,4 @@
OPENSHIFT: true
VERSION: RHOAI-v2.11
KUBECTL: oc


SLACKCQ: false