Skip to content

Commit

Permalink
add support for cluster downsize
Browse files Browse the repository at this point in the history
we do this by way of adding backoffLimitPerIndex and setting to 0,
meaning that a pod (follower broker) cannot be recreated when the
pod is killed. We might want to do this for autoscaling. See
the examples/elasticity/downsize for details.

Signed-off-by: vsoch <[email protected]>
  • Loading branch information
vsoch committed Jan 10, 2024
1 parent 0a119ba commit 1c01f9d
Show file tree
Hide file tree
Showing 22 changed files with 480 additions and 1,112 deletions.
6 changes: 3 additions & 3 deletions .github/workflows/helm.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,10 @@ jobs:
name: Prepare chart
steps:
- name: Checkout Repository
uses: actions/checkout@v3
uses: actions/checkout@v4
- uses: actions/setup-go@v3
with:
go-version: ^1.18.1
go-version: ^1.20
- name: GHCR Login
if: (github.event_name != 'pull_request')
uses: docker/login-action@v2
Expand Down Expand Up @@ -44,4 +44,4 @@ jobs:
PKG_RESPONSE=$(helm package ./chart)
echo "$PKG_RESPONSE"
CHART_TAR_GZ=$(basename "$PKG_RESPONSE")
helm push "$CHART_TAR_GZ" oci://ghcr.io/flux-framework/flux-operator-helm
helm push "$CHART_TAR_GZ" oci://ghcr.io/flux-framework/flux-operator-helm
2 changes: 1 addition & 1 deletion .github/workflows/main.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ jobs:
run: sudo apt-get update && sudo apt-get install -y libsodium-dev libzmq3-dev libczmq-dev

- name: Start minikube
uses: medyagh/setup-minikube@697f2b7aaed5f70bf2a94ee21a4ec3dde7b12f92 # v0.0.9
uses: medyagh/setup-minikube@master

- name: Pull Docker Containers to MiniKube
env:
Expand Down
1 change: 1 addition & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -342,6 +342,7 @@ $(KUSTOMIZE): $(LOCALBIN)
controller-gen: $(CONTROLLER_GEN) ## Download controller-gen locally if necessary.
$(CONTROLLER_GEN): $(LOCALBIN)
GOBIN=$(LOCALBIN) go install sigs.k8s.io/controller-tools/cmd/controller-gen@$(CONTROLLER_TOOLS_VERSION)
go mod tidy

# Build the latest openapi-gen from source
.PHONY: openapi-gen
Expand Down
6 changes: 6 additions & 0 deletions api/v1alpha2/minicluster_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,12 @@ type MiniClusterSpec struct {
// +optional
ShareProcessNamespace bool `json:"shareProcessNamespace"`

// Restart failed workers (defaults to true)
// This is setting backoffLimitPerIndex to 0 on the backend
// This requires an additional feature gate to be enabled.
// +optional
SuspendWorkers bool `json:"suspendWorkers"`

// Cleanup the pods and storage when the index broker pod is complete
// +kubebuilder:default=false
// +default=false
Expand Down
5 changes: 5 additions & 0 deletions api/v1alpha2/swagger.json
Original file line number Diff line number Diff line change
Expand Up @@ -622,6 +622,11 @@
"format": "int32",
"default": 1
},
"suspendWorkers": {
"description": "Restart failed workers (defaults to true) This is setting backoffLimitPerIndex to 0 on the backend This requires an additional feature gate to be enabled.",
"type": "boolean",
"default": false
},
"tasks": {
"description": "Total number of CPUs being run across entire cluster",
"type": "integer",
Expand Down
8 changes: 8 additions & 0 deletions api/v1alpha2/zz_generated.openapi.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions chart/templates/minicluster-crd.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -689,6 +689,11 @@ spec:
pods) This is also the minimum number required to start Flux
format: int32
type: integer
suspendWorkers:
description: Restart failed workers (defaults to true) This is setting
backoffLimitPerIndex to 0 on the backend This requires an additional
feature gate to be enabled.
type: boolean
tasks:
default: 1
description: Total number of CPUs being run across entire cluster
Expand Down
5 changes: 5 additions & 0 deletions config/crd/bases/flux-framework.org_miniclusters.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -695,6 +695,11 @@ spec:
in pods) This is also the minimum number required to start Flux
format: int32
type: integer
suspendWorkers:
description: Restart failed workers (defaults to true) This is setting
backoffLimitPerIndex to 0 on the backend This requires an additional
feature gate to be enabled.
type: boolean
tasks:
default: 1
description: Total number of CPUs being run across entire cluster
Expand Down
6 changes: 6 additions & 0 deletions controllers/flux/job.go
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,12 @@ func NewMiniClusterJob(cluster *api.MiniCluster) (*batchv1.Job, error) {
},
}

// Does the user want backoff limit per index?
if cluster.Spec.SuspendWorkers {
var backoffLimitPerIndex int32 = 0
job.Spec.BackoffLimitPerIndex = &backoffLimitPerIndex
}

// Add Affinity to map one pod / node only if the user hasn't disbaled it
if !cluster.Spec.Network.DisableAffinity {
job.Spec.Template.Spec.Affinity = getAffinity(cluster)
Expand Down
82 changes: 0 additions & 82 deletions controllers/flux/suite_test.go

This file was deleted.

13 changes: 13 additions & 0 deletions docs/getting_started/custom-resource-definition.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,19 @@ This would be equivalent to giving a start command of `sleep infinity` however o
(e.g., if there is a flux shutdown from within the Flux instance) the sleep command would
not exit with a failed code.

### suspendWorkers

By default, when a pod fails it is attempted to restart. When you use the [JobBackoffPerIndex](https://kubernetes.io/blog/2023/08/21/kubernetes-1-28-jobapi-update/#backoff-limit-per-index) feature gate (Kubernetes 1.28) you can set this to an explicit number of failures allowed. For example, a value of 0 will mean the pod is only allowed to fail once (and not recreated).
Since we want this primarily to be a case of "restart the workers" or "don't restart them" we expose this as a boolean.
Setting `suspendWorkers` to true indicates that on a failure, we do not restart.

```yaml
spec:
suspendWorkers: true
```

This can be useful for cases of autoscaling in the down direction when you need to drain a node, and then delete the pod.

### launcher

If you are using an executor that launches Flux Jobs (e.g., workflow managers such as Snakemake and Nextflow do!)
Expand Down
5 changes: 5 additions & 0 deletions examples/dist/flux-operator-arm.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -701,6 +701,11 @@ spec:
in pods) This is also the minimum number required to start Flux
format: int32
type: integer
suspendWorkers:
description: Restart failed workers (defaults to true) This is setting
backoffLimitPerIndex to 0 on the backend This requires an additional
feature gate to be enabled.
type: boolean
tasks:
default: 1
description: Total number of CPUs being run across entire cluster
Expand Down
5 changes: 5 additions & 0 deletions examples/dist/flux-operator.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -701,6 +701,11 @@ spec:
in pods) This is also the minimum number required to start Flux
format: int32
type: integer
suspendWorkers:
description: Restart failed workers (defaults to true) This is setting
backoffLimitPerIndex to 0 on the backend This requires an additional
feature gate to be enabled.
type: boolean
tasks:
default: 1
description: Total number of CPUs being run across entire cluster
Expand Down
Loading

0 comments on commit 1c01f9d

Please sign in to comment.