Skip to content

Commit

Permalink
prepare 1.7.2 docs, introduction of specified-pod is only added into …
Browse files Browse the repository at this point in the history
…current(wait review)

Signed-off-by: Abner-1 <[email protected]>
  • Loading branch information
ABNER-1 committed Oct 10, 2024
1 parent cc4ea61 commit 9489c32
Show file tree
Hide file tree
Showing 28 changed files with 560 additions and 370 deletions.
6 changes: 3 additions & 3 deletions docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ $ helm repo add openkruise https://openkruise.github.io/charts/
$ helm repo update

# Install the latest version.
$ helm install kruise openkruise/kruise --version 1.7.1
$ helm install kruise openkruise/kruise --version 1.7.2
```
**Note:** [Changelog](https://github.com/openkruise/kruise/blob/master/CHANGELOG.md).

Expand All @@ -37,7 +37,7 @@ $ helm repo add openkruise https://openkruise.github.io/charts/
$ helm repo update

# Upgrade to the latest version.
$ helm upgrade kruise openkruise/kruise --version 1.7.1 [--force]
$ helm upgrade kruise openkruise/kruise --version 1.7.2 [--force]
```

Note that:
Expand Down Expand Up @@ -83,7 +83,7 @@ The following table lists the configurable parameters of the chart and their def
| `manager.log.level` | Log level that kruise-manager printed | `4` |
| `manager.replicas` | Replicas of kruise-controller-manager deployment | `2` |
| `manager.image.repository` | Repository for kruise-manager image | `openkruise/kruise-manager` |
| `manager.image.tag` | Tag for kruise-manager image | `v1.7.1` |
| `manager.image.tag` | Tag for kruise-manager image | `v1.7.2` |
| `manager.resources.limits.cpu` | CPU resource limit of kruise-manager container | `200m` |
| `manager.resources.limits.memory` | Memory resource limit of kruise-manager container | `512Mi` |
| `manager.resources.requests.cpu` | CPU resource request of kruise-manager container | `100m` |
Expand Down
193 changes: 109 additions & 84 deletions docs/user-manuals/advancedstatefulset.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,10 +76,10 @@ spec:
image: nginx:alpine
```

### User Stories
#### User Stories
The main motivation of this feature is to support a more flexible StatefulSet, a building block in an ecosystem where Stateful applications can be migrated across Kubernetes clusters with more automation. As follows:

#### Story 1
##### Story 1

**Migrating across namespaces**: Many organizations use namespaces for team isolation. Consider a team that is migrating a `StatefulSet` to a new namespace in a cluster. Migration could be motivated by a branding change, or a requirement to move out of a shared namespace. Consider the StatefulSet `my-app` with `replicas: 5`, running in a shared namespace.

Expand Down Expand Up @@ -108,15 +108,48 @@ ordinals.start: 0 ordinals.start: 3

The `replicasStatefulSet` and `replicas` fields should be updated jointly, depending on the requirements of the migration.

#### Story 2
##### Story 2

**Migrating across clusters**: Organizations taking a multi cluster approach may need to move workloads across clusters due to capacity constraints, infrastructure constraints, or for better application isolation. Similar to namespace migration, the application operator should manage network connectivity, volumes and slice orchestration.

#### Story 3
##### Story 3

**Non-Zero Based Indexing:** A user may want to number their StatefulSet starting from ordinal `1`, rather than ordinal `0`. Using
`1` based numbering may be easier to reason about and conceptualize (eg: ordinal `k` is the `k`'th replica, not the `k+1`'th replica).

## Scale features

### PersistentVolumeClaim retention

**FEATURE STATE:** Kruise v1.1.0

If you have enabled the `StatefulSetAutoDeletePVC` feature-gate during [Kruise installation or upgrade](../installation#optional-feature-gate),
you can use `.spec.persistentVolumeClaimRetentionPolicy` field to control if and how PVCs are deleted during the lifecycle of a StatefulSet.

This is same to the upstream StatefulSet (K8s >= 1.23 [alpha]), please refer to [the upstream document for it](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#persistentvolumeclaim-retention).

### Scaling with rate limiting

**FEATURE STATE:** Kruise v0.10.0

To avoid creating all failure pods at once when a new CloneSet applied, a `maxUnavailable` field for scale strategy has been added since Kruise `v0.10.0`.

```yaml
apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
spec:
# ...
replicas: 100
scaleStrategy:
maxUnavailable: 10% # percentage or absolute number
```
When this field has been set, Advanced StatefulSet will create pods with the guarantee that the number of unavailable pods during the update cannot exceed this value.
For example, the StatefulSet will firstly create 10 pods. After that, it will create one more pod only if one pod created has been running and ready.
Note that it can just be allowed to work with Parallel podManagementPolicy.
### Ordinals reserve(skip)
Since Advanced StatefulSet `v1beta1` (Kruise >= v0.7.0), it supports ordinals reserve.
Expand All @@ -138,45 +171,35 @@ spec:
For an Advanced StatefulSet with `replicas=4, reserveOrdinals=[1]`, the ordinals of running Pods will be `[0,2,3,4]`.

- If you want to migrate Pod-3 and reserve this ordinal, just append `3` into `reserveOrdinals` list.
Then controller will delete Pod-3 and create Pod-5 (existing Pods will be `[0,2,4,5]`).
Then controller will delete Pod-3 and create Pod-5 (existing Pods will be `[0,2,4,5]`).
- If you just want to delete Pod-3, you should append `3` into `reserveOrdinals` list and set `replicas` to `3`.
Then controller will delete Pod-3 (existing Pods will be `[0,2,4]`).
Then controller will delete Pod-3 (existing Pods will be `[0,2,4]`).

## MaxUnavailable
### Specified Pod Deletion

Advanced StatefulSet adds a `maxUnavailable` capability in the `RollingUpdateStatefulSetStrategy` to allow parallel Pod
updates with the guarantee that the number of unavailable pods during the update cannot exceed this value.
It is only allowed to use when the podManagementPolicy is `Parallel`.
**FEATURE STATE:** Kruise v1.5.5, Kruise v1.6.4, Kruise v1.7.2+

This feature achieves similar update efficiency like Deployment for cases where the order of
update is not critical to the workload. Without this feature, the native `StatefulSet` controller can only
update Pods one by one even if the podManagementPolicy is `Parallel`.
Compared to manually deleting a Pod directly, pod deletion by labeling pod with `apps.kruise.io/specified-delete: true` will be protected by the `maxUnavailable` of the Advanced StatefulSet during deletion,
and it will trigger the `PreparingDelete` lifecycle hook (see below).

```yaml
apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
apiVersion: v1
kind: Pod
metadata:
labels:
# ...
apps.kruise.io/specified-delete: true
spec:
containers:
- name: main
# ...
podManagementPolicy: Parallel
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 20%
```

For example, assuming an Advanced StatefulSet has five Pods named P0 to P4, and the application can
tolerate losing three replicas temporally. If we want to update the StatefulSet Pod spec from v1 to
v2, we can perform the following steps using the `MaxUnavailable` feature for fast update.

1. Set `MaxUnavailable` to 3 to allow three unavailable Pods maximally.
2. Optionally, Set `Partition` to 4 in case canary update is needed. Partition means all Pods with an ordinal that is
greater than or equal to the partition will be updated. In this case P4 will be updated even though `MaxUnavailable`
is 3.
3. After P4 finish update, change `Partition` to 0. The controller will update P1,P2 and P3 concurrently.
Note that with default StatefulSet, the Pods will be updated sequentially in the order of P3, P2, P1.
4. Once one of P1, P2 and P3 finishes update, P0 will be updated immediately.
When the controller receives the above Pod update, it will trigger the deletion process of the pod with specified deletion label and ensure that the `maxUnavailable` limit is not exceeded.
The pod will be re-built by the workload if the ordinal is not reserved.

## In-Place Update
## Update features
### In-Place Update

Advanced StatefulSet adds a `podUpdatePolicy` field in `spec.updateStrategy.rollingUpdate`
which controls recreate or in-place update for Pods.
Expand Down Expand Up @@ -244,14 +267,38 @@ spec:
maxUnavailable: 2
```

## Update sequence
### Pre-download image for in-place update

**FEATURE STATE:** Kruise v0.10.0

If you have enabled the `PreDownloadImageForInPlaceUpdate` feature-gate during [Kruise installation or upgrade](../installation#optional-feature-gate),
Advanced StatefulSet controller will automatically pre-download the image you want to update to the nodes of all old Pods.
It is quite useful to accelerate the progress of applications upgrade.

The parallelism of each new image pre-downloading by Advanced StatefulSet is `1`, which means the image is downloaded on nodes one by one.
You can change the parallelism using `apps.kruise.io/image-predownload-parallelism` annotation on Advanced StatefulSet according to the capability of image registry,
for registries with more bandwidth and P2P image downloading ability, a larger parallelism can speed up the pre-download process.

Since Kruise v1.1.0, you can use `apps.kruise.io/image-predownload-min-updated-ready-pods` to make sure the new image starting pre-download after a few new Pods have been updated ready. Its value can be absolute number or percentage.

```yaml
apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
metadata:
annotations:
apps.kruise.io/image-predownload-parallelism: "10"
apps.kruise.io/image-predownload-min-updated-ready-pods: "3"
```

Note that to avoid most unnecessary image downloading, now controller will only pre-download images for Advanced StatefulSet with replicas > `3`.

### Update sequence

Advanced StatefulSet adds a `unorderedUpdate` field in `spec.updateStrategy.rollingUpdate`, which contains strategies for non-ordered update.
If `unorderedUpdate` is not nil, pods will be updated with non-ordered sequence. Noted that UnorderedUpdate can only be allowed to work with Parallel podManagementPolicy.

Currently `unorderedUpdate` only contains one field: `priorityStrategy`.

### Priority strategy
#### Priority strategy

This strategy defines rules for calculating the priority of updating pods.
All update candidates will be applied with the priority terms.
Expand Down Expand Up @@ -291,79 +338,57 @@ spec:
unorderedUpdate:
priorityStrategy:
orderPriority:
- orderedKey: some-label-key
- orderedKey: some-label-key
```

## Paused update
### MaxUnavailable

`paused` indicates that Pods updating is paused, controller will not update Pods but just maintain the number of replicas.
Advanced StatefulSet adds a `maxUnavailable` capability in the `RollingUpdateStatefulSetStrategy` to allow parallel Pod
updates with the guarantee that the number of unavailable pods during the update cannot exceed this value.
It is only allowed to use when the podManagementPolicy is `Parallel`.

This feature achieves similar update efficiency like Deployment for cases where the order of
update is not critical to the workload. Without this feature, the native `StatefulSet` controller can only
update Pods one by one even if the podManagementPolicy is `Parallel`.

```yaml
apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
spec:
# ...
podManagementPolicy: Parallel
updateStrategy:
type: RollingUpdate
rollingUpdate:
paused: true
```

## Pre-download image for in-place update

**FEATURE STATE:** Kruise v0.10.0

If you have enabled the `PreDownloadImageForInPlaceUpdate` feature-gate during [Kruise installation or upgrade](../installation#optional-feature-gate),
Advanced StatefulSet controller will automatically pre-download the image you want to update to the nodes of all old Pods.
It is quite useful to accelerate the progress of applications upgrade.

The parallelism of each new image pre-downloading by Advanced StatefulSet is `1`, which means the image is downloaded on nodes one by one.
You can change the parallelism using `apps.kruise.io/image-predownload-parallelism` annotation on Advanced StatefulSet according to the capability of image registry,
for registries with more bandwidth and P2P image downloading ability, a larger parallelism can speed up the pre-download process.

Since Kruise v1.1.0, you can use `apps.kruise.io/image-predownload-min-updated-ready-pods` to make sure the new image starting pre-download after a few new Pods have been updated ready. Its value can be absolute number or percentage.

```yaml
apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
metadata:
annotations:
apps.kruise.io/image-predownload-parallelism: "10"
apps.kruise.io/image-predownload-min-updated-ready-pods: "3"
maxUnavailable: 20%
```

Note that to avoid most unnecessary image downloading, now controller will only pre-download images for Advanced StatefulSet with replicas > `3`.
For example, assuming an Advanced StatefulSet has five Pods named P0 to P4, and the application can
tolerate losing three replicas temporally. If we want to update the StatefulSet Pod spec from v1 to
v2, we can perform the following steps using the `MaxUnavailable` feature for fast update.

## Scaling with rate limiting
1. Set `MaxUnavailable` to 3 to allow three unavailable Pods maximally.
2. Optionally, Set `Partition` to 4 in case canary update is needed. Partition means all Pods with an ordinal that is
greater than or equal to the partition will be updated. In this case P4 will be updated even though `MaxUnavailable`
is 3.
3. After P4 finish update, change `Partition` to 0. The controller will update P1,P2 and P3 concurrently.
Note that with default StatefulSet, the Pods will be updated sequentially in the order of P3, P2, P1.
4. Once one of P1, P2 and P3 finishes update, P0 will be updated immediately.

**FEATURE STATE:** Kruise v0.10.0
### Paused update

To avoid creating all failure pods at once when a new CloneSet applied, a `maxUnavailable` field for scale strategy has been added since Kruise `v0.10.0`.
`paused` indicates that Pods updating is paused, controller will not update Pods but just maintain the number of replicas.

```yaml
apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
spec:
# ...
replicas: 100
scaleStrategy:
maxUnavailable: 10% # percentage or absolute number
updateStrategy:
rollingUpdate:
paused: true
```

When this field has been set, Advanced StatefulSet will create pods with the guarantee that the number of unavailable pods during the update cannot exceed this value.

For example, the StatefulSet will firstly create 10 pods. After that, it will create one more pod only if one pod created has been running and ready.

Note that it can just be allowed to work with Parallel podManagementPolicy.

## PersistentVolumeClaim retention

**FEATURE STATE:** Kruise v1.1.0

If you have enabled the `StatefulSetAutoDeletePVC` feature-gate during [Kruise installation or upgrade](../installation#optional-feature-gate),
you can use `.spec.persistentVolumeClaimRetentionPolicy` field to control if and how PVCs are deleted during the lifecycle of a StatefulSet.

This is same to the upstream StatefulSet (K8s >= 1.23 [alpha]), please refer to [the upstream document for it](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#persistentvolumeclaim-retention).

## Lifecycle hook

**FEATURE STATE:** Kruise v0.8.0
Expand Down
Loading

0 comments on commit 9489c32

Please sign in to comment.