From ed48970de70d2667661ed8f4f03b2a1419c752fb Mon Sep 17 00:00:00 2001 From: thiyyakat Date: Mon, 22 Sep 2025 15:15:10 +0530 Subject: [PATCH 01/11] Add proposal for preservation of failed machines --- docs/proposals/failed-machine-preservation.md | 103 ++++++++++++++++++ 1 file changed, 103 insertions(+) create mode 100644 docs/proposals/failed-machine-preservation.md diff --git a/docs/proposals/failed-machine-preservation.md b/docs/proposals/failed-machine-preservation.md new file mode 100644 index 000000000..684e59826 --- /dev/null +++ b/docs/proposals/failed-machine-preservation.md @@ -0,0 +1,103 @@ +# Preservation of Failed Machines + + + +- [Preservation of Failed Machines](#preservation-of-failed-machines) + - [Objective](#objective) + - [Solution Design](#solution-design) + - [State Machine](#state-machine) + - [Use Cases](#use-cases) + + + + +## Objective + +Currently, the Machine Controller Manager(MCM) moves Machines with errors to the `Unknown` phase, and after the configured `machineHealthTimeout` seconds, to the `Failed` phase. +`Failed` machines are swiftly moved to the `Terminating` phase during which the node is drained and the `Machine` object is deleted. This rapid cleanup prevents SRE/operators/support from conducting an analysis on the VM and makes finding root cause of failure more difficult. + +This document proposes enhancing MCM, such that: +* VMs of `Failed` machines are retained temporarily for analysis +* There is a configurable limit to the number of `Failed` machines that can be preserved +* There is a configurable limit to the duration for which such machines are preserved +* Users can specify which healthy machines they would like to preserve in case of failure +* Users can request MCM to delete a preserved `Failed` machine, even before the timeout expires + +## Solution Design + +In order to achieve the objectives mentioned, the following are proposed: +1. Enhance `machineControllerManager` configuration in the `ShootSpec`, to specify the max number of failed machines to be preserved, +and the time duration for which these machines will be preserved. + ``` + machineControllerManager: + failedMachinePreserveMax: 2 + failedMachinePreserveTimeout: 3h + ``` + * Since gardener worker pool can correspond to `1..N` MachineDeployments depending on number of zones, `failedMachinePreserveMax` will be distributed across N machine deployments. + * `failedMachinePreserveMax` must be chosen such that it can be appropriately distributed across the MachineDeployments. +2. Allow user/operator to explicitly request for preservation of a machine if it moves to `Failed` phase with the use of an annotation : `node.machine.sapcloud.io/preserve-when-failed=true`. +When such an annotated machine transitions from `Unknown` to `Failed`, it is prevented from moving to `Terminating` phase until `failedMachinePreserveTimeout` expires. + * A user/operator can request MCM to stop preserving a preserved `Failed` machine by adding/modifying the annotation: `node.machine.sapcloud.io/preserve-when-failed=false`. + * For a machine thus annotated, MCM will move it to `Terminating` phase even if `failedMachinePreserveTimeout` has not expired. +3. If an un-annotated machine moves to `Failed` phase, and the `failedMachinePreserveMax` has not been reached, MCM will auto-preserve this machine. +4. MCM will be modified to introduce a new stage in the `Failed` phase: `machineutils.PreserveFailed`, and a failed machine that is preserved by MCM will be transitioned to this stage after moving to `Failed`. + * In this new stage, pods can be evicted and scheduled on other healthy machines, and the user/operator can wait for the corresponding VM to potentially recover. If the machine moves to `Running` phase on recovery, new pods can be scheduled on it. It is yet to be determined whether this feature will be required. + + +## State Machine + +The behaviour described above can be summarised using the state machine below: + +``` +(Running Machine) +├── [User adds `node.machine.sapcloud.io/preserve-when-failed=true`] → (Running + Requested) +└── [Machine fails + capacity available] → (PreserveFailed) + +(Running + Requested) +├── [Machine fails + capacity available] → (PreserveFailed) +├── [Machine fails + no capacity] → Failed → Terminating +└── [User removes `node.machine.sapcloud.io/preserve-when-failed=true`] → (Running) + +(PreserveFailed) +├── [User adds `node.machine.sapcloud.io/preserve-when-failed=false`] → Terminating +└── [failedMachinePreserveTimeout expires] → Terminating + +``` +In the above state machine, the phase `Running` also includes machines that are in the process of creation for which no errors have been encountered yet. +The transition of moving a machine from `PreserveFailed` to `Running` has not been shown since we haven't determined whether it is in scope for the current iteration of this feature. + +## Use Cases: + +### Use Case 1: Proactive Preservation Request +**Scenario:** Operator suspects a machine might fail and wants to ensure preservation for analysis. +#### Steps: +1. Operator annotates node with `node.machine.sapcloud.io/preserve-when-failed=true` +2. Machine fails later +3. MCM preserves the machine (if capacity allows) +4. Operator analyzes the failed VM +5. Operator releases the failed machine by setting `node.machine.sapcloud.io/preserve-when-failed=false` on the node object + +### Use Case 2: Automatic Preservation +**Scenario:** Machine fails unexpectedly, no prior annotation. +#### Steps: +1. Machine transitions to Failed state +2. MCM checks preservation capacity +3. If capacity available, machine moved to `PreserveFailed` phase by MCM +4. After timeout, machine is terminated by MCM + +### Use Case 3: Capacity Management +**Scenario:** Multiple machines fail when preservation capacity is full. +#### Steps: +1. Machines M1, M2 already preserved (capacity = 2) +2. Machine M3 fails with annotation `node.machine.sapcloud.io/preserve-when-failed=true` set +3. MCM cannot preserve M3 due to capacity limits +4. M3 moved from `Failed` to `Terminating` by MCM, following which it is deleted + +### Use Case 4: Early Release +**Scenario:** Operator has performed his analysis and no longer requires machine to be preserved + +#### Steps: +1. Machine M1 is in `PreserveFailed` phase +2. Operator adds: `node.machine.sapcloud.io/preserve-when-failed=false` to node. +3. MCM transitions M1 to `Terminating` +4. Capacity becomes available for preserving future `Failed` machines. From 50e6fd15f3a0742ad45054c8e94165c3a84e8a4b Mon Sep 17 00:00:00 2001 From: thiyyakat Date: Tue, 23 Sep 2025 09:22:55 +0530 Subject: [PATCH 02/11] Add limitations --- docs/proposals/failed-machine-preservation.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/docs/proposals/failed-machine-preservation.md b/docs/proposals/failed-machine-preservation.md index 684e59826..5f1798439 100644 --- a/docs/proposals/failed-machine-preservation.md +++ b/docs/proposals/failed-machine-preservation.md @@ -42,6 +42,7 @@ When such an annotated machine transitions from `Unknown` to `Failed`, it is pre 3. If an un-annotated machine moves to `Failed` phase, and the `failedMachinePreserveMax` has not been reached, MCM will auto-preserve this machine. 4. MCM will be modified to introduce a new stage in the `Failed` phase: `machineutils.PreserveFailed`, and a failed machine that is preserved by MCM will be transitioned to this stage after moving to `Failed`. * In this new stage, pods can be evicted and scheduled on other healthy machines, and the user/operator can wait for the corresponding VM to potentially recover. If the machine moves to `Running` phase on recovery, new pods can be scheduled on it. It is yet to be determined whether this feature will be required. +5. Machines of a MachineDeployment in `PreserveFailed` stage will also be counted towards the replica count and the enforcement of maximum machines allowed for the MachineDeployment. ## State Machine @@ -101,3 +102,9 @@ The transition of moving a machine from `PreserveFailed` to `Running` has not be 2. Operator adds: `node.machine.sapcloud.io/preserve-when-failed=false` to node. 3. MCM transitions M1 to `Terminating` 4. Capacity becomes available for preserving future `Failed` machines. + +## Limitations + +1. During rolling updates we will NOT honor preserving Machines. The Machine will be replaced with a healthy one if it moves to Failed phase. +2. Since gardener worker pool can correspond to 1..N MachineDeployments depending on number of zones, we will need to distribute the `failedMachinePreserveMax` across N machine deployments. +So, even if there are no failed machines preserved in other zones, the max per zone would still be enforced. Hence, the value of `failedMachinePreserveMax` should be chosen appropriately. From 454422c7dc8f0f433114e29c191b748a1aaf68ea Mon Sep 17 00:00:00 2001 From: thiyyakat Date: Tue, 23 Sep 2025 16:02:38 +0530 Subject: [PATCH 03/11] Address review comments --- docs/proposals/failed-machine-preservation.md | 85 ++++++++++--------- 1 file changed, 47 insertions(+), 38 deletions(-) diff --git a/docs/proposals/failed-machine-preservation.md b/docs/proposals/failed-machine-preservation.md index 5f1798439..30e6788dd 100644 --- a/docs/proposals/failed-machine-preservation.md +++ b/docs/proposals/failed-machine-preservation.md @@ -4,16 +4,15 @@ - [Preservation of Failed Machines](#preservation-of-failed-machines) - [Objective](#objective) - - [Solution Design](#solution-design) + - [Proposal](#proposal) - [State Machine](#state-machine) - [Use Cases](#use-cases) - ## Objective -Currently, the Machine Controller Manager(MCM) moves Machines with errors to the `Unknown` phase, and after the configured `machineHealthTimeout` seconds, to the `Failed` phase. +Currently, the Machine Controller Manager(MCM) moves Machines with errors to the `Unknown` phase, and after the configured `machineHealthTimeout`, to the `Failed` phase. `Failed` machines are swiftly moved to the `Terminating` phase during which the node is drained and the `Machine` object is deleted. This rapid cleanup prevents SRE/operators/support from conducting an analysis on the VM and makes finding root cause of failure more difficult. This document proposes enhancing MCM, such that: @@ -21,9 +20,9 @@ This document proposes enhancing MCM, such that: * There is a configurable limit to the number of `Failed` machines that can be preserved * There is a configurable limit to the duration for which such machines are preserved * Users can specify which healthy machines they would like to preserve in case of failure -* Users can request MCM to delete a preserved `Failed` machine, even before the timeout expires +* Users can request MCM to release a preserved `Failed` machine, even before the timeout expires, so that MCM can transition the machine to `Terminating` phase and trigger its deletion. -## Solution Design +## Proposal In order to achieve the objectives mentioned, the following are proposed: 1. Enhance `machineControllerManager` configuration in the `ShootSpec`, to specify the max number of failed machines to be preserved, @@ -35,64 +34,70 @@ and the time duration for which these machines will be preserved. ``` * Since gardener worker pool can correspond to `1..N` MachineDeployments depending on number of zones, `failedMachinePreserveMax` will be distributed across N machine deployments. * `failedMachinePreserveMax` must be chosen such that it can be appropriately distributed across the MachineDeployments. -2. Allow user/operator to explicitly request for preservation of a machine if it moves to `Failed` phase with the use of an annotation : `node.machine.sapcloud.io/preserve-when-failed=true`. -When such an annotated machine transitions from `Unknown` to `Failed`, it is prevented from moving to `Terminating` phase until `failedMachinePreserveTimeout` expires. - * A user/operator can request MCM to stop preserving a preserved `Failed` machine by adding/modifying the annotation: `node.machine.sapcloud.io/preserve-when-failed=false`. +2. Allow user/operator to explicitly request for preservation of a specific machine with the use of an annotation : `node.machine.sapcloud.io/preserve-when-failed=true`, such that, if it moves to `Failed` phase, the machine is preserved by MCM, provided there is capacity. +3. MCM will be modified to introduce a new stage in the `Failed` phase: `machineutils.PreserveFailed`, and a failed machine that is preserved by MCM will be transitioned to this stage after moving to `Failed`. +4. A machine in `PreserveFailed` stage automatically moves to `Terminating` phase once `failedMachinePreserveTimeout` expires. + * A user/operator can request MCM to stop preserving a machine in `PreservedFailed` stage using the annotation: `node.machine.sapcloud.io/preserve-when-failed=false`. * For a machine thus annotated, MCM will move it to `Terminating` phase even if `failedMachinePreserveTimeout` has not expired. -3. If an un-annotated machine moves to `Failed` phase, and the `failedMachinePreserveMax` has not been reached, MCM will auto-preserve this machine. -4. MCM will be modified to introduce a new stage in the `Failed` phase: `machineutils.PreserveFailed`, and a failed machine that is preserved by MCM will be transitioned to this stage after moving to `Failed`. - * In this new stage, pods can be evicted and scheduled on other healthy machines, and the user/operator can wait for the corresponding VM to potentially recover. If the machine moves to `Running` phase on recovery, new pods can be scheduled on it. It is yet to be determined whether this feature will be required. -5. Machines of a MachineDeployment in `PreserveFailed` stage will also be counted towards the replica count and the enforcement of maximum machines allowed for the MachineDeployment. +5. If an un-annotated machine moves to `Failed` phase, and the `failedMachinePreserveMax` has not been reached, MCM will auto-preserve this machine. +6. Machines of a MachineDeployment in `PreserveFailed` stage will also be counted towards the replica count and the enforcement of maximum machines allowed for the MachineDeployment. +7. At any point in time `machines requested for preservation + machines in PreservedFailed <= failedMachinePreserveMax`. If `machines requested for preservation + machines in PreservedFailed` is at or exceeds `failedMachinePreserveMax` on annotating a machine, the annotation will be deleted by MCM. ## State Machine The behaviour described above can be summarised using the state machine below: +```mermaid +--- +config: + layout: elk +--- +stateDiagram + direction TBP + state "PreserveFailed + (node drained)" as PreserveFailed + state "Requested + (node & machine annotated)" + as Requested + [*] --> Running + Running --> Requested:annotated with value=true && max not breached + Running --> Running:annotated, but max breached + Requested --> PreserveFailed:on failure + Running --> PreserveFailed:on failure && max not breached + PreserveFailed --> Terminating:after timeout + PreserveFailed --> Terminating:annotated with value=false + Running --> Failed : on failure && max breached + PreserveFailed --> Running : VM recovers + Failed --> Terminating + Terminating --> [*] ``` -(Running Machine) -├── [User adds `node.machine.sapcloud.io/preserve-when-failed=true`] → (Running + Requested) -└── [Machine fails + capacity available] → (PreserveFailed) -(Running + Requested) -├── [Machine fails + capacity available] → (PreserveFailed) -├── [Machine fails + no capacity] → Failed → Terminating -└── [User removes `node.machine.sapcloud.io/preserve-when-failed=true`] → (Running) - -(PreserveFailed) -├── [User adds `node.machine.sapcloud.io/preserve-when-failed=false`] → Terminating -└── [failedMachinePreserveTimeout expires] → Terminating - -``` In the above state machine, the phase `Running` also includes machines that are in the process of creation for which no errors have been encountered yet. -The transition of moving a machine from `PreserveFailed` to `Running` has not been shown since we haven't determined whether it is in scope for the current iteration of this feature. ## Use Cases: ### Use Case 1: Proactive Preservation Request **Scenario:** Operator suspects a machine might fail and wants to ensure preservation for analysis. #### Steps: -1. Operator annotates node with `node.machine.sapcloud.io/preserve-when-failed=true` +1. Operator annotates node with `node.machine.sapcloud.io/preserve-when-failed=true`, provided `failedMachinePreserveMax` is not violated 2. Machine fails later -3. MCM preserves the machine (if capacity allows) +3. MCM preserves the machine 4. Operator analyzes the failed VM -5. Operator releases the failed machine by setting `node.machine.sapcloud.io/preserve-when-failed=false` on the node object ### Use Case 2: Automatic Preservation **Scenario:** Machine fails unexpectedly, no prior annotation. #### Steps: -1. Machine transitions to Failed state -2. MCM checks preservation capacity -3. If capacity available, machine moved to `PreserveFailed` phase by MCM -4. After timeout, machine is terminated by MCM +1. Machine transitions to `Failed` phase +2. If `failedMachinePreserveMax` is not breached, machine moved to `PreserveFailed` phase by MCM +3. After `failedMachinePreserveTimeout`, machine is terminated by MCM ### Use Case 3: Capacity Management **Scenario:** Multiple machines fail when preservation capacity is full. #### Steps: -1. Machines M1, M2 already preserved (capacity = 2) -2. Machine M3 fails with annotation `node.machine.sapcloud.io/preserve-when-failed=true` set -3. MCM cannot preserve M3 due to capacity limits -4. M3 moved from `Failed` to `Terminating` by MCM, following which it is deleted +1. Machines M1, M2 already preserved (failedMachinePreserveMax = 2) +2. Operator wishes to preserve M3 in case of failure. He increases `failedMachinePreserveMax` to 3, and annotates M3 with `node.machine.sapcloud.io/preserve-when-failed=true`. +3. If M3 fails, machine moved to `PreserveFailed` phase by MCM. ### Use Case 4: Early Release **Scenario:** Operator has performed his analysis and no longer requires machine to be preserved @@ -100,9 +105,13 @@ The transition of moving a machine from `PreserveFailed` to `Running` has not be #### Steps: 1. Machine M1 is in `PreserveFailed` phase 2. Operator adds: `node.machine.sapcloud.io/preserve-when-failed=false` to node. -3. MCM transitions M1 to `Terminating` +3. MCM transitions M1 to `Terminating` even though `failedMachinePreserveTimeout` has not expired 4. Capacity becomes available for preserving future `Failed` machines. +## Open Point + +How will MCM provide the user with the option to drain a node when it is in `PreserveFailed` stage? + ## Limitations 1. During rolling updates we will NOT honor preserving Machines. The Machine will be replaced with a healthy one if it moves to Failed phase. From 961692a8b5ac30e2fafebf6adb48a58e8ebc10d5 Mon Sep 17 00:00:00 2001 From: thiyyakat Date: Tue, 23 Sep 2025 16:10:41 +0530 Subject: [PATCH 04/11] Change mermaid layout from elk to default for github support --- docs/proposals/failed-machine-preservation.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/docs/proposals/failed-machine-preservation.md b/docs/proposals/failed-machine-preservation.md index 30e6788dd..e1dde5f67 100644 --- a/docs/proposals/failed-machine-preservation.md +++ b/docs/proposals/failed-machine-preservation.md @@ -48,10 +48,6 @@ and the time duration for which these machines will be preserved. The behaviour described above can be summarised using the state machine below: ```mermaid ---- -config: - layout: elk ---- stateDiagram direction TBP state "PreserveFailed From fc1093484cfdecfa887a9894c0cfbd2142d85220 Mon Sep 17 00:00:00 2001 From: thiyyakat Date: Wed, 1 Oct 2025 15:35:55 +0530 Subject: [PATCH 05/11] Improve clarity --- docs/proposals/failed-machine-preservation.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/proposals/failed-machine-preservation.md b/docs/proposals/failed-machine-preservation.md index e1dde5f67..7296bf23e 100644 --- a/docs/proposals/failed-machine-preservation.md +++ b/docs/proposals/failed-machine-preservation.md @@ -34,13 +34,13 @@ and the time duration for which these machines will be preserved. ``` * Since gardener worker pool can correspond to `1..N` MachineDeployments depending on number of zones, `failedMachinePreserveMax` will be distributed across N machine deployments. * `failedMachinePreserveMax` must be chosen such that it can be appropriately distributed across the MachineDeployments. -2. Allow user/operator to explicitly request for preservation of a specific machine with the use of an annotation : `node.machine.sapcloud.io/preserve-when-failed=true`, such that, if it moves to `Failed` phase, the machine is preserved by MCM, provided there is capacity. +2. Allow user/operator to request for preservation of a specific machine with the use of an annotation : `node.machine.sapcloud.io/preserve-when-failed=true`, such that, if the machine moves to `Failed` phase, it is preserved by MCM, provided there is capacity. 3. MCM will be modified to introduce a new stage in the `Failed` phase: `machineutils.PreserveFailed`, and a failed machine that is preserved by MCM will be transitioned to this stage after moving to `Failed`. 4. A machine in `PreserveFailed` stage automatically moves to `Terminating` phase once `failedMachinePreserveTimeout` expires. * A user/operator can request MCM to stop preserving a machine in `PreservedFailed` stage using the annotation: `node.machine.sapcloud.io/preserve-when-failed=false`. * For a machine thus annotated, MCM will move it to `Terminating` phase even if `failedMachinePreserveTimeout` has not expired. -5. If an un-annotated machine moves to `Failed` phase, and the `failedMachinePreserveMax` has not been reached, MCM will auto-preserve this machine. -6. Machines of a MachineDeployment in `PreserveFailed` stage will also be counted towards the replica count and the enforcement of maximum machines allowed for the MachineDeployment. +5. If an un-annotated machine moves to `Failed` phase, and the number of preserved failed machines is less than `failedMachinePreserveMax`, MCM will auto-preserve this machine. +6. Machines of a MachineDeployment in `PreserveFailed` stage will also be counted towards the replica count and in the enforcement of maximum machines allowed for the MachineDeployment. 7. At any point in time `machines requested for preservation + machines in PreservedFailed <= failedMachinePreserveMax`. If `machines requested for preservation + machines in PreservedFailed` is at or exceeds `failedMachinePreserveMax` on annotating a machine, the annotation will be deleted by MCM. From 309527fd26d2df5d380b35f869310b82639aeca8 Mon Sep 17 00:00:00 2001 From: thiyyakat Date: Fri, 3 Oct 2025 14:51:44 +0530 Subject: [PATCH 06/11] Change proposal as per discussions --- docs/proposals/failed-machine-preservation.md | 113 ++++++++---------- 1 file changed, 51 insertions(+), 62 deletions(-) diff --git a/docs/proposals/failed-machine-preservation.md b/docs/proposals/failed-machine-preservation.md index 7296bf23e..990ad1657 100644 --- a/docs/proposals/failed-machine-preservation.md +++ b/docs/proposals/failed-machine-preservation.md @@ -5,7 +5,6 @@ - [Preservation of Failed Machines](#preservation-of-failed-machines) - [Objective](#objective) - [Proposal](#proposal) - - [State Machine](#state-machine) - [Use Cases](#use-cases) @@ -15,68 +14,62 @@ Currently, the Machine Controller Manager(MCM) moves Machines with errors to the `Unknown` phase, and after the configured `machineHealthTimeout`, to the `Failed` phase. `Failed` machines are swiftly moved to the `Terminating` phase during which the node is drained and the `Machine` object is deleted. This rapid cleanup prevents SRE/operators/support from conducting an analysis on the VM and makes finding root cause of failure more difficult. +Moreover, in cases where a node seems healthy but all the workload on it are facing issues, there is a need for operators to be able to cordon/drain the node and conduct their analysis without the cluster-autoscaler scaling down the node. + This document proposes enhancing MCM, such that: -* VMs of `Failed` machines are retained temporarily for analysis -* There is a configurable limit to the number of `Failed` machines that can be preserved +* VMs of machines are retained temporarily for analysis +* There is a configurable limit to the number of machines that can be preserved * There is a configurable limit to the duration for which such machines are preserved * Users can specify which healthy machines they would like to preserve in case of failure -* Users can request MCM to release a preserved `Failed` machine, even before the timeout expires, so that MCM can transition the machine to `Terminating` phase and trigger its deletion. +* Users can request MCM to release a preserved machine, even before the timeout expires, so that MCM can transition the machine to either `Running` or `Terminating` phase, as the case may be. ## Proposal In order to achieve the objectives mentioned, the following are proposed: -1. Enhance `machineControllerManager` configuration in the `ShootSpec`, to specify the max number of failed machines to be preserved, +1. Enhance `machineControllerManager` configuration in the `ShootSpec`, to specify the max number of machines to be preserved, and the time duration for which these machines will be preserved. ``` machineControllerManager: - failedMachinePreserveMax: 2 - failedMachinePreserveTimeout: 3h + machinePreserveMax: 1 + machinePreserveTimeout: 72h ``` - * Since gardener worker pool can correspond to `1..N` MachineDeployments depending on number of zones, `failedMachinePreserveMax` will be distributed across N machine deployments. - * `failedMachinePreserveMax` must be chosen such that it can be appropriately distributed across the MachineDeployments. -2. Allow user/operator to request for preservation of a specific machine with the use of an annotation : `node.machine.sapcloud.io/preserve-when-failed=true`, such that, if the machine moves to `Failed` phase, it is preserved by MCM, provided there is capacity. -3. MCM will be modified to introduce a new stage in the `Failed` phase: `machineutils.PreserveFailed`, and a failed machine that is preserved by MCM will be transitioned to this stage after moving to `Failed`. -4. A machine in `PreserveFailed` stage automatically moves to `Terminating` phase once `failedMachinePreserveTimeout` expires. - * A user/operator can request MCM to stop preserving a machine in `PreservedFailed` stage using the annotation: `node.machine.sapcloud.io/preserve-when-failed=false`. - * For a machine thus annotated, MCM will move it to `Terminating` phase even if `failedMachinePreserveTimeout` has not expired. -5. If an un-annotated machine moves to `Failed` phase, and the number of preserved failed machines is less than `failedMachinePreserveMax`, MCM will auto-preserve this machine. -6. Machines of a MachineDeployment in `PreserveFailed` stage will also be counted towards the replica count and in the enforcement of maximum machines allowed for the MachineDeployment. -7. At any point in time `machines requested for preservation + machines in PreservedFailed <= failedMachinePreserveMax`. If `machines requested for preservation + machines in PreservedFailed` is at or exceeds `failedMachinePreserveMax` on annotating a machine, the annotation will be deleted by MCM. - - -## State Machine - -The behaviour described above can be summarised using the state machine below: -```mermaid -stateDiagram - direction TBP - state "PreserveFailed - (node drained)" as PreserveFailed - state "Requested - (node & machine annotated)" - as Requested - [*] --> Running - Running --> Requested:annotated with value=true && max not breached - Running --> Running:annotated, but max breached - Requested --> PreserveFailed:on failure - Running --> PreserveFailed:on failure && max not breached - PreserveFailed --> Terminating:after timeout - PreserveFailed --> Terminating:annotated with value=false - Running --> Failed : on failure && max breached - PreserveFailed --> Running : VM recovers - Failed --> Terminating - Terminating --> [*] - -``` - -In the above state machine, the phase `Running` also includes machines that are in the process of creation for which no errors have been encountered yet. + * This configuration will be set per worker pool. + * Since gardener worker pool can correspond to `1..N` MachineDeployments depending on number of zones, `machinePreserveMax` will be distributed across N machine deployments. + * `machinePreserveMax` must be chosen such that it can be appropriately distributed across the MachineDeployments. + * Example: if `machinePreserveMax` is set to 2, and the worker pool has 2 zones, then the maximum number of machines that will be preserved per zone is 1. +2. MCM will be modified to include a new phase `Preserved` to indicate that the machine has been preserved by MCM. +3. Allow user/operator to request for preservation of a specific machine/node with the use of annotations : `node.machine.sapcloud.io/preserve=now` and `node.machine.sapcloud.io/preserve=when-failed`. +4. When annotation `node.machine.sapcloud.io/preserve=now` is added to a `Running` machine, the following will take place: + - `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` is added to the node to prevent CA from scaling it down. + - `machine.CurrentStatus.PreserveExpiryTime` is updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$ + - The machine stage is changed to `Preserved` + - After timeout, the `node.machine.sapcloud.io/preserve=now` and `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` are deleted, the machine phase is changed to `Running` and the CA may delete the node. The `machine.CurrentStatus.PreserveExpiryTime` is set to `nil`. + - Number of machines explicitly annotated will count towards enforcing `machinePreserveMax`. On breach, the annotation will be rejected. +5. When annotation `node.machine.sapcloud.io/preserve=when-failed` is added to a `Running` machine and the machine goes to `Failed`, the following will take place: + - The machine phase is changed to `Preserved`. + - Pods (other than daemonset pods) are drained. + - `machine.CurrentStatus.PreserveExpiryTime` is updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$ + - After timeout, the `node.machine.sapcloud.io/preserve=when-failed` is deleted. The phase is changed to `Terminating`. + - Number of machines explicitly annotated will count towards enforcing `machinePreserveMax`. On breach, the annotation will be rejected. +6. When an un-annotated machine goes to `Failed` phase and the $count(machinesAnnotatedForPreservation)+count(AutoPreservedMachines) Date: Fri, 3 Oct 2025 15:08:08 +0530 Subject: [PATCH 07/11] Fix limitations --- docs/proposals/failed-machine-preservation.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/proposals/failed-machine-preservation.md b/docs/proposals/failed-machine-preservation.md index 990ad1657..509e08ea7 100644 --- a/docs/proposals/failed-machine-preservation.md +++ b/docs/proposals/failed-machine-preservation.md @@ -101,4 +101,4 @@ and the time duration for which these machines will be preserved. 1. During rolling updates we will NOT honor preserving Machines. The Machine will be replaced with a healthy one if it moves to Failed phase. 2. Since gardener worker pool can correspond to 1..N MachineDeployments depending on number of zones, we will need to distribute the `machinePreserveMax` across N machine deployments. -So, even if there are no failed machines preserved in other zones, the max per zone would still be enforced. Hence, the value of `failedMachinePreserveMax` should be chosen appropriately. +So, even if there are no failed machines preserved in other zones, the max per zone would still be enforced. Hence, the value of `machinePreserveMax` should be chosen appropriately. From aa2ae8f7bfa75f41f2c28a840aca654075d16d7d Mon Sep 17 00:00:00 2001 From: thiyyakat Date: Fri, 3 Oct 2025 16:31:17 +0530 Subject: [PATCH 08/11] Add state diagrams --- docs/proposals/failed-machine-preservation.md | 50 +++++++++++++++++++ 1 file changed, 50 insertions(+) diff --git a/docs/proposals/failed-machine-preservation.md b/docs/proposals/failed-machine-preservation.md index 509e08ea7..ea616ed82 100644 --- a/docs/proposals/failed-machine-preservation.md +++ b/docs/proposals/failed-machine-preservation.md @@ -5,6 +5,7 @@ - [Preservation of Failed Machines](#preservation-of-failed-machines) - [Objective](#objective) - [Proposal](#proposal) + - [State Diagrams](#state-diagrams) - [Use Cases](#use-cases) @@ -64,6 +65,55 @@ and the time duration for which these machines will be preserved. 9. Machines of a MachineDeployment in `Preserved` stage will also be counted towards the replica count and in the enforcement of maximum machines allowed for the MachineDeployment. 10. At any point in time $count(machinesAnnotatedForPreservation)+count(PreservedMachines)<=machinePreserveMax$. +## State Diagrams: + +1. State Diagram for when a `Running` machine or its node is annotated with `node.machine.sapcloud.io/preserve=now`: +```mermaid +stateDiagram-v2 +direction TBP + state "Running" as R + state "Preserved" as P + [*]-->R + R --> P: annotated with value=now && max not breached + P --> R: annotated with value=false or timeout occurs +``` + +2. State Diagram for when a `Running` machine or its node is annotated with `node.machine.sapcloud.io/preserve=when-failed`: +```mermaid +stateDiagram-v2 + state "Running" as R + state "Running + Requested" as RR + state "Failed" as F + state "Preserved + (node drained)" as P + state "Terminating" as T + [*]-->R + R --> RR: annotated with value=when-failed && max not breached + RR --> F: on failure + F --> P + P --> T: on timeout or value=false + P --> R: if node Healthy before timeout + T --> [*] +``` + +3. State Diagram for when an un-annotated `Running` machine fails: +```mermaid +stateDiagram-v2 +direction TBP + state "Running" as R + state "Failed" as F + state "Preserved" as P + state "Terminating" as T + [*] --> R + R-->F: on failure + F --> P: if max not breached + F --> T: if max breached + P --> T: on timeout or value=false + P --> R : if node Healthy before timeout + T --> [*] +``` + + ## Use Cases: ### Use Case 1: Proactive Preservation Request From 9462118bad59fadfaecf446bd155094120b3e589 Mon Sep 17 00:00:00 2001 From: thiyyakat Date: Fri, 3 Oct 2025 16:34:33 +0530 Subject: [PATCH 09/11] Rename file and proposal --- ...failed-machine-preservation.md => machine-preservation.md} | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) rename docs/proposals/{failed-machine-preservation.md => machine-preservation.md} (98%) diff --git a/docs/proposals/failed-machine-preservation.md b/docs/proposals/machine-preservation.md similarity index 98% rename from docs/proposals/failed-machine-preservation.md rename to docs/proposals/machine-preservation.md index ea616ed82..ffce5bfa4 100644 --- a/docs/proposals/failed-machine-preservation.md +++ b/docs/proposals/machine-preservation.md @@ -1,8 +1,8 @@ -# Preservation of Failed Machines +# Preservation of Machines -- [Preservation of Failed Machines](#preservation-of-failed-machines) +- [Preservation of Machines](#preservation-of-machines) - [Objective](#objective) - [Proposal](#proposal) - [State Diagrams](#state-diagrams) From 849a99dd0da422affb2944555766d03c76b8985b Mon Sep 17 00:00:00 2001 From: thiyyakat Date: Wed, 8 Oct 2025 19:55:18 +0530 Subject: [PATCH 10/11] Update proposal to reflect changes decided in meeting --- docs/proposals/machine-preservation.md | 135 +++++++++++-------------- 1 file changed, 57 insertions(+), 78 deletions(-) diff --git a/docs/proposals/machine-preservation.md b/docs/proposals/machine-preservation.md index ffce5bfa4..d31306456 100644 --- a/docs/proposals/machine-preservation.md +++ b/docs/proposals/machine-preservation.md @@ -15,140 +15,119 @@ Currently, the Machine Controller Manager(MCM) moves Machines with errors to the `Unknown` phase, and after the configured `machineHealthTimeout`, to the `Failed` phase. `Failed` machines are swiftly moved to the `Terminating` phase during which the node is drained and the `Machine` object is deleted. This rapid cleanup prevents SRE/operators/support from conducting an analysis on the VM and makes finding root cause of failure more difficult. -Moreover, in cases where a node seems healthy but all the workload on it are facing issues, there is a need for operators to be able to cordon/drain the node and conduct their analysis without the cluster-autoscaler scaling down the node. +Moreover, in cases where a node seems healthy but all the workload on it are facing issues, there is a need for operators to be able to cordon/drain the node and conduct their analysis without the cluster-autoscaler (CA) scaling down the node. This document proposes enhancing MCM, such that: * VMs of machines are retained temporarily for analysis -* There is a configurable limit to the number of machines that can be preserved -* There is a configurable limit to the duration for which such machines are preserved -* Users can specify which healthy machines they would like to preserve in case of failure +* There is a configurable limit to the number of machines that can be preserved automatically on failure (auto-preservation) +* There is a configurable limit to the duration for which machines are preserved +* Users can specify which healthy machines they would like to preserve in case of failure, or for diagnoses in current state (prevent scale down by CA) * Users can request MCM to release a preserved machine, even before the timeout expires, so that MCM can transition the machine to either `Running` or `Terminating` phase, as the case may be. +Related Issue: https://github.com/gardener/machine-controller-manager/issues/1008 + ## Proposal In order to achieve the objectives mentioned, the following are proposed: -1. Enhance `machineControllerManager` configuration in the `ShootSpec`, to specify the max number of machines to be preserved, +1. Enhance `machineControllerManager` configuration in the `ShootSpec`, to specify the max number of machines to be auto-preserved, and the time duration for which these machines will be preserved. ``` machineControllerManager: - machinePreserveMax: 1 + autoPreserveFailedMax: 0 machinePreserveTimeout: 72h ``` * This configuration will be set per worker pool. * Since gardener worker pool can correspond to `1..N` MachineDeployments depending on number of zones, `machinePreserveMax` will be distributed across N machine deployments. * `machinePreserveMax` must be chosen such that it can be appropriately distributed across the MachineDeployments. * Example: if `machinePreserveMax` is set to 2, and the worker pool has 2 zones, then the maximum number of machines that will be preserved per zone is 1. -2. MCM will be modified to include a new phase `Preserved` to indicate that the machine has been preserved by MCM. -3. Allow user/operator to request for preservation of a specific machine/node with the use of annotations : `node.machine.sapcloud.io/preserve=now` and `node.machine.sapcloud.io/preserve=when-failed`. -4. When annotation `node.machine.sapcloud.io/preserve=now` is added to a `Running` machine, the following will take place: +2. MCM will be modified to include a new sub-phase `Preserved` to indicate that the machine has been preserved by MCM. +3. Allow user/operator to request for preservation of a specific machine/node with the use of annotation : `node.machine.sapcloud.io/preserve=true`. +4. When annotation `node.machine.sapcloud.io/preserve=true` is added to a `Running` machine, the following will take place: - `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` is added to the node to prevent CA from scaling it down. - `machine.CurrentStatus.PreserveExpiryTime` is updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$ - - The machine stage is changed to `Preserved` - - After timeout, the `node.machine.sapcloud.io/preserve=now` and `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` are deleted, the machine phase is changed to `Running` and the CA may delete the node. The `machine.CurrentStatus.PreserveExpiryTime` is set to `nil`. - - Number of machines explicitly annotated will count towards enforcing `machinePreserveMax`. On breach, the annotation will be rejected. -5. When annotation `node.machine.sapcloud.io/preserve=when-failed` is added to a `Running` machine and the machine goes to `Failed`, the following will take place: - - The machine phase is changed to `Preserved`. - - Pods (other than daemonset pods) are drained. - - `machine.CurrentStatus.PreserveExpiryTime` is updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$ - - After timeout, the `node.machine.sapcloud.io/preserve=when-failed` is deleted. The phase is changed to `Terminating`. - - Number of machines explicitly annotated will count towards enforcing `machinePreserveMax`. On breach, the annotation will be rejected. -6. When an un-annotated machine goes to `Failed` phase and the $count(machinesAnnotatedForPreservation)+count(AutoPreservedMachines)R - R --> P: annotated with value=now && max not breached - P --> R: annotated with value=false or timeout occurs + R --> RP: annotated with preserve=true + RP --> R: annotated with preserve=false or timeout occurs ``` -2. State Diagram for when a `Running` machine or its node is annotated with `node.machine.sapcloud.io/preserve=when-failed`: -```mermaid -stateDiagram-v2 - state "Running" as R - state "Running + Requested" as RR - state "Failed" as F - state "Preserved - (node drained)" as P - state "Terminating" as T - [*]-->R - R --> RR: annotated with value=when-failed && max not breached - RR --> F: on failure - F --> P - P --> T: on timeout or value=false - P --> R: if node Healthy before timeout - T --> [*] -``` - -3. State Diagram for when an un-annotated `Running` machine fails: +2. State Diagram for when an un-annotated `Running` machine fails (Auto-preservation): ```mermaid stateDiagram-v2 direction TBP state "Running" as R - state "Failed" as F - state "Preserved" as P + state "Failed + (node drained)" as F + state "Failed:Preserved" as FP state "Terminating" as T [*] --> R R-->F: on failure - F --> P: if max not breached - F --> T: if max breached - P --> T: on timeout or value=false - P --> R : if node Healthy before timeout + F --> FP: if autoPreserveFailedMax not breached + F --> T: if autoPreserveFailedMax breached + FP --> T: on timeout or value=false + FP --> R : if node Healthy before timeout T --> [*] ``` - ## Use Cases: ### Use Case 1: Proactive Preservation Request **Scenario:** Operator suspects a machine might fail and wants to ensure preservation for analysis. #### Steps: -1. Operator annotates node with `node.machine.sapcloud.io/preserve=when-failed`, provided `machinePreserveMax` is not violated -2. Machine fails later -3. MCM preserves the machine -4. Operator analyzes the failed VM +1. Operator annotates node with `node.machine.sapcloud.io/preserve=true` +2. MCM preserves the machine, and prevents CA from scaling it down +3. Operator analyzes the VM -### Use Case 2: Automatic Preservation +### Use Case 2: Auto-Preservation **Scenario:** Machine fails unexpectedly, no prior annotation. #### Steps: 1. Machine transitions to `Failed` phase -2. If `machinePreserveMax` is not breached, machine moved to `Preserved` phase by MCM -3. After `machinePreserveTimeout`, machine is terminated by MCM - -### Use Case 3: Preservation Request for Analysing Running Machine -**Scenario:** Workload on machine failing. Operator wishes to diagnose. -#### Steps: -1. Operator annotates node with `node.machine.sapcloud.io/preserve=now`, provided `machinePreserveMax` is not violated -2. MCM preserves machine and prevents CA from scaling it down -3. Operator analyzes the machine +2. Machine is drained +3. If `autoPreserveFailedMax` is not breached, machine moved to `Failed:Preserved` phase by MCM +4. After `machinePreserveTimeout`, machine is terminated by MCM -### Use Case 4: Early Release +### Use Case 3: Early Release **Scenario:** Operator has performed his analysis and no longer requires machine to be preserved #### Steps: -1. Machine is in `Preserved` phase +1. Machine is in `Running:Preserved` or `Failed:Preserved` phase 2. Operator adds: `node.machine.sapcloud.io/preserve=false` to node. -3. MCM transitions machine to `Running` or `Terminating`, depending on which phase it was in before moving to `Preserved`, even though `machinePreserveTimeout` has not expired -4. Capacity becomes available for preserving future annotated machines or for auto-preservation of `Failed` machines. +3. MCM transitions machine to `Running` or `Terminating`, for `Running:Preserved` or `Failed:Preserved` respectively, even though `machinePreserveTimeout` has not expired +4. If machine was in `Failed:Preserved`, capacity becomes available for auto-preservation. -## Limitations +## Points to Note 1. During rolling updates we will NOT honor preserving Machines. The Machine will be replaced with a healthy one if it moves to Failed phase. -2. Since gardener worker pool can correspond to 1..N MachineDeployments depending on number of zones, we will need to distribute the `machinePreserveMax` across N machine deployments. -So, even if there are no failed machines preserved in other zones, the max per zone would still be enforced. Hence, the value of `machinePreserveMax` should be chosen appropriately. +2. Hibernation policy would override machine preservation. +3. If Machine and Node annotation values differ for a particular annotation key (including `node.machine.sapcloud.io/preserve=true`), the Node annotation value will override the Machine annotation value. +4. If `autoPreserveFailedMax` is reduced in the Shoot Spec, older machines are moved to `Terminating` phase before newer ones. +5. In case of a scale down of an MCD's replica count, `Preserved` machines will be the last to be scaled down. Replica count will always be honoured. +6. If the value for annotation key `cluster-autoscaler.kubernetes.io/scale-down-disabled` for a machine in `Running:Preserved` is changed to `false` by a user, the value will be overwritten to `true` by MCM. +7. On increase/decrease of timeout- new value will only apply to machines that go into `Preserved` phase after the change. Operators can always edit `machine.CurrentStatus.PreserveExpiryTime` to prolong the expiry time of existing `Preserved` machines. + - can specify timeout +8. [Modify CA FAQ](https://github.com/gardener/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-prevent-cluster-autoscaler-from-scaling-down-a-particular-node) once feature is developed to use `node.machine.sapcloud.io/preserve=true` instead of the `cluster-autoscaler.kubernetes.io/scale-down-disabled=true` currently suggested. This would: + - harmonise machine flow + - shield from CA's internals + - make it generic and no longer CA specific + - allow a timeout to be specified \ No newline at end of file From 227b3cd9d221bf51e2ab47aafaa82b93d0e62be1 Mon Sep 17 00:00:00 2001 From: thiyyakat Date: Thu, 9 Oct 2025 11:33:40 +0530 Subject: [PATCH 11/11] Modify proposal to support use case for `preserve=when-failed` --- docs/proposals/machine-preservation.md | 76 ++++++++++++++++++-------- 1 file changed, 52 insertions(+), 24 deletions(-) diff --git a/docs/proposals/machine-preservation.md b/docs/proposals/machine-preservation.md index d31306456..945cad115 100644 --- a/docs/proposals/machine-preservation.md +++ b/docs/proposals/machine-preservation.md @@ -41,38 +41,59 @@ and the time duration for which these machines will be preserved. * `machinePreserveMax` must be chosen such that it can be appropriately distributed across the MachineDeployments. * Example: if `machinePreserveMax` is set to 2, and the worker pool has 2 zones, then the maximum number of machines that will be preserved per zone is 1. 2. MCM will be modified to include a new sub-phase `Preserved` to indicate that the machine has been preserved by MCM. -3. Allow user/operator to request for preservation of a specific machine/node with the use of annotation : `node.machine.sapcloud.io/preserve=true`. -4. When annotation `node.machine.sapcloud.io/preserve=true` is added to a `Running` machine, the following will take place: +3. Allow user/operator to request for preservation of a specific machine/node with the use of annotations : `node.machine.sapcloud.io/preserve=now` and `node.machine.sapcloud.io/preserve=when-failed`. +4. When annotation `node.machine.sapcloud.io/preserve=now` is added to a `Running` machine, the following will take place: - `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` is added to the node to prevent CA from scaling it down. - `machine.CurrentStatus.PreserveExpiryTime` is updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$ - The machine's phase is changed to `Running:Preserved` - - After timeout, the `node.machine.sapcloud.io/preserve=true` and `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` are deleted, the machine phase is changed to `Running` and the CA may delete the node. The `machine.CurrentStatus.PreserveExpiryTime` is set to `nil`. -5. When an un-annotated machine goes to `Failed` phase and `autoPreserveFailedMax` is not breached: + - After timeout, the `node.machine.sapcloud.io/preserve=now` and `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` are deleted, the machine phase is changed to `Running` and the CA may delete the node. The `machine.CurrentStatus.PreserveExpiryTime` is set to `nil`. +5. When annotation `node.machine.sapcloud.io/preserve=when-failed` is added to a `Running` machine and the machine goes to `Failed`, the following will take place: + - The machine is drained of pods except for Daemonset pods. + - The machine phase is changed to `Failed:Preserved`. + - `machine.CurrentStatus.PreserveExpiryTime` is updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$ + - After timeout, the `node.machine.sapcloud.io/preserve=when-failed` is deleted. The phase is changed to `Terminating`. +6. When an un-annotated machine goes to `Failed` phase and `autoPreserveFailedMax` is not breached: - Pods (other than DaemonSet pods) are drained. - The machine's phase is changed to `Failed:Preserved`. - `machine.CurrentStatus.PreserveExpiryTime` is updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$ - After timeout, the phase is changed to `Terminating`. - Number of machines in `Failed:Preserved` phase count towards enforcing `autoPreserveFailedMax`. -6. If a failed machine is currently in `Failed:Preserved` and after timeout its VM/node is found to be Healthy, the machine will be moved to `Running`. -7. A user/operator can request MCM to stop preserving a machine/node in `Running:Preserved` or `Failed:Preserved` phase using the annotation: `node.machine.sapcloud.io/preserve=false`. +7. If a failed machine is currently in `Failed:Preserved` and before timeout its VM/node is found to be Healthy, the machine will be moved to `Running`. +8. A user/operator can request MCM to stop preserving a machine/node in `Running:Preserved` or `Failed:Preserved` phase using the annotation: `node.machine.sapcloud.io/preserve=false`. * MCM will move a machine thus annotated either to `Running` phase or `Terminating` depending on the phase of the machine before it was preserved. -8. Machines of a MachineDeployment in `Preserved` sub-phase will also be counted towards the replica count and in the enforcement of maximum machines allowed for the MachineDeployment. -9. MCM will be modified to perform drain in `Failed` phase rather than `Terminating`. +9. Machines of a MachineDeployment in `Preserved` sub-phase will also be counted towards the replica count and in the enforcement of maximum machines allowed for the MachineDeployment. +10. MCM will be modified to perform drain in `Failed` phase rather than `Terminating`. ## State Diagrams: -1. State Diagram for when a `Running` machine or its node is annotated with `node.machine.sapcloud.io/preserve=true`: +1. State Diagram for when a `Running` machine or its node is annotated with `node.machine.sapcloud.io/preserve=now`: ```mermaid stateDiagram-v2 direction TBP state "Running" as R state "Running:Preserved" as RP [*]-->R - R --> RP: annotated with preserve=true + R --> RP: annotated with preserve=now RP --> R: annotated with preserve=false or timeout occurs ``` - -2. State Diagram for when an un-annotated `Running` machine fails (Auto-preservation): +2. State Diagram for when a `Running` machine or its node is annotated with `node.machine.sapcloud.io/preserve=when-failed`: +```mermaid +stateDiagram-v2 + state "Running" as R + state "Running + Requested" as RR + state "Failed + (node drained)" as F + state "Failed:Preserved" as P + state "Terminating" as T + [*]-->R + R --> RR: annotated with preserve=when-failed + RR --> F: on failure + F --> P + P --> T: on timeout or preserve=false + P --> R: if node Healthy before timeout + T --> [*] +``` +3. State Diagram for when an un-annotated `Running` machine fails (Auto-preservation): ```mermaid stateDiagram-v2 direction TBP @@ -92,14 +113,23 @@ direction TBP ## Use Cases: -### Use Case 1: Proactive Preservation Request -**Scenario:** Operator suspects a machine might fail and wants to ensure preservation for analysis. +### Use Case 1: Preservation Request for Analysing Running Machine +**Scenario:** Workload on machine failing. Operator wishes to diagnose. #### Steps: -1. Operator annotates node with `node.machine.sapcloud.io/preserve=true` +1. Operator annotates node with `node.machine.sapcloud.io/preserve=now` 2. MCM preserves the machine, and prevents CA from scaling it down 3. Operator analyzes the VM -### Use Case 2: Auto-Preservation +### Use Case 2: Proactive Preservation Request +**Scenario:** Operator suspects a machine might fail and wants to ensure preservation for analysis. +#### Steps: +1. Operator annotates node with `node.machine.sapcloud.io/preserve=when-failed` +2. Machine fails later +3. MCM preserves the machine +4. Operator analyzes the VM + + +### Use Case 3: Auto-Preservation **Scenario:** Machine fails unexpectedly, no prior annotation. #### Steps: 1. Machine transitions to `Failed` phase @@ -107,7 +137,7 @@ direction TBP 3. If `autoPreserveFailedMax` is not breached, machine moved to `Failed:Preserved` phase by MCM 4. After `machinePreserveTimeout`, machine is terminated by MCM -### Use Case 3: Early Release +### Use Case 4: Early Release **Scenario:** Operator has performed his analysis and no longer requires machine to be preserved #### Steps: 1. Machine is in `Running:Preserved` or `Failed:Preserved` phase @@ -115,18 +145,16 @@ direction TBP 3. MCM transitions machine to `Running` or `Terminating`, for `Running:Preserved` or `Failed:Preserved` respectively, even though `machinePreserveTimeout` has not expired 4. If machine was in `Failed:Preserved`, capacity becomes available for auto-preservation. - ## Points to Note -1. During rolling updates we will NOT honor preserving Machines. The Machine will be replaced with a healthy one if it moves to Failed phase. -2. Hibernation policy would override machine preservation. -3. If Machine and Node annotation values differ for a particular annotation key (including `node.machine.sapcloud.io/preserve=true`), the Node annotation value will override the Machine annotation value. +1. During rolling updates MCM will NOT honor preserving Machines. The Machine will be replaced with a healthy one if it moves to Failed phase. +2. Hibernation policy will override machine preservation. +3. If Machine and Node annotation values differ for a particular annotation key, the Node annotation value will override the Machine annotation value. 4. If `autoPreserveFailedMax` is reduced in the Shoot Spec, older machines are moved to `Terminating` phase before newer ones. 5. In case of a scale down of an MCD's replica count, `Preserved` machines will be the last to be scaled down. Replica count will always be honoured. 6. If the value for annotation key `cluster-autoscaler.kubernetes.io/scale-down-disabled` for a machine in `Running:Preserved` is changed to `false` by a user, the value will be overwritten to `true` by MCM. -7. On increase/decrease of timeout- new value will only apply to machines that go into `Preserved` phase after the change. Operators can always edit `machine.CurrentStatus.PreserveExpiryTime` to prolong the expiry time of existing `Preserved` machines. - - can specify timeout -8. [Modify CA FAQ](https://github.com/gardener/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-prevent-cluster-autoscaler-from-scaling-down-a-particular-node) once feature is developed to use `node.machine.sapcloud.io/preserve=true` instead of the `cluster-autoscaler.kubernetes.io/scale-down-disabled=true` currently suggested. This would: +7. On increase/decrease of timeout, the new value will only apply to machines that go into `Preserved` phase after the change. Operators can always edit `machine.CurrentStatus.PreserveExpiryTime` to prolong the expiry time of existing `Preserved` machines. +8. [Modify CA FAQ](https://github.com/gardener/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-prevent-cluster-autoscaler-from-scaling-down-a-particular-node) once feature is developed to use `node.machine.sapcloud.io/preserve=now` instead of the `cluster-autoscaler.kubernetes.io/scale-down-disabled=true` currently suggested. This would: - harmonise machine flow - shield from CA's internals - make it generic and no longer CA specific