If any clusters are offline, bundle status can get stuck #594

philomory · 2021-11-05T02:17:18Z

If any clusters are offline/unavailable, the status of Bundles that get deployed to those clusters can get stuck with misleading/confusing error messages.

Steps to reproduce:

Create a fleet workspace, and add at least two clusters. Make one cluster an e.g. single-node k3s cluster, so that it is easy to start/stop the cluster. We'll call that cluster "A".
Create a ClusterGroup that contains at least two of your clusters, including cluster A. We'll call this group "G".

Create a git repository containing the following code:

# test-bundle/fleet.yaml
defaultNamespace: test-bundle
helm:
  chart: ./chart
  releaseName: test-bundle

# test-bundle/chart/Chart.yaml
apiVersion: v2
name: test-bundle
version: 0.0.1

# test-bundle/chart/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: test
spec:
  selector:
    matchLabels:
      app: test
  template:
    metadata:
      labels:
        app: test
    spec:
      containers:
        - name: test
          image: paulbouwer/hello-kubernetes:1.10.1
          lifecycle:
            preStop:
              exec:
                command: ["sleep", "1"]

Create a GitRepo object pointing to this repository that applies to ClusterGroup "G". We'll call this GitRepo "R".
Let Fleet deploy the bundle from R to all clusters in G. Wait for them all to be Ready.

Push the following change (which introduces an error) to the git repository R:

--- a/test-bundle/chart/templates/deployment.yaml
+++ b/test-bundle/chart/templates/deployment.yaml
@@ -20,6 +20,6 @@ spec:
         - name: test
           image: paulbouwer/hello-kubernetes:1.10.1
           lifecycle:
-            preStop:
+            preStart:
               exec:
                 command: ["sleep", "2"]

Wait for Fleet to attempt to deploy this change. The GitRepo, Bundle, and all relevant BundleDeployments should end up in the state ErrApplied, with an error message similar to error validating "": error validating data: ValidationError(Deployment.spec.template.spec.containers[0].lifecycle): unknown field "preStart" in io.k8s.api.core.v1.Lifecycle'.
Shut off cluster A (stop the only node in the cluster).
Wait until Cluster A shows as offline under Cluster Management in Rancher.

Push the following change (which neither fixes the error nor introduces new errors) to GitRepo G:

--- a/test-bundle/chart/templates/deployment.yaml
+++ b/test-bundle/chart/templates/deployment.yaml
@@ -18,7 +18,7 @@ spec:
     spec:
       containers:
         - name: test
-          image: paulbouwer/hello-kubernetes:1.10.1
+          image: paulbouwer/hello-kubernetes:1.10
           lifecycle:
             preStart:
               exec:

Wait until the GitRepo UI in Fleet shows that it has picked up the new commit. The GitRepo, Bundle, and all relevant BundleDeployments will still be in the same error state as before.

Push the following change (which fixes the initial error) to GitRepo G:

--- a/test-bundle/chart/templates/deployment.yaml
+++ b/test-bundle/chart/templates/deployment.yaml
@@ -20,6 +20,6 @@ spec:
         - name: test
           image: paulbouwer/hello-kubernetes:1.10
           lifecycle:
-            preStart:
+            preStop:
               exec:
                 command: ["sleep", "2"]

Wait for the GitRepo UI in Fleet to show that it has picked up the new commit. Shortly, all BundleDeployments except that belonging to A will update to show that they are in a "Ready" state. However, the BundleDeployment for A will remain in the original error state, as will the Bundle and GitRepo.
Observe that, in Fleet's UI, GitRepo R still displays an error message similar to Error validating "": error validating data: ValidationError(Deployment.spec.template.spec.containers[0].lifecycle): unknown field "preStart" in io.k8s.api.core.v1.Lifecycle, even though the actual repository no longer contains any reference to a preStart field.

It is worth noting that, if step 10 is skipped - so that the commit in step 12 (which fixes the error) is the first commit to the repo after cluster A goes offline - then in step 12 the BundleDeployment for A will go to a "Wait Applied" state rather than being stuck in the error state.

The text was updated successfully, but these errors were encountered:

strowi · 2023-03-13T12:31:57Z

Looks like we are running into this error 2years later with rancher 2.7.1. Having 1 downed Cluster block the whole process is what we were trying to circumvent with fleet. Any idea or timeline?
Currently i have to change a selector to remove the cluster from the group.

manno · 2023-04-13T08:53:06Z

This might be related to default values in the rollout strategy. The defaults are documented in the fleet.yaml reference.

manno · 2024-08-21T13:36:42Z

Let's test if this still happens on 2.9.1

mmartin24 · 2024-08-22T16:03:11Z

It seems to still be happening on 2.9.1-rc3.

Adding some notes about how is it observed on this version:

On step 7 state is either Not Ready or Modified. Nevertheless, an error message is displayed:
Error log:

Modified(3) [Bundle repo-r-test-bundle]; deployment.apps test-bundle/test modified {"spec":{"template":{"spec":{"containers":[{"image":"paulbouwer/hello-kubernetes:1.10.1","imagePullPolicy":"IfNotPresent","lifecycle":{"preStart":{"exec":{"command":["sleep","2"]}}},"name":"test","resources":{},"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File"}]}}}}

After step 10 State is Wait Applied directly:

After 'fix' from step 12 state of Git Repo is Wait Applied, yet, similar result to the originally described happens:

Error:

WaitApplied(1) [Bundle repo-r-test-bundle]; deployment.apps test-bundle/test modified {"spec":{"template":{"spec":{"containers":[{"image":"paulbouwer/hello-kubernetes:1.10.1","imagePullPolicy":"IfNotPresent","lifecycle":{"preStart":{"exec":{"command":["sleep","2"]}}},"name":"test","resources":{},"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File"}]}}}}

manno · 2024-09-10T09:22:59Z

When working on this we should

document how state fields are propagated between resources, like we do for resource counts

This should help reproduce fleet#594 [1]. [1]: rancher/fleet#594

This is reproduction step 6 of fleet#594 [1]. [1]: rancher/fleet#594

This is reproduction step 10 of fleet#594 [1]. [1]: rancher/fleet#594

This reverts commit 1ffa225. This is reproduction step 12 of fleet#594 [1]. [1]: rancher/fleet#594

This is reproduction step 6 of fleet#594 [1]. [1]: rancher/fleet#594

This is reproduction step 10 of fleet#594 [1]. [1]: rancher/fleet#594

This reverts commit 1ffa225. This is reproduction step 12 of fleet#594 [1]. [1]: rancher/fleet#594

weyfonk · 2024-10-01T14:51:01Z

This might be related to default values in the rollout strategy. The defaults are documented in the fleet.yaml reference.

Rollout strategy does not seem to be the culprit here, as setting maxUnavailablePartitions to 100% bundle-wide does not change anything.
Reproduced this with Fleet standalone (latest main).

weyfonk · 2024-10-03T09:49:47Z

The issue here is that since a bundle deployment's status is updated by the agent living in the bundle deployment's target cluster, a bundle deployment targeting a downstream cluster will not have its status updated once that cluster is offline.

With status data being propagated from bundle deployment upwards to bundles and GitRepos, then to clusters and cluster groups, this explains why those resources have their statuses still showing an outdated modified status.

A solution for this could consist in watching a Fleet Cluster resource's agent lastSeen timestamp, which is also updated by the agent from the downstream cluster, and will therefore not be updated anymore once the cluster is offline. Fleet would then need to update statuses of all bundle deployments in that cluster once more than $threshold has elapsed since that lastSeen timestamp, for those status updates to then be propagated to other resources.
$threshold could be a hard-coded value or left configurable, with a sensible default value (eg. 15 or more minutes).

deniseschannon added the team/area3 label Nov 25, 2021

zube bot added [zube]: To Triage and removed team/area3 labels Jul 5, 2022

kkaempf added the kind/bug label Jan 3, 2023

kkaempf modified the milestones: 2023-Q2-v2.7x, 2023-Q3-v2.7x Mar 13, 2023

kkaempf added area/fleet labels Mar 15, 2023

kkaempf modified the milestones: 2023-Q3-v2.7x, 2023-Q4-v2.8x Jul 24, 2023

manno modified the milestones: 2023-Q4-v2.8x, 2024-Q1-2.8x Aug 24, 2023

manno modified the milestones: 2024-Q1-2.8x, v2.9.0 Nov 27, 2023

manno removed the [zube]: To Triage label Apr 3, 2024

kkaempf removed team/fleet labels Apr 10, 2024

kkaempf modified the milestones: v2.9.0, v2.9-Next1 Aug 15, 2024

manno assigned mmartin24 Aug 21, 2024

olblak modified the milestones: v2.9.2, v2.9-Next1 Sep 11, 2024

weyfonk assigned weyfonk and unassigned mmartin24 Oct 1, 2024

weyfonk added a commit to weyfonk/fleet-experiments that referenced this issue Oct 1, 2024

Add test bundle

5d2bdc0

This should help reproduce fleet#594 [1]. [1]: rancher/fleet#594

weyfonk added a commit to weyfonk/fleet-experiments that referenced this issue Oct 1, 2024

Introduce error in deployment

1ffa225

This is reproduction step 6 of fleet#594 [1]. [1]: rancher/fleet#594

weyfonk added a commit to weyfonk/fleet-experiments that referenced this issue Oct 1, 2024

Push harmless change to deployment

1acfeb4

This is reproduction step 10 of fleet#594 [1]. [1]: rancher/fleet#594

weyfonk added a commit to weyfonk/fleet-experiments that referenced this issue Oct 1, 2024

Revert "Introduce error in deployment"

abf8f30

This reverts commit 1ffa225. This is reproduction step 12 of fleet#594 [1]. [1]: rancher/fleet#594

weyfonk added a commit to weyfonk/fleet-experiments that referenced this issue Oct 1, 2024

Introduce error in deployment

80da717

This is reproduction step 6 of fleet#594 [1]. [1]: rancher/fleet#594

weyfonk added a commit to weyfonk/fleet-experiments that referenced this issue Oct 1, 2024

Push harmless change to deployment

21373e5

This is reproduction step 10 of fleet#594 [1]. [1]: rancher/fleet#594

weyfonk added a commit to weyfonk/fleet-experiments that referenced this issue Oct 1, 2024

Revert "Introduce error in deployment"

8ec8d23

This reverts commit 1ffa225. This is reproduction step 12 of fleet#594 [1]. [1]: rancher/fleet#594

kkaempf modified the milestones: v2.9.3, 2.9.4 Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If any clusters are offline, bundle status can get stuck #594

If any clusters are offline, bundle status can get stuck #594

philomory commented Nov 5, 2021 •

edited

Loading

strowi commented Mar 13, 2023

manno commented Apr 13, 2023

manno commented Aug 21, 2024

mmartin24 commented Aug 22, 2024

manno commented Sep 10, 2024

weyfonk commented Oct 1, 2024 •

edited

Loading

weyfonk commented Oct 3, 2024

If any clusters are offline, bundle status can get stuck #594

If any clusters are offline, bundle status can get stuck #594

Comments

philomory commented Nov 5, 2021 • edited Loading

Steps to reproduce:

strowi commented Mar 13, 2023

manno commented Apr 13, 2023

manno commented Aug 21, 2024

mmartin24 commented Aug 22, 2024

manno commented Sep 10, 2024

weyfonk commented Oct 1, 2024 • edited Loading

weyfonk commented Oct 3, 2024

philomory commented Nov 5, 2021 •

edited

Loading

weyfonk commented Oct 1, 2024 •

edited

Loading