Fix `active_server` for multi-cluster deployments #205

eliasp · 2024-03-19T02:07:07Z

Description

When addressing multiple clusters at once within a play, the active_server would only be determined once - when processing the very first cluster.
active_server would then be used to delegate the task to, to wait for all nodes and pods to be ready again.

When all clusters are sized equally, this wouldn't cause any real troubles (yet still be incorrect in terms of actually verifying the state of the cluster's nodes), since the number of nodes in the first cluster would match the number of nodes in any other cluster checked for the nodes to be up.

But as soon as a cluster with a different number of nodes is present, this would cause failures/timeouts, since the check waiting for all nodes to be up and running again would never reach the expected number.

By not utilizing run_once and verifying for the include of first_server.yml, whether active_server is a member of the current cluster, this problem can be prevented.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update
Small minor change not affecting the Ansible Role code (GitHub Actions Workflow, Documentation etc.)

How Has This Been Tested?

added multiple clusters with different numbers of nodes to my inventory
utilized the rke2 successfully with those changes
saw the rke2 role fail without those changes

eliasp · 2024-03-25T07:54:16Z

Found a few issues with the current approach. Reworking it and will update the PR as soon as possible.

Don't show the summary only for the first configured cluster, but for each. Instead of utilizing `run_once` which limits the execution to once per play, just use the existing `when` condition to limit it to the 1st server of each cluster.

While `first_server.yml` would only be included for the 1st server of a cluster, `run_once` for setting the `active_server` fact also meant this would only happen once for all clusters instead of once per cluster. Use a loop across the members of each cluster and a fact delegation instead, to set the `active_server` fact instead on each cluster member to the `inventory_hostname` of the server, for which `first_server.yml` was included.

Instead of downloading only the kubeconf of the 1st cluster processed, download it for each cluster.

Don't limit the waiting for the remaining nodes to be ready to the 1st cluster with `run_once`. Instead, check whether `inventory_hostname` either matches `active_server` or the 1st member of the group of servers.

eliasp · 2024-03-26T23:35:38Z

Pushed now a new iteration which worked as expected in multiple local tests here so far.

MonolithProjects

LGTM. Thanks

MonolithProjects added the bug Something isn't working label Mar 19, 2024

eliasp added 4 commits March 26, 2024 21:57

Show the nodes summary for each cluster

1f4dd6e

Don't show the summary only for the first configured cluster, but for each. Instead of utilizing `run_once` which limits the execution to once per play, just use the existing `when` condition to limit it to the 1st server of each cluster.

Download kubeconf for each cluster, not once for all

328ea09

Instead of downloading only the kubeconf of the 1st cluster processed, download it for each cluster.

Wait for every cluster's remaining nodes to be ready

1994af1

Don't limit the waiting for the remaining nodes to be ready to the 1st cluster with `run_once`. Instead, check whether `inventory_hostname` either matches `active_server` or the 1st member of the group of servers.

eliasp force-pushed the multiple-clusters-active_server branch from 944e220 to 1994af1 Compare March 26, 2024 23:34

MonolithProjects self-assigned this Apr 5, 2024

MonolithProjects approved these changes Apr 5, 2024

View reviewed changes

MonolithProjects merged commit 8b3a166 into lablabs:main Apr 5, 2024
5 checks passed

eliasp deleted the multiple-clusters-active_server branch April 8, 2024 15:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `active_server` for multi-cluster deployments #205

Fix `active_server` for multi-cluster deployments #205

eliasp commented Mar 19, 2024

eliasp commented Mar 25, 2024

eliasp commented Mar 26, 2024

MonolithProjects left a comment

Fix active_server for multi-cluster deployments #205

Fix active_server for multi-cluster deployments #205

Conversation

eliasp commented Mar 19, 2024

Description

Type of change

How Has This Been Tested?

eliasp commented Mar 25, 2024

eliasp commented Mar 26, 2024

MonolithProjects left a comment

Choose a reason for hiding this comment

Fix `active_server` for multi-cluster deployments #205

Fix `active_server` for multi-cluster deployments #205