Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix active_server for multi-cluster deployments #205

Merged

Conversation

eliasp
Copy link
Contributor

@eliasp eliasp commented Mar 19, 2024

Description

When addressing multiple clusters at once within a play, the active_server would only be determined once - when processing the very first cluster.
active_server would then be used to delegate the task to, to wait for all nodes and pods to be ready again.

When all clusters are sized equally, this wouldn't cause any real troubles (yet still be incorrect in terms of actually verifying the state of the cluster's nodes), since the number of nodes in the first cluster would match the number of nodes in any other cluster checked for the nodes to be up.

But as soon as a cluster with a different number of nodes is present, this would cause failures/timeouts, since the check waiting for all nodes to be up and running again would never reach the expected number.

By not utilizing run_once and verifying for the include of first_server.yml, whether active_server is a member of the current cluster, this problem can be prevented.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update
  • Small minor change not affecting the Ansible Role code (GitHub Actions Workflow, Documentation etc.)

How Has This Been Tested?

  • added multiple clusters with different numbers of nodes to my inventory
  • utilized the rke2 successfully with those changes
  • saw the rke2 role fail without those changes

@MonolithProjects MonolithProjects added the bug Something isn't working label Mar 19, 2024
@eliasp
Copy link
Contributor Author

eliasp commented Mar 25, 2024

Found a few issues with the current approach. Reworking it and will update the PR as soon as possible.

eliasp added 4 commits March 26, 2024 21:57
Don't show the summary only for the first configured cluster, but for each.

Instead of utilizing `run_once` which limits the execution to once per play,
just use the existing `when` condition to limit it to the 1st server of each
cluster.
While `first_server.yml` would only be included for the 1st server of a cluster,
`run_once` for setting the `active_server` fact also meant this would only happen once
for all clusters instead of once per cluster.

Use a loop across the members of each cluster and a fact delegation instead, to set the
`active_server` fact instead on each cluster member to the `inventory_hostname` of the
server, for which `first_server.yml` was included.
Instead of downloading only the kubeconf of the 1st cluster processed, download it for
each cluster.
Don't limit the waiting for the remaining nodes to be ready to the 1st cluster with
`run_once`.
Instead, check whether `inventory_hostname` either matches `active_server` or the 1st
member of the group of servers.
@eliasp eliasp force-pushed the multiple-clusters-active_server branch from 944e220 to 1994af1 Compare March 26, 2024 23:34
@eliasp
Copy link
Contributor Author

eliasp commented Mar 26, 2024

Pushed now a new iteration which worked as expected in multiple local tests here so far.

@MonolithProjects MonolithProjects self-assigned this Apr 5, 2024
Copy link
Collaborator

@MonolithProjects MonolithProjects left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks

@MonolithProjects MonolithProjects merged commit 8b3a166 into lablabs:main Apr 5, 2024
5 checks passed
@eliasp eliasp deleted the multiple-clusters-active_server branch April 8, 2024 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants