-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect offline clusters #2933
Open
weyfonk
wants to merge
14
commits into
rancher:main
Choose a base branch
from
weyfonk:594-detect-offline-cluster
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Detect offline clusters #2933
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
0xavi0
reviewed
Oct 8, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if you want to merge this PR with the comments with open questions, maybe better to clarify first.
Just a couple of observations, but overall LGTM
weyfonk
force-pushed
the
594-detect-offline-cluster
branch
2 times, most recently
from
October 8, 2024 09:51
38a697b
to
452e56e
Compare
weyfonk
force-pushed
the
594-detect-offline-cluster
branch
3 times, most recently
from
October 18, 2024 14:51
a1c1b94
to
8a32810
Compare
weyfonk
force-pushed
the
594-detect-offline-cluster
branch
3 times, most recently
from
October 22, 2024 09:21
235303e
to
8f07ba4
Compare
This needs more discussion around UI/UX. |
This allows the Fleet controller to detect offline clusters and update statuses of bundle deployments targeting offline clusters. Next to do: * understand how bundle deployment status updates should be propagated to bundles (which currently simply appear as `Modified`) and further up * write tests (eg. integration tests, updating cluster status by hand?) * set sensible defaults (eg. monitoring interval higher, and threshold higher than agent's cluster status reporting interval)
This enables that state to be reflected upwards in bundle, GitRepo, cluster and cluster group statuses.
This also provides unit tests for detecting offline clusters.
The frequency of cluster status checks is currently hard-coded to 1 minute, but could be made configurable. The threshold for considering a cluster offline now explicitly depends on how often agents report their statuses to the management cluster. Changes to that configured interval should impact the cluster status monitor, which would take the new value into account from its next run onwards.
This fixes an error and ignores a few others to make the linter happy.
This adds checks ensuring that for offline clusters, for which calls to update bundle deployment statuses are expected, those statuses contain `Ready` and `Monitored` conditions with status `False` and reasons reflecting the cluster's offline status.
This ensures that creating a new agent bundle fails with an agent check-in interval set to 0.
This adds a check on the agent check-in interval to cluster import, for consistency with agent bundle updates.
This enables users to determine how often the Fleet controller will check for offline clusters, and based on which threshold. If the configured threshold is below the triple of the check-in interval, that tripled value will be used instead.
This optimises updates to bundle deployments, running them only against clusters which bundle deployments are not yet marked as offline.
This better reflects what is then known about workloads running in such clusters than `False`.
This should fix Fleet controller deployments complaining about the interval being 0 when it should never be.
Running one cluster status monitor per Fleet controller pod is not necessary and may cause conflicts in sharded setups.
Omitting the agent check-in interval when patching the `fleet-controller` config map would now lead to errors when setting up agents with a check-in interval bearing the default value for a duration, ie 0s. That interval is now set with a hard-coded value, which is of no importance for such tests, for the sake of not being zero.
weyfonk
force-pushed
the
594-detect-offline-cluster
branch
from
December 18, 2024 17:08
8f07ba4
to
9f8db1c
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This adds a cluster status monitor to the Fleet controller, which checks when each cluster last saw its agent online. If more than the expected interval elapses, that cluster is considered offline, and the monitor updates its bundle deployments' statuses to reflect that. This will trigger status updates to bundles,
GitRepo
s, clusters and cluster groups.Refers to #594.
Open points:
Ready
andMonitored
conditions while clearing modified and non-ready statuses, to prevent outdated messages from appearing in a bundle deployment's display status and further up the chain of status updates (to bundles, then upwards toGitRepo
s, etc)