Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect offline clusters #2933

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

weyfonk
Copy link
Contributor

@weyfonk weyfonk commented Oct 7, 2024

This adds a cluster status monitor to the Fleet controller, which checks when each cluster last saw its agent online. If more than the expected interval elapses, that cluster is considered offline, and the monitor updates its bundle deployments' statuses to reflect that. This will trigger status updates to bundles, GitRepos, clusters and cluster groups.

Refers to #594.

Open points:

  • how far, and how fine-grained, do we want to make bundle deployment status updates for offline clusters? This currently takes a fairly basic approach, updating both Ready and Monitored conditions while clearing modified and non-ready statuses, to prevent outdated messages from appearing in a bundle deployment's display status and further up the chain of status updates (to bundles, then upwards to GitRepos, etc)
  • should we make the frequency of monitoring configurable?
  • do we have a way to exclude the local/management cluster from agent-last-seen checks?

@weyfonk weyfonk requested a review from a team as a code owner October 7, 2024 10:15
Copy link
Contributor

@0xavi0 0xavi0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if you want to merge this PR with the comments with open questions, maybe better to clarify first.

Just a couple of observations, but overall LGTM

internal/cmd/controller/clustermonitor/monitor.go Outdated Show resolved Hide resolved
internal/cmd/controller/clustermonitor/monitor_test.go Outdated Show resolved Hide resolved
@weyfonk weyfonk force-pushed the 594-detect-offline-cluster branch 2 times, most recently from 38a697b to 452e56e Compare October 8, 2024 09:51
@kkaempf kkaempf modified the milestones: v2.10.0, v2.9.4 Oct 8, 2024
@weyfonk weyfonk force-pushed the 594-detect-offline-cluster branch 3 times, most recently from a1c1b94 to 8a32810 Compare October 18, 2024 14:51
@weyfonk weyfonk marked this pull request as draft October 18, 2024 16:06
@weyfonk weyfonk force-pushed the 594-detect-offline-cluster branch 3 times, most recently from 235303e to 8f07ba4 Compare October 22, 2024 09:21
@weyfonk weyfonk marked this pull request as ready for review October 22, 2024 09:40
@weyfonk weyfonk marked this pull request as draft October 25, 2024 09:10
@weyfonk weyfonk modified the milestones: v2.9.4, v2.10.0 Oct 25, 2024
@weyfonk
Copy link
Contributor Author

weyfonk commented Oct 25, 2024

This needs more discussion around UI/UX.

@weyfonk weyfonk modified the milestones: v2.10.0, v2.11.0 Oct 30, 2024
This allows the Fleet controller to detect offline clusters and update
statuses of bundle deployments targeting offline clusters.

Next to do:
* understand how bundle deployment status updates should be propagated
  to bundles (which currently simply appear as `Modified`) and further
  up
* write tests (eg. integration tests, updating cluster status by hand?)
* set sensible defaults (eg. monitoring interval higher, and threshold
  higher than agent's cluster status reporting interval)
This enables that state to be reflected upwards in bundle, GitRepo,
cluster and cluster group statuses.
This also provides unit tests for detecting offline clusters.
The frequency of cluster status checks is currently hard-coded to 1
minute, but could be made configurable.

The threshold for considering a cluster offline now explicitly depends
on how often agents report their statuses to the management cluster.
Changes to that configured interval should impact the cluster status
monitor, which would take the new value into account from its next run
onwards.
This fixes an error and ignores a few others to make the linter happy.
This adds checks ensuring that for offline clusters, for which calls to
update bundle deployment statuses are expected, those statuses contain
`Ready` and `Monitored` conditions with status `False` and reasons
reflecting the cluster's offline status.
This ensures that creating a new agent bundle fails with an agent
check-in interval set to 0.
This adds a check on the agent check-in interval to cluster import, for
consistency with agent bundle updates.
This enables users to determine how often the Fleet controller will
check for offline clusters, and based on which threshold. If the
configured threshold is below the triple of the check-in interval, that
tripled value will be used instead.
This optimises updates to bundle deployments, running them only against
clusters which bundle deployments are not yet marked as offline.
This better reflects what is then known about workloads running in such
clusters than `False`.
This should fix Fleet controller deployments complaining about the
interval being 0 when it should never be.
Running one cluster status monitor per Fleet controller pod is not
necessary and may cause conflicts in sharded setups.
Omitting the agent check-in interval when patching the
`fleet-controller` config map would now lead to errors when setting up
agents with a check-in interval bearing the default value for a
duration, ie 0s.
That interval is now set with a hard-coded value, which is of no
importance for such tests, for the sake of not being zero.
@weyfonk weyfonk force-pushed the 594-detect-offline-cluster branch from 8f07ba4 to 9f8db1c Compare December 18, 2024 17:08
@weyfonk weyfonk marked this pull request as ready for review December 18, 2024 17:22
@manno manno removed this from the v2.11.0 milestone Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants