Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The description of cluster paused and skipping remediation of MHC should be clearer #9026

Closed
haijianyang opened this issue Jul 21, 2023 · 13 comments · Fixed by #10817
Closed
Assignees
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@haijianyang
Copy link
Contributor

What steps did you take and what happened?

Original link https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/healthchecking.html#skipping-remediation

What did you expect to happen?

Original description:
`Implicit skipping when the resource is paused (using cluster.x-k8s.io/paused annotation):

When a cluster is paused, none of the machines in that cluster are considered for remediation.
When a machine is paused, only that machine is not considered for remediation.
A cluster or a machine is usually paused automatically by Cluster API when it detects a migration.`

MHC Controller

// Return early if the object or Cluster is paused.
if annotations.IsPaused(cluster, m) {
	log.Info("Reconciliation is paused for this object")
	return ctrl.Result{}, nil
}

// IsPaused returns true if the Cluster is paused or the object has the `paused` annotation.
func IsPaused(cluster *clusterv1.Cluster, o metav1.Object) bool {
	if cluster.Spec.Paused {
		return true
	}
	return HasPaused(o)
}

According to the MHC code, only when cluster.Spec.Paused is true, none of the machines in that cluster are considered for remediation.

Cluster with cluster.x-k8s.io/paused annotation will only pause the reconciliation of the cluster, not the reconciliation of other resources including MHC.

Cluster API version

v1.4.4

Kubernetes version

No response

Anything else you would like to add?

No response

Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 21, 2023
@killianmuldoon
Copy link
Contributor

/triage accepted

You're right that this is confusing - would be a good idea to clean up the documentation around this.

/help

@k8s-ci-robot
Copy link
Contributor

@killianmuldoon:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/triage accepted

You're right that this is confusing - would be a good idea to clean up the documentation around this.

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 21, 2023
@SubhasmitaSw
Copy link
Contributor

/assign

@SubhasmitaSw
Copy link
Contributor

@killianmuldoon the controller spec definition has to be modified or just the documentation needs more clarity?

@killianmuldoon
Copy link
Contributor

For this we just need to clear up the documentation.

@SubhasmitaSw
Copy link
Contributor

cool, thanks @killianmuldoon

@antonblr
Copy link

antonblr commented Dec 6, 2023

Should IsPaused check for Cluster annotation as well? It is checking for cluster.spec.paused, but it is not even documented. Only annotations are documented, and MHC explicitly calling func named annotations.IsPaused func, not spec.IsPaused, hinting that it will check the annotations.

Something like:

// util/annotations/helpers.go

// IsPaused returns true if the Cluster or object is paused.
func IsPaused(cluster *clusterv1.Cluster, o metav1.Object) bool {
	if cluster.Spec.Paused {
		return true
	}

        // Cluster or object  has the `paused` annotation
	return HasPaused(cluster) || HasPaused(o)
}

@sbueringer
Copy link
Member

Not sure if we should introduce a second way to achieve the same thing (i.e. pausing a Cluster). If there is documentation missing we can update the documentation instead of changing the implementation.

@fabriziopandini
Copy link
Member

/priority backlog

@k8s-ci-robot k8s-ci-robot added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Apr 11, 2024
@enxebre
Copy link
Member

enxebre commented May 29, 2024

@fabriziopandini shall we close this in favour of #10130?

@fabriziopandini
Copy link
Member

According to the MHC code, only when cluster.Spec.Paused is true, none of the machines in that cluster are considered for remediation.

AFAIK this is not true, or at least it is only partially true.

  • MHC does not remediate any machine in the cluster if the cluster is paused (pausing the cluster is like pausing all the MHC resources belonging to it).
  • A specific MHC resource does not remediate any target machine if the MHC instance is paused.
  • Paused machines are not remediated by MHC.

Given that, I suggest improving the note in https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/healthchecking.html#skipping-remediation to

There are scenarios where remediation for a machine may be undesirable (eg. during cluster migration using clusterctl move). For such cases, MachineHealthCheck provides the following mechanisms to skip remediation.

- Users can skip remediation for a specific machine by setting the cluster.x-k8s.io/skip-remediation annotation on it.
- Paused Machines (Machines with the cluster.x-k8s.io/paused annotation) are not considered for remediation.
- If a specific MHC resource is paused (using cluster.x-k8s.io/paused annotation), it will stop to remediate the corresponding target machines.
- If the Cluster is paused (using the cluster.x-k8s.io/paused annotation or by setting cluster.spec.paused to true), all the MHC resources belonging to the Cluster will be implicitly paused, and thus stop remediating target machines.

Note: the last option (pausing the Cluster) is the one used by clusterctl move.

/good-first-issue

@fabriziopandini shall we close this in favour of #10130?

This issue is about improving the description of the MHC behaviuor in the book.
The linked issue instead is about surfacing what is paused, so I don't think they overlap.

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini:
This request has been marked as suitable for new contributors.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

According to the MHC code, only when cluster.Spec.Paused is true, none of the machines in that cluster are considered for remediation.

AFAIK this is not true, or at least it is only partially true.

  • MHC does not remediate any machine in the cluster if the cluster is paused (pausing the cluster is like pausing all the MHC resources belonging to it).
  • A specific MHC resource does not remediate any target machine if the MHC instance is paused.
  • Paused machines are not remediated by MHC.

Given that, I suggest improving the note in https://cluster-api.sigs.k8s.io/tasks/automated-machine-management/healthchecking.html#skipping-remediation to

There are scenarios where remediation for a machine may be undesirable (eg. during cluster migration using clusterctl move). For such cases, MachineHealthCheck provides the following mechanisms to skip remediation.

- Users can skip remediation for a specific machine by setting the cluster.x-k8s.io/skip-remediation annotation on it.
- Paused Machines (Machines with the cluster.x-k8s.io/paused annotation) are not considered for remediation.
- If a specific MHC resource is paused (using cluster.x-k8s.io/paused annotation), it will stop to remediate the corresponding target machines.
- If the Cluster is paused (using the cluster.x-k8s.io/paused annotation or by setting cluster.spec.paused to true), all the MHC resources belonging to the Cluster will be implicitly paused, and thus stop remediating target machines.

Note: the last option (pausing the Cluster) is the one used by clusterctl move.

/good-first-issue

@fabriziopandini shall we close this in favour of #10130?

This issue is about improving the description of the MHC behaviuor in the book.
The linked issue instead is about surfacing what is paused, so I don't think they overlap.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. label May 30, 2024
@SD-13
Copy link
Contributor

SD-13 commented Jun 29, 2024

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
9 participants