Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Do not update MS status when unable to get workload cluster or machine node #10436

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 10 additions & 3 deletions internal/controllers/machineset/machineset_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,10 @@ func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (_ ctrl.Re
// Requeue if the reconcile failed because the ClusterCacheTracker was locked for
// the current cluster because of concurrent access.
if errors.Is(err, remote.ErrClusterLocked) {
if aggr, ok := err.(kerrors.Aggregate); ok && len(aggr.Errors()) > 1 {
// Print the errors if it's not only ErrClusterLocked.
log.Info(aggr.Error())
}
log.V(5).Info("Requeuing because another worker has the lock on the ClusterCacheTracker")
return ctrl.Result{RequeueAfter: time.Minute}, nil
}
Expand Down Expand Up @@ -852,7 +856,8 @@ func (r *Reconciler) shouldAdopt(ms *clusterv1.MachineSet) bool {
}

// updateStatus updates the Status field for the MachineSet
// It checks for the current state of the replicas and updates the Status of the MachineSet.
// It checks for the current state of the replicas and updates the Status field of the MachineSet.
// When unable to retrieve the Node status, it returns error and won't update the Status field of the MachineSet.
func (r *Reconciler) updateStatus(ctx context.Context, cluster *clusterv1.Cluster, ms *clusterv1.MachineSet, filteredMachines []*clusterv1.Machine) error {
log := ctrl.LoggerFrom(ctx)
newStatus := ms.Status.DeepCopy()
Expand Down Expand Up @@ -890,8 +895,7 @@ func (r *Reconciler) updateStatus(ctx context.Context, cluster *clusterv1.Cluste

node, err := r.getMachineNode(ctx, cluster, machine)
if err != nil && machine.GetDeletionTimestamp().IsZero() {
log.Error(err, "Unable to retrieve Node status", "Node", klog.KObj(node))
continue
return errors.Wrapf(err, "unable to retrieve the status of Node %s", klog.KObj(node))
Copy link
Member

@sbueringer sbueringer Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this doesn't address:

For MachineSet specifically: It definitely solves the "happy path", but it makes the "unhappy path" worse by freezing the status indefinitely. Without the changes the ready and available replicas were dropping to 0 when the communication broke down. Probably a matter of preference if in case of the communication breakdown we prefer to keep ready and available replicas as is or drop them to 0. But another side effect is also that we don't update ObservedGeneration anymore, which definitely seems wrong (similar for the conditions)

Specifically:

"unhappy path" worse by freezing the status indefinitely

But another side effect is also that we don't update ObservedGeneration anymore, which definitely seems wrong (similar for the conditions)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sbueringer.

"unhappy path" worse by freezing the status indefinitely

This PR does not update any fields of MS.Status when unable to get workload cluster or machine Node due to ErrClusterLocked or any other errors. Because the ErrClusterLocked error can be recovered soon after reconciling again, and the error that cannot get machine Node (e.g. network issue, or apiserver unavailable temporarily) should also be recovered soon after reconciling again. Are there other "unhappy path" you meant?

But another side effect is also that we don't update ObservedGeneration anymore, which definitely seems wrong (similar for the conditions).

Since this PR returns error at line 891, so it won't update ms.Status at line 904 and newStatus.ObservedGeneration at line 924. This is as expected.

图片
图片

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and the error that cannot get machine Node (e.g. network issue, or apiserver unavailable temporarily) should also be recovered soon after reconciling again. Are there other "unhappy path" you meant?

Yup, apiserver down for a longer period of time / indefinitely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case the apiserver goes down for a longer period of time or indefinitely, KCP controller should catch it and handle correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KCP won't always be able to recover (if that's what you meant)

Copy link
Contributor Author

@jessehu jessehu May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @fabriziopandini as discussed with @sbueringer in current thread, IMHO this PR provides a simple patch to solve the MS status update issue described in #10195, and there would be no need to introduce the suggested handling code (provided by @sbueringer and IMHO a little complex)

  1. I had replied my thought in 🐛 Do not update KCP and MS status when unable to get workload cluster #10229 (comment) and also removed the KCP status handling code as you suggested.
  2. In case the apiserver goes down for a longer period of time or indefinitely, the MS status won't be updated, but KCP controller should catch it and handle correctly. In this case updating MS status won't help.

BTW I was on vacation last week so sorry for the late reply. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will look into it as soon as possible, but we already invested a lot of time in trying to find an acceptable way forward.

Copy link
Member

@sbueringer sbueringer May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to express above that it's not acceptable that the MS status is permanently not updated anymore when a workload cluster stays unreachable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I described that
2) in case the apiserver goes down for a longer period of time or indefinitely, the MS status won't be updated, but KCP controller should catch it and handle correctly. In this case updating MS status won't help.
3) Even if we want to update MS status in this special case, it would be not good to set th MS ready replicas to 0, instead IMHO a new status UnknownReplicas should be added (but this changes the current API definition and is not a trivial change).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure updating the MS status won't contribute to KCP trying to remediate the problem, but it will actually keep the status up-to-date. The status is much more than just the replica fields

}

if noderefutil.IsNodeReady(node) {
Expand Down Expand Up @@ -964,6 +968,9 @@ func (r *Reconciler) getMachineNode(ctx context.Context, cluster *clusterv1.Clus
}
node := &corev1.Node{}
if err := remoteClient.Get(ctx, client.ObjectKey{Name: machine.Status.NodeRef.Name}, node); err != nil {
if apierrors.IsNotFound(err) {
return nil, nil
}
return nil, errors.Wrapf(err, "error retrieving node %s for machine %s/%s", machine.Status.NodeRef.Name, machine.Namespace, machine.Name)
}
return node, nil
Expand Down
Loading