Skip to content

AzureMachinePoolMachine marked as terminally failed when ProvisioningState is set to Failed #5303

Closed
@mweibel

Description

@mweibel

/kind bug

What steps did you take and what happened:
When a VMSS VM has a provisioning state of Failed, AzureMachinePoolMachine controller will mark the AMPM as failed:

machineScope.SetFailureReason(capierrors.UpdateMachineError)
machineScope.SetFailureMessage(errors.Errorf("Azure VM state is %s", state))

And will not continue to reconcile the AMPM further:

if machineScope.AzureMachinePool.Status.FailureReason != nil || machineScope.AzureMachinePool.Status.FailureMessage != nil {
log.Info("Error state detected, skipping reconciliation")
return reconcile.Result{}, nil
}

According to the Cluster API provider contract also CAPI will stop reconciling this Machine (once it bubbled up to the Machine):

Note: once any of failureReason or failureMessage surface on the machine who is referencing the infrastructureMachine object, they cannot be restored anymore (it is considered a terminal error; the only way to recover is to delete and recreate the machine). Also, if the machine is under control of a MachineHealthCheck instance, the machine will be automatically remediated.

(emphasis mine)

However, VMSS VMs may sometimes go into ProvisioningState Failed and recover themselves again. We just had several of those. In the VMSS Activity Log they were marked as Health issues (PlatformInitiated Downtime).

What did you expect to happen:
I'd expect a AMPM to only go into terminal failed state when it really can't recover.

Anything else you would like to add:
I'm happy to provide a PR for this, but I'd need some information first. For example if there is a way to reliably determine if a VMSS VM is Failed and can't be recovered.
As a quick fix (and to test this), I'll probably resort to not setting FailureReason/Message in this case.

Environment:

  • cluster-api-provider-azure version: 1.17.2
  • Kubernetes version: (use kubectl version): 1.31.1
  • OS (e.g. from /etc/os-release): linux/windows

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.priority/important-soonMust be staffed and worked on either currently, or very soon, ideally in time for the next release.

    Type

    No type

    Projects

    Status

    Done

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions