Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

☂️ Enable MCM providers to force delete machines stuck in Terminal state #810

Open
1 of 6 tasks
himanshu-kun opened this issue May 4, 2023 · 4 comments
Open
1 of 6 tasks
Labels
area/quality Output qualification (tests, checks, scans, automation in general, etc.) related area/robustness Robustness, reliability, resilience related kind/enhancement Enhancement, improvement, extension lifecycle/stale Nobody worked on this for 6 months (will further age) priority/3 Priority (lower number equals higher priority)

Comments

@himanshu-kun
Copy link
Contributor

himanshu-kun commented May 4, 2023

How to categorize this issue?

/area quality
/area robustness
/kind enhancement
/priority 3

What would you like to be added:

MCM should be able to force delete machines if they are stuck in terminal state (an irrecoverable state where API calls other than force delete won't work).

Why is this needed:

We have seen issues (currently on Azure only) where the VM was stuck in terminal state (refer Live Ticket # 2946)

VM deletion failed due to - machine codes error: 
code = [Internal] message = [Code="OSProvisioningTimedOut" Message="
OS Provisioning failure has reached terminal state and is non-recoverable for VM 
'shoot--hc-can-az--prod-az-haas-hana-vsmp4-z2-8667b-5frj5'. 
Consider deleting and recreating this virtual machine. 

The above error in Azure can be reproduced possibly by

MCM fails recovering from this situation , as we detach the disks first (an Update operation) in Azure and then go for DeleteVM() . Since disk detachment is never triggered due to terminal state, the situation becomes irrecoverable and the Delete flow of MCM keeps on repeating.

Similar situations could be seen in other providers where normal Delete won't work and a force delete might be needed.

Example ticket canary # 4358
Proposal:

  • Have an alternate Force Delete flow , which is triggered if the normal Delete flow fails for a threshold number of times
    • Need to confirm the error from provider , as force delete shouldn't be triggered for errors where a backoff should be done. Ex-
      • API rate limits (often seen in CCloud)
      • invalid credentials
  • If an Annotation is placed on machine obj, then we can trigger a force delete, which might vary from provider to provider.

Providers:

@himanshu-kun himanshu-kun added the kind/enhancement Enhancement, improvement, extension label May 4, 2023
@gardener-robot gardener-robot added area/quality Output qualification (tests, checks, scans, automation in general, etc.) related area/robustness Robustness, reliability, resilience related priority/3 Priority (lower number equals higher priority) labels May 4, 2023
@himanshu-kun
Copy link
Contributor Author

Post-grooming discussion

We'll consider this after the cascade delete feature is in, because there the delete call is the one which is issued first. Cascade delete issue -> gardener/machine-controller-manager-provider-azure#91

@unmarshall
Copy link
Contributor

MOM

Attendees: @himanshu-kun , @rishabh-11 , @elankath , @unmarshall
MCServer today allows configuration of machine-creation-timeout but does not have a corresponding machine-deletion-timeout. We need a way to determine when is the right time to use force delete option for machine-resources (NIC, Disks and VM). When we overhaul the state transitions for machines in MCM we should think of defining timeout post which a force deletion will happen.

@himanshu-kun himanshu-kun added the exp/beginner Issue that requires only basic skills label Sep 12, 2023
@unmarshall
Copy link
Contributor

We must observe if force deletion of VM is required since we no longer detach the disks first but instead leverage cascade delete. So forceful deletion of the VMs is a nice feature to have, but going forward it might be required only in rare situations. How rare, we can observe and ascertain

@unmarshall
Copy link
Contributor

unmarshall commented Nov 28, 2023

We (me and @himanshu-kun) noticed at least in azure, that there is no use case for force deletion of the VM whose ProvisioningState is set to Failed. A simple VM delete without force deletion is sufficient. If the disks and nic have DeleteOption set to Detach then triggering the delete will dis-associate these resources to this VM. If the DeleteOption is set to Delete (cascade delete) then deleting the VM will trigger the deletion of the associated resources as well.

@himanshu-kun himanshu-kun removed the exp/beginner Issue that requires only basic skills label Dec 11, 2023
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/quality Output qualification (tests, checks, scans, automation in general, etc.) related area/robustness Robustness, reliability, resilience related kind/enhancement Enhancement, improvement, extension lifecycle/stale Nobody worked on this for 6 months (will further age) priority/3 Priority (lower number equals higher priority)
Projects
None yet
Development

No branches or pull requests

3 participants