Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/longhorn #1596

Merged
merged 4 commits into from
Dec 6, 2024
Merged

Fix/longhorn #1596

merged 4 commits into from
Dec 6, 2024

Conversation

Despire
Copy link
Contributor

@Despire Despire commented Dec 4, 2024

Closes #1570

This PR moves away from the current way of volume replication in which the number of replicas is increased and then subsequently decrease forcing to move it to another node while maintaining the StorageClass specified replica count,
to using longhorn setting block-for-eviction-if-last-replica which has the following benefits:

  • Protects data by preventing the drain operation from completing until there is a healthy replica available for each volume available on another node.
  • Automatically evicts replicas, so the user does not need to do it manually (through the UI).
  • The drain operation is only as slow and data-intensive as is necessary to protect data.

With this setting longhorn will try to maintain the replica count defined in the StorageClass (given that there exists the required number of nodes), On node deletion this setting will block the deletion if the node being deleted is the last node that has a heatlhy replica, until a new replica is created on another node after which the deletion will continue as expected.

The tests have been done on dynamic and static nodes where I would delete entire nodepools where all of the replicas were present forcing them to move to another nodepool available. It would happen that from time to time the eviction of the replicas would be stuck for ~10-15mins but it would always manage to continue and not get stuck.

Further, I had to disable the concurrent cordoning of the nodes as I've encountered an issue where the deletion of nodes would deadlock if all of the nodes where all of the replicas would live are to be deleted, opting instead for a one-by-one cordoning and deletion of nodes.

Additionally, another bug was spotted where the StorageClasses for providers defined in the InputManifest were not correctly cleaned-up. Longhorn annotations were also moved to the PatchAnnotations part in Kuber microserver to have a single point where annotations are applied.

Despire and others added 4 commits December 3, 2024 15:14
# Conflicts:
#	services/kuber/server/domain/utils/longhorn/longhorn.go
#	services/kuber/server/domain/utils/nodes/delete.go
Copy link
Contributor

@JKBGIT1 JKBGIT1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@Despire Despire added this pull request to the merge queue Dec 6, 2024
Merged via the queue into master with commit 956d889 Dec 6, 2024
@Despire Despire deleted the fix/longhorn branch December 6, 2024 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: Kuber fails to delete node.
2 participants