Controller endlessly creates BEST_POSSIBLE znodes during WAGED migration #2949

GrantPSpencer · 2024-10-15T19:07:47Z

Describe the bug

Under specific circumstances, WAGED migration will cause the controller to endleslly write a new znode under BEST_POSSIBLE.

To Reproduce

Some test code to reproduce the issue:
GrantPSpencer#45

Have CRUSHED resources
Disable a node's partition on all instances in the cluster (essentially making no valid assignment)
Switched to WAGED
Observe Behavior

Expected behavior

Rebalance should fail and new best possible nodes should not be endlessly created

The fix is not straight forward. See below for my thoughts on what triggers this. I think the true solution for this might be to have partialRebalance correctly fail to generate a mapping, rather than tweaking our persist assignment logic.

Additional context

Say we have resource_0 and resource_1, both currently using CRUSHED. We have disabled resource_0's only partition on all instances in the cluster. The current state will be offline for all nodes for resource_0 while using CRUSHED. Then we switched the resource to WAGED:

WAGED rebalancer's emergencyRebalance _assignmentManager.getBestPossibleAssignment(, which looks at the in-memory (or zk-stored) assignment and fills in any missing resources with the current states (resource_0 still has all OFFLINE for current states)
A best possible with resource_0 and resource_1's current states are stored in ZK and memory.

We then trigger a partial rebalance. When partialRebalance calls WagedRebalanceUtil.calculateAssignment it is passing in a clusterModel that has removed resource_0 from the replica map so toBeAssignedReplicas does not include resource_0. It then produces a result it sees as valid that only includes resource_1. We then call _assignmentMetadataStore.asyncUpdateBestPossibleAssignmentCache which stores the result in memory and triggers an onDemand rebalance.

We start the onDemand rebalance and our in-memory best possible only has an assignment for resource_1. In emergency rebalance, emergencyRebalance once again calls _assignmentManager.getBestPossibleAssignment( and it takes the best possible that only has resource_1 and fills in the blanks for resource_0 by combining it with the current state, so it creates a new map with assignments for resource_0 and resource_1 where all resource_0 assignments are offline.
We then call persistBestPossibleAssignment, which writes the mapping to zookeeper and stores it in memory.

We then do another partial rebalance, which computes a mapping with only resource_1, persists that into memory, and triggers an onDemand rebalance. I believe this will occur endlessly until a node goes down, which leads to calling calculateAssignment( instead of getBestPossibleAssignment( and persist an assignment that only includes resource_1 to Z (rather than just in memory previously)

The text was updated successfully, but these errors were encountered:

GrantPSpencer added the bug Something isn't working label Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Controller endlessly creates BEST_POSSIBLE znodes during WAGED migration #2949

Controller endlessly creates BEST_POSSIBLE znodes during WAGED migration #2949

GrantPSpencer commented Oct 15, 2024

Controller endlessly creates BEST_POSSIBLE znodes during WAGED migration #2949

Controller endlessly creates BEST_POSSIBLE znodes during WAGED migration #2949

Comments

GrantPSpencer commented Oct 15, 2024

Describe the bug

To Reproduce

Expected behavior

Additional context