You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Disable a node's partition on all instances in the cluster (essentially making no valid assignment)
Switched to WAGED
Observe Behavior
Expected behavior
Rebalance should fail and new best possible nodes should not be endlessly created
The fix is not straight forward. See below for my thoughts on what triggers this. I think the true solution for this might be to have partialRebalance correctly fail to generate a mapping, rather than tweaking our persist assignment logic.
Additional context
Say we have resource_0 and resource_1, both currently using CRUSHED. We have disabled resource_0's only partition on all instances in the cluster. The current state will be offline for all nodes for resource_0 while using CRUSHED. Then we switched the resource to WAGED:
WAGED rebalancer's emergencyRebalance _assignmentManager.getBestPossibleAssignment(, which looks at the in-memory (or zk-stored) assignment and fills in any missing resources with the current states (resource_0 still has all OFFLINE for current states)
A best possible with resource_0 and resource_1's current states are stored in ZK and memory.
We then trigger a partial rebalance. When partialRebalance calls WagedRebalanceUtil.calculateAssignment it is passing in a clusterModel that has removed resource_0 from the replica map so toBeAssignedReplicas does not include resource_0. It then produces a result it sees as valid that only includes resource_1. We then call _assignmentMetadataStore.asyncUpdateBestPossibleAssignmentCache which stores the result in memory and triggers an onDemand rebalance.
We start the onDemand rebalance and our in-memory best possible only has an assignment for resource_1. In emergency rebalance, emergencyRebalance once again calls _assignmentManager.getBestPossibleAssignment( and it takes the best possible that only has resource_1 and fills in the blanks for resource_0 by combining it with the current state, so it creates a new map with assignments for resource_0 and resource_1 where all resource_0 assignments are offline.
We then call persistBestPossibleAssignment, which writes the mapping to zookeeper and stores it in memory.
We then do another partial rebalance, which computes a mapping with only resource_1, persists that into memory, and triggers an onDemand rebalance. I believe this will occur endlessly until a node goes down, which leads to calling calculateAssignment( instead of getBestPossibleAssignment( and persist an assignment that only includes resource_1 to Z (rather than just in memory previously)
The text was updated successfully, but these errors were encountered:
Describe the bug
Under specific circumstances, WAGED migration will cause the controller to endleslly write a new znode under BEST_POSSIBLE.
To Reproduce
Some test code to reproduce the issue:
GrantPSpencer#45
Expected behavior
Rebalance should fail and new best possible nodes should not be endlessly created
The fix is not straight forward. See below for my thoughts on what triggers this. I think the true solution for this might be to have partialRebalance correctly fail to generate a mapping, rather than tweaking our persist assignment logic.
Additional context
Say we have resource_0 and resource_1, both currently using CRUSHED. We have disabled resource_0's only partition on all instances in the cluster. The current state will be offline for all nodes for resource_0 while using CRUSHED. Then we switched the resource to WAGED:
WAGED rebalancer's emergencyRebalance
_assignmentManager.getBestPossibleAssignment(
, which looks at the in-memory (or zk-stored) assignment and fills in any missing resources with the current states (resource_0 still has all OFFLINE for current states)A best possible with resource_0 and resource_1's current states are stored in ZK and memory.
We then trigger a partial rebalance. When partialRebalance calls
WagedRebalanceUtil.calculateAssignment
it is passing in a clusterModel that has removed resource_0 from the replica map so toBeAssignedReplicas does not include resource_0. It then produces a result it sees as valid that only includes resource_1. We then call_assignmentMetadataStore.asyncUpdateBestPossibleAssignmentCache
which stores the result in memory and triggers an onDemand rebalance.We start the onDemand rebalance and our in-memory best possible only has an assignment for resource_1. In emergency rebalance, emergencyRebalance once again calls
_assignmentManager.getBestPossibleAssignment(
and it takes the best possible that only has resource_1 and fills in the blanks for resource_0 by combining it with the current state, so it creates a new map with assignments for resource_0 and resource_1 where all resource_0 assignments are offline.We then call persistBestPossibleAssignment, which writes the mapping to zookeeper and stores it in memory.
We then do another partial rebalance, which computes a mapping with only resource_1, persists that into memory, and triggers an onDemand rebalance. I believe this will occur endlessly until a node goes down, which leads to calling
calculateAssignment(
instead ofgetBestPossibleAssignment(
and persist an assignment that only includes resource_1 to Z (rather than just in memory previously)The text was updated successfully, but these errors were encountered: