Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controller endlessly creates BEST_POSSIBLE znodes during WAGED migration #2949

Open
GrantPSpencer opened this issue Oct 15, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@GrantPSpencer
Copy link
Contributor

Describe the bug

Under specific circumstances, WAGED migration will cause the controller to endleslly write a new znode under BEST_POSSIBLE.

To Reproduce

Some test code to reproduce the issue:
GrantPSpencer#45

  1. Have CRUSHED resources
  2. Disable a node's partition on all instances in the cluster (essentially making no valid assignment)
  3. Switched to WAGED
  4. Observe Behavior

Expected behavior

Rebalance should fail and new best possible nodes should not be endlessly created

The fix is not straight forward. See below for my thoughts on what triggers this. I think the true solution for this might be to have partialRebalance correctly fail to generate a mapping, rather than tweaking our persist assignment logic.

Additional context

Say we have resource_0 and resource_1, both currently using CRUSHED. We have disabled resource_0's only partition on all instances in the cluster. The current state will be offline for all nodes for resource_0 while using CRUSHED. Then we switched the resource to WAGED:

WAGED rebalancer's emergencyRebalance _assignmentManager.getBestPossibleAssignment(, which looks at the in-memory (or zk-stored) assignment and fills in any missing resources with the current states (resource_0 still has all OFFLINE for current states)
A best possible with resource_0 and resource_1's current states are stored in ZK and memory.

We then trigger a partial rebalance. When partialRebalance calls WagedRebalanceUtil.calculateAssignment it is passing in a clusterModel that has removed resource_0 from the replica map so toBeAssignedReplicas does not include resource_0. It then produces a result it sees as valid that only includes resource_1. We then call _assignmentMetadataStore.asyncUpdateBestPossibleAssignmentCache which stores the result in memory and triggers an onDemand rebalance.

We start the onDemand rebalance and our in-memory best possible only has an assignment for resource_1. In emergency rebalance, emergencyRebalance once again calls _assignmentManager.getBestPossibleAssignment( and it takes the best possible that only has resource_1 and fills in the blanks for resource_0 by combining it with the current state, so it creates a new map with assignments for resource_0 and resource_1 where all resource_0 assignments are offline.
We then call persistBestPossibleAssignment, which writes the mapping to zookeeper and stores it in memory.

We then do another partial rebalance, which computes a mapping with only resource_1, persists that into memory, and triggers an onDemand rebalance. I believe this will occur endlessly until a node goes down, which leads to calling calculateAssignment( instead of getBestPossibleAssignment( and persist an assignment that only includes resource_1 to Z (rather than just in memory previously)

@GrantPSpencer GrantPSpencer added the bug Something isn't working label Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant