-
Notifications
You must be signed in to change notification settings - Fork 66
Update DisableNodeChecker to check replica set membership #3341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Update DisableNodeChecker to check replica set membership #3341
Conversation
This commit updates the DisableNodeChecker to check that a worker node is not contained in any partition replica set. This ensures that we can only remove a node if it has been removed from all partition replica sets before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR updates the DisableNodeChecker to ensure that a worker node is not disabled while still being part of any partition replica set. Key changes include:
- Adding an asynchronous connection method (open_connection) to handle node connectivity.
- Updating the remove_nodes command to incorporate the partition table and metadata client when creating the DisableNodeChecker.
- Expanding the DisableNodeChecker with an async safe_to_disable_worker method that checks for replica set membership using concurrent tasks.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
File | Description |
---|---|
tools/restatectl/src/connection.rs | Introduces a new async connection method to check node status |
tools/restatectl/src/commands/node/remove_nodes.rs | Updates node removal to integrate partition state and metadata |
tools/restatectl/src/commands/node/disable_node_checker.rs | Adds async worker disable checks to ensure no active replica set membership |
} | ||
|
||
while let Some(result) = replica_membership.join_next().await { | ||
result.expect("check replica membership not to panic")?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using 'expect' here could lead to a panic if an async task fails unexpectedly. Consider propagating the error to provide more graceful error handling.
result.expect("check replica membership not to panic")?; | |
match result { | |
Ok(Ok(())) => continue, | |
Ok(Err(err)) => return Err(err), | |
Err(join_error) => return Err(DisableWorkerError::JoinError(join_error)), | |
} |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works great, thanks @tillrohrmann!
nodes_configuration: &'a NodesConfiguration, | ||
logs: &'b Logs, | ||
logs: &'a Logs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if everything has the same lifetime, do you still need the named lifetime 'a
?
WorkerState::Active => { | ||
return Err(DisableWorkerError::Active); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically, the safety depends on the replication property or partitions located on this worker node. So, this could be a stricter than needed. That said, best to get a feel on how it feels in practice before we make it more relaxed. So, +1 to keeping it as is for the first iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
async move { | ||
// todo replace with multi-get when available | ||
let epoch_metadata = metadata_client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little anxious about doing this on restatectl's side. I understand that epoch-metadata is the source-of-truth, but would using the view from PartitionReplicaSetStates
be a possible alternative? We can expose those replica-sets via datafusion (I'm already on it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's #3356 right? I'll look into how to use the new table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit more work but wouldn't it be even better to put the remove node operation in the ClusterCtrlSvc
? Then we can use from places like the BYOC deployment controller. The source of truth behind that can still be the DF but we also have access to locally-cached metadata on the server.
This commit updates the DisableNodeChecker to check that a worker node is not contained in any partition replica set. This ensures that we can only remove a node if it has been removed from all partition replica sets before.
@pcholakov this might be helpful for building the fargate controller.