Skip to content

Update DisableNodeChecker to check replica set membership #3341

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tillrohrmann
Copy link
Contributor

This commit updates the DisableNodeChecker to check that a worker node is not contained in any partition replica set. This ensures that we can only remove a node if it has been removed from all partition replica sets before.

@pcholakov this might be helpful for building the fargate controller.

This commit updates the DisableNodeChecker to check that a worker node
is not contained in any partition replica set. This ensures that we can
only remove a node if it has been removed from all partition replica sets
before.
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates the DisableNodeChecker to ensure that a worker node is not disabled while still being part of any partition replica set. Key changes include:

  • Adding an asynchronous connection method (open_connection) to handle node connectivity.
  • Updating the remove_nodes command to incorporate the partition table and metadata client when creating the DisableNodeChecker.
  • Expanding the DisableNodeChecker with an async safe_to_disable_worker method that checks for replica set membership using concurrent tasks.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
tools/restatectl/src/connection.rs Introduces a new async connection method to check node status
tools/restatectl/src/commands/node/remove_nodes.rs Updates node removal to integrate partition state and metadata
tools/restatectl/src/commands/node/disable_node_checker.rs Adds async worker disable checks to ensure no active replica set membership

}

while let Some(result) = replica_membership.join_next().await {
result.expect("check replica membership not to panic")?;
Copy link
Preview

Copilot AI Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using 'expect' here could lead to a panic if an async task fails unexpectedly. Consider propagating the error to provide more graceful error handling.

Suggested change
result.expect("check replica membership not to panic")?;
match result {
Ok(Ok(())) => continue,
Ok(Err(err)) => return Err(err),
Err(join_error) => return Err(DisableWorkerError::JoinError(join_error)),
}

Copilot uses AI. Check for mistakes.

Copy link
Contributor

@pcholakov pcholakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works great, thanks @tillrohrmann!

nodes_configuration: &'a NodesConfiguration,
logs: &'b Logs,
logs: &'a Logs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if everything has the same lifetime, do you still need the named lifetime 'a?

Comment on lines +118 to +120
WorkerState::Active => {
return Err(DisableWorkerError::Active);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically, the safety depends on the replication property or partitions located on this worker node. So, this could be a stricter than needed. That said, best to get a feel on how it feels in practice before we make it more relaxed. So, +1 to keeping it as is for the first iteration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


async move {
// todo replace with multi-get when available
let epoch_metadata = metadata_client
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little anxious about doing this on restatectl's side. I understand that epoch-metadata is the source-of-truth, but would using the view from PartitionReplicaSetStates be a possible alternative? We can expose those replica-sets via datafusion (I'm already on it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's #3356 right? I'll look into how to use the new table.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit more work but wouldn't it be even better to put the remove node operation in the ClusterCtrlSvc? Then we can use from places like the BYOC deployment controller. The source of truth behind that can still be the DF but we also have access to locally-cached metadata on the server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants