Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VirtualCluster] Clean up after removing node instances from a virtual cluster #441

Merged
merged 9 commits into from
Jan 9, 2025

Conversation

Chong-Li
Copy link
Collaborator

@Chong-Li Chong-Li commented Jan 2, 2025

Why are these changes needed?

When gcs is removing a node instance from a virtual cluster, it has to drain the node firstly. Otherwise, the pending tasks will be staying at the node.

On the other hand, nodes' resource view are pulled to gcs periodically. So when gcs is figuring out which nodes should be removed from the virtual cluster, it may use an outdated resource view. Therefore, the drain node operation has to take care of the running tasks at the node.

Related issue number

#409

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@Chong-Li Chong-Li requested a review from wumuzi520 January 2, 2025 10:08
@Chong-Li Chong-Li changed the title [VirtualCluster] Drain before remove node instances [VirtualCluster] Drain before removing node instances from a virtual cluster Jan 2, 2025
@Chong-Li Chong-Li changed the title [VirtualCluster] Drain before removing node instances from a virtual cluster [VirtualCluster] Clean up after removing node instances from a virtual cluster Jan 6, 2025
@@ -38,6 +38,13 @@ bool VirtualClusterManager::UpdateVirtualCluster(
return false;
}

// The local node is removed from a (indivisible) virtual cluster, we have to clean up
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the virtual_cluster_data is only removed from virtual_clusters_ when virtual_cluster_data.is_removed(), but the raylet may not receive the removed publish event, so that the virtual_cluster_data(removed from gcs) will never be erased at raylet side.

std::string local_node_instance_id_;
std::function<void()> local_node_cleanup_fn_;
/// The (indivisible) virtual cluster to which the local node belongs.
std::string virtual_cluster_id_;
Copy link
Collaborator

@wumuzi520 wumuzi520 Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

local_virtual_cluster_id_

@@ -32,19 +32,40 @@ bool VirtualClusterManager::UpdateVirtualCluster(
const auto &virtual_cluster_id = virtual_cluster_data.id();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (local_cluster_data.contains(local_node)) {
    if(virtual_cluster_data.contains(local_node)) {
         if (virtual_cluster_data.removed()) {
                clean();
            } else {
                if (virtual_cluster_data.cluster_id() != local_cluster_data.cluster_id()) {
                    clean();
                }
            }
    } else {
        // normal shrink.
        clean();
    }
}

Signed-off-by: Chong Li <[email protected]>
virtual_clusters_.erase(input_virtual_cluster_id);
} else {
// Whether the local node in the input data.
bool local_node_in_data =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool local_node_in_data =
bool input_virtual_cluster_contains_local_node =

if (!virtual_cluster_id_set.contains(curr_iter->first)) {
auto virtual_cluster_data = curr_iter->second;
virtual_cluster_data.set_is_removed(true);
updated_subscribe(curr_iter->first, std::move(virtual_cluster_data));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

invoke subscribe directly.

Copy link
Collaborator

@wumuzi520 wumuzi520 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants