-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VirtualCluster] Clean up after removing node instances from a virtual cluster #441
Conversation
@@ -38,6 +38,13 @@ bool VirtualClusterManager::UpdateVirtualCluster( | |||
return false; | |||
} | |||
|
|||
// The local node is removed from a (indivisible) virtual cluster, we have to clean up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the virtual_cluster_data
is only removed from virtual_clusters_ when virtual_cluster_data.is_removed()
, but the raylet may not receive the removed publish event, so that the virtual_cluster_data(removed from gcs) will never be erased at raylet side.
std::string local_node_instance_id_; | ||
std::function<void()> local_node_cleanup_fn_; | ||
/// The (indivisible) virtual cluster to which the local node belongs. | ||
std::string virtual_cluster_id_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
local_virtual_cluster_id_
@@ -32,19 +32,40 @@ bool VirtualClusterManager::UpdateVirtualCluster( | |||
const auto &virtual_cluster_id = virtual_cluster_data.id(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (local_cluster_data.contains(local_node)) {
if(virtual_cluster_data.contains(local_node)) {
if (virtual_cluster_data.removed()) {
clean();
} else {
if (virtual_cluster_data.cluster_id() != local_cluster_data.cluster_id()) {
clean();
}
}
} else {
// normal shrink.
clean();
}
}
Signed-off-by: Chong Li <[email protected]>
virtual_clusters_.erase(input_virtual_cluster_id); | ||
} else { | ||
// Whether the local node in the input data. | ||
bool local_node_in_data = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool local_node_in_data = | |
bool input_virtual_cluster_contains_local_node = |
if (!virtual_cluster_id_set.contains(curr_iter->first)) { | ||
auto virtual_cluster_data = curr_iter->second; | ||
virtual_cluster_data.set_is_removed(true); | ||
updated_subscribe(curr_iter->first, std::move(virtual_cluster_data)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
invoke subscribe
directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Why are these changes needed?
When gcs is removing a node instance from a virtual cluster, it has to drain the node firstly. Otherwise, the pending tasks will be staying at the node.
On the other hand, nodes' resource view are pulled to gcs periodically. So when gcs is figuring out which nodes should be removed from the virtual cluster, it may use an outdated resource view. Therefore, the
drain node
operation has to take care of the running tasks at the node.Related issue number
#409
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.