-
Notifications
You must be signed in to change notification settings - Fork 547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Azure] Fix to sync NSG status while opening ports #3844
[Azure] Fix to sync NSG status while opening ports #3844
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @landscapepainter! Left two comments. : )
sky/provision/azure/instance.py
Outdated
@@ -714,62 +717,83 @@ def open_ports( | |||
subscription_id = provider_config['subscription_id'] | |||
resource_group = provider_config['resource_group'] | |||
network_client = azure.get_client('network', subscription_id) | |||
nsg_name = _NSG_NAME.format(cluster_name_on_cloud=cluster_name_on_cloud) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we avoid assuming the nsg name for a cluster to be more robust to future changes, such #3845? We can just check all the NSG in the same resource group as we are using one resource group per cluster.
By doing that, let's keep the while
and for
loop in the same order as the original.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Michaelvll Thanks! I have two questions to address:
-
If I understand correctly, it doesn't seem like we can keep the original order of
while
andfor
loop. It seems necessary to calllist_network_security_groups(resource_group)
within the while loop (while not nsg_created:
) so that the NSG status is updated on each iteration. This ensures that the checkif nsg.provisioning_state not in ['Creating', 'Updating']:
accurately reflects the latest state within the for loop (for nsg in nsg_list:
). However, I might be overlooking something. Please let me know if that’s the case. -
I can remove the check for specific nsg name from this PR, but this was also intended to be used for [Azure] Allow resource group specifiation for Azure instance provisioning #3764 as well. As you mentioned, we currently use one resource group per cluster, but this changes when resource group specification is used with
sky serve
. Serve controller and all the replicas would be created under the user specified resource group, and in this case, there would be multiple NSGs under the resource group. This was causing issues as opening ports while launching replica 4 was checking on NSG creation for replica 5 and got stuck.If losing track of the NSG name for the cluster throughout the
launch
cycle is your concern, we can keep on track of the NSG name while passing the name as a parameter to our ARM template instead of directly creating it from the ARM("nsgName": "[concat('sky-', parameters('clusterId'), '-nsg')]"
) and perhaps have that stored inprovider_config
as well, so we can read from it instead of using_NSG_NAME.format()
.2-1. Is it possible that we may need multiple NSGs for a single cluster in near future? We don't seem to do this now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 1, can we check if there is an Azure API that get a single nsg
instead of listing all of them, so that we can keep getting that nsg within the while loop? The current loop order is unintuitive, looping through the nsg list within a retry loop. It would be more readable to do the retry loop for each NSG (although we only select a single NSG).
For 2, it makes sense when a resource group can be specified. In that case, how about we just create a function in azure/config.py
that generates the NSG name, so it can be reused in both config.py
and here? We have to be careful with backward compatibility then, if we do the exact match of the NSG. How about we allow a list of NSG names to support legacy names, such as ray-{clusterId}-...
and the one that will be added with #3848
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Michaelvll Following your review, I found an API method .get
to directly retrieve the nsg object. But there's an issue with using this with multiple possible NSG names(legacy names). Attempting to get an nsg instance with the following snippet takes about ~15s to return a ResourceNotFoundError
when given a non-existent nsg name:
nsg = network_client.network_security_groups.get(resource_group_name, nsg_name)
And we have to provide non-existent nsg name at most twice with two legacy nsg names, which adds ~30s, to provision VM. So, I instead kept the listing method to iterate through the NSG list to find a matching name among our legacy and current nsg names. This is done in a separate method, _get_cluster_nsg
, so it also resolves your concern on readability.
Also, following your comment, I created a method to create NSG name with our current naming convention and used it within config.py::bootstrap_instances
and instance.py::_get_cluster_nsg
, which is called by instance.py::open_ports
.
These are done at befcbc1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @landscapepainter! Mostly looks good to me with some minor fixes for the function structure
Co-authored-by: Zhanghao Wu <[email protected]>
Co-authored-by: Zhanghao Wu <[email protected]>
Co-authored-by: Zhanghao Wu <[email protected]>
Co-authored-by: Zhanghao Wu <[email protected]>
Thanks @Michaelvll! This is ready for another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested performance and basic sky operations. LGTM!
This resolves #3843.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py --azure
besides the ones failing frommaster
branch.pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh