Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cudo] Provisioning failing with Invalid value for count_vm_available #3829

Open
romilbhardwaj opened this issue Aug 13, 2024 · 4 comments
Open

Comments

@romilbhardwaj
Copy link
Collaborator

sky launch is failing with an error from Cudo API. Also reported by users on slack.

(base) ➜  ~ sky launch -c vm --cloud cudo --gpus A6000
I 08-13 07:49:33 optimizer.py:691] == Optimizer ==
I 08-13 07:49:33 optimizer.py:714] Estimated cost: $0.8 / hour
I 08-13 07:49:33 optimizer.py:714]
I 08-13 07:49:33 optimizer.py:839] Considered resources (1 node):
I 08-13 07:49:33 optimizer.py:909] ----------------------------------------------------------------------------------------------------------
I 08-13 07:49:33 optimizer.py:909]  CLOUD   INSTANCE                      vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN
I 08-13 07:49:33 optimizer.py:909] ----------------------------------------------------------------------------------------------------------
I 08-13 07:49:33 optimizer.py:909]  Cudo    epyc-rome-rtx-a6000_4x1v2gb   2       4         RTXA6000:1     no-luster-1   0.80          ✔
I 08-13 07:49:33 optimizer.py:909] ----------------------------------------------------------------------------------------------------------
I 08-13 07:49:33 optimizer.py:909]
I 08-13 07:49:33 optimizer.py:927] Multiple Cudo instances satisfy RTXA6000:1. The cheapest Cudo(epyc-rome-rtx-a6000_4x1v2gb, {'RTXA6000': 1}) is considered among:
I 08-13 07:49:33 optimizer.py:927] ['epyc-rome-rtx-a6000_4x1v2gb', 'epyc-rome-rtx-a6000_8x1v4gb', 'epyc-rome-rtx-a6000_24x1v12gb', 'epyc-rome-rtx-a6000_48x1v24gb'].
I 08-13 07:49:33 optimizer.py:927]
I 08-13 07:49:33 optimizer.py:933] To list more details, run 'sky show-gpus RTXA6000'.
Launching a new cluster 'vm'. Proceed? [Y/n]:
I 08-13 07:49:34 cloud_vm_ray_backend.py:4354] Creating a new cluster: 'vm' [1x Cudo(epyc-rome-rtx-a6000_4x1v2gb, {'RTXA6000': 1})].
I 08-13 07:49:34 cloud_vm_ray_backend.py:4354] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 08-13 07:49:34 cloud_vm_ray_backend.py:1314] To view detailed progress: tail -n100 -f /Users/romilb/sky_logs/sky-2024-08-13-07-49-33-642961/provision.log
I 08-13 07:49:34 provisioner.py:65] Launching on Cudo no-luster-1 (all zones)
W 08-13 07:49:36 instance.py:89] run_instances: Invalid value for `count_vm_available`, must not be `None`

Full stack trace:

D 08-13 07:52:38 provisioner.py:171] Failed to provision 'vm' on Cudo (all zones).
D 08-13 07:52:38 provisioner.py:173] bulk_provision for 'vm' failed. Stacktrace:
D 08-13 07:52:38 provisioner.py:173] Traceback (most recent call last):
D 08-13 07:52:38 provisioner.py:173]   File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/provisioner.py", line 165, in bulk_provision
D 08-13 07:52:38 provisioner.py:173]     return _bulk_provision(cloud, region, zones, cluster_name,
D 08-13 07:52:38 provisioner.py:173]   File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/provisioner.py", line 87, in _bulk_provision
D 08-13 07:52:38 provisioner.py:173]     provision_record = provision.run_instances(provider_name,
D 08-13 07:52:38 provisioner.py:173]   File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/__init__.py", line 47, in _wrapper
D 08-13 07:52:38 provisioner.py:173]     return impl(*args, **kwargs)
D 08-13 07:52:38 provisioner.py:173]   File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/cudo/instance.py", line 86, in run_instances
D 08-13 07:52:38 provisioner.py:173]     cudo_wrapper.vm_available(to_start_count, gpu_count, gpu_model, region,
D 08-13 07:52:38 provisioner.py:173]   File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/provision/cudo/cudo_wrapper.py", line 133, in vm_available
D 08-13 07:52:38 provisioner.py:173]     types = api.list_vm_machine_types(mem,
D 08-13 07:52:38 provisioner.py:173]   File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/cudo_compute/api/virtual_machines_api.py", line 1423, in list_vm_machine_types
D 08-13 07:52:38 provisioner.py:173]     (data) = self.list_vm_machine_types_with_http_info(memory_gib, vcpu, **kwargs)  # noqa: E501
D 08-13 07:52:38 provisioner.py:173]   File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/cudo_compute/api/virtual_machines_api.py", line 1523, in list_vm_machine_types_with_http_info
D 08-13 07:52:38 provisioner.py:173]     return self.api_client.call_api(
D 08-13 07:52:38 provisioner.py:173]   File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/cudo_compute/api_client.py", line 326, in call_api
D 08-13 07:52:38 provisioner.py:173]     return self.__call_api(resource_path, method,
D 08-13 07:52:38 provisioner.py:173]   File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/cudo_compute/api_client.py", line 170, in __call_api
D 08-13 07:52:38 provisioner.py:173]     return_data = self.deserialize(response_data, response_type)
D 08-13 07:52:38 provisioner.py:173]   File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/cudo_compute/api_client.py", line 242, in deserialize
D 08-13 07:52:38 provisioner.py:173]     return self.__deserialize(data, response_type)
D 08-13 07:52:38 provisioner.py:173]   File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/cudo_compute/api_client.py", line 281, in __deserialize
D 08-13 07:52:38 provisioner.py:173]     return self.__deserialize_model(data, klass)
D 08-13 07:52:38 provisioner.py:173]   File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/cudo_compute/api_client.py", line 627, in __deserialize_model
D 08-13 07:52:38 provisioner.py:173]     instance = klass(**kwargs)
D 08-13 07:52:38 provisioner.py:173]   File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/cudo_compute/models/list_vm_machine_types_response.py", line 76, in __init__
D 08-13 07:52:38 provisioner.py:173]     self.count_vm_available = count_vm_available
D 08-13 07:52:38 provisioner.py:173]   File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/cudo_compute/models/list_vm_machine_types_response.py", line 105, in count_vm_available
D 08-13 07:52:38 provisioner.py:173]     raise ValueError("Invalid value for `count_vm_available`, must not be `None`")  # noqa: E501
D 08-13 07:52:38 provisioner.py:173] ValueError: Invalid value for `count_vm_available`, must not be `None`
D 08-13 07:52:38 provisioner.py:173] 
@romilbhardwaj
Copy link
Collaborator Author

cc @JungleCatSW were there any recent changes to the cudo API which may be causing this?

@sarath7974
Copy link

I’m also experiencing this issue with launching VMs on Cudo Cloud. Is there any progress on resolving this.

@Michaelvll
Copy link
Collaborator

@sarath7974 Would you be able to test with this branch: #3841

If it works for you, we will merge it soon : )

@sarath7974
Copy link

@Michaelvll Tested PR #3841, but encountered a different error.

ValueError: Invalid value for `network_id`, must not be `None`

During handling of the above exception, another exception occurred:

TypeError: NetworksApi.delete_network() missing 1 required positional argument: 'network_id'

The above exception was the direct cause of the following exception:

sky.provision.common.StopFailoverError: During provisioner's failover, terminating 'vm' failed. This can cause resource leakage. Please check the failure and the cluster status on the cloud, and manually terminate the cluster. Details: [TypeError] NetworksApi.delete_network() missing 1 required positional argument: 'network_id'

@Michaelvll Michaelvll added the OSS label Dec 19, 2024 — with Linear
@Michaelvll Michaelvll removed the OSS label Dec 19, 2024
@Michaelvll Michaelvll added the OSS label Dec 19, 2024 — with Linear
@Michaelvll Michaelvll removed the OSS label Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants