-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Fix A10 GPU on Azure #3707
Conversation
To discuss: Should we put the template into a separate file like https://github.com/skypilot-org/skypilot/blob/master/sky/skylet/providers/azure/azure-config-template.json |
return | ||
|
||
# Configure driver extension for A10 GPUs | ||
create_result = poller.result().as_dict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should merge the parameters for A10 to original parameters to avoid two create_or_update
which can cause significant overhead? Please help profile it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I benchmarked and it reduces the provision time from 15m15s
to 14m36s
.
Since now we got more quotas, I also tested the YAML in the issue and it works in e2e as well 🫡 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @cblmemo!
accs = azure_catalog.get_accelerators_from_instance_type(instance_type) | ||
if accs is not None and "A10" in accs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to avoid using azure_catalog
in the node_provider as that seems to be an abstraction leakage, but I am ok to keep it as is for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, how about in the azure-ray.yaml
we pass a need_nvidia_driver_extension: true
within clouds/azure.py::make_deployable_variables
. There we can check the accelerators, and get rid of the use of catalog here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! PTAL 🫡
Co-authored-by: Zhanghao Wu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @cblmemo! Please check the comment above. We can get rid of the use of catalog in the node_provider
with that. Other parts look good to me.
Co-authored-by: Zhanghao Wu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried it out and seems working well. Thanks @cblmemo!
The following takes 13mins on my end
sky launch --cloud azure --gpus A10 --down nvidia-smi
* init * works. todo: only do this for A10 VMs * only install for A10 instances * merge into one template * Update sky/skylet/providers/azure/node_provider.py Co-authored-by: Zhanghao Wu <[email protected]> * add warning * apply suggestions from code review * Update sky/clouds/azure.py Co-authored-by: Zhanghao Wu <[email protected]> --------- Co-authored-by: Zhanghao Wu <[email protected]>
Fixes #3651. A10 GPUs require a special type of driver, Grid Driver, which is suggested to be installed by enabling the
NvidiaGpuDriverLinux
extension. This PR added this extension for A10 instances.Tested (run the relevant ones):
bash format.sh
nvidia-smi
works well.sky launch --cloud azure -c az-wo-a10
and checked that the drivers is not installedpytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh