-
Notifications
You must be signed in to change notification settings - Fork 549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FluidStack] Add NVLINK GPUs #3954
base: master
Are you sure you want to change the base?
Conversation
* Add NVLINK GPUs as distinct gpus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this feature @mjibril ! I tried sky show-gpus --cloud fluidstack -a
and it seems those nvlink gpu types is not shown in the table. is that expected?
$ sky show-gpus --cloud fluidstack -a
COMMON_GPU AVAILABLE_QUANTITIES
A100 4
A100-80GB 1, 2, 4, 8
H100 1, 2, 4, 8
V100 1, 2
OTHER_GPU AVAILABLE_QUANTITIES
A40 1, 2
L40 1, 2, 4, 8
RTX4000 1, 2
RTX5000 1, 2
RTXA4000 1, 2, 4, 8, 10
RTXA5000 1, 2, 4, 8
RTXA6000 1, 2, 4, 8
GPU QTY CLOUD INSTANCE_TYPE DEVICE_MEM vCPUs HOST_MEM HOURLY_PRICE HOURLY_SPOT_PRICE
A100 4 Fluidstack recUiB2e6s3XDxwE9 40GB 60 440GB $ 5.880 -
GPU QTY CLOUD INSTANCE_TYPE DEVICE_MEM vCPUs HOST_MEM HOURLY_PRICE HOURLY_SPOT_PRICE
A100-80GB 1 Fluidstack recdpfEZCYapXrX5TbSFNMUUi 80GB 32 120GB $ 2.483 -
A100-80GB 2 Fluidstack recE2ZDQmqR9HBKYs5xSnjtPw 80GB 64 240GB $ 4.956 -
A100-80GB 4 Fluidstack recWGm4oJ9AB3XVPxzRaujgbx 80GB 126 480GB $ 9.895 -
A100-80GB 8 Fluidstack recUYj6oGJCvAvCXC7KQo5Fc7 80GB 252 960GB $ 19.786 -
GPU QTY CLOUD INSTANCE_TYPE DEVICE_MEM vCPUs HOST_MEM HOURLY_PRICE HOURLY_SPOT_PRICE
A40 1 Fluidstack custom:0:49CA322D5074468E99EF80A20EB47838 48GB 8 64GB $ 1.692 -
A40 2 Fluidstack custom:1:49CA322D5074468E99EF80A20EB47838 48GB 16 128GB $ 3.338 -
GPU QTY CLOUD INSTANCE_TYPE DEVICE_MEM vCPUs HOST_MEM HOURLY_PRICE HOURLY_SPOT_PRICE
H100 1 Fluidstack recgdWtkWDJq2qMHRJehMFG2C 80GB 32 180GB $ 5.290 -
H100 2 Fluidstack rec49QTUzoTUX7PtMkkFJxRn8 80GB 64 360GB $ 10.576 -
H100 4 Fluidstack recUBDR2d46CWuYeW5uWWuTpx 80GB 126 720GB $ 21.141 -
H100 8 Fluidstack recA75DjBbrFwoD8bT3GBZDCa 80GB 252 1440GB $ 42.284 -
GPU QTY CLOUD INSTANCE_TYPE DEVICE_MEM vCPUs HOST_MEM HOURLY_PRICE HOURLY_SPOT_PRICE
L40 1 Fluidstack recVcAEL8UwVgZWP5WNrQJN8r 48GB 32 60GB $ 1.761 -
L40 2 Fluidstack recSCaaKigbSg5MQPPVNoH9nG 48GB 64 120GB $ 3.508 -
L40 4 Fluidstack reciAySCoQSubyQ2atsxqFRxK 48GB 126 240GB $ 7.002 -
L40 8 Fluidstack recBNHGKPfgVSjm7hhThid8wu 48GB 252 480GB $ 14.001 -
GPU QTY CLOUD INSTANCE_TYPE DEVICE_MEM vCPUs HOST_MEM HOURLY_PRICE HOURLY_SPOT_PRICE
RTX4000 1 Fluidstack custom:0:6B224766C0EF48A9A7E5E342DD771D26 8GB 8 64GB $ 0.702 -
RTX4000 2 Fluidstack custom:1:6B224766C0EF48A9A7E5E342DD771D26 8GB 16 128GB $ 1.358 -
GPU QTY CLOUD INSTANCE_TYPE DEVICE_MEM vCPUs HOST_MEM HOURLY_PRICE HOURLY_SPOT_PRICE
RTX5000 1 Fluidstack custom:0:3243C00DDFFA449F872F81FBD068D8A7 16GB 8 64GB $ 1.012 -
RTX5000 2 Fluidstack custom:1:3243C00DDFFA449F872F81FBD068D8A7 16GB 16 128GB $ 1.978 -
GPU QTY CLOUD INSTANCE_TYPE DEVICE_MEM vCPUs HOST_MEM HOURLY_PRICE HOURLY_SPOT_PRICE
RTXA4000 1 Fluidstack rec3pUyh6pNkIjCaL 16GB 6 24GB $ 0.641 -
RTXA4000 1 Fluidstack custom:0:36F6353DC62E4E2397950DE5EC40BD26 16GB 8 64GB $ 1.052 -
RTXA4000 2 Fluidstack recD36aFY7yDpZoGt 16GB 12 48GB $ 1.277 -
RTXA4000 2 Fluidstack custom:1:36F6353DC62E4E2397950DE5EC40BD26 16GB 16 128GB $ 2.058 -
RTXA4000 4 Fluidstack recyJRy1LdC46X6Bq 16GB 24 96GB $ 2.551 -
RTXA4000 8 Fluidstack recWmWiuQ9RGSGHHZ 16GB 48 192GB $ 5.098 -
RTXA4000 10 Fluidstack rec71JCVNQJ7LrtJq 16GB 64 240GB $ 6.409 -
GPU QTY CLOUD INSTANCE_TYPE DEVICE_MEM vCPUs HOST_MEM HOURLY_PRICE HOURLY_SPOT_PRICE
RTXA5000 1 Fluidstack recnEsfRtKtjJtP89 24GB 8 30GB $ 0.802 -
RTXA5000 1 Fluidstack custom:0:5D62B4BCE9D4417B9D6714D02A017219 24GB 8 64GB $ 1.202 -
RTXA5000 2 Fluidstack recpQJ16RVGg82H1U 24GB 16 60GB $ 1.599 -
RTXA5000 2 Fluidstack custom:1:5D62B4BCE9D4417B9D6714D02A017219 24GB 16 128GB $ 2.358 -
RTXA5000 4 Fluidstack recGGyBSNFR6E4HwX 24GB 32 120GB $ 3.195 -
RTXA5000 8 Fluidstack recCsiQJWg1pO7JiG 24GB 64 240GB $ 6.386 -
GPU QTY CLOUD INSTANCE_TYPE DEVICE_MEM vCPUs HOST_MEM HOURLY_PRICE HOURLY_SPOT_PRICE
RTXA6000 1 Fluidstack recBHQpdbSJmZJmFk 48GB 6 55GB $ 0.790 -
RTXA6000 1 Fluidstack recY0EqEFid9a5Yqf 48GB 16 59GB $ 1.562 -
RTXA6000 2 Fluidstack recQTIlaAazJRDuUt 48GB 12 110GB $ 1.580 -
RTXA6000 1 Fluidstack custom:0:A5AF79B0E3C54438B8AF7F3098FC6341 48GB 8 64GB $ 1.692 -
RTXA6000 2 Fluidstack recsGZmsxe35V5HK3 48GB 32 119GB $ 3.124 -
RTXA6000 2 Fluidstack custom:1:A5AF79B0E3C54438B8AF7F3098FC6341 48GB 16 128GB $ 3.338 -
RTXA6000 4 Fluidstack recDsMwZj5HbYY6Tg 48GB 24 220GB $ 3.400 -
RTXA6000 4 Fluidstack rec3HRxyysDOzxikh 48GB 64 238GB $ 6.244 -
RTXA6000 8 Fluidstack rec67A0ZCdJgLfQf4 48GB 128 480GB $ 12.408 -
GPU QTY CLOUD INSTANCE_TYPE DEVICE_MEM vCPUs HOST_MEM HOURLY_PRICE HOURLY_SPOT_PRICE
V100 1 Fluidstack custom:0:A3E9735B4E2841E3982622EE4DBB41DA 16GB 8 64GB $ 1.232 -
V100 2 Fluidstack custom:1:A3E9735B4E2841E3982622EE4DBB41DA 16GB 16 128GB $ 2.418 -
@@ -298,7 +298,7 @@ def query_instances( | |||
'pending': status_lib.ClusterStatus.INIT, | |||
'stopped': status_lib.ClusterStatus.STOPPED, | |||
'running': status_lib.ClusterStatus.UP, | |||
'unhealthy': status_lib.ClusterStatus.INIT, | |||
'failed': status_lib.ClusterStatus.INIT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why changing this? Should we add a new entry instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"unhealthy" not used anymore, refactored to "failed".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @cblmemo !
For H100-NVLink
not showing in the list of GPUs, recall that the catalog is fetched from the Skypilot catalog repository, which itself is generated from code currently in the main branch of Skypilot. The code in the main branch does not contain the new mapping, as such the GPU will not show.
To view this new GPU locally, we need to fetch the catalog from FluidStack using the code from the forked repo.
python3 sky/clouds/service_catalog/data_fetchers/fetch_fluidstack.py
cp fluidstack/vms.csv ~/.sky/catalogs/v5/fluidstack/vms.csv
sky show-gpus --cloud fluidstack -a
We also need to add the FluidStack API key ~/.fluidstack/api_key
obtainable from the dashboard prior to fetching the catalog from the FluidStack API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mjibril ! Thanks for fixing this. I tested with catalog and it works well. However, when I tries to launch an A100-80GB-NVLINK cluster, my instance keep in this pending status (there were a 10 minutes count down but nothing happens when it reaches 0). Could you double check if this is due to availability issue? Other than this it looks great to me ;)
Yes @cblmemo . this is due to low stock issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for confirming @mjibril ! LGTM.
sky launch --gpus A100-80GB-NVLINK:8 --cloud fluidstack
python sky/clouds/service_catalog/data_fetchers/fetch_fluidstack.py
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh