Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FluidStack] Add NVLINK GPUs #3954

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mjibril
Copy link
Contributor

@mjibril mjibril commented Sep 17, 2024

* Add NVLINK GPUs as distinct gpus

sky launch --gpus A100-80GB-NVLINK:8 --cloud fluidstack
python sky/clouds/service_catalog/data_fetchers/fetch_fluidstack.py

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

    * Add NVLINK GPUs as distinct gpus
@Michaelvll Michaelvll requested a review from cblmemo September 19, 2024 17:04
Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this feature @mjibril ! I tried sky show-gpus --cloud fluidstack -a and it seems those nvlink gpu types is not shown in the table. is that expected?

$ sky show-gpus --cloud fluidstack -a
COMMON_GPU  AVAILABLE_QUANTITIES  
A100        4                     
A100-80GB   1, 2, 4, 8            
H100        1, 2, 4, 8            
V100        1, 2                  

OTHER_GPU  AVAILABLE_QUANTITIES  
A40        1, 2                  
L40        1, 2, 4, 8            
RTX4000    1, 2                  
RTX5000    1, 2                  
RTXA4000   1, 2, 4, 8, 10        
RTXA5000   1, 2, 4, 8            
RTXA6000   1, 2, 4, 8            

GPU   QTY  CLOUD       INSTANCE_TYPE      DEVICE_MEM  vCPUs  HOST_MEM  HOURLY_PRICE  HOURLY_SPOT_PRICE  
A100  4    Fluidstack  recUiB2e6s3XDxwE9  40GB        60     440GB     $ 5.880       -                  

GPU        QTY  CLOUD       INSTANCE_TYPE              DEVICE_MEM  vCPUs  HOST_MEM  HOURLY_PRICE  HOURLY_SPOT_PRICE  
A100-80GB  1    Fluidstack  recdpfEZCYapXrX5TbSFNMUUi  80GB        32     120GB     $ 2.483       -                  
A100-80GB  2    Fluidstack  recE2ZDQmqR9HBKYs5xSnjtPw  80GB        64     240GB     $ 4.956       -                  
A100-80GB  4    Fluidstack  recWGm4oJ9AB3XVPxzRaujgbx  80GB        126    480GB     $ 9.895       -                  
A100-80GB  8    Fluidstack  recUYj6oGJCvAvCXC7KQo5Fc7  80GB        252    960GB     $ 19.786      -                  

GPU  QTY  CLOUD       INSTANCE_TYPE                              DEVICE_MEM  vCPUs  HOST_MEM  HOURLY_PRICE  HOURLY_SPOT_PRICE  
A40  1    Fluidstack  custom:0:49CA322D5074468E99EF80A20EB47838  48GB        8      64GB      $ 1.692       -                  
A40  2    Fluidstack  custom:1:49CA322D5074468E99EF80A20EB47838  48GB        16     128GB     $ 3.338       -                  

GPU   QTY  CLOUD       INSTANCE_TYPE              DEVICE_MEM  vCPUs  HOST_MEM  HOURLY_PRICE  HOURLY_SPOT_PRICE  
H100  1    Fluidstack  recgdWtkWDJq2qMHRJehMFG2C  80GB        32     180GB     $ 5.290       -                  
H100  2    Fluidstack  rec49QTUzoTUX7PtMkkFJxRn8  80GB        64     360GB     $ 10.576      -                  
H100  4    Fluidstack  recUBDR2d46CWuYeW5uWWuTpx  80GB        126    720GB     $ 21.141      -                  
H100  8    Fluidstack  recA75DjBbrFwoD8bT3GBZDCa  80GB        252    1440GB    $ 42.284      -                  

GPU  QTY  CLOUD       INSTANCE_TYPE              DEVICE_MEM  vCPUs  HOST_MEM  HOURLY_PRICE  HOURLY_SPOT_PRICE  
L40  1    Fluidstack  recVcAEL8UwVgZWP5WNrQJN8r  48GB        32     60GB      $ 1.761       -                  
L40  2    Fluidstack  recSCaaKigbSg5MQPPVNoH9nG  48GB        64     120GB     $ 3.508       -                  
L40  4    Fluidstack  reciAySCoQSubyQ2atsxqFRxK  48GB        126    240GB     $ 7.002       -                  
L40  8    Fluidstack  recBNHGKPfgVSjm7hhThid8wu  48GB        252    480GB     $ 14.001      -                  

GPU      QTY  CLOUD       INSTANCE_TYPE                              DEVICE_MEM  vCPUs  HOST_MEM  HOURLY_PRICE  HOURLY_SPOT_PRICE  
RTX4000  1    Fluidstack  custom:0:6B224766C0EF48A9A7E5E342DD771D26  8GB         8      64GB      $ 0.702       -                  
RTX4000  2    Fluidstack  custom:1:6B224766C0EF48A9A7E5E342DD771D26  8GB         16     128GB     $ 1.358       -                  

GPU      QTY  CLOUD       INSTANCE_TYPE                              DEVICE_MEM  vCPUs  HOST_MEM  HOURLY_PRICE  HOURLY_SPOT_PRICE  
RTX5000  1    Fluidstack  custom:0:3243C00DDFFA449F872F81FBD068D8A7  16GB        8      64GB      $ 1.012       -                  
RTX5000  2    Fluidstack  custom:1:3243C00DDFFA449F872F81FBD068D8A7  16GB        16     128GB     $ 1.978       -                  

GPU       QTY  CLOUD       INSTANCE_TYPE                              DEVICE_MEM  vCPUs  HOST_MEM  HOURLY_PRICE  HOURLY_SPOT_PRICE  
RTXA4000  1    Fluidstack  rec3pUyh6pNkIjCaL                          16GB        6      24GB      $ 0.641       -                  
RTXA4000  1    Fluidstack  custom:0:36F6353DC62E4E2397950DE5EC40BD26  16GB        8      64GB      $ 1.052       -                  
RTXA4000  2    Fluidstack  recD36aFY7yDpZoGt                          16GB        12     48GB      $ 1.277       -                  
RTXA4000  2    Fluidstack  custom:1:36F6353DC62E4E2397950DE5EC40BD26  16GB        16     128GB     $ 2.058       -                  
RTXA4000  4    Fluidstack  recyJRy1LdC46X6Bq                          16GB        24     96GB      $ 2.551       -                  
RTXA4000  8    Fluidstack  recWmWiuQ9RGSGHHZ                          16GB        48     192GB     $ 5.098       -                  
RTXA4000  10   Fluidstack  rec71JCVNQJ7LrtJq                          16GB        64     240GB     $ 6.409       -                  

GPU       QTY  CLOUD       INSTANCE_TYPE                              DEVICE_MEM  vCPUs  HOST_MEM  HOURLY_PRICE  HOURLY_SPOT_PRICE  
RTXA5000  1    Fluidstack  recnEsfRtKtjJtP89                          24GB        8      30GB      $ 0.802       -                  
RTXA5000  1    Fluidstack  custom:0:5D62B4BCE9D4417B9D6714D02A017219  24GB        8      64GB      $ 1.202       -                  
RTXA5000  2    Fluidstack  recpQJ16RVGg82H1U                          24GB        16     60GB      $ 1.599       -                  
RTXA5000  2    Fluidstack  custom:1:5D62B4BCE9D4417B9D6714D02A017219  24GB        16     128GB     $ 2.358       -                  
RTXA5000  4    Fluidstack  recGGyBSNFR6E4HwX                          24GB        32     120GB     $ 3.195       -                  
RTXA5000  8    Fluidstack  recCsiQJWg1pO7JiG                          24GB        64     240GB     $ 6.386       -                  

GPU       QTY  CLOUD       INSTANCE_TYPE                              DEVICE_MEM  vCPUs  HOST_MEM  HOURLY_PRICE  HOURLY_SPOT_PRICE  
RTXA6000  1    Fluidstack  recBHQpdbSJmZJmFk                          48GB        6      55GB      $ 0.790       -                  
RTXA6000  1    Fluidstack  recY0EqEFid9a5Yqf                          48GB        16     59GB      $ 1.562       -                  
RTXA6000  2    Fluidstack  recQTIlaAazJRDuUt                          48GB        12     110GB     $ 1.580       -                  
RTXA6000  1    Fluidstack  custom:0:A5AF79B0E3C54438B8AF7F3098FC6341  48GB        8      64GB      $ 1.692       -                  
RTXA6000  2    Fluidstack  recsGZmsxe35V5HK3                          48GB        32     119GB     $ 3.124       -                  
RTXA6000  2    Fluidstack  custom:1:A5AF79B0E3C54438B8AF7F3098FC6341  48GB        16     128GB     $ 3.338       -                  
RTXA6000  4    Fluidstack  recDsMwZj5HbYY6Tg                          48GB        24     220GB     $ 3.400       -                  
RTXA6000  4    Fluidstack  rec3HRxyysDOzxikh                          48GB        64     238GB     $ 6.244       -                  
RTXA6000  8    Fluidstack  rec67A0ZCdJgLfQf4                          48GB        128    480GB     $ 12.408      -                  

GPU   QTY  CLOUD       INSTANCE_TYPE                              DEVICE_MEM  vCPUs  HOST_MEM  HOURLY_PRICE  HOURLY_SPOT_PRICE  
V100  1    Fluidstack  custom:0:A3E9735B4E2841E3982622EE4DBB41DA  16GB        8      64GB      $ 1.232       -                  
V100  2    Fluidstack  custom:1:A3E9735B4E2841E3982622EE4DBB41DA  16GB        16     128GB     $ 2.418       -                  

@@ -298,7 +298,7 @@ def query_instances(
'pending': status_lib.ClusterStatus.INIT,
'stopped': status_lib.ClusterStatus.STOPPED,
'running': status_lib.ClusterStatus.UP,
'unhealthy': status_lib.ClusterStatus.INIT,
'failed': status_lib.ClusterStatus.INIT,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why changing this? Should we add a new entry instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"unhealthy" not used anymore, refactored to "failed".

Copy link
Contributor Author

@mjibril mjibril Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @cblmemo !

For H100-NVLink not showing in the list of GPUs, recall that the catalog is fetched from the Skypilot catalog repository, which itself is generated from code currently in the main branch of Skypilot. The code in the main branch does not contain the new mapping, as such the GPU will not show.

To view this new GPU locally, we need to fetch the catalog from FluidStack using the code from the forked repo.

python3 sky/clouds/service_catalog/data_fetchers/fetch_fluidstack.py
cp fluidstack/vms.csv ~/.sky/catalogs/v5/fluidstack/vms.csv 
sky show-gpus --cloud fluidstack -a

We also need to add the FluidStack API key ~/.fluidstack/api_key obtainable from the dashboard prior to fetching the catalog from the FluidStack API.

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mjibril ! Thanks for fixing this. I tested with catalog and it works well. However, when I tries to launch an A100-80GB-NVLINK cluster, my instance keep in this pending status (there were a 10 minutes count down but nothing happens when it reaches 0). Could you double check if this is due to availability issue? Other than this it looks great to me ;)

image

@mjibril
Copy link
Contributor Author

mjibril commented Oct 29, 2024

Hi @mjibril ! Thanks for fixing this. I tested with catalog and it works well. However, when I tries to launch an A100-80GB-NVLINK cluster, my instance keep in this pending status (there were a 10 minutes count down but nothing happens when it reaches 0). Could you double check if this is due to availability issue? Other than this it looks great to me ;)

Yes @cblmemo . this is due to low stock issues.

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for confirming @mjibril ! LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants