-
Notifications
You must be signed in to change notification settings - Fork 548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[k8s] Realtime GPU availability of kubernetes cluster in sky show-gpus
#3499
Conversation
Ran some tests + updated cli docs. Ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the support for this @romilbhardwaj! Mostly looks good to me with minor nits. Just tried it on a k8s cluster with GPUs and it seems working well.
return list_accelerators_realtime(gpus_only, name_filter, region_filter, | ||
quantity_filter, case_sensitive, | ||
all_regions, require_price)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will calling this adds additional overhead to the list_accelerators
? Since we are relying on the list_accelerators
to generate the optimization candidate resources, which will be called multiple times during the failover process. Would be nice to make sure this does not add overhead. : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point.. the overhead compared to a the previous implementation isn't much different since the previous implementation was also invoking the kubernetes API:
This branch:
multitime -n 5 sky launch --dryrun -y --gpus T4:1
===> multitime results
1: sky launch --dryrun -y --gpus T4:1
Mean Std.Dev. Min Median Max
real 3.883 0.064 3.782 3.883 3.982
user 2.775 0.081 2.654 2.766 2.871
sys 3.136 0.285 2.676 3.268 3.448
Master:
multitime -n 5 sky launch --dryrun -y --gpus T4:1
1: sky launch --dryrun -y --gpus T4:1
Mean Std.Dev. Min Median Max
real 3.863 0.032 3.829 3.860 3.917
user 2.713 0.023 2.670 2.716 2.735
sys 3.438 0.097 3.267 3.471 3.535
That said, we should put a lru cache with a time-to-live (TTL) to expire based on time. Added a TODO.
dd02ab9
to
a263365
Compare
…o k8s_show_gpus_availability # Conflicts: # sky/cli.py
Thanks @Michaelvll! Ready for another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @romilbhardwaj! LGTM. IIRC, we may want to have a separate section for the k8s table in sky show-gpus
without any argument, so that it can be easier to distinguish those "on-prem" GPUs.
Also, it seems sky show-gpus t4
does not contain the kubernetes cluster, although sky show-gpus --cloud kubernetes
does show the T4 GPUs. Can we show the k8s section in sky show-gpus t4
as well?
…o k8s_show_gpus_availability
Thanks @Michaelvll - I've made some updates:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @romilbhardwaj!
I found having kubernetes GPUs mixed with the cloud tables a bit weird in sky show-gpus t4
.
One idea: we just have two sections, one for clouds, and one for k8s? For the k8s section, we just show the real-time availability table.
Similarly for sky show-gpus
, we can have two sections, each with a title, e.g., Clouds
, Kubernetes
(similar to our sky status
with three sections for clusters, jobs, and services).
We can have the Kubernetes section at the top so as to make all the cloud tables more connected together : )
Thanks @Michaelvll - here's the latest behavior to help review:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @romilbhardwaj for updating this! It works great! LGTM!
A minor point: for |
Thanks @Michaelvll! Moved the hint to the top for |
Closes #2839 and and #3448. Shows realtime availability of GPUs on the cluster when
--cloud kubernetes
is passed tosky show-gpus
.Examples
On a kubernetes cluster with the following configuration:
Tested (run the relevant ones):
bash format.sh
pytest tests/test_list_accelerators.py