You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[k8s] On-demand single-host TPU support on GKE (#3947)
* initial version of TPU support on GKE
* revert unnecesary change
* revert
* use TPU_LABEL_KEY constant
* nit
* nit
* update detect_gpu_label_formatter() to use match_label_key()
* tidy get_gpu_label_key_value
* nit
* update method name
* update get_gke_accelerator_name to support TPU
* add support for get_label_keys method due to TPU label key
* syntax
* update get_tpu_topology_label_key_value
* nit
* refactor error surfacing methods to have it work with TPU support
* update toleration comment
* support listing available TPUs and show-gpus for TPUs
* nit
* update help message
* Update /tmp/tpu_logs dir's write permission
* nit
* nit
* comment update on TPU resource lackage error handling
* Update to use global constant instead of hard coded string of nvidia.com/gpu and google.com/tpu
* add smoke test and make exec work on TPU pods
* update smoke test to check if TPU is reachable.
* add comment
* nit
* Comment on number of requested TPU chips for multi- and single- host TPU slice.
* update method to check GKE supported TPU name
* nit
* move is_tpu_pod_slice to kubernetes_utils
* update get_accelerator_from_label_value to use is_tpu_pod_slice method
* nit
* format
* nit
* check acc count support
* preemptive TPU check
* update check_tpu_fits
* error msg update
* merge get_tpu_topology_label_key_value into get_gpu_label_key_value
* Update sky/provision/kubernetes/utils.py
Co-authored-by: Tian Xia <[email protected]>
* nit fixes
* format
* nit
* Implement method for reading acc counts from node/pod object
* assertion update for is_tpu_vm
* Exclude multi-host TPUs to displayed from show-gpus
* Notify users that multi-host TPUs are not supported from 'sky show-gpus'
* format
* nit
* display warning message from show-gpus conditionally
* update sky show-gpus
* update get_accelerator_label_key_value
* format
* format
* format
* update comment
* resolve review comments
* update tpuvm_mnist.yaml
* resolve comments
* update display message for show-gpus
* format
---------
Co-authored-by: Tian Xia <[email protected]>
0 commit comments