-
Notifications
You must be signed in to change notification settings - Fork 539
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[k8s] On-demand single-host TPU support on GKE (#3947)
* initial version of TPU support on GKE * revert unnecesary change * revert * use TPU_LABEL_KEY constant * nit * nit * update detect_gpu_label_formatter() to use match_label_key() * tidy get_gpu_label_key_value * nit * update method name * update get_gke_accelerator_name to support TPU * add support for get_label_keys method due to TPU label key * syntax * update get_tpu_topology_label_key_value * nit * refactor error surfacing methods to have it work with TPU support * update toleration comment * support listing available TPUs and show-gpus for TPUs * nit * update help message * Update /tmp/tpu_logs dir's write permission * nit * nit * comment update on TPU resource lackage error handling * Update to use global constant instead of hard coded string of nvidia.com/gpu and google.com/tpu * add smoke test and make exec work on TPU pods * update smoke test to check if TPU is reachable. * add comment * nit * Comment on number of requested TPU chips for multi- and single- host TPU slice. * update method to check GKE supported TPU name * nit * move is_tpu_pod_slice to kubernetes_utils * update get_accelerator_from_label_value to use is_tpu_pod_slice method * nit * format * nit * check acc count support * preemptive TPU check * update check_tpu_fits * error msg update * merge get_tpu_topology_label_key_value into get_gpu_label_key_value * Update sky/provision/kubernetes/utils.py Co-authored-by: Tian Xia <[email protected]> * nit fixes * format * nit * Implement method for reading acc counts from node/pod object * assertion update for is_tpu_vm * Exclude multi-host TPUs to displayed from show-gpus * Notify users that multi-host TPUs are not supported from 'sky show-gpus' * format * nit * display warning message from show-gpus conditionally * update sky show-gpus * update get_accelerator_label_key_value * format * format * format * update comment * resolve review comments * update tpuvm_mnist.yaml * resolve comments * update display message for show-gpus * format --------- Co-authored-by: Tian Xia <[email protected]>
- Loading branch information
1 parent
2030398
commit eea13cc
Showing
12 changed files
with
691 additions
and
267 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.