[TPU VM] Attaching & Mounting Persistent Disk #3497
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue
Reference: #2778
When launching a TPU VM with
sky launch tpu_vm.yaml -c mycluster --cloud gcp --disk-size 200
, the resulting VM is still initialized with a disk size of 100 GB (default size). Users have to add a persistent disk to expand their local disk capacity as the boot disk of TPU VMs is not resizable.tpu_vm.yaml:
Solution
We currently use the Cloud TPU API for managing TPUVMs (e.g.
create_instance, set_labels, and delete_instance
). However, this API lacks functionality for disk attachment. Therefore, this PR includes using the GCP CLI to attach a persistent disk to TPU VMs (Documentation).Test 1:
Launch the TPUVM with a specified disk size:
sky launch tpu_vm.yaml -c mycluster --cloud gcp --disk-size 200
sky stop mucluster
Restart the TPUVM with a specified disk size:
sky launch tpu_vm.yaml -c mycluster --cloud gcp --disk-size 200
Verified that a extra disk with size 100GB has been created and attached to the TPUVM
Ensured that disk is mounted under the path
/mnt/disks/persist
Test 2:
Disk creation failed: The resource projects/project_name/zones/zone_name/disks/mycluster-d9a3-tpu-extra-disk' already exists
Test 3:
Launch the TPUVM with a disk size that is less than 100:
sky launch tpu_vm.yaml -c mycluster --cloud gcp --disk-size 80
sky stop mucluster
Restart the TPUVM with a specified disk size:
sky launch tpu_vm.yaml -c mycluster --cloud gcp --disk-size 80
Verified that no extra disk has been created
Test 4:
pytest tests/test_smoke.py --tpu
Note: