Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TPU VM] Attaching & Mounting Persistent Disk #3497

Closed
wants to merge 6 commits into from

Conversation

jackyk02
Copy link
Contributor

@jackyk02 jackyk02 commented Apr 29, 2024

Issue
Reference: #2778

When launching a TPU VM with sky launch tpu_vm.yaml -c mycluster --cloud gcp --disk-size 200, the resulting VM is still initialized with a disk size of 100 GB (default size). Users have to add a persistent disk to expand their local disk capacity as the boot disk of TPU VMs is not resizable.

tpu_vm.yaml:

resources:
   accelerators: tpu-v2-8
   accelerator_args:
      runtime_version: tpu-vm-base

Solution
We currently use the Cloud TPU API for managing TPUVMs (e.g. create_instance, set_labels, and delete_instance). However, this API lacks functionality for disk attachment. Therefore, this PR includes using the GCP CLI to attach a persistent disk to TPU VMs (Documentation).

Test 1:

  1. Launch the TPUVM with a specified disk size:
    sky launch tpu_vm.yaml -c mycluster --cloud gcp --disk-size 200
    sky stop mucluster

  2. Restart the TPUVM with a specified disk size:
    sky launch tpu_vm.yaml -c mycluster --cloud gcp --disk-size 200

  3. Verified that a extra disk with size 100GB has been created and attached to the TPUVM

  4. Ensured that disk is mounted under the path /mnt/disks/persist

Test 2:

  1. Relaunch the TPUVM multiple times
  2. Received Error: Disk creation failed: The resource projects/project_name/zones/zone_name/disks/mycluster-d9a3-tpu-extra-disk' already exists

Test 3:

  1. Launch the TPUVM with a disk size that is less than 100:
    sky launch tpu_vm.yaml -c mycluster --cloud gcp --disk-size 80
    sky stop mucluster

  2. Restart the TPUVM with a specified disk size:
    sky launch tpu_vm.yaml -c mycluster --cloud gcp --disk-size 80

  3. Verified that no extra disk has been created

Test 4:
pytest tests/test_smoke.py --tpu

Note:

  1. Disk attachment only takes effect when the cluster is restarted.

Copy link
Contributor

This PR is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale label Aug 28, 2024
Copy link
Contributor

github-actions bot commented Sep 8, 2024

This PR was closed because it has been stalled for 10 days with no activity.

@github-actions github-actions bot closed this Sep 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant