Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New provisioner GCP #2681

Merged
merged 145 commits into from
Jan 1, 2024
Merged

New provisioner GCP #2681

merged 145 commits into from
Jan 1, 2024

Conversation

suquark
Copy link
Collaborator

@suquark suquark commented Oct 8, 2023

Draft implementation for new provisioner for GCP.

For this PR, we only support non-tpuvm instances. We will support TPUVM in a later PR.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@suquark suquark changed the title [WIP] New provisioner GCP New provisioner GCP Oct 20, 2023
@suquark suquark force-pushed the new_provisioner_gcp branch from 91c00ad to 71ce79c Compare October 20, 2023 22:27
@suquark suquark requested a review from Michaelvll October 20, 2023 22:32
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/provision/gcp/instance.py Outdated Show resolved Hide resolved
sky/provision/gcp/instance.py Show resolved Hide resolved
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @suquark! It looks great! Left several comments.

I just tried the following and it seems working great:

  1. sky launch -c test-gcp --cloud gcp (2m5s)
  2. time sky launch -y -c test-gcp-32 --cloud gcp --num-nodes 32 --cpus 2 (3m24.091s with the two threadpool comments applied)

I found some issues:

  1. When I ctrl-c a sky down for the 32-node cluster above, and try to sky down again, sometimes the following error occurs:
  File "/home/gcpuser/skypilot/sky/provision/gcp/instance.py", line 430, in terminate_instances
    _wait_for_operations(operations, project_id, zone)
  File "/home/gcpuser/skypilot/sky/provision/gcp/instance.py", line 87, in _wait_for_operations
    if handler.wait_for_operation(operation, project_id, zone):
  File "/home/gcpuser/skypilot/sky/provision/gcp/instance_utils.py", line 414, in wait_for_operation
    raise Exception(result['error'])
Exception: {'errors': [{'code': 'RESOURCE_NOT_FOUND', 'message': "The resource 'projects/skypilot-375900/zones/us-central1-a/instances/test-gcp-new-32-2-2514-worker-6a81764d-compute' was not found"}]}

sky/provision/gcp/instance.py Outdated Show resolved Hide resolved
sky/provision/gcp/instance.py Outdated Show resolved Hide resolved
sky/provision/gcp/instance.py Show resolved Hide resolved
sky/provision/gcp/instance.py Outdated Show resolved Hide resolved
sky/provision/gcp/instance.py Outdated Show resolved Hide resolved
sky/provision/gcp/config.py Outdated Show resolved Hide resolved
sky/provision/gcp/instance_utils.py Outdated Show resolved Hide resolved
sky/provision/gcp/instance.py Outdated Show resolved Hide resolved
sky/provision/gcp/config.py Show resolved Hide resolved
sky/provision/gcp/config.py Outdated Show resolved Hide resolved
@Michaelvll
Copy link
Collaborator

Thanks for the comment @concretevitamin ! I just updated the logging to the cleaner version as suggested. It is a bit tricky to keep the same partially dimmed version. I suppose it should be fine to keep the whole line dimmed.

For normal VM:
image

For TPU VM:
image

@Michaelvll
Copy link
Collaborator

Michaelvll commented Dec 28, 2023

Tested (b2bcbe8):

  • pytest tests/test_smoke.py
  • pytest tests/test_smoke.py --gcp
  • pytest tests/test_smoke.py --aws
  • bash -i tests/backward_compatibility_tests.sh
  • pytest tests/test_smoke.py --gcp with private IP and VPC (on a GCP machine, the regions with NAT is added in the ~/.sky/config.yaml)
  • pytest tests/test_smoke.py --gcp with private IP and VPC with jump server (on a AWS machine, except for serve and docker tests, as they do not support proxy)

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved
sky/provision/gcp/instance_utils.py Outdated Show resolved Hide resolved
sky/provision/gcp/instance_utils.py Outdated Show resolved Hide resolved
@Michaelvll
Copy link
Collaborator

Michaelvll commented Jan 1, 2024

Tested:

  • master: sky launch -c test-tpuvm-pod --gpus tpu-v2-32 echo hi; this PR: sky exec test-tpuvm-pod echo hi; sky launch -c test-tpuvm-pod --gpus tpu-v2-32 echo hi
  • commit before Set TPU VM the default option #1758 (8f18a6d) sky launch -c test-tpu-node test.yaml; this PR: sky exec test-tpu-node test-with-acc-args.yaml; sky launch -c test-tpu-node test-with-acc-args.yaml
    resources: 
      accelerators: tpu-v2-8
    
    run: |
      echo "Hello World"
    

@Michaelvll Michaelvll merged commit 318553b into master Jan 1, 2024
19 checks passed
@Michaelvll Michaelvll deleted the new_provisioner_gcp branch January 1, 2024 01:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants