From 778d666f8633f37cc0be949bab03fa52ef8a9dd2 Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Tue, 21 Nov 2023 03:44:58 +0800 Subject: [PATCH] [Docs] Docs for multiple resources interface (#2803) * Add docs for multiple resources * Add docs for multiple resources * Update docs/source/examples/auto-failover.rst Co-authored-by: Romil Bhardwaj * Update docs/source/examples/auto-failover.rst Co-authored-by: Romil Bhardwaj * Update docs/source/examples/auto-failover.rst Co-authored-by: Romil Bhardwaj * Update docs/source/examples/auto-failover.rst Co-authored-by: Romil Bhardwaj * Address comments * reword * move --------- Co-authored-by: Romil Bhardwaj --- docs/source/examples/auto-failover.rst | 148 ++++++++++++++++++------- 1 file changed, 108 insertions(+), 40 deletions(-) diff --git a/docs/source/examples/auto-failover.rst b/docs/source/examples/auto-failover.rst index e612efcea08..e31b3033e46 100644 --- a/docs/source/examples/auto-failover.rst +++ b/docs/source/examples/auto-failover.rst @@ -32,7 +32,7 @@ A common high-end GPU to use in deep learning is a NVIDIA V100 GPU. These GPUs are often in high demand and hard to get. Let's see how SkyPilot's auto-failover provisioner handles such a request: -.. code-block:: +.. code-block:: console $ sky launch -c gpu --gpus V100 ... # optimizer output @@ -42,13 +42,7 @@ provisioner handles such a request: I 02-11 21:17:43 cloud_vm_ray_backend.py:624] I 02-11 21:17:43 cloud_vm_ray_backend.py:624] Launching on GCP us-central1 (us-central1-a) W 02-11 21:17:56 cloud_vm_ray_backend.py:358] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-a (message: The zone 'projects/intercloud-320520/zones/us-central1-a' does not have enough resources available to fulfill the request. Try a different zone, or try again later.) - I 02-11 21:17:56 cloud_vm_ray_backend.py:624] - I 02-11 21:17:56 cloud_vm_ray_backend.py:624] Launching on GCP us-central1 (us-central1-b) - W 02-11 21:18:10 cloud_vm_ray_backend.py:358] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-b (message: The zone 'projects/intercloud-320520/zones/us-central1-b' does not have enough resources available to fulfill the request. Try a different zone, or try again later.) - I 02-11 21:18:10 cloud_vm_ray_backend.py:624] - I 02-11 21:18:10 cloud_vm_ray_backend.py:624] Launching on GCP us-central1 (us-central1-c) - W 02-11 21:18:24 cloud_vm_ray_backend.py:358] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-c (message: The zone 'projects/intercloud-320520/zones/us-central1-c' does not have enough resources available to fulfill the request. Try a different zone, or try again later.) - I 02-11 21:18:24 cloud_vm_ray_backend.py:624] + ... I 02-11 21:18:24 cloud_vm_ray_backend.py:624] Launching on GCP us-central1 (us-central1-f) W 02-11 21:18:38 cloud_vm_ray_backend.py:358] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-f (message: The zone 'projects/intercloud-320520/zones/us-central1-f' does not have enough resources available to fulfill the request. Try a different zone, or try again later.) I 02-11 21:18:38 cloud_vm_ray_backend.py:624] @@ -66,7 +60,7 @@ Here is an example of cross-cloud failover when requesting 8x V100 GPUs. All regions in GCP failed to provide the resource, so the provisioner switched to AWS, where it succeeded after two regions: -.. code-block:: +.. code-block:: console $ sky launch -c v100-8 --gpus V100:8 ... # optimizer output @@ -76,41 +70,115 @@ AWS, where it succeeded after two regions: I 02-23 16:39:59 cloud_vm_ray_backend.py:668] I 02-23 16:39:59 cloud_vm_ray_backend.py:668] Launching on GCP us-central1 (us-central1-a) W 02-23 16:40:17 cloud_vm_ray_backend.py:403] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-a (message: The zone 'projects/intercloud-320520/zones/us-central1-a' does not have enough resources available to fulfill the request. Try a different zone, or try again later.) - I 02-23 16:40:17 cloud_vm_ray_backend.py:668] - I 02-23 16:40:17 cloud_vm_ray_backend.py:668] Launching on GCP us-central1 (us-central1-b) - W 02-23 16:40:35 cloud_vm_ray_backend.py:403] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-b (message: The zone 'projects/intercloud-320520/zones/us-central1-b' does not have enough resources available to fulfill the request. Try a different zone, or try again later.) - I 02-23 16:40:35 cloud_vm_ray_backend.py:668] - I 02-23 16:40:35 cloud_vm_ray_backend.py:668] Launching on GCP us-central1 (us-central1-c) - W 02-23 16:40:55 cloud_vm_ray_backend.py:403] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-c (message: The zone 'projects/intercloud-320520/zones/us-central1-c' does not have enough resources available to fulfill the request. Try a different zone, or try again later.) - I 02-23 16:40:55 cloud_vm_ray_backend.py:668] - I 02-23 16:40:55 cloud_vm_ray_backend.py:668] Launching on GCP us-central1 (us-central1-f) - W 02-23 16:41:13 cloud_vm_ray_backend.py:403] Got QUOTA_EXCEEDED in us-central1-f (message: Quota 'NVIDIA_V100_GPUS' exceeded. Limit: 1.0 in region us-central1.) - I 02-23 16:41:13 cloud_vm_ray_backend.py:668] - I 02-23 16:41:13 cloud_vm_ray_backend.py:668] Launching on GCP us-west1 (us-west1-a) - W 02-23 16:41:31 cloud_vm_ray_backend.py:403] Got QUOTA_EXCEEDED in us-west1-a (message: Quota 'NVIDIA_V100_GPUS' exceeded. Limit: 1.0 in region us-west1.) - I 02-23 16:41:31 cloud_vm_ray_backend.py:668] - I 02-23 16:41:31 cloud_vm_ray_backend.py:668] Launching on GCP us-east1 (us-east1-c) - W 02-23 16:41:50 cloud_vm_ray_backend.py:403] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-east1-c (message: The zone 'projects/intercloud-320520/zones/us-east1-c' does not have enough resources available to fulfill the request. Try a different zone, or try again later.) - E 02-23 16:41:50 cloud_vm_ray_backend.py:746] Failed to acquire resources in all regions/zones (requested GCP(n1-highmem-8, {'V100': 8.0})). Try changing resource requirements or use another cloud. - W 02-23 16:41:50 cloud_vm_ray_backend.py:891] - W 02-23 16:41:50 cloud_vm_ray_backend.py:891] Provision failed for GCP(n1-highmem-8, {'V100': 8.0}). Trying other launchable resources (if any)... ... - I 02-23 16:41:50 cloud_vm_ray_backend.py:668] Launching on AWS us-east-1 (us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1e,us-east-1f) - W 02-23 16:42:15 cloud_vm_ray_backend.py:477] Got error(s) in all zones of us-east-1: - W 02-23 16:42:15 cloud_vm_ray_backend.py:479] create_instances: Attempt failed with An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient p3.16xlarge capacity in the Availability Zone you requested (us-east-1a). Our system will be working on provisioning additional capacity. You can currently get p3.16xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1b, us-east-1d, us-east-1f., retrying. - W 02-23 16:42:15 cloud_vm_ray_backend.py:479] create_instances: Attempt failed with An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient p3.16xlarge capacity in the Availability Zone you requested (us-east-1b). Our system will be working on provisioning additional capacity. You can currently get p3.16xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1d, us-east-1f., retrying. - W 02-23 16:42:15 cloud_vm_ray_backend.py:479] create_instances: Attempt failed with An error occurred (Unsupported) when calling the RunInstances operation: Your requested instance type (p3.16xlarge) is not supported in your requested Availability Zone (us-east-1c). Please retry your request by not specifying an Availability Zone or choosing us-east-1a, us-east-1b, us-east-1d, us-east-1f., retrying. - W 02-23 16:42:15 cloud_vm_ray_backend.py:479] create_instances: Attempt failed with An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient p3.16xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p3.16xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b, us-east-1f., retrying. - W 02-23 16:42:15 cloud_vm_ray_backend.py:479] create_instances: Attempt failed with An error occurred (Unsupported) when calling the RunInstances operation: Your requested instance type (p3.16xlarge) is not supported in your requested Availability Zone (us-east-1e). Please retry your request by not specifying an Availability Zone or choosing us-east-1a, us-east-1b, us-east-1d, us-east-1f., retrying. - W 02-23 16:42:15 cloud_vm_ray_backend.py:479] botocore.exceptions.ClientError: An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient p3.16xlarge capacity in the Availability Zone you requested (us-east-1f). Our system will be working on provisioning additional capacity. You can currently get p3.16xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b, us-east-1d. - I 02-23 16:42:15 cloud_vm_ray_backend.py:668] I 02-23 16:42:15 cloud_vm_ray_backend.py:668] Launching on AWS us-east-2 (us-east-2a,us-east-2b,us-east-2c) W 02-23 16:42:26 cloud_vm_ray_backend.py:477] Got error(s) in all zones of us-east-2: W 02-23 16:42:26 cloud_vm_ray_backend.py:479] create_instances: Attempt failed with An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient p3.16xlarge capacity in the Availability Zone you requested (us-east-2a). Our system will be working on provisioning additional capacity. You can currently get p3.16xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-2b., retrying. - W 02-23 16:42:26 cloud_vm_ray_backend.py:479] create_instances: Attempt failed with An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient p3.16xlarge capacity in the Availability Zone you requested (us-east-2b). Our system will be working on provisioning additional capacity. You can currently get p3.16xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-2a., retrying. - W 02-23 16:42:26 cloud_vm_ray_backend.py:479] create_instances: Attempt failed with An error occurred (Unsupported) when calling the RunInstances operation: Your requested instance type (p3.16xlarge) is not supported in your requested Availability Zone (us-east-2c). Please retry your request by not specifying an Availability Zone or choosing us-east-2a, us-east-2b., retrying. - W 02-23 16:42:26 cloud_vm_ray_backend.py:479] create_instances: Attempt failed with An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient p3.16xlarge capacity in the Availability Zone you requested (us-east-2a). Our system will be working on provisioning additional capacity. You can currently get p3.16xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-2b., retrying. - W 02-23 16:42:26 cloud_vm_ray_backend.py:479] botocore.exceptions.ClientError: An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient p3.16xlarge capacity in the Availability Zone you requested (us-east-2b). Our system will be working on provisioning additional capacity. You can currently get p3.16xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-2a. + ... I 02-23 16:42:26 cloud_vm_ray_backend.py:668] I 02-23 16:42:26 cloud_vm_ray_backend.py:668] Launching on AWS us-west-2 (us-west-2a,us-west-2b,us-west-2c,us-west-2d) I 02-23 16:47:04 cloud_vm_ray_backend.py:740] Successfully provisioned or found existing VM. Setup completed. + + +Multiple Candidate GPUs +------------------------- + +If a task can be run on different GPUs, the user can specify multiple candidate GPUs, +and SkyPilot will automatically find the cheapest available GPU. + +To allow SkyPilot to choose any of the candidate GPUs, specify a set of candidate GPUs in the task yaml: + +.. code-block:: yaml + + resources: + accelerators: {A10:1, L4:1, A10g:1} + +In the above example, SkyPilot will try to provision the any cheapest available GPU within the set of +A10, L4, and A10g GPUs, using :code:`sky launch task.yaml`. + +.. code-block:: console + + $ sky launch task.yaml + ... + I 11-19 08:07:45 optimizer.py:910] ----------------------------------------------------------------------------------------------------- + I 11-19 08:07:45 optimizer.py:910] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN + I 11-19 08:07:45 optimizer.py:910] ----------------------------------------------------------------------------------------------------- + I 11-19 08:07:45 optimizer.py:910] Azure Standard_NV6ads_A10_v5 6 55 A10:1 eastus 0.45 ✔ + I 11-19 08:07:45 optimizer.py:910] GCP g2-standard-4 4 16 L4:1 us-east4-a 0.70 + I 11-19 08:07:45 optimizer.py:910] AWS g5.xlarge 4 16 A10G:1 us-east-1 1.01 + I 11-19 08:07:45 optimizer.py:910] ----------------------------------------------------------------------------------------------------- + + + +To specify a preference order, use a list of candidate GPUs in the task yaml: + +.. code-block:: yaml + + resources: + accelerators: [A10:1, A10g:1, L4:1] + +In the above example, SkyPilot will first try to provision an A10 GPU, then an A10g GPU, and finally an L4 GPU. + + +(**Advanced**) Multiple Candidate Resources +-------------------------------------------- + +If a task would like to specify multiple candidate resources (not only GPUs), the user can specify a list of candidate resources with a preference annotation: + + +.. code-block:: yaml + + resources: + ordered: # Candidate resources in a preference order + - cloud: gcp + accelerators: A100-80GB + - instance_type: g5.xlarge + - cloud: azure + region: eastus + accelerator: A100 + + + +.. code-block:: yaml + + resources: + any_of: # Candidate resources that can be chosen in any order + - cloud: gcp + accelerators: A100-80GB + - instance_type: g5.xlarge + - cloud: azure + region: eastus + accelerator: A100 + +.. tip:: + + The list items are specified with a leading prefix :code:`-`, and each item is a dictionary that + includes the field for a candidate resource. :code:`ordered` and :code:`any_of` indicate the preference for the candidate resources. + +The following is an example for only allowing a set of regions/clouds for launching the required resource: + +.. code-block:: yaml + + resources: + any_of: + - cloud: aws + region: us-east-2 + accelerators: A100:8 + - cloud: gcp + region: us-central1 + accelerators: A100:8 + +This will genereate the following output: + +.. code-block:: console + + $ sky launch -c mycluster task.yaml + ... + I 11-20 14:06:24 optimizer.py:910] ---------------------------------------------------------------------------------------------- + I 11-20 14:06:24 optimizer.py:910] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN + I 11-20 14:06:24 optimizer.py:910] ---------------------------------------------------------------------------------------------- + I 11-20 14:06:24 optimizer.py:910] GCP a2-highgpu-8g 96 680 A100:8 us-central1-a 29.39 ✔ + I 11-20 14:06:24 optimizer.py:910] AWS p4d.24xlarge 96 1152 A100:8 us-east-2 32.77 + I 11-20 14:06:24 optimizer.py:910] ---------------------------------------------------------------------------------------------- + I 11-20 14:06:24 optimizer.py:910] + Launching a new cluster 'mycluster'. Proceed? [Y/n]: