[Docs] Docs for multiple resources interface (#2803)

* Add docs for multiple resources * Add docs for multiple resources * Update docs/source/examples/auto-failover.rst Co-authored-by: Romil Bhardwaj <[email protected]> * Update docs/source/examples/auto-failover.rst Co-authored-by: Romil Bhardwaj <[email protected]> * Update docs/source/examples/auto-failover.rst Co-authored-by: Romil Bhardwaj <[email protected]> * Update docs/source/examples/auto-failover.rst Co-authored-by: Romil Bhardwaj <[email protected]> * Address comments * reword * move --------- Co-authored-by: Romil Bhardwaj <[email protected]>
skypilot-org · Nov 20, 2023 · 778d666 · 778d666
1 parent feacc9f
commit 778d666
Showing 1 changed file with 108 additions and 40 deletions.
diff --git a/docs/source/examples/auto-failover.rst b/docs/source/examples/auto-failover.rst
@@ -32,7 +32,7 @@ A common high-end GPU to use in deep learning is a NVIDIA V100 GPU.  These GPUs
 are often in high demand and hard to get.  Let's see how SkyPilot's auto-failover
 provisioner handles such a request:
 
-.. code-block::
+.. code-block:: console
 
   $ sky launch -c gpu --gpus V100
   ...  # optimizer output
@@ -42,13 +42,7 @@ provisioner handles such a request:
   I 02-11 21:17:43 cloud_vm_ray_backend.py:624]
   I 02-11 21:17:43 cloud_vm_ray_backend.py:624] Launching on GCP us-central1 (us-central1-a)
   W 02-11 21:17:56 cloud_vm_ray_backend.py:358] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-a (message: The zone 'projects/intercloud-320520/zones/us-central1-a' does not have enough resources available to fulfill the request.  Try a different zone, or try again later.)
-  I 02-11 21:17:56 cloud_vm_ray_backend.py:624]
-  I 02-11 21:17:56 cloud_vm_ray_backend.py:624] Launching on GCP us-central1 (us-central1-b)
-  W 02-11 21:18:10 cloud_vm_ray_backend.py:358] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-b (message: The zone 'projects/intercloud-320520/zones/us-central1-b' does not have enough resources available to fulfill the request.  Try a different zone, or try again later.)
-  I 02-11 21:18:10 cloud_vm_ray_backend.py:624]
-  I 02-11 21:18:10 cloud_vm_ray_backend.py:624] Launching on GCP us-central1 (us-central1-c)
-  W 02-11 21:18:24 cloud_vm_ray_backend.py:358] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-c (message: The zone 'projects/intercloud-320520/zones/us-central1-c' does not have enough resources available to fulfill the request.  Try a different zone, or try again later.)
-  I 02-11 21:18:24 cloud_vm_ray_backend.py:624]
+  ...
   I 02-11 21:18:24 cloud_vm_ray_backend.py:624] Launching on GCP us-central1 (us-central1-f)
   W 02-11 21:18:38 cloud_vm_ray_backend.py:358] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-f (message: The zone 'projects/intercloud-320520/zones/us-central1-f' does not have enough resources available to fulfill the request.  Try a different zone, or try again later.)
   I 02-11 21:18:38 cloud_vm_ray_backend.py:624]
@@ -66,7 +60,7 @@ Here is an example of cross-cloud failover when requesting 8x V100 GPUs.  All
 regions in GCP failed to provide the resource, so the provisioner switched to
 AWS, where it succeeded after two regions:
 
-.. code-block::
+.. code-block:: console
 
   $ sky launch -c v100-8 --gpus V100:8
   ...  # optimizer output
@@ -76,41 +70,115 @@ AWS, where it succeeded after two regions:
   I 02-23 16:39:59 cloud_vm_ray_backend.py:668]
   I 02-23 16:39:59 cloud_vm_ray_backend.py:668] Launching on GCP us-central1 (us-central1-a)
   W 02-23 16:40:17 cloud_vm_ray_backend.py:403] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-a (message: The zone 'projects/intercloud-320520/zones/us-central1-a' does not have enough resources available to fulfill the request.  Try a different zone, or try again later.)
-  I 02-23 16:40:17 cloud_vm_ray_backend.py:668]
-  I 02-23 16:40:17 cloud_vm_ray_backend.py:668] Launching on GCP us-central1 (us-central1-b)
-  W 02-23 16:40:35 cloud_vm_ray_backend.py:403] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-b (message: The zone 'projects/intercloud-320520/zones/us-central1-b' does not have enough resources available to fulfill the request.  Try a different zone, or try again later.)
-  I 02-23 16:40:35 cloud_vm_ray_backend.py:668]
-  I 02-23 16:40:35 cloud_vm_ray_backend.py:668] Launching on GCP us-central1 (us-central1-c)
-  W 02-23 16:40:55 cloud_vm_ray_backend.py:403] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-c (message: The zone 'projects/intercloud-320520/zones/us-central1-c' does not have enough resources available to fulfill the request.  Try a different zone, or try again later.)
-  I 02-23 16:40:55 cloud_vm_ray_backend.py:668]
-  I 02-23 16:40:55 cloud_vm_ray_backend.py:668] Launching on GCP us-central1 (us-central1-f)
-  W 02-23 16:41:13 cloud_vm_ray_backend.py:403] Got QUOTA_EXCEEDED in us-central1-f (message: Quota 'NVIDIA_V100_GPUS' exceeded.  Limit: 1.0 in region us-central1.)
-  I 02-23 16:41:13 cloud_vm_ray_backend.py:668]
-  I 02-23 16:41:13 cloud_vm_ray_backend.py:668] Launching on GCP us-west1 (us-west1-a)
-  W 02-23 16:41:31 cloud_vm_ray_backend.py:403] Got QUOTA_EXCEEDED in us-west1-a (message: Quota 'NVIDIA_V100_GPUS' exceeded.  Limit: 1.0 in region us-west1.)
-  I 02-23 16:41:31 cloud_vm_ray_backend.py:668]
-  I 02-23 16:41:31 cloud_vm_ray_backend.py:668] Launching on GCP us-east1 (us-east1-c)
-  W 02-23 16:41:50 cloud_vm_ray_backend.py:403] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-east1-c (message: The zone 'projects/intercloud-320520/zones/us-east1-c' does not have enough resources available to fulfill the request.  Try a different zone, or try again later.)
-  E 02-23 16:41:50 cloud_vm_ray_backend.py:746] Failed to acquire resources in all regions/zones (requested GCP(n1-highmem-8, {'V100': 8.0})). Try changing resource requirements or use another cloud.
-  W 02-23 16:41:50 cloud_vm_ray_backend.py:891]
-  W 02-23 16:41:50 cloud_vm_ray_backend.py:891] Provision failed for GCP(n1-highmem-8, {'V100': 8.0}). Trying other launchable resources (if any)...
   ...
-  I 02-23 16:41:50 cloud_vm_ray_backend.py:668] Launching on AWS us-east-1 (us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1e,us-east-1f)
-  W 02-23 16:42:15 cloud_vm_ray_backend.py:477] Got error(s) in all zones of us-east-1:
-  W 02-23 16:42:15 cloud_vm_ray_backend.py:479]   create_instances: Attempt failed with An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient p3.16xlarge capacity in the Availability Zone you requested (us-east-1a). Our system will be working on provisioning additional capacity. You can currently get p3.16xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1b, us-east-1d, us-east-1f., retrying.
-  W 02-23 16:42:15 cloud_vm_ray_backend.py:479]   create_instances: Attempt failed with An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient p3.16xlarge capacity in the Availability Zone you requested (us-east-1b). Our system will be working on provisioning additional capacity. You can currently get p3.16xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1d, us-east-1f., retrying.
-  W 02-23 16:42:15 cloud_vm_ray_backend.py:479]   create_instances: Attempt failed with An error occurred (Unsupported) when calling the RunInstances operation: Your requested instance type (p3.16xlarge) is not supported in your requested Availability Zone (us-east-1c). Please retry your request by not specifying an Availability Zone or choosing us-east-1a, us-east-1b, us-east-1d, us-east-1f., retrying.
-  W 02-23 16:42:15 cloud_vm_ray_backend.py:479]   create_instances: Attempt failed with An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient p3.16xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p3.16xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b, us-east-1f., retrying.
-  W 02-23 16:42:15 cloud_vm_ray_backend.py:479]   create_instances: Attempt failed with An error occurred (Unsupported) when calling the RunInstances operation: Your requested instance type (p3.16xlarge) is not supported in your requested Availability Zone (us-east-1e). Please retry your request by not specifying an Availability Zone or choosing us-east-1a, us-east-1b, us-east-1d, us-east-1f., retrying.
-  W 02-23 16:42:15 cloud_vm_ray_backend.py:479]   botocore.exceptions.ClientError: An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient p3.16xlarge capacity in the Availability Zone you requested (us-east-1f). Our system will be working on provisioning additional capacity. You can currently get p3.16xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b, us-east-1d.
-  I 02-23 16:42:15 cloud_vm_ray_backend.py:668]
   I 02-23 16:42:15 cloud_vm_ray_backend.py:668] Launching on AWS us-east-2 (us-east-2a,us-east-2b,us-east-2c)
   W 02-23 16:42:26 cloud_vm_ray_backend.py:477] Got error(s) in all zones of us-east-2:
   W 02-23 16:42:26 cloud_vm_ray_backend.py:479]   create_instances: Attempt failed with An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient p3.16xlarge capacity in the Availability Zone you requested (us-east-2a). Our system will be working on provisioning additional capacity. You can currently get p3.16xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-2b., retrying.
-  W 02-23 16:42:26 cloud_vm_ray_backend.py:479]   create_instances: Attempt failed with An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient p3.16xlarge capacity in the Availability Zone you requested (us-east-2b). Our system will be working on provisioning additional capacity. You can currently get p3.16xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-2a., retrying.
-  W 02-23 16:42:26 cloud_vm_ray_backend.py:479]   create_instances: Attempt failed with An error occurred (Unsupported) when calling the RunInstances operation: Your requested instance type (p3.16xlarge) is not supported in your requested Availability Zone (us-east-2c). Please retry your request by not specifying an Availability Zone or choosing us-east-2a, us-east-2b., retrying.
-  W 02-23 16:42:26 cloud_vm_ray_backend.py:479]   create_instances: Attempt failed with An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient p3.16xlarge capacity in the Availability Zone you requested (us-east-2a). Our system will be working on provisioning additional capacity. You can currently get p3.16xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-2b., retrying.
-  W 02-23 16:42:26 cloud_vm_ray_backend.py:479]   botocore.exceptions.ClientError: An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient p3.16xlarge capacity in the Availability Zone you requested (us-east-2b). Our system will be working on provisioning additional capacity. You can currently get p3.16xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-2a.
+  ...
   I 02-23 16:42:26 cloud_vm_ray_backend.py:668]
   I 02-23 16:42:26 cloud_vm_ray_backend.py:668] Launching on AWS us-west-2 (us-west-2a,us-west-2b,us-west-2c,us-west-2d)
   I 02-23 16:47:04 cloud_vm_ray_backend.py:740] Successfully provisioned or found existing VM. Setup completed.
+
+
+Multiple Candidate GPUs
+-------------------------
+
+If a task can be run on different GPUs, the user can specify multiple candidate GPUs,
+and SkyPilot will automatically find the cheapest available GPU.
+
+To allow SkyPilot to choose any of the candidate GPUs, specify a set of candidate GPUs in the task yaml:
+
+.. code-block:: yaml
+
+  resources:
+    accelerators: {A10:1, L4:1, A10g:1}
+
+In the above example, SkyPilot will try to provision the any cheapest available GPU within the set of
+A10, L4, and A10g GPUs, using :code:`sky launch task.yaml`.
+
+.. code-block:: console
+
+  $ sky launch task.yaml
+  ...
+  I 11-19 08:07:45 optimizer.py:910] -----------------------------------------------------------------------------------------------------
+  I 11-19 08:07:45 optimizer.py:910]  CLOUD   INSTANCE                 vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
+  I 11-19 08:07:45 optimizer.py:910] -----------------------------------------------------------------------------------------------------
+  I 11-19 08:07:45 optimizer.py:910]  Azure   Standard_NV6ads_A10_v5   6       55        A10:1          eastus        0.45          ✔     
+  I 11-19 08:07:45 optimizer.py:910]  GCP     g2-standard-4            4       16        L4:1           us-east4-a    0.70                
+  I 11-19 08:07:45 optimizer.py:910]  AWS     g5.xlarge                4       16        A10G:1         us-east-1     1.01                
+  I 11-19 08:07:45 optimizer.py:910] -----------------------------------------------------------------------------------------------------
+
+
+
+To specify a preference order, use a list of candidate GPUs in the task yaml:
+
+.. code-block:: yaml
+
+  resources:
+    accelerators: [A10:1, A10g:1, L4:1]
+
+In the above example, SkyPilot will first try to provision an A10 GPU, then an A10g GPU, and finally an L4 GPU.
+
+
+(**Advanced**) Multiple Candidate Resources
+--------------------------------------------
+
+If a task would like to specify multiple candidate resources (not only GPUs), the user can specify a list of candidate resources with a preference annotation:
+
+
+.. code-block:: yaml
+
+  resources:
+    ordered: # Candidate resources in a preference order
+      - cloud: gcp
+        accelerators: A100-80GB
+      - instance_type: g5.xlarge
+      - cloud: azure
+        region: eastus
+        accelerator: A100
+
+
+
+.. code-block:: yaml
+
+    resources:
+      any_of: # Candidate resources that can be chosen in any order
+        - cloud: gcp
+          accelerators: A100-80GB
+        - instance_type: g5.xlarge
+        - cloud: azure
+          region: eastus
+          accelerator: A100
+          
+.. tip::
+
+  The list items are specified with a leading prefix :code:`-`, and each item is a dictionary that
+  includes the field for a candidate resource. :code:`ordered` and :code:`any_of` indicate the preference for the candidate resources.
+
+The following is an example for only allowing a set of regions/clouds for launching the required resource:
+
+.. code-block:: yaml
+
+  resources:
+    any_of:
+      - cloud: aws
+        region: us-east-2
+        accelerators: A100:8
+      - cloud: gcp
+        region: us-central1
+        accelerators: A100:8
+
+This will genereate the following output:
+
+.. code-block:: console
+
+  $ sky launch -c mycluster task.yaml
+  ...
+  I 11-20 14:06:24 optimizer.py:910] ----------------------------------------------------------------------------------------------
+  I 11-20 14:06:24 optimizer.py:910]  CLOUD   INSTANCE        vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE     COST ($)   CHOSEN   
+  I 11-20 14:06:24 optimizer.py:910] ----------------------------------------------------------------------------------------------
+  I 11-20 14:06:24 optimizer.py:910]  GCP     a2-highgpu-8g   96      680       A100:8         us-central1-a   29.39         ✔     
+  I 11-20 14:06:24 optimizer.py:910]  AWS     p4d.24xlarge    96      1152      A100:8         us-east-2       32.77               
+  I 11-20 14:06:24 optimizer.py:910] ----------------------------------------------------------------------------------------------
+  I 11-20 14:06:24 optimizer.py:910]
+  Launching a new cluster 'mycluster'. Proceed? [Y/n]: