Skip to content

Commit

Permalink
[Docs] Docs for multiple resources interface (#2803)
Browse files Browse the repository at this point in the history
* Add docs for multiple resources

* Add docs for multiple resources

* Update docs/source/examples/auto-failover.rst

Co-authored-by: Romil Bhardwaj <[email protected]>

* Update docs/source/examples/auto-failover.rst

Co-authored-by: Romil Bhardwaj <[email protected]>

* Update docs/source/examples/auto-failover.rst

Co-authored-by: Romil Bhardwaj <[email protected]>

* Update docs/source/examples/auto-failover.rst

Co-authored-by: Romil Bhardwaj <[email protected]>

* Address comments

* reword

* move

---------

Co-authored-by: Romil Bhardwaj <[email protected]>
  • Loading branch information
Michaelvll and romilbhardwaj authored Nov 20, 2023
1 parent feacc9f commit 778d666
Showing 1 changed file with 108 additions and 40 deletions.
148 changes: 108 additions & 40 deletions docs/source/examples/auto-failover.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ A common high-end GPU to use in deep learning is a NVIDIA V100 GPU. These GPUs
are often in high demand and hard to get. Let's see how SkyPilot's auto-failover
provisioner handles such a request:

.. code-block::
.. code-block:: console
$ sky launch -c gpu --gpus V100
... # optimizer output
Expand All @@ -42,13 +42,7 @@ provisioner handles such a request:
I 02-11 21:17:43 cloud_vm_ray_backend.py:624]
I 02-11 21:17:43 cloud_vm_ray_backend.py:624] Launching on GCP us-central1 (us-central1-a)
W 02-11 21:17:56 cloud_vm_ray_backend.py:358] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-a (message: The zone 'projects/intercloud-320520/zones/us-central1-a' does not have enough resources available to fulfill the request. Try a different zone, or try again later.)
I 02-11 21:17:56 cloud_vm_ray_backend.py:624]
I 02-11 21:17:56 cloud_vm_ray_backend.py:624] Launching on GCP us-central1 (us-central1-b)
W 02-11 21:18:10 cloud_vm_ray_backend.py:358] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-b (message: The zone 'projects/intercloud-320520/zones/us-central1-b' does not have enough resources available to fulfill the request. Try a different zone, or try again later.)
I 02-11 21:18:10 cloud_vm_ray_backend.py:624]
I 02-11 21:18:10 cloud_vm_ray_backend.py:624] Launching on GCP us-central1 (us-central1-c)
W 02-11 21:18:24 cloud_vm_ray_backend.py:358] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-c (message: The zone 'projects/intercloud-320520/zones/us-central1-c' does not have enough resources available to fulfill the request. Try a different zone, or try again later.)
I 02-11 21:18:24 cloud_vm_ray_backend.py:624]
...
I 02-11 21:18:24 cloud_vm_ray_backend.py:624] Launching on GCP us-central1 (us-central1-f)
W 02-11 21:18:38 cloud_vm_ray_backend.py:358] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-f (message: The zone 'projects/intercloud-320520/zones/us-central1-f' does not have enough resources available to fulfill the request. Try a different zone, or try again later.)
I 02-11 21:18:38 cloud_vm_ray_backend.py:624]
Expand All @@ -66,7 +60,7 @@ Here is an example of cross-cloud failover when requesting 8x V100 GPUs. All
regions in GCP failed to provide the resource, so the provisioner switched to
AWS, where it succeeded after two regions:

.. code-block::
.. code-block:: console
$ sky launch -c v100-8 --gpus V100:8
... # optimizer output
Expand All @@ -76,41 +70,115 @@ AWS, where it succeeded after two regions:
I 02-23 16:39:59 cloud_vm_ray_backend.py:668]
I 02-23 16:39:59 cloud_vm_ray_backend.py:668] Launching on GCP us-central1 (us-central1-a)
W 02-23 16:40:17 cloud_vm_ray_backend.py:403] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-a (message: The zone 'projects/intercloud-320520/zones/us-central1-a' does not have enough resources available to fulfill the request. Try a different zone, or try again later.)
I 02-23 16:40:17 cloud_vm_ray_backend.py:668]
I 02-23 16:40:17 cloud_vm_ray_backend.py:668] Launching on GCP us-central1 (us-central1-b)
W 02-23 16:40:35 cloud_vm_ray_backend.py:403] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-b (message: The zone 'projects/intercloud-320520/zones/us-central1-b' does not have enough resources available to fulfill the request. Try a different zone, or try again later.)
I 02-23 16:40:35 cloud_vm_ray_backend.py:668]
I 02-23 16:40:35 cloud_vm_ray_backend.py:668] Launching on GCP us-central1 (us-central1-c)
W 02-23 16:40:55 cloud_vm_ray_backend.py:403] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-c (message: The zone 'projects/intercloud-320520/zones/us-central1-c' does not have enough resources available to fulfill the request. Try a different zone, or try again later.)
I 02-23 16:40:55 cloud_vm_ray_backend.py:668]
I 02-23 16:40:55 cloud_vm_ray_backend.py:668] Launching on GCP us-central1 (us-central1-f)
W 02-23 16:41:13 cloud_vm_ray_backend.py:403] Got QUOTA_EXCEEDED in us-central1-f (message: Quota 'NVIDIA_V100_GPUS' exceeded. Limit: 1.0 in region us-central1.)
I 02-23 16:41:13 cloud_vm_ray_backend.py:668]
I 02-23 16:41:13 cloud_vm_ray_backend.py:668] Launching on GCP us-west1 (us-west1-a)
W 02-23 16:41:31 cloud_vm_ray_backend.py:403] Got QUOTA_EXCEEDED in us-west1-a (message: Quota 'NVIDIA_V100_GPUS' exceeded. Limit: 1.0 in region us-west1.)
I 02-23 16:41:31 cloud_vm_ray_backend.py:668]
I 02-23 16:41:31 cloud_vm_ray_backend.py:668] Launching on GCP us-east1 (us-east1-c)
W 02-23 16:41:50 cloud_vm_ray_backend.py:403] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-east1-c (message: The zone 'projects/intercloud-320520/zones/us-east1-c' does not have enough resources available to fulfill the request. Try a different zone, or try again later.)
E 02-23 16:41:50 cloud_vm_ray_backend.py:746] Failed to acquire resources in all regions/zones (requested GCP(n1-highmem-8, {'V100': 8.0})). Try changing resource requirements or use another cloud.
W 02-23 16:41:50 cloud_vm_ray_backend.py:891]
W 02-23 16:41:50 cloud_vm_ray_backend.py:891] Provision failed for GCP(n1-highmem-8, {'V100': 8.0}). Trying other launchable resources (if any)...
...
I 02-23 16:41:50 cloud_vm_ray_backend.py:668] Launching on AWS us-east-1 (us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1e,us-east-1f)
W 02-23 16:42:15 cloud_vm_ray_backend.py:477] Got error(s) in all zones of us-east-1:
W 02-23 16:42:15 cloud_vm_ray_backend.py:479] create_instances: Attempt failed with An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient p3.16xlarge capacity in the Availability Zone you requested (us-east-1a). Our system will be working on provisioning additional capacity. You can currently get p3.16xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1b, us-east-1d, us-east-1f., retrying.
W 02-23 16:42:15 cloud_vm_ray_backend.py:479] create_instances: Attempt failed with An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient p3.16xlarge capacity in the Availability Zone you requested (us-east-1b). Our system will be working on provisioning additional capacity. You can currently get p3.16xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1d, us-east-1f., retrying.
W 02-23 16:42:15 cloud_vm_ray_backend.py:479] create_instances: Attempt failed with An error occurred (Unsupported) when calling the RunInstances operation: Your requested instance type (p3.16xlarge) is not supported in your requested Availability Zone (us-east-1c). Please retry your request by not specifying an Availability Zone or choosing us-east-1a, us-east-1b, us-east-1d, us-east-1f., retrying.
W 02-23 16:42:15 cloud_vm_ray_backend.py:479] create_instances: Attempt failed with An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient p3.16xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p3.16xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b, us-east-1f., retrying.
W 02-23 16:42:15 cloud_vm_ray_backend.py:479] create_instances: Attempt failed with An error occurred (Unsupported) when calling the RunInstances operation: Your requested instance type (p3.16xlarge) is not supported in your requested Availability Zone (us-east-1e). Please retry your request by not specifying an Availability Zone or choosing us-east-1a, us-east-1b, us-east-1d, us-east-1f., retrying.
W 02-23 16:42:15 cloud_vm_ray_backend.py:479] botocore.exceptions.ClientError: An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient p3.16xlarge capacity in the Availability Zone you requested (us-east-1f). Our system will be working on provisioning additional capacity. You can currently get p3.16xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b, us-east-1d.
I 02-23 16:42:15 cloud_vm_ray_backend.py:668]
I 02-23 16:42:15 cloud_vm_ray_backend.py:668] Launching on AWS us-east-2 (us-east-2a,us-east-2b,us-east-2c)
W 02-23 16:42:26 cloud_vm_ray_backend.py:477] Got error(s) in all zones of us-east-2:
W 02-23 16:42:26 cloud_vm_ray_backend.py:479] create_instances: Attempt failed with An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient p3.16xlarge capacity in the Availability Zone you requested (us-east-2a). Our system will be working on provisioning additional capacity. You can currently get p3.16xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-2b., retrying.
W 02-23 16:42:26 cloud_vm_ray_backend.py:479] create_instances: Attempt failed with An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient p3.16xlarge capacity in the Availability Zone you requested (us-east-2b). Our system will be working on provisioning additional capacity. You can currently get p3.16xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-2a., retrying.
W 02-23 16:42:26 cloud_vm_ray_backend.py:479] create_instances: Attempt failed with An error occurred (Unsupported) when calling the RunInstances operation: Your requested instance type (p3.16xlarge) is not supported in your requested Availability Zone (us-east-2c). Please retry your request by not specifying an Availability Zone or choosing us-east-2a, us-east-2b., retrying.
W 02-23 16:42:26 cloud_vm_ray_backend.py:479] create_instances: Attempt failed with An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient p3.16xlarge capacity in the Availability Zone you requested (us-east-2a). Our system will be working on provisioning additional capacity. You can currently get p3.16xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-2b., retrying.
W 02-23 16:42:26 cloud_vm_ray_backend.py:479] botocore.exceptions.ClientError: An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient p3.16xlarge capacity in the Availability Zone you requested (us-east-2b). Our system will be working on provisioning additional capacity. You can currently get p3.16xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-2a.
...
I 02-23 16:42:26 cloud_vm_ray_backend.py:668]
I 02-23 16:42:26 cloud_vm_ray_backend.py:668] Launching on AWS us-west-2 (us-west-2a,us-west-2b,us-west-2c,us-west-2d)
I 02-23 16:47:04 cloud_vm_ray_backend.py:740] Successfully provisioned or found existing VM. Setup completed.
Multiple Candidate GPUs
-------------------------

If a task can be run on different GPUs, the user can specify multiple candidate GPUs,
and SkyPilot will automatically find the cheapest available GPU.

To allow SkyPilot to choose any of the candidate GPUs, specify a set of candidate GPUs in the task yaml:

.. code-block:: yaml
resources:
accelerators: {A10:1, L4:1, A10g:1}
In the above example, SkyPilot will try to provision the any cheapest available GPU within the set of
A10, L4, and A10g GPUs, using :code:`sky launch task.yaml`.

.. code-block:: console
$ sky launch task.yaml
...
I 11-19 08:07:45 optimizer.py:910] -----------------------------------------------------------------------------------------------------
I 11-19 08:07:45 optimizer.py:910] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
I 11-19 08:07:45 optimizer.py:910] -----------------------------------------------------------------------------------------------------
I 11-19 08:07:45 optimizer.py:910] Azure Standard_NV6ads_A10_v5 6 55 A10:1 eastus 0.45 ✔
I 11-19 08:07:45 optimizer.py:910] GCP g2-standard-4 4 16 L4:1 us-east4-a 0.70
I 11-19 08:07:45 optimizer.py:910] AWS g5.xlarge 4 16 A10G:1 us-east-1 1.01
I 11-19 08:07:45 optimizer.py:910] -----------------------------------------------------------------------------------------------------
To specify a preference order, use a list of candidate GPUs in the task yaml:

.. code-block:: yaml
resources:
accelerators: [A10:1, A10g:1, L4:1]
In the above example, SkyPilot will first try to provision an A10 GPU, then an A10g GPU, and finally an L4 GPU.


(**Advanced**) Multiple Candidate Resources
--------------------------------------------

If a task would like to specify multiple candidate resources (not only GPUs), the user can specify a list of candidate resources with a preference annotation:


.. code-block:: yaml
resources:
ordered: # Candidate resources in a preference order
- cloud: gcp
accelerators: A100-80GB
- instance_type: g5.xlarge
- cloud: azure
region: eastus
accelerator: A100
.. code-block:: yaml
resources:
any_of: # Candidate resources that can be chosen in any order
- cloud: gcp
accelerators: A100-80GB
- instance_type: g5.xlarge
- cloud: azure
region: eastus
accelerator: A100
.. tip::

The list items are specified with a leading prefix :code:`-`, and each item is a dictionary that
includes the field for a candidate resource. :code:`ordered` and :code:`any_of` indicate the preference for the candidate resources.

The following is an example for only allowing a set of regions/clouds for launching the required resource:

.. code-block:: yaml
resources:
any_of:
- cloud: aws
region: us-east-2
accelerators: A100:8
- cloud: gcp
region: us-central1
accelerators: A100:8
This will genereate the following output:

.. code-block:: console
$ sky launch -c mycluster task.yaml
...
I 11-20 14:06:24 optimizer.py:910] ----------------------------------------------------------------------------------------------
I 11-20 14:06:24 optimizer.py:910] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
I 11-20 14:06:24 optimizer.py:910] ----------------------------------------------------------------------------------------------
I 11-20 14:06:24 optimizer.py:910] GCP a2-highgpu-8g 96 680 A100:8 us-central1-a 29.39 ✔
I 11-20 14:06:24 optimizer.py:910] AWS p4d.24xlarge 96 1152 A100:8 us-east-2 32.77
I 11-20 14:06:24 optimizer.py:910] ----------------------------------------------------------------------------------------------
I 11-20 14:06:24 optimizer.py:910]
Launching a new cluster 'mycluster'. Proceed? [Y/n]:

0 comments on commit 778d666

Please sign in to comment.