Skip to content

Commit

Permalink
[UX] A new look of SkyPilot console outputs (#4023)
Browse files Browse the repository at this point in the history
* [UX] default to minimal logging (no module/line number/timestamp).

* Fix mypy.

* Fix typing

* Update sky/utils/env_options.py

Co-authored-by: Tian Xia <[email protected]>

* Update sky/utils/env_options.py

Co-authored-by: Tian Xia <[email protected]>

* Account for debug flag.

* Remove prefixes from docs.

* wip

* Optimize the output

* optimize logging

* format

* Update the ux

* fix options

* Fix logs ux for controller

* Add job starting title

* fixes

* keep align

* fix indent

* UX v3

* Format

* UX for launching

* Add UX for setup and mounts

* Fix setup and file mounts

* Fix output

* Refactor output

* Fix

* update

* Change to ⚙️

* New alternative

* cyan for spinner

* address comments

* format

* format

* refactor and fix

* format

* format

* controller logs

* fix serve ux

* Updated serve UX

* Fix serve ux

* format

* Fix backward compat job log

* Fix streaming for old clusters

* Fix nested status

* fix status

* Add looking for resources spinner

* Fix azure logging

* format

* format

* Fix old provisioner

* add a new internal IP for Lambda

* fix multi-worker for old provisioner

* Avoid error out for refresh in teardown

* format

* Fix k8s output

* Fixes

* fix

* Fix

* format

* Fix smoke minimal

* Fix validating minimal

* fix managed job tests

* address comments

* dim indent and green finish line

* Fix optimizer output

* Fix nested rich status

* format

* reducing refreshing frequency

* remove accidentally added file

* update docs

* update docs

* increase initial delay for smoke test

* A diff icon

* increase refresh frequency

* minor

* fix

* fix message

* Update sky/backends/cloud_vm_ray_backend.py

Co-authored-by: Zongheng Yang <[email protected]>

* fix

* fix the smoke test yaml

* fix

* Add docstr

* fix

* shorten style / fore

* rename class

* move constants

* Add indent symbol for instance up

* Update controller setup

* format

* rename env_key

* minor move

* format

---------

Co-authored-by: Zongheng Yang <[email protected]>
Co-authored-by: Tian Xia <[email protected]>
  • Loading branch information
3 people authored Oct 12, 2024
1 parent fdd68b2 commit d63497c
Show file tree
Hide file tree
Showing 39 changed files with 1,004 additions and 644 deletions.
107 changes: 58 additions & 49 deletions docs/source/examples/auto-failover.rst
Original file line number Diff line number Diff line change
Expand Up @@ -53,26 +53,26 @@ Cross-region failover

The provisioner first retries across all regions within a task's chosen cloud.

A common high-end GPU to use in deep learning is a NVIDIA V100 GPU. These GPUs
A common high-end GPU to use in AI is a NVIDIA A100 GPU. These GPUs
are often in high demand and hard to get. Let's see how SkyPilot's auto-failover
provisioner handles such a request:

.. code-block:: console
$ sky launch -c gpu --gpus V100
... # optimizer output
I 02-11 21:17:43 cloud_vm_ray_backend.py:1034] Creating a new cluster: "gpu" [1x GCP(n1-highmem-8, {'V100': 1.0})].
I 02-11 21:17:43 cloud_vm_ray_backend.py:1034] Tip: to reuse an existing cluster, specify --cluster-name (-c) in the CLI or use sky.launch(.., cluster_name=..) in the Python API. Run `sky status` to see existing clusters.
I 02-11 21:17:43 cloud_vm_ray_backend.py:614] To view detailed progress: tail -n100 -f sky_logs/sky-2022-02-11-21-17-43-171661/provision.log
I 02-11 21:17:43 cloud_vm_ray_backend.py:624]
I 02-11 21:17:43 cloud_vm_ray_backend.py:624] Launching on GCP us-central1 (us-central1-a)
W 02-11 21:17:56 cloud_vm_ray_backend.py:358] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-a (message: The zone 'projects/intercloud-320520/zones/us-central1-a' does not have enough resources available to fulfill the request. Try a different zone, or try again later.)
$ sky launch -c gpu --gpus A100
...
Launching a new cluster 'gpu'. Proceed? [Y/n]:
⚙️ Launching on GCP us-central1 (us-central1-a).
W 10-11 18:25:57 instance_utils.py:112] Got return codes 'VM_MIN_COUNT_NOT_REACHED', 'ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS' in us-central1-a: 'Requested minimum count of 1 VMs could not be created'; "The zone 'projects/xxxxxx/zones/us-central1-a' does not have enough resources available to fulfill the request. '(resource type:compute)'"
...
⚙️ Launching on GCP us-central1 (us-central1-f)
...
I 02-11 21:18:24 cloud_vm_ray_backend.py:624] Launching on GCP us-central1 (us-central1-f)
W 02-11 21:18:38 cloud_vm_ray_backend.py:358] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-f (message: The zone 'projects/intercloud-320520/zones/us-central1-f' does not have enough resources available to fulfill the request. Try a different zone, or try again later.)
I 02-11 21:18:38 cloud_vm_ray_backend.py:624]
I 02-11 21:18:38 cloud_vm_ray_backend.py:624] Launching on GCP us-west1 (us-west1-a)
Successfully connected to 35.230.120.87.
⚙️ Launching on GCP us-west1 (us-west1-a)
...
✓ Cluster launched: a100-8. View logs at: ~/sky_logs/sky-2024-10-11-18-32-48-894132/provision.log
GCP was chosen as the best cloud to run the task. There was no capacity in any of the regions in US Central, so the auto-failover provisioner moved to US West instead, allowing for our instance to be successfully provisioned.

Expand All @@ -81,28 +81,37 @@ Cross-cloud failover
If all regions within the chosen cloud failed, the provisioner retries on the next
cheapest cloud.

Here is an example of cross-cloud failover when requesting 8x V100 GPUs. All
regions in GCP failed to provide the resource, so the provisioner switched to
AWS, where it succeeded after two regions:
Here is an example of cross-cloud failover when requesting 8x A100 GPUs. All
regions in Azure failed to provide the resource, so the provisioner switched to
GCP, where it succeeded after one region:

.. code-block:: console
$ sky launch -c v100-8 --gpus V100:8
... # optimizer output
I 02-23 16:39:59 cloud_vm_ray_backend.py:1010] Creating a new cluster: "v100-8" [1x GCP(n1-highmem-8, {'V100': 8.0})].
I 02-23 16:39:59 cloud_vm_ray_backend.py:1010] Tip: to reuse an existing cluster, specify --cluster-name (-c) in the CLI or use sky.launch(.., cluster_name=..) in the Python API. Run `sky status` to see existing clusters.
I 02-23 16:39:59 cloud_vm_ray_backend.py:658] To view detailed progress: tail -n100 -f sky_logs/sky-2022-02-23-16-39-58-577551/provision.log
I 02-23 16:39:59 cloud_vm_ray_backend.py:668]
I 02-23 16:39:59 cloud_vm_ray_backend.py:668] Launching on GCP us-central1 (us-central1-a)
W 02-23 16:40:17 cloud_vm_ray_backend.py:403] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-a (message: The zone 'projects/intercloud-320520/zones/us-central1-a' does not have enough resources available to fulfill the request. Try a different zone, or try again later.)
$ sky launch -c a100-8 --gpus A100:8
Considered resources (1 node):
----------------------------------------------------------------------------------------------------
CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
----------------------------------------------------------------------------------------------------
Azure Standard_ND96asr_v4 96 900 A100:8 eastus 27.20 ✔
GCP a2-highgpu-8g 96 680 A100:8 us-central1-a 29.39
AWS p4d.24xlarge 96 1152 A100:8 us-east-1 32.77
----------------------------------------------------------------------------------------------------
Launching a new cluster 'a100-8'. Proceed? [Y/n]:
...
⚙️ Launching on Azure eastus.
E 10-11 18:24:59 instance.py:457] Failed to create instances: [azure.core.exceptions.HttpResponseError] (InvalidTemplateDeployment)
sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in all zones in eastus
...
I 02-23 16:42:15 cloud_vm_ray_backend.py:668] Launching on AWS us-east-2 (us-east-2a,us-east-2b,us-east-2c)
W 02-23 16:42:26 cloud_vm_ray_backend.py:477] Got error(s) in all zones of us-east-2:
W 02-23 16:42:26 cloud_vm_ray_backend.py:479] create_instances: Attempt failed with An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient p3.16xlarge capacity in the Availability Zone you requested (us-east-2a). Our system will be working on provisioning additional capacity. You can currently get p3.16xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-2b., retrying.
⚙️ Launching on GCP us-central1 (us-central1-a).
W 10-11 18:25:57 instance_utils.py:112] Got return codes 'VM_MIN_COUNT_NOT_REACHED', 'ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS' in us-central1-a: 'Requested minimum count of 1 VMs could not be created'; "The zone 'projects/xxxxxx/zones/us-central1-a' does not have enough resources available to fulfill the request. '(resource type:compute)'"
...
I 02-23 16:42:26 cloud_vm_ray_backend.py:668]
I 02-23 16:42:26 cloud_vm_ray_backend.py:668] Launching on AWS us-west-2 (us-west-2a,us-west-2b,us-west-2c,us-west-2d)
I 02-23 16:47:04 cloud_vm_ray_backend.py:740] Successfully provisioned or found existing VM. Setup completed.
⚙️ Launching on GCP us-central1 (us-central1-b).
Instance is up.
✓ Cluster launched: a100-8. View logs at: ~/sky_logs/sky-2024-10-11-18-24-14-357884/provision.log
Multiple Candidate GPUs
Expand All @@ -125,13 +134,13 @@ A10, L4, and A10g GPUs, using :code:`sky launch task.yaml`.
$ sky launch task.yaml
...
I 11-19 08:07:45 optimizer.py:910] -----------------------------------------------------------------------------------------------------
I 11-19 08:07:45 optimizer.py:910] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
I 11-19 08:07:45 optimizer.py:910] -----------------------------------------------------------------------------------------------------
I 11-19 08:07:45 optimizer.py:910] Azure Standard_NV6ads_A10_v5 6 55 A10:1 eastus 0.45 ✔
I 11-19 08:07:45 optimizer.py:910] GCP g2-standard-4 4 16 L4:1 us-east4-a 0.70
I 11-19 08:07:45 optimizer.py:910] AWS g5.xlarge 4 16 A10G:1 us-east-1 1.01
I 11-19 08:07:45 optimizer.py:910] -----------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------
CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
-----------------------------------------------------------------------------------------------------
Azure Standard_NV6ads_A10_v5 6 55 A10:1 eastus 0.45 ✔
GCP g2-standard-4 4 16 L4:1 us-east4-a 0.70
AWS g5.xlarge 4 16 A10G:1 us-east-1 1.01
-----------------------------------------------------------------------------------------------------
Expand Down Expand Up @@ -212,15 +221,15 @@ This will generate the following output:
$ sky launch -c mycluster task.yaml
...
I 12-20 23:55:56 optimizer.py:717]
I 12-20 23:55:56 optimizer.py:840] Considered resources (1 node):
I 12-20 23:55:56 optimizer.py:910] ---------------------------------------------------------------------------------------------
I 12-20 23:55:56 optimizer.py:910] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
I 12-20 23:55:56 optimizer.py:910] ---------------------------------------------------------------------------------------------
I 12-20 23:55:56 optimizer.py:910] GCP g2-standard-96 96 384 L4:8 us-east4-a 7.98 ✔
I 12-20 23:55:56 optimizer.py:910] AWS g5.48xlarge 192 768 A10G:8 us-east-1 16.29
I 12-20 23:55:56 optimizer.py:910] GCP a2-highgpu-8g 96 680 A100:8 us-east1-b 29.39
I 12-20 23:55:56 optimizer.py:910] AWS p4d.24xlarge 96 1152 A100:8 us-east-1 32.77
I 12-20 23:55:56 optimizer.py:910] ---------------------------------------------------------------------------------------------
I 12-20 23:55:56 optimizer.py:910]
Considered resources (1 node):
---------------------------------------------------------------------------------------------
CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
---------------------------------------------------------------------------------------------
GCP g2-standard-96 96 384 L4:8 us-east4-a 7.98 ✔
AWS g5.48xlarge 192 768 A10G:8 us-east-1 16.29
GCP a2-highgpu-8g 96 680 A100:8 us-east1-b 29.39
AWS p4d.24xlarge 96 1152 A100:8 us-east-1 32.77
---------------------------------------------------------------------------------------------
Launching a new cluster 'mycluster'. Proceed? [Y/n]:
4 changes: 3 additions & 1 deletion sky/adaptors/azure.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,9 @@
azure = common.LazyImport(
'azure',
import_error_message=('Failed to import dependencies for Azure.'
'Try pip install "skypilot[azure]"'))
'Try pip install "skypilot[azure]"'),
set_loggers=lambda: logging.getLogger('azure.identity').setLevel(logging.
ERROR))
Client = Any
sky_logger = sky_logging.init_logger(__name__)

Expand Down
8 changes: 6 additions & 2 deletions sky/adaptors/common.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
"""Lazy import for modules to avoid import error when not used."""
import functools
import importlib
from typing import Any, Optional, Tuple
from typing import Any, Callable, Optional, Tuple


class LazyImport:
Expand All @@ -18,15 +18,19 @@ class LazyImport:

def __init__(self,
module_name: str,
import_error_message: Optional[str] = None):
import_error_message: Optional[str] = None,
set_loggers: Optional[Callable] = None):
self._module_name = module_name
self._module = None
self._import_error_message = import_error_message
self._set_loggers = set_loggers

def load_module(self):
if self._module is None:
try:
self._module = importlib.import_module(self._module_name)
if self._set_loggers is not None:
self._set_loggers()
except ImportError as e:
if self._import_error_message is not None:
raise ImportError(self._import_error_message) from e
Expand Down
13 changes: 9 additions & 4 deletions sky/backends/backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,9 @@

import sky
from sky.usage import usage_lib
from sky.utils import rich_utils
from sky.utils import timeline
from sky.utils import ux_utils

if typing.TYPE_CHECKING:
from sky import resources
Expand Down Expand Up @@ -54,8 +56,9 @@ def provision(
cluster_name = sky.backends.backend_utils.generate_cluster_name()
usage_lib.record_cluster_name_for_current_operation(cluster_name)
usage_lib.messages.usage.update_actual_task(task)
return self._provision(task, to_provision, dryrun, stream_logs,
cluster_name, retry_until_up)
with rich_utils.safe_status(ux_utils.spinner_message('Launching')):
return self._provision(task, to_provision, dryrun, stream_logs,
cluster_name, retry_until_up)

@timeline.event
@usage_lib.messages.usage.update_runtime('sync_workdir')
Expand All @@ -76,7 +79,8 @@ def sync_file_mounts(
@usage_lib.messages.usage.update_runtime('setup')
def setup(self, handle: _ResourceHandleType, task: 'task_lib.Task',
detach_setup: bool) -> None:
return self._setup(handle, task, detach_setup)
with rich_utils.safe_status(ux_utils.spinner_message('Running setup')):
return self._setup(handle, task, detach_setup)

def add_storage_objects(self, task: 'task_lib.Task') -> None:
raise NotImplementedError
Expand All @@ -96,7 +100,8 @@ def execute(self,
usage_lib.record_cluster_name_for_current_operation(
handle.get_cluster_name())
usage_lib.messages.usage.update_actual_task(task)
return self._execute(handle, task, detach_run, dryrun)
with rich_utils.safe_status(ux_utils.spinner_message('Submitting job')):
return self._execute(handle, task, detach_run, dryrun)

@timeline.event
def post_execute(self, handle: _ResourceHandleType, down: bool) -> None:
Expand Down
29 changes: 13 additions & 16 deletions sky/backends/backend_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,9 +70,6 @@
SKY_REMOTE_PATH = '~/.sky/wheels'
SKY_USER_FILE_PATH = '~/.sky/generated'

BOLD = '\033[1m'
RESET_BOLD = '\033[0m'

# Do not use /tmp because it gets cleared on VM restart.
_SKY_REMOTE_FILE_MOUNTS_DIR = '~/.sky/file_mounts/'

Expand Down Expand Up @@ -1171,7 +1168,8 @@ def wait_until_ray_cluster_ready(
runner = command_runner.SSHCommandRunner(node=(head_ip, 22),
**ssh_credentials)
with rich_utils.safe_status(
'[bold cyan]Waiting for workers...') as worker_status:
ux_utils.spinner_message('Waiting for workers',
log_path=log_path)) as worker_status:
while True:
rc, output, stderr = runner.run(
instance_setup.RAY_STATUS_WITH_SKY_RAY_PORT_COMMAND,
Expand All @@ -1187,9 +1185,11 @@ def wait_until_ray_cluster_ready(
ready_head, ready_workers = _count_healthy_nodes_from_ray(
output, is_local_cloud=is_local_cloud)

worker_status.update('[bold cyan]'
f'{ready_workers} out of {num_nodes - 1} '
'workers ready')
worker_status.update(
ux_utils.spinner_message(
f'{ready_workers} out of {num_nodes - 1} '
'workers ready',
log_path=log_path))

# In the local case, ready_head=0 and ready_workers=num_nodes. This
# is because there is no matching regex for _LAUNCHED_HEAD_PATTERN.
Expand Down Expand Up @@ -1304,7 +1304,6 @@ def parallel_data_transfer_to_nodes(
stream_logs: bool; Whether to stream logs to stdout
source_bashrc: bool; Source bashrc before running the command.
"""
fore = colorama.Fore
style = colorama.Style

origin_source = source
Expand Down Expand Up @@ -1341,12 +1340,10 @@ def _sync_node(runner: 'command_runner.CommandRunner') -> None:

num_nodes = len(runners)
plural = 's' if num_nodes > 1 else ''
message = (f'{fore.CYAN}{action_message} (to {num_nodes} node{plural})'
f': {style.BRIGHT}{origin_source}{style.RESET_ALL} -> '
f'{style.BRIGHT}{target}{style.RESET_ALL}')
message = (f' {style.DIM}{action_message} (to {num_nodes} node{plural})'
f': {origin_source} -> {target}{style.RESET_ALL}')
logger.info(message)
with rich_utils.safe_status(f'[bold cyan]{action_message}[/]'):
subprocess_utils.run_in_parallel(_sync_node, runners)
subprocess_utils.run_in_parallel(_sync_node, runners)


def check_local_gpus() -> bool:
Expand Down Expand Up @@ -2488,9 +2485,9 @@ def get_clusters(
progress = rich_progress.Progress(transient=True,
redirect_stdout=False,
redirect_stderr=False)
task = progress.add_task(
f'[bold cyan]Refreshing status for {len(records)} cluster{plural}[/]',
total=len(records))
task = progress.add_task(ux_utils.spinner_message(
f'Refreshing status for {len(records)} cluster{plural}'),
total=len(records))

def _refresh_cluster(cluster_name):
try:
Expand Down
Loading

0 comments on commit d63497c

Please sign in to comment.