[UX] A new look of SkyPilot console outputs (#4023)

* [UX] default to minimal logging (no module/line number/timestamp). * Fix mypy. * Fix typing * Update sky/utils/env_options.py Co-authored-by: Tian Xia <[email protected]> * Update sky/utils/env_options.py Co-authored-by: Tian Xia <[email protected]> * Account for debug flag. * Remove prefixes from docs. * wip * Optimize the output * optimize logging * format * Update the ux * fix options * Fix logs ux for controller * Add job starting title * fixes * keep align * fix indent * UX v3 * Format * UX for launching * Add UX for setup and mounts * Fix setup and file mounts * Fix output * Refactor output * Fix * update * Change to ⚙️ * New alternative * cyan for spinner * address comments * format * format * refactor and fix * format * format * controller logs * fix serve ux * Updated serve UX * Fix serve ux * format * Fix backward compat job log * Fix streaming for old clusters * Fix nested status * fix status * Add looking for resources spinner * Fix azure logging * format * format * Fix old provisioner * add a new internal IP for Lambda * fix multi-worker for old provisioner * Avoid error out for refresh in teardown * format * Fix k8s output * Fixes * fix * Fix * format * Fix smoke minimal * Fix validating minimal * fix managed job tests * address comments * dim indent and green finish line * Fix optimizer output * Fix nested rich status * format * reducing refreshing frequency * remove accidentally added file * update docs * update docs * increase initial delay for smoke test * A diff icon * increase refresh frequency * minor * fix * fix message * Update sky/backends/cloud_vm_ray_backend.py Co-authored-by: Zongheng Yang <[email protected]> * fix * fix the smoke test yaml * fix * Add docstr * fix * shorten style / fore * rename class * move constants * Add indent symbol for instance up * Update controller setup * format * rename env_key * minor move * format --------- Co-authored-by: Zongheng Yang <[email protected]> Co-authored-by: Tian Xia <[email protected]>
skypilot-org · Oct 12, 2024 · d63497c · d63497c
1 parent fdd68b2
commit d63497c
Show file tree

Hide file tree

Showing 39 changed files with 1,004 additions and 644 deletions.
diff --git a/docs/source/examples/auto-failover.rst b/docs/source/examples/auto-failover.rst
@@ -53,26 +53,26 @@ Cross-region failover
 
 The provisioner first retries across all regions within a task's chosen cloud.
 
-A common high-end GPU to use in deep learning is a NVIDIA V100 GPU.  These GPUs
+A common high-end GPU to use in AI is a NVIDIA A100 GPU.  These GPUs
 are often in high demand and hard to get.  Let's see how SkyPilot's auto-failover
 provisioner handles such a request:
 
 .. code-block:: console
 
-  $ sky launch -c gpu --gpus V100
-  ...  # optimizer output
-  I 02-11 21:17:43 cloud_vm_ray_backend.py:1034] Creating a new cluster: "gpu" [1x GCP(n1-highmem-8, {'V100': 1.0})].
-  I 02-11 21:17:43 cloud_vm_ray_backend.py:1034] Tip: to reuse an existing cluster, specify --cluster-name (-c) in the CLI or use sky.launch(.., cluster_name=..) in the Python API. Run `sky status` to see existing clusters.
-  I 02-11 21:17:43 cloud_vm_ray_backend.py:614] To view detailed progress: tail -n100 -f sky_logs/sky-2022-02-11-21-17-43-171661/provision.log
-  I 02-11 21:17:43 cloud_vm_ray_backend.py:624]
-  I 02-11 21:17:43 cloud_vm_ray_backend.py:624] Launching on GCP us-central1 (us-central1-a)
-  W 02-11 21:17:56 cloud_vm_ray_backend.py:358] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-a (message: The zone 'projects/intercloud-320520/zones/us-central1-a' does not have enough resources available to fulfill the request.  Try a different zone, or try again later.)
+  $ sky launch -c gpu --gpus A100
+
+  ...
+  Launching a new cluster 'gpu'. Proceed? [Y/n]: 
+  ⚙️ Launching on GCP us-central1 (us-central1-a).
+  W 10-11 18:25:57 instance_utils.py:112] Got return codes 'VM_MIN_COUNT_NOT_REACHED', 'ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS' in us-central1-a: 'Requested minimum count of 1 VMs could not be created'; "The zone 'projects/xxxxxx/zones/us-central1-a' does not have enough resources available to fulfill the request.  '(resource type:compute)'"
+  ...
+
+  ⚙️ Launching on GCP us-central1 (us-central1-f)
   ...
-  I 02-11 21:18:24 cloud_vm_ray_backend.py:624] Launching on GCP us-central1 (us-central1-f)
-  W 02-11 21:18:38 cloud_vm_ray_backend.py:358] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-f (message: The zone 'projects/intercloud-320520/zones/us-central1-f' does not have enough resources available to fulfill the request.  Try a different zone, or try again later.)
-  I 02-11 21:18:38 cloud_vm_ray_backend.py:624]
-  I 02-11 21:18:38 cloud_vm_ray_backend.py:624] Launching on GCP us-west1 (us-west1-a)
-  Successfully connected to 35.230.120.87.
+
+  ⚙️ Launching on GCP us-west1 (us-west1-a)
+  ...
+  ✓ Cluster launched: a100-8.  View logs at: ~/sky_logs/sky-2024-10-11-18-32-48-894132/provision.log
 
 GCP was chosen as the best cloud to run the task. There was no capacity in any of the regions in US Central, so the auto-failover provisioner moved to US West instead, allowing for our instance to be successfully provisioned.
 
@@ -81,28 +81,37 @@ Cross-cloud failover
 If all regions within the chosen cloud failed, the provisioner retries on the next
 cheapest cloud.
 
-Here is an example of cross-cloud failover when requesting 8x V100 GPUs.  All
-regions in GCP failed to provide the resource, so the provisioner switched to
-AWS, where it succeeded after two regions:
+Here is an example of cross-cloud failover when requesting 8x A100 GPUs.  All
+regions in Azure failed to provide the resource, so the provisioner switched to
+GCP, where it succeeded after one region:
 
 .. code-block:: console
 
-  $ sky launch -c v100-8 --gpus V100:8
-  ...  # optimizer output
-  I 02-23 16:39:59 cloud_vm_ray_backend.py:1010] Creating a new cluster: "v100-8" [1x GCP(n1-highmem-8, {'V100': 8.0})].
-  I 02-23 16:39:59 cloud_vm_ray_backend.py:1010] Tip: to reuse an existing cluster, specify --cluster-name (-c) in the CLI or use sky.launch(.., cluster_name=..) in the Python API. Run `sky status` to see existing clusters.
-  I 02-23 16:39:59 cloud_vm_ray_backend.py:658] To view detailed progress: tail -n100 -f sky_logs/sky-2022-02-23-16-39-58-577551/provision.log
-  I 02-23 16:39:59 cloud_vm_ray_backend.py:668]
-  I 02-23 16:39:59 cloud_vm_ray_backend.py:668] Launching on GCP us-central1 (us-central1-a)
-  W 02-23 16:40:17 cloud_vm_ray_backend.py:403] Got ZONE_RESOURCE_POOL_EXHAUSTED in us-central1-a (message: The zone 'projects/intercloud-320520/zones/us-central1-a' does not have enough resources available to fulfill the request.  Try a different zone, or try again later.)
+  $ sky launch -c a100-8 --gpus A100:8
+
+  Considered resources (1 node):
+  ----------------------------------------------------------------------------------------------------
+   CLOUD   INSTANCE              vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE     COST ($)   CHOSEN   
+  ----------------------------------------------------------------------------------------------------
+   Azure   Standard_ND96asr_v4   96      900       A100:8         eastus          27.20         ✔     
+   GCP     a2-highgpu-8g         96      680       A100:8         us-central1-a   29.39               
+   AWS     p4d.24xlarge          96      1152      A100:8         us-east-1       32.77               
+  ----------------------------------------------------------------------------------------------------
+  Launching a new cluster 'a100-8'. Proceed? [Y/n]: 
+
+  ...
+  ⚙️ Launching on Azure eastus.
+  E 10-11 18:24:59 instance.py:457] Failed to create instances: [azure.core.exceptions.HttpResponseError] (InvalidTemplateDeployment)
+  sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in all zones in eastus
   ...
-  I 02-23 16:42:15 cloud_vm_ray_backend.py:668] Launching on AWS us-east-2 (us-east-2a,us-east-2b,us-east-2c)
-  W 02-23 16:42:26 cloud_vm_ray_backend.py:477] Got error(s) in all zones of us-east-2:
-  W 02-23 16:42:26 cloud_vm_ray_backend.py:479]   create_instances: Attempt failed with An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient p3.16xlarge capacity in the Availability Zone you requested (us-east-2a). Our system will be working on provisioning additional capacity. You can currently get p3.16xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-2b., retrying.
+
+  ⚙️ Launching on GCP us-central1 (us-central1-a).
+  W 10-11 18:25:57 instance_utils.py:112] Got return codes 'VM_MIN_COUNT_NOT_REACHED', 'ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS' in us-central1-a: 'Requested minimum count of 1 VMs could not be created'; "The zone 'projects/xxxxxx/zones/us-central1-a' does not have enough resources available to fulfill the request.  '(resource type:compute)'"
   ...
-  I 02-23 16:42:26 cloud_vm_ray_backend.py:668]
-  I 02-23 16:42:26 cloud_vm_ray_backend.py:668] Launching on AWS us-west-2 (us-west-2a,us-west-2b,us-west-2c,us-west-2d)
-  I 02-23 16:47:04 cloud_vm_ray_backend.py:740] Successfully provisioned or found existing VM. Setup completed.
+
+  ⚙️ Launching on GCP us-central1 (us-central1-b).
+    Instance is up.
+  ✓ Cluster launched: a100-8.  View logs at: ~/sky_logs/sky-2024-10-11-18-24-14-357884/provision.log
 
 
 Multiple Candidate GPUs
@@ -125,13 +134,13 @@ A10, L4, and A10g GPUs, using :code:`sky launch task.yaml`.
 
   $ sky launch task.yaml
   ...
-  I 11-19 08:07:45 optimizer.py:910] -----------------------------------------------------------------------------------------------------
-  I 11-19 08:07:45 optimizer.py:910]  CLOUD   INSTANCE                 vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN
-  I 11-19 08:07:45 optimizer.py:910] -----------------------------------------------------------------------------------------------------
-  I 11-19 08:07:45 optimizer.py:910]  Azure   Standard_NV6ads_A10_v5   6       55        A10:1          eastus        0.45          ✔
-  I 11-19 08:07:45 optimizer.py:910]  GCP     g2-standard-4            4       16        L4:1           us-east4-a    0.70
-  I 11-19 08:07:45 optimizer.py:910]  AWS     g5.xlarge                4       16        A10G:1         us-east-1     1.01
-  I 11-19 08:07:45 optimizer.py:910] -----------------------------------------------------------------------------------------------------
+  -----------------------------------------------------------------------------------------------------
+   CLOUD   INSTANCE                 vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN
+  -----------------------------------------------------------------------------------------------------
+   Azure   Standard_NV6ads_A10_v5   6       55        A10:1          eastus        0.45          ✔
+   GCP     g2-standard-4            4       16        L4:1           us-east4-a    0.70
+   AWS     g5.xlarge                4       16        A10G:1         us-east-1     1.01
+  -----------------------------------------------------------------------------------------------------
 
 
 
@@ -212,15 +221,15 @@ This will generate the following output:
 
   $ sky launch -c mycluster task.yaml
   ...
-  I 12-20 23:55:56 optimizer.py:717]
-  I 12-20 23:55:56 optimizer.py:840] Considered resources (1 node):
-  I 12-20 23:55:56 optimizer.py:910] ---------------------------------------------------------------------------------------------
-  I 12-20 23:55:56 optimizer.py:910]  CLOUD   INSTANCE         vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN
-  I 12-20 23:55:56 optimizer.py:910] ---------------------------------------------------------------------------------------------
-  I 12-20 23:55:56 optimizer.py:910]  GCP     g2-standard-96   96      384       L4:8           us-east4-a    7.98          ✔
-  I 12-20 23:55:56 optimizer.py:910]  AWS     g5.48xlarge      192     768       A10G:8         us-east-1     16.29
-  I 12-20 23:55:56 optimizer.py:910]  GCP     a2-highgpu-8g    96      680       A100:8         us-east1-b    29.39
-  I 12-20 23:55:56 optimizer.py:910]  AWS     p4d.24xlarge     96      1152      A100:8         us-east-1     32.77
-  I 12-20 23:55:56 optimizer.py:910] ---------------------------------------------------------------------------------------------
-  I 12-20 23:55:56 optimizer.py:910]
+
+  Considered resources (1 node):
+  ---------------------------------------------------------------------------------------------
+   CLOUD   INSTANCE         vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN
+  ---------------------------------------------------------------------------------------------
+   GCP     g2-standard-96   96      384       L4:8           us-east4-a    7.98          ✔
+   AWS     g5.48xlarge      192     768       A10G:8         us-east-1     16.29
+   GCP     a2-highgpu-8g    96      680       A100:8         us-east1-b    29.39
+   AWS     p4d.24xlarge     96      1152      A100:8         us-east-1     32.77
+  ---------------------------------------------------------------------------------------------
+
   Launching a new cluster 'mycluster'. Proceed? [Y/n]:
diff --git a/sky/adaptors/azure.py b/sky/adaptors/azure.py
@@ -20,7 +20,9 @@
 azure = common.LazyImport(
     'azure',
     import_error_message=('Failed to import dependencies for Azure.'
-                          'Try pip install "skypilot[azure]"'))
+                          'Try pip install "skypilot[azure]"'),
+    set_loggers=lambda: logging.getLogger('azure.identity').setLevel(logging.
+                                                                     ERROR))
 Client = Any
 sky_logger = sky_logging.init_logger(__name__)
 

diff --git a/sky/adaptors/common.py b/sky/adaptors/common.py
@@ -1,7 +1,7 @@
 """Lazy import for modules to avoid import error when not used."""
 import functools
 import importlib
-from typing import Any, Optional, Tuple
+from typing import Any, Callable, Optional, Tuple
 
 
 class LazyImport:
@@ -18,15 +18,19 @@ class LazyImport:
 
     def __init__(self,
                  module_name: str,
-                 import_error_message: Optional[str] = None):
+                 import_error_message: Optional[str] = None,
+                 set_loggers: Optional[Callable] = None):
         self._module_name = module_name
         self._module = None
         self._import_error_message = import_error_message
+        self._set_loggers = set_loggers
 
     def load_module(self):
         if self._module is None:
             try:
                 self._module = importlib.import_module(self._module_name)
+                if self._set_loggers is not None:
+                    self._set_loggers()
             except ImportError as e:
                 if self._import_error_message is not None:
                     raise ImportError(self._import_error_message) from e

diff --git a/sky/backends/backend.py b/sky/backends/backend.py
@@ -4,7 +4,9 @@
 
 import sky
 from sky.usage import usage_lib
+from sky.utils import rich_utils
 from sky.utils import timeline
+from sky.utils import ux_utils
 
 if typing.TYPE_CHECKING:
     from sky import resources
@@ -54,8 +56,9 @@ def provision(
             cluster_name = sky.backends.backend_utils.generate_cluster_name()
         usage_lib.record_cluster_name_for_current_operation(cluster_name)
         usage_lib.messages.usage.update_actual_task(task)
-        return self._provision(task, to_provision, dryrun, stream_logs,
-                               cluster_name, retry_until_up)
+        with rich_utils.safe_status(ux_utils.spinner_message('Launching')):
+            return self._provision(task, to_provision, dryrun, stream_logs,
+                                   cluster_name, retry_until_up)
 
     @timeline.event
     @usage_lib.messages.usage.update_runtime('sync_workdir')
@@ -76,7 +79,8 @@ def sync_file_mounts(
     @usage_lib.messages.usage.update_runtime('setup')
     def setup(self, handle: _ResourceHandleType, task: 'task_lib.Task',
               detach_setup: bool) -> None:
-        return self._setup(handle, task, detach_setup)
+        with rich_utils.safe_status(ux_utils.spinner_message('Running setup')):
+            return self._setup(handle, task, detach_setup)
 
     def add_storage_objects(self, task: 'task_lib.Task') -> None:
         raise NotImplementedError
@@ -96,7 +100,8 @@ def execute(self,
         usage_lib.record_cluster_name_for_current_operation(
             handle.get_cluster_name())
         usage_lib.messages.usage.update_actual_task(task)
-        return self._execute(handle, task, detach_run, dryrun)
+        with rich_utils.safe_status(ux_utils.spinner_message('Submitting job')):
+            return self._execute(handle, task, detach_run, dryrun)
 
     @timeline.event
     def post_execute(self, handle: _ResourceHandleType, down: bool) -> None:

diff --git a/sky/backends/backend_utils.py b/sky/backends/backend_utils.py
@@ -70,9 +70,6 @@
 SKY_REMOTE_PATH = '~/.sky/wheels'
 SKY_USER_FILE_PATH = '~/.sky/generated'
 
-BOLD = '\033[1m'
-RESET_BOLD = '\033[0m'
-
 # Do not use /tmp because it gets cleared on VM restart.
 _SKY_REMOTE_FILE_MOUNTS_DIR = '~/.sky/file_mounts/'
 
@@ -1171,7 +1168,8 @@ def wait_until_ray_cluster_ready(
     runner = command_runner.SSHCommandRunner(node=(head_ip, 22),
                                              **ssh_credentials)
     with rich_utils.safe_status(
-            '[bold cyan]Waiting for workers...') as worker_status:
+            ux_utils.spinner_message('Waiting for workers',
+                                     log_path=log_path)) as worker_status:
         while True:
             rc, output, stderr = runner.run(
                 instance_setup.RAY_STATUS_WITH_SKY_RAY_PORT_COMMAND,
@@ -1187,9 +1185,11 @@ def wait_until_ray_cluster_ready(
             ready_head, ready_workers = _count_healthy_nodes_from_ray(
                 output, is_local_cloud=is_local_cloud)
 
-            worker_status.update('[bold cyan]'
-                                 f'{ready_workers} out of {num_nodes - 1} '
-                                 'workers ready')
+            worker_status.update(
+                ux_utils.spinner_message(
+                    f'{ready_workers} out of {num_nodes - 1} '
+                    'workers ready',
+                    log_path=log_path))
 
             # In the local case, ready_head=0 and ready_workers=num_nodes. This
             # is because there is no matching regex for _LAUNCHED_HEAD_PATTERN.
@@ -1304,7 +1304,6 @@ def parallel_data_transfer_to_nodes(
         stream_logs: bool; Whether to stream logs to stdout
         source_bashrc: bool; Source bashrc before running the command.
     """
-    fore = colorama.Fore
     style = colorama.Style
 
     origin_source = source
@@ -1341,12 +1340,10 @@ def _sync_node(runner: 'command_runner.CommandRunner') -> None:
 
     num_nodes = len(runners)
     plural = 's' if num_nodes > 1 else ''
-    message = (f'{fore.CYAN}{action_message} (to {num_nodes} node{plural})'
-               f': {style.BRIGHT}{origin_source}{style.RESET_ALL} -> '
-               f'{style.BRIGHT}{target}{style.RESET_ALL}')
+    message = (f'  {style.DIM}{action_message} (to {num_nodes} node{plural})'
+               f': {origin_source} -> {target}{style.RESET_ALL}')
     logger.info(message)
-    with rich_utils.safe_status(f'[bold cyan]{action_message}[/]'):
-        subprocess_utils.run_in_parallel(_sync_node, runners)
+    subprocess_utils.run_in_parallel(_sync_node, runners)
 
 
 def check_local_gpus() -> bool:
@@ -2488,9 +2485,9 @@ def get_clusters(
     progress = rich_progress.Progress(transient=True,
                                       redirect_stdout=False,
                                       redirect_stderr=False)
-    task = progress.add_task(
-        f'[bold cyan]Refreshing status for {len(records)} cluster{plural}[/]',
-        total=len(records))
+    task = progress.add_task(ux_utils.spinner_message(
+        f'Refreshing status for {len(records)} cluster{plural}'),
+                             total=len(records))
 
     def _refresh_cluster(cluster_name):
         try: