diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md
index ae0a28b52ed..c7c31960482 100644
--- a/.github/pull_request_template.md
+++ b/.github/pull_request_template.md
@@ -11,4 +11,4 @@ Tested (run the relevant ones):
 - [ ] Any manual or new tests for this PR (please specify below)
 - [ ] All smoke tests: `pytest tests/test_smoke.py` 
 - [ ] Relevant individual smoke tests: `pytest tests/test_smoke.py::test_fill_in_the_name` 
-- [ ] Backward compatibility tests: `bash tests/backward_comaptibility_tests.sh`
+- [ ] Backward compatibility tests: `conda deactivate; bash -i tests/backward_compatibility_tests.sh`
diff --git a/.github/workflows/pytest.yml b/.github/workflows/pytest.yml
index 87d8cea9f16..bf84bea4d50 100644
--- a/.github/workflows/pytest.yml
+++ b/.github/workflows/pytest.yml
@@ -27,7 +27,7 @@ jobs:
           - tests/test_optimizer_random_dag.py
           - tests/test_storage.py
           - tests/test_wheels.py
-          - tests/test_spot_serve.py
+          - tests/test_jobs_and_serve.py
           - tests/test_yaml_parser.py
     runs-on: ubuntu-latest
     steps:
diff --git a/.gitignore b/.gitignore
index f1dbf59a52f..efa74dd744b 100644
--- a/.gitignore
+++ b/.gitignore
@@ -11,4 +11,4 @@ sky_logs/
 sky/clouds/service_catalog/data_fetchers/*.csv
 .vscode/
 .idea/
-
+.env
diff --git a/README.md b/README.md
index bff69464175..a2704df3643 100644
--- a/README.md
+++ b/README.md
@@ -27,22 +27,25 @@
 
 ----
 :fire: *News* :fire:
+- [Jun, 2024] Reproduce **GPT** with [llm.c](https://github.com/karpathy/llm.c/discussions/481) on any cloud: [**guide**](./llm/gpt-2/)
+- [Apr, 2024] Serve and finetune [**Llama 3**](https://skypilot.readthedocs.io/en/latest/gallery/llms/llama-3.html) on any cloud or Kubernetes: [**example**](./llm/llama-3/)
+- [Apr, 2024] Serve [**Qwen-110B**](https://qwenlm.github.io/blog/qwen1.5-110b/) on your infra: [**example**](./llm/qwen/)
 - [Apr, 2024] Using [**Ollama**](https://github.com/ollama/ollama) to deploy quantized LLMs on CPUs and GPUs: [**example**](./llm/ollama/)
-- [Mar, 2024] Serve and deploy [**Databricks DBRX**](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm) on your infra: [**example**](./llm/dbrx/)
 - [Feb, 2024] Deploying and scaling [**Gemma**](https://blog.google/technology/developers/gemma-open-models/) with SkyServe: [**example**](./llm/gemma/)
-- [Feb, 2024] Speed up your LLM deployments with [**SGLang**](https://github.com/sgl-project/sglang) for 5x throughput on SkyServe: [**example**](./llm/sglang/)
 - [Feb, 2024] Serving [**Code Llama 70B**](https://ai.meta.com/blog/code-llama-large-language-model-coding/) with vLLM and SkyServe: [**example**](./llm/codellama/)
-- [Dec, 2023] Using [**LoRAX**](https://github.com/predibase/lorax) to serve 1000s of finetuned LLMs on a single instance in the cloud: [**example**](./llm/lorax/)
 - [Dec, 2023] [**Mixtral 8x7B**](https://mistral.ai/news/mixtral-of-experts/), a high quality sparse mixture-of-experts model, was released by Mistral AI! Deploy via SkyPilot on any cloud: [**example**](./llm/mixtral/)
 - [Nov, 2023] Using [**Axolotl**](https://github.com/OpenAccess-AI-Collective/axolotl) to finetune Mistral 7B on the cloud (on-demand and spot): [**example**](./llm/axolotl/)
-- [Sep, 2023] [**Mistral 7B**](https://mistral.ai/news/announcing-mistral-7b/), a high-quality open LLM, was released! Deploy via SkyPilot on any cloud: [**Mistral docs**](https://docs.mistral.ai/self-deployment/skypilot)
 - [Sep, 2023] Case study: [**Covariant**](https://covariant.ai/) transformed AI development on the cloud using SkyPilot, delivering models 4x faster cost-effectively: [**read the case study**](https://blog.skypilot.co/covariant/)
-- [Aug, 2023] Cookbook: Finetuning Llama 2 in your own cloud environment, privately: [**example**](./llm/vicuna-llama-2/), [**blog post**](https://blog.skypilot.co/finetuning-llama2-operational-guide/)
+- [Aug, 2023] **Finetuning Cookbook**: Finetuning Llama 2 in your own cloud environment, privately: [**example**](./llm/vicuna-llama-2/), [**blog post**](https://blog.skypilot.co/finetuning-llama2-operational-guide/)
 - [June, 2023] Serving LLM 24x Faster On the Cloud [**with vLLM**](https://vllm.ai/) and SkyPilot: [**example**](./llm/vllm/), [**blog post**](https://blog.skypilot.co/serving-llm-24x-faster-on-the-cloud-with-vllm-and-skypilot/)
 
 <details>
   <summary>Archived</summary>
 
+- [Mar, 2024] Serve and deploy [**Databricks DBRX**](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm) on your infra: [**example**](./llm/dbrx/)
+- [Feb, 2024] Speed up your LLM deployments with [**SGLang**](https://github.com/sgl-project/sglang) for 5x throughput on SkyServe: [**example**](./llm/sglang/)
+- [Dec, 2023] Using [**LoRAX**](https://github.com/predibase/lorax) to serve 1000s of finetuned LLMs on a single instance in the cloud: [**example**](./llm/lorax/)
+- [Sep, 2023] [**Mistral 7B**](https://mistral.ai/news/announcing-mistral-7b/), a high-quality open LLM, was released! Deploy via SkyPilot on any cloud: [**Mistral docs**](https://docs.mistral.ai/self-deployment/skypilot)
 - [July, 2023] Self-Hosted **Llama-2 Chatbot** on Any Cloud: [**example**](./llm/llama-2/)
 - [April, 2023] [SkyPilot YAMLs](./llm/vicuna/) for finetuning & serving the [Vicuna LLM](https://lmsys.org/blog/2023-03-30-vicuna/) with a single command!
 
@@ -151,6 +154,9 @@ To learn more, see our [Documentation](https://skypilot.readthedocs.io/en/latest
 <!-- Keep this section in sync with index.rst in SkyPilot Docs -->
 Runnable examples:
 - LLMs on SkyPilot
+  - [GPT-2 via `llm.c`](./llm/gpt-2/)
+  - [Llama 3](./llm/llama-3/)
+  - [Qwen](./llm/qwen/)
   - [Databricks DBRX](./llm/dbrx/)
   - [Gemma](./llm/gemma/)
   - [Mixtral 8x7B](./llm/mixtral/); [Mistral 7B](https://docs.mistral.ai/self-deployment/skypilot/) (from official Mistral team)
@@ -168,7 +174,7 @@ Runnable examples:
   - [LocalGPT](./llm/localgpt)
   - [Falcon](./llm/falcon)
   - Add yours here & see more in [`llm/`](./llm)!
-- Framework examples: [PyTorch DDP](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_torch.yaml), [DeepSpeed](./examples/deepspeed-multinode/sky.yaml), [JAX/Flax on TPU](https://github.com/skypilot-org/skypilot/blob/master/examples/tpu/tpuvm_mnist.yaml), [Stable Diffusion](https://github.com/skypilot-org/skypilot/tree/master/examples/stable_diffusion), [Detectron2](https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml), [Distributed](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_tf_app.py) [TensorFlow](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_app_storage.yaml), [Ray Train](examples/distributed_ray_train/ray_train.yaml), [NeMo](https://github.com/skypilot-org/skypilot/blob/master/examples/nemo/nemo.yaml), [programmatic grid search](https://github.com/skypilot-org/skypilot/blob/master/examples/huggingface_glue_imdb_grid_search_app.py), [Docker](https://github.com/skypilot-org/skypilot/blob/master/examples/docker/echo_app.yaml), [Cog](https://github.com/skypilot-org/skypilot/blob/master/examples/cog/), [Unsloth](https://github.com/skypilot-org/skypilot/blob/master/examples/unsloth/unsloth.yaml), [Ollama](https://github.com/skypilot-org/skypilot/blob/master/llm/ollama) and [many more (`examples/`)](./examples).
+- Framework examples: [PyTorch DDP](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_torch.yaml), [DeepSpeed](./examples/deepspeed-multinode/sky.yaml), [JAX/Flax on TPU](https://github.com/skypilot-org/skypilot/blob/master/examples/tpu/tpuvm_mnist.yaml), [Stable Diffusion](https://github.com/skypilot-org/skypilot/tree/master/examples/stable_diffusion), [Detectron2](https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml), [Distributed](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_tf_app.py) [TensorFlow](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_app_storage.yaml), [Ray Train](examples/distributed_ray_train/ray_train.yaml), [NeMo](https://github.com/skypilot-org/skypilot/blob/master/examples/nemo/nemo.yaml), [programmatic grid search](https://github.com/skypilot-org/skypilot/blob/master/examples/huggingface_glue_imdb_grid_search_app.py), [Docker](https://github.com/skypilot-org/skypilot/blob/master/examples/docker/echo_app.yaml), [Cog](https://github.com/skypilot-org/skypilot/blob/master/examples/cog/), [Unsloth](https://github.com/skypilot-org/skypilot/blob/master/examples/unsloth/unsloth.yaml), [Ollama](https://github.com/skypilot-org/skypilot/blob/master/llm/ollama), [llm.c](https://github.com/skypilot-org/skypilot/tree/master/llm/gpt-2) and [many more (`examples/`)](./examples).
 
 Follow updates:
 - [Twitter](https://twitter.com/skypilot_org)
@@ -179,6 +185,7 @@ Read the research:
 - [SkyPilot paper](https://www.usenix.org/system/files/nsdi23-yang-zongheng.pdf) and [talk](https://www.usenix.org/conference/nsdi23/presentation/yang-zongheng) (NSDI 2023)
 - [Sky Computing whitepaper](https://arxiv.org/abs/2205.07147)
 - [Sky Computing vision paper](https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s02-stoica.pdf) (HotOS 2021)
+- [Policy for Managed Spot Jobs](https://www.usenix.org/conference/nsdi24/presentation/wu-zhanghao)  (NSDI 2024)
 
 ## Support and Questions
 We are excited to hear your feedback!
diff --git a/docs/source/_gallery_original/index.rst b/docs/source/_gallery_original/index.rst
index b38d53cad52..e8a540c883c 100644
--- a/docs/source/_gallery_original/index.rst
+++ b/docs/source/_gallery_original/index.rst
@@ -1,3 +1,5 @@
+.. _ai-gallery:
+
 AI Gallery
 ====================
 
@@ -36,6 +38,8 @@ Contents
    Mistral 7B (Mistral AI) <https://docs.mistral.ai/self-deployment/skypilot/>
    DBRX (Databricks) <llms/dbrx>
    Llama-2 (Meta) <llms/llama-2>
+   Llama-3 (Meta) <llms/llama-3>
+   Qwen (Alibaba) <llms/qwen>
    CodeLlama (Meta) <llms/codellama>
    Gemma (Google) <llms/gemma>
 
diff --git a/docs/source/_gallery_original/llms/llama-3.md b/docs/source/_gallery_original/llms/llama-3.md
new file mode 120000
index 00000000000..54133a94f28
--- /dev/null
+++ b/docs/source/_gallery_original/llms/llama-3.md
@@ -0,0 +1 @@
+../../../../llm/llama-3/README.md
\ No newline at end of file
diff --git a/docs/source/_gallery_original/llms/qwen.md b/docs/source/_gallery_original/llms/qwen.md
new file mode 120000
index 00000000000..2a6d513503f
--- /dev/null
+++ b/docs/source/_gallery_original/llms/qwen.md
@@ -0,0 +1 @@
+../../../../llm/qwen/README.md
\ No newline at end of file
diff --git a/docs/source/_static/custom.js b/docs/source/_static/custom.js
index 56bae76ec61..5630793d8ff 100644
--- a/docs/source/_static/custom.js
+++ b/docs/source/_static/custom.js
@@ -5,7 +5,7 @@ document.addEventListener('DOMContentLoaded', function () {
        script.setAttribute('data-project-name', 'SkyPilot');
        script.setAttribute('data-project-color', '#4C4C4D');
        script.setAttribute('data-project-logo', 'https://avatars.githubusercontent.com/u/109387420?s=100&v=4');
-       script.setAttribute('data-modal-disclaimer', 'Results are automatically generated and may be inaccurate or contain inappropriate information. Do not include any sensitive information in your query.');
+       script.setAttribute('data-modal-disclaimer', 'Results are automatically generated and may be inaccurate or contain inappropriate information. Do not include any sensitive information in your query.\n**To get further assistance, you can chat directly with the development team** by joining the [SkyPilot Slack](https://slack.skypilot.co/).');
        script.setAttribute('data-modal-title', 'SkyPilot Docs AI - Ask a Question.');
        script.setAttribute('data-button-position-bottom', '85px');
        script.async = true;
@@ -26,9 +26,11 @@ document.addEventListener('DOMContentLoaded', () => {
     // New items:
     const newItems = [
         { selector: '.caption-text', text: 'SkyServe: Model Serving' },
+        { selector: '.toctree-l1 > a', text: 'Managed Jobs' },
         { selector: '.toctree-l1 > a', text: 'Running on Kubernetes' },
-        { selector: '.toctree-l1 > a', text: 'DBRX (Databricks)' },
         { selector: '.toctree-l1 > a', text: 'Ollama' },
+        { selector: '.toctree-l1 > a', text: 'Llama-3 (Meta)' },
+        { selector: '.toctree-l1 > a', text: 'Qwen (Alibaba)' },
     ];
     newItems.forEach(({ selector, text }) => {
         document.querySelectorAll(selector).forEach((el) => {
diff --git a/docs/source/cloud-setup/cloud-permissions/index.rst b/docs/source/cloud-setup/cloud-permissions/index.rst
index 873cbf339fc..e2a1aaf16ae 100644
--- a/docs/source/cloud-setup/cloud-permissions/index.rst
+++ b/docs/source/cloud-setup/cloud-permissions/index.rst
@@ -20,3 +20,4 @@ Table of Contents
     aws
     gcp
     vsphere
+    kubernetes
diff --git a/docs/source/cloud-setup/cloud-permissions/kubernetes.rst b/docs/source/cloud-setup/cloud-permissions/kubernetes.rst
new file mode 100644
index 00000000000..34d29b49f0a
--- /dev/null
+++ b/docs/source/cloud-setup/cloud-permissions/kubernetes.rst
@@ -0,0 +1,333 @@
+.. _cloud-permissions-kubernetes:
+
+Kubernetes
+==========
+
+When running outside your Kubernetes cluster, SkyPilot uses your local ``~/.kube/config`` file
+for authentication and creating resources on your Kubernetes cluster.
+
+When running inside your Kubernetes cluster (e.g., as a Spot controller or Serve controller),
+SkyPilot can operate using either of the following three authentication methods:
+
+1. **Using your local kubeconfig file**: In this case, SkyPilot will
+   copy your local ``~/.kube/config`` file to the controller pod and use it for
+   authentication. This is the default method when running inside the cluster,
+   and no additional configuration is required.
+
+   .. note::
+
+       If your cluster uses exec based authentication in your ``~/.kube/config`` file
+       (e.g., GKE uses exec auth by default), SkyPilot may not be able to authenticate using this method. In this case,
+       consider using the service account methods below.
+
+2. **Creating a service account**: SkyPilot can automatically create the service
+   account and roles for itself to manage resources in the Kubernetes cluster.
+   To use this method, set ``remote_identity: SERVICE_ACCOUNT`` to your
+   Kubernetes configuration in the :ref:`~/.sky/config.yaml <config-yaml>` file:
+
+   .. code-block:: yaml
+
+       kubernetes:
+         remote_identity: SERVICE_ACCOUNT
+
+   For details on the permissions that are granted to the service account,
+   refer to the `Minimum Permissions Required for SkyPilot`_ section below.
+
+3. **Using a custom service account**: If you have a custom service account
+   with the `necessary permissions <k8s-permissions_>`__, you can configure
+   SkyPilot to use it by adding this to your :ref:`~/.sky/config.yaml <config-yaml>` file:
+
+   .. code-block:: yaml
+
+       kubernetes:
+         remote_identity: your-service-account-name
+
+.. note::
+
+    Service account based authentication applies only when the remote SkyPilot
+    cluster (including spot and serve controller) is launched inside the
+    Kubernetes cluster. When running outside the cluster (e.g., on AWS),
+    SkyPilot will use the local ``~/.kube/config`` file for authentication.
+
+Below are the permissions required by SkyPilot and an example service account YAML that you can use to create a service account with the necessary permissions.
+
+.. _k8s-permissions:
+
+Minimum Permissions Required for SkyPilot
+-----------------------------------------
+
+SkyPilot requires permissions equivalent to the following roles to be able to manage the resources in the Kubernetes cluster:
+
+.. code-block:: yaml
+
+    # Namespaced role for the service account
+    # Required for creating pods, services and other necessary resources in the namespace.
+    # Note these permissions only apply in the namespace where SkyPilot is deployed, and the namespace can be changed below.
+    kind: Role
+    apiVersion: rbac.authorization.k8s.io/v1
+    metadata:
+      name: sky-sa-role  # Can be changed if needed
+      namespace: default  # Change to your namespace if using a different one.
+    rules:
+      - apiGroups: ["*"]
+        resources: ["*"]
+        verbs: ["*"]
+    ---
+    # ClusterRole for accessing cluster-wide resources. Details for each resource below:
+    kind: ClusterRole
+    apiVersion: rbac.authorization.k8s.io/v1
+    metadata:
+      name: sky-sa-cluster-role  # Can be changed if needed
+      namespace: default  # Change to your namespace if using a different one.
+      labels:
+        parent: skypilot
+    rules:
+      - apiGroups: [""]
+        resources: ["nodes"]  # Required for getting node resources.
+        verbs: ["get", "list", "watch"]
+      - apiGroups: ["node.k8s.io"]
+        resources: ["runtimeclasses"]   # Required for autodetecting the runtime class of the nodes.
+        verbs: ["get", "list", "watch"]
+
+
+.. tip::
+
+    If you are using a different namespace than ``default``, make sure to change the namespace in the above manifests.
+
+These roles must apply to both the user account configured in the kubeconfig file and the service account used by SkyPilot (if configured).
+
+If your tasks use object store mounting or require access to ingress resources, you will need to grant additional permissions as described below.
+
+Permissions for Object Store Mounting
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If your tasks use object store mounting (e.g., S3, GCS, etc.), SkyPilot will need to run a DaemonSet to expose the FUSE device as a Kubernetes resource to SkyPilot pods.
+
+To allow this, you will need to also create a ``skypilot-system`` namespace which will run the DaemonSet and grant the necessary permissions to the service account in that namespace.
+
+
+.. code-block:: yaml
+
+    # Required only if using object store mounting
+    # Create namespace for SkyPilot system
+    apiVersion: v1
+    kind: Namespace
+    metadata:
+      name: skypilot-system  # Do not change this
+      labels:
+        parent: skypilot
+    ---
+    # Role for the skypilot-system namespace to create FUSE device manager and
+    # any other system components required by SkyPilot.
+    # This role must be bound in the skypilot-system namespace to the service account used for SkyPilot.
+    kind: Role
+    apiVersion: rbac.authorization.k8s.io/v1
+    metadata:
+      name: skypilot-system-service-account-role  # Can be changed if needed
+      namespace: skypilot-system  # Do not change this namespace
+      labels:
+        parent: skypilot
+    rules:
+      - apiGroups: ["*"]
+        resources: ["*"]
+        verbs: ["*"]
+
+
+Permissions for using Ingress
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If your tasks use :ref:`Ingress <kubernetes-ingress>` for exposing ports, you will need to grant the necessary permissions to the service account in the ``ingress-nginx`` namespace.
+
+.. code-block:: yaml
+
+    # Required only if using ingresses
+    # Role for accessing ingress service IP
+    apiVersion: rbac.authorization.k8s.io/v1
+    kind: Role
+    metadata:
+      namespace: ingress-nginx  # Do not change this
+      name: sky-sa-role-ingress-nginx  # Can be changed if needed
+    rules:
+      - apiGroups: [""]
+        resources: ["services"]
+        verbs: ["list", "get"]
+
+
+.. _k8s-sa-example:
+
+Example using Custom Service Account
+------------------------------------
+
+To create a service account that has all necessary permissions for SkyPilot (including for accessing object stores), you can use the following YAML.
+
+.. tip::
+
+    In this example, the service account is named ``sky-sa`` and is created in the ``default`` namespace.
+    Change the namespace and service account name as needed.
+
+
+.. code-block:: yaml
+   :linenos:
+
+    # create-sky-sa.yaml
+    kind: ServiceAccount
+    apiVersion: v1
+    metadata:
+      name: sky-sa  # Change to your service account name
+      namespace: default  # Change to your namespace if using a different one.
+      labels:
+        parent: skypilot
+    ---
+    # Role for the service account
+    kind: Role
+    apiVersion: rbac.authorization.k8s.io/v1
+    metadata:
+      name: sky-sa-role  # Can be changed if needed
+      namespace: default  # Change to your namespace if using a different one.
+      labels:
+        parent: skypilot
+    rules:
+      - apiGroups: ["*"]  # Required for creating pods, services, secrets and other necessary resources in the namespace.
+        resources: ["*"]
+        verbs: ["*"]
+    ---
+    # RoleBinding for the service account
+    kind: RoleBinding
+    apiVersion: rbac.authorization.k8s.io/v1
+    metadata:
+      name: sky-sa-rb  # Can be changed if needed
+      namespace: default  # Change to your namespace if using a different one.
+      labels:
+        parent: skypilot
+    subjects:
+      - kind: ServiceAccount
+        name: sky-sa  # Change to your service account name
+    roleRef:
+      kind: Role
+      name: sky-sa-role  # Use the same name as the role at line 14
+      apiGroup: rbac.authorization.k8s.io
+    ---
+    # ClusterRole for the service account
+    kind: ClusterRole
+    apiVersion: rbac.authorization.k8s.io/v1
+    metadata:
+      name: sky-sa-cluster-role  # Can be changed if needed
+      namespace: default  # Change to your namespace if using a different one.
+      labels:
+        parent: skypilot
+    rules:
+      - apiGroups: [""]
+        resources: ["nodes"]  # Required for getting node resources.
+        verbs: ["get", "list", "watch"]
+      - apiGroups: ["node.k8s.io"]
+        resources: ["runtimeclasses"]   # Required for autodetecting the runtime class of the nodes.
+        verbs: ["get", "list", "watch"]
+      - apiGroups: ["networking.k8s.io"]   # Required for exposing services through ingresses
+        resources: ["ingressclasses"]
+        verbs: ["get", "list", "watch"]
+    ---
+    # ClusterRoleBinding for the service account
+    apiVersion: rbac.authorization.k8s.io/v1
+    kind: ClusterRoleBinding
+    metadata:
+      name: sky-sa-cluster-role-binding  # Can be changed if needed
+      namespace: default  # Change to your namespace if using a different one.
+      labels:
+        parent: skypilot
+    subjects:
+      - kind: ServiceAccount
+        name: sky-sa  # Change to your service account name
+        namespace: default  # Change to your namespace if using a different one.
+    roleRef:
+      kind: ClusterRole
+      name: sky-sa-cluster-role  # Use the same name as the cluster role at line 43
+      apiGroup: rbac.authorization.k8s.io
+    ---
+    # Optional: If using object store mounting, create the skypilot-system namespace
+    apiVersion: v1
+    kind: Namespace
+    metadata:
+      name: skypilot-system  # Do not change this
+      labels:
+        parent: skypilot
+    ---
+    # Optional: If using object store mounting, create role in the skypilot-system
+    # namespace to create FUSE device manager.
+    kind: Role
+    apiVersion: rbac.authorization.k8s.io/v1
+    metadata:
+      name: skypilot-system-service-account-role  # Can be changed if needed
+      namespace: skypilot-system  # Do not change this namespace
+      labels:
+        parent: skypilot
+    rules:
+      - apiGroups: ["*"]
+        resources: ["*"]
+        verbs: ["*"]
+    ---
+    # Optional: If using object store mounting, create rolebinding in the skypilot-system
+    # namespace to create FUSE device manager.
+    apiVersion: rbac.authorization.k8s.io/v1
+    kind: RoleBinding
+    metadata:
+      name: sky-sa-skypilot-system-role-binding
+      namespace: skypilot-system  # Do not change this namespace
+      labels:
+        parent: skypilot
+    subjects:
+      - kind: ServiceAccount
+        name: sky-sa  # Change to your service account name
+        namespace: default  # Change this to the namespace where the service account is created
+    roleRef:
+      kind: Role
+      name: skypilot-system-service-account-role  # Use the same name as the role at line 88
+      apiGroup: rbac.authorization.k8s.io
+    ---
+    # Optional: Role for accessing ingress resources
+    apiVersion: rbac.authorization.k8s.io/v1
+    kind: Role
+    metadata:
+      name: sky-sa-role-ingress-nginx  # Can be changed if needed
+      namespace: ingress-nginx  # Do not change this namespace
+      labels:
+        parent: skypilot
+    rules:
+      - apiGroups: [""]
+        resources: ["services"]
+        verbs: ["list", "get", "watch"]
+      - apiGroups: ["rbac.authorization.k8s.io"]
+        resources: ["roles", "rolebindings"]
+        verbs: ["list", "get", "watch"]
+    ---
+    # Optional: RoleBinding for accessing ingress resources
+    apiVersion: rbac.authorization.k8s.io/v1
+    kind: RoleBinding
+    metadata:
+      name: sky-sa-rolebinding-ingress-nginx  # Can be changed if needed
+      namespace: ingress-nginx  # Do not change this namespace
+      labels:
+        parent: skypilot
+    subjects:
+      - kind: ServiceAccount
+        name: sky-sa  # Change to your service account name
+        namespace: default  # Change this to the namespace where the service account is created
+    roleRef:
+      kind: Role
+      name: sky-sa-role-ingress-nginx  # Use the same name as the role at line 119
+      apiGroup: rbac.authorization.k8s.io
+
+Create the service account using the following command:
+
+.. code-block:: bash
+
+    $ kubectl apply -f create-sky-sa.yaml
+
+After creating the service account, the cluster admin may distribute kubeconfigs with the ``sky-sa`` service account to users who need to access the cluster.
+
+Users should also configure SkyPilot to use the ``sky-sa`` service account through ``~/.sky/config.yaml``:
+
+.. code-block:: yaml
+
+    # ~/.sky/config.yaml
+    kubernetes:
+      remote_identity: sky-sa   # Or your service account name
diff --git a/docs/source/docs/index.rst b/docs/source/docs/index.rst
index d0608f4c0e9..57efa26acbc 100644
--- a/docs/source/docs/index.rst
+++ b/docs/source/docs/index.rst
@@ -69,6 +69,9 @@ Runnable examples:
 
 * **LLMs on SkyPilot**
 
+  * `GPT-2 via llm.c <https://github.com/skypilot-org/skypilot/tree/master/llm/gpt-2>`_
+  * `Llama 3 <https://github.com/skypilot-org/skypilot/tree/master/llm/llama-3>`_
+  * `Qwen <https://github.com/skypilot-org/skypilot/tree/master/llm/qwen>`_
   * `Databricks DBRX <https://github.com/skypilot-org/skypilot/tree/master/llm/dbrx>`_
   * `Gemma <https://github.com/skypilot-org/skypilot/tree/master/llm/gemma>`_
   * `Mixtral 8x7B <https://github.com/skypilot-org/skypilot/tree/master/llm/mixtral>`_; `Mistral 7B <https://docs.mistral.ai/self-deployment/skypilot>`_ (from official Mistral team)
@@ -87,7 +90,7 @@ Runnable examples:
   * `Falcon <https://github.com/skypilot-org/skypilot/tree/master/llm/falcon>`_
   * Add yours here & see more in `llm/ <https://github.com/skypilot-org/skypilot/tree/master/llm>`_!
 
-* Framework examples: `PyTorch DDP <https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_torch.yaml>`_, `DeepSpeed <https://github.com/skypilot-org/skypilot/blob/master/examples/deepspeed-multinode/sky.yaml>`_, `JAX/Flax on TPU <https://github.com/skypilot-org/skypilot/blob/master/examples/tpu/tpuvm_mnist.yaml>`_, `Stable Diffusion <https://github.com/skypilot-org/skypilot/tree/master/examples/stable_diffusion>`_, `Detectron2 <https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml>`_, `Distributed <https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_tf_app.py>`_ `TensorFlow <https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_app_storage.yaml>`_, `NeMo <https://github.com/skypilot-org/skypilot/blob/master/examples/nemo/nemo.yaml>`_, `programmatic grid search <https://github.com/skypilot-org/skypilot/blob/master/examples/huggingface_glue_imdb_grid_search_app.py>`_, `Docker <https://github.com/skypilot-org/skypilot/blob/master/examples/docker/echo_app.yaml>`_, `Cog <https://github.com/skypilot-org/skypilot/blob/master/examples/cog/>`_, `Unsloth <https://github.com/skypilot-org/skypilot/blob/master/examples/unsloth/unsloth.yaml>`_, `Ollama <https://github.com/skypilot-org/skypilot/blob/master/llm/ollama>`_ and `many more <https://github.com/skypilot-org/skypilot/tree/master/examples>`_.
+* Framework examples: `PyTorch DDP <https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_torch.yaml>`_, `DeepSpeed <https://github.com/skypilot-org/skypilot/blob/master/examples/deepspeed-multinode/sky.yaml>`_, `JAX/Flax on TPU <https://github.com/skypilot-org/skypilot/blob/master/examples/tpu/tpuvm_mnist.yaml>`_, `Stable Diffusion <https://github.com/skypilot-org/skypilot/tree/master/examples/stable_diffusion>`_, `Detectron2 <https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml>`_, `Distributed <https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_tf_app.py>`_ `TensorFlow <https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_app_storage.yaml>`_, `NeMo <https://github.com/skypilot-org/skypilot/blob/master/examples/nemo/nemo.yaml>`_, `programmatic grid search <https://github.com/skypilot-org/skypilot/blob/master/examples/huggingface_glue_imdb_grid_search_app.py>`_, `Docker <https://github.com/skypilot-org/skypilot/blob/master/examples/docker/echo_app.yaml>`_, `Cog <https://github.com/skypilot-org/skypilot/blob/master/examples/cog/>`_, `Unsloth <https://github.com/skypilot-org/skypilot/blob/master/examples/unsloth/unsloth.yaml>`_, `Ollama <https://github.com/skypilot-org/skypilot/blob/master/llm/ollama>`_, `llm.c <https://github.com/skypilot-org/skypilot/tree/master/llm/gpt-2>`__ and `many more <https://github.com/skypilot-org/skypilot/tree/master/examples>`_.
 
 Follow updates:
 
@@ -112,14 +115,14 @@ Contents
    ../getting-started/installation
    ../getting-started/quickstart
    ../getting-started/tutorial
-   ../examples/gpu-jupyter
+   ../examples/interactive-development
 
 
 .. toctree::
    :maxdepth: 1
    :caption: Running Jobs
 
-   ../examples/spot-jobs
+   ../examples/managed-jobs
    ../reference/job-queue
    ../examples/auto-failover
    ../reference/kubernetes/index
@@ -137,7 +140,7 @@ Contents
    :maxdepth: 1
    :caption: Cutting Cloud Costs
 
-   ../examples/spot-jobs
+   Managed Spot Jobs <../examples/spot-jobs>
    ../reference/auto-stop
    ../reference/benchmark/index
 
diff --git a/docs/source/examples/docker-containers.rst b/docs/source/examples/docker-containers.rst
index 9fe835d6b9a..8bc7ae16837 100644
--- a/docs/source/examples/docker-containers.rst
+++ b/docs/source/examples/docker-containers.rst
@@ -8,6 +8,10 @@ SkyPilot can run a container either as a task, or as the runtime environment of
 * If the container image is invocable / has an entrypoint: run it :ref:`as a task <docker-containers-as-tasks>`.
 * If the container image is to be used as a runtime environment (e.g., ``ubuntu``, ``nvcr.io/nvidia/pytorch:23.10-py3``, etc.) and if you have extra commands to run inside the container: run it :ref:`as a runtime environment <docker-containers-as-runtime-environments>`.
 
+.. note::
+
+    Running docker containers is `not supported on RunPod <https://docs.runpod.io/references/faq#can-i-run-my-own-docker-daemon-on-runpod>`_. To use RunPod, use ``setup`` and ``run`` to configure your environment. See `GitHub issue <https://github.com/skypilot-org/skypilot/issues/3096#issuecomment-2150559797>`_ for more.
+
 .. _docker-containers-as-tasks:
 
 Running Containers as Tasks
diff --git a/docs/source/examples/gpu-jupyter.rst b/docs/source/examples/gpu-jupyter.rst
deleted file mode 100644
index 7f4c593e02f..00000000000
--- a/docs/source/examples/gpu-jupyter.rst
+++ /dev/null
@@ -1,98 +0,0 @@
-GPU-backed Jupyter Notebooks
-============================
-
-Jupyter notebooks are a useful tool for interactive development, debugging, and
-visualization. SkyPilot makes the process of running a GPU-backed Jupyter notebook
-simple by automatically managing provisioning and port forwarding.
-
-To get a machine with a GPU attached, use:
-
-.. code-block:: bash
-
-   # Launch a VM with 1 NVIDIA GPU and forward port 8888 to localhost
-   sky launch -c jupyter-vm --gpus K80:1
-   ssh -L 8888:localhost:8888 jupyter-vm
-
-.. note::
-
-  View the supported GPUs with the :code:`sky show-gpus` command.
-
-Use ``ssh jupyter-vm`` to SSH into the VM. Inside the VM, you can run the
-following commands to start a Jupyter session:
-
-.. code-block:: bash
-
-   pip install jupyter
-   jupyter notebook
-
-In your local browser, you should now be able to access :code:`localhost:8888` and see the following screen:
-
-.. image:: ../images/jupyter-auth.png
-  :width: 100%
-  :alt: Jupyter authentication window
-
-Enter the password or token and you will be directed to a page where you can create a new notebook.
-
-.. image:: ../images/jupyter-create.png
-  :width: 100%
-  :alt: Create a new Jupyter notebook
-
-You can verify that this notebook is running on the GPU-backed instance using :code:`nvidia-smi`.
-
-.. image:: ../images/jupyter-gpu.png
-  :width: 100%
-  :alt: nvidia-smi in notebook
-
-The GPU node is a normal SkyPilot cluster, so you can use the usual CLI commands on it.  For example, run ``sky down/stop`` to terminate or stop it, and ``sky exec`` to execute a task.
-
-Notebooks in SkyPilot tasks
----------------------------
-Jupyter notebooks can also be used in SkyPilot tasks, allowing access to the full
-range of SkyPilot's features including :ref:`mounted storage <sky-storage>` and
-:ref:`autostop <auto-stop>`.
-
-The following :code:`jupyter.yaml` is an example of a task specification that can launch notebooks with SkyPilot.
-
-.. code:: yaml
-
-  # jupyter.yaml
-
-  name: jupyter
-
-  resources:
-    accelerators: K80:1
-
-  file_mounts:
-    /covid:
-      source: s3://fah-public-data-covid19-cryptic-pockets
-      mode: MOUNT
-
-  setup: |
-    pip install --upgrade pip
-    conda init bash
-    conda create -n jupyter python=3.9 -y
-    conda activate jupyter
-    pip install jupyter
-
-  run: |
-    cd ~/sky_workdir
-    conda activate jupyter
-    jupyter notebook --port 8888
-
-Launch the GPU-backed Jupyter notebook:
-
-.. code:: bash
-
-  sky launch -c jupyter jupyter.yaml
-
-To access the notebook locally, use SSH port forwarding.
-
-.. code:: bash
-
-  ssh -L 8888:localhost:8888 jupyter
-
-You can verify that this notebook has access to the mounted storage bucket.
-
-.. image:: ../images/jupyter-covid.png
-  :width: 100%
-  :alt: accessing covid data from notebook
diff --git a/docs/source/examples/interactive-development.rst b/docs/source/examples/interactive-development.rst
new file mode 100644
index 00000000000..f9c4b5c4803
--- /dev/null
+++ b/docs/source/examples/interactive-development.rst
@@ -0,0 +1,209 @@
+Start a Development Cluster
+===========================
+
+
+SkyPilot makes interactive development easy on Kubernetes or cloud VMs. It helps you:
+
+#. :ref:`Launch <dev-launch>`: Quickly get a cluster with GPU or other resource requirement with a single command.
+#. :ref:`Autostop <dev-autostop>`: Automatically stop the cluster after some idle time for cost savings.
+#. :ref:`Connect <dev-connect>`: Easily connect to the cluster using the cluster name:
+
+   - :ref:`SSH <dev-ssh>`
+   - :ref:`VSCode <dev-vscode>`
+   - :ref:`Jupyter Notebooks <dev-notebooks>`
+
+.. _dev-launch:
+
+Launch
+------
+
+To launch a cluster with a cheap GPU for development:
+
+.. code-block:: bash
+
+  # Launch a cluster with 1 NVIDIA GPU and sync the local working directory to the
+  # cluster.
+  sky launch -c dev --gpus T4 --workdir .
+
+This can be launched as a pod in your Kubernetes cluster or a VM on any cloud.
+
+.. tab-set::
+
+  .. tab-item:: Kubernetes
+
+    .. figure:: ../images/k8s-pod.png
+      :align: center
+      :width: 80%
+      :alt: Launch a cluster as a pod in Kubernetes
+
+      Launch a cluster as a pod in Kubernetes
+
+  .. tab-item:: Cloud
+
+    .. figure:: ../images/gcp-vm.png
+      :align: center
+      :width: 80%
+      :alt: Launch a cluster as a VM on GCP
+
+      Launch a cluster as a VM on GCP
+
+.. note::
+
+  View the supported GPUs with the :code:`sky show-gpus` command.
+
+.. _dev-autostop:
+
+Autostop
+--------
+
+SkyPilot allows you to automatically stop the cluster after a period of idle time to save costs. You can set the autostop time with a single command:
+
+.. code-block:: bash
+
+  # Auto stop the cluster after 5 hours
+  sky autostop -i 300 dev
+
+Or add an additional flag :code:`-i` during the launch:
+
+.. code-block:: bash
+
+  # Launch a cluster with auto stop after 5 hours
+  sky launch -c dev --gpus T4 --workdir . -i 300
+
+For more details of auto stopping, check out: :ref:`auto-stop`. This feature is designed
+to prevent idle clusters from incurring unnecessary costs, ensuring your cluster
+stops automatically, whether it's overnight or throughout the weekend.
+
+
+.. _dev-connect:
+
+Connect
+-------
+
+A user can easily connect to the cluster using your method of choice:
+
+.. _dev-ssh:
+
+SSH
+~~~
+
+SkyPilot will automatically configure the SSH setting for a cluster, so that users can connect to the cluster with the cluster name:
+
+.. code-block:: bash
+  
+  ssh dev
+
+
+.. _dev-vscode:
+
+VSCode
+~~~~~~
+
+A common use case for interactive development is to connect a local IDE to a remote cluster and directly edit code that lives on the cluster. 
+This is supported by simply connecting VSCode to the cluster with the cluster name:
+
+#. Click on the top bar, type: :code:`> remote-ssh`, and select :code:`Remote-SSH: Connect Current Window to Host...`
+#. Select the cluster name (e.g., ``dev``) from the list of hosts.
+#. Open folder: :code:`sky_workdir` and find your code.
+
+For more details, please refer to the `VSCode documentation <https://code.visualstudio.com/docs/remote/ssh-tutorial>`__.
+
+.. image:: https://imgur.com/8mKfsET.gif
+  :align: center
+  :alt: Connect to the cluster with VSCode
+
+.. _dev-notebooks:
+
+Jupyter Notebooks
+~~~~~~~~~~~~~~~~~
+
+Jupyter notebooks are a useful tool for interactive development, debugging, and
+visualization.
+
+Connect to the machine and forward the port used by jupyter notebook:
+
+.. code-block:: bash
+
+   ssh -L 8888:localhost:8888 dev
+
+Inside the cluster, you can run the following commands to start a Jupyter session:
+
+.. code-block:: bash
+
+   pip install jupyter
+   jupyter notebook
+
+In your local browser, you should now be able to access :code:`localhost:8888` and see the following screen:
+
+.. image:: ../images/jupyter-auth.png
+  :width: 100%
+  :alt: Jupyter authentication window
+
+Enter the password or token and you will be directed to a page where you can create a new notebook.
+
+.. image:: ../images/jupyter-create.png
+  :width: 100%
+  :alt: Create a new Jupyter notebook
+
+You can verify that this notebook is running on the GPU-backed instance using :code:`nvidia-smi`.
+
+.. image:: ../images/jupyter-gpu.png
+  :width: 100%
+  :alt: nvidia-smi in notebook
+
+The GPU node is a normal SkyPilot cluster, so you can use the usual CLI commands on it.  For example, run ``sky down/stop`` to terminate or stop it, and ``sky exec`` to execute a task.
+
+Notebooks in SkyPilot tasks
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Jupyter notebooks can also be used in SkyPilot tasks, allowing access to the full
+range of SkyPilot's features including :ref:`mounted storage <sky-storage>` and
+:ref:`autostop <auto-stop>`.
+
+The following :code:`jupyter.yaml` is an example of a task specification that can launch notebooks with SkyPilot.
+
+.. code:: yaml
+
+  # jupyter.yaml
+
+  name: jupyter
+
+  resources:
+    accelerators: T4:1
+
+  file_mounts:
+    /covid:
+      source: s3://fah-public-data-covid19-cryptic-pockets
+      mode: MOUNT
+
+  setup: |
+    pip install --upgrade pip
+    conda init bash
+    conda create -n jupyter python=3.9 -y
+    conda activate jupyter
+    pip install jupyter
+
+  run: |
+    cd ~/sky_workdir
+    conda activate jupyter
+    jupyter notebook --port 8888 &
+
+Launch the GPU-backed Jupyter notebook:
+
+.. code:: bash
+
+  sky launch -c jupyter jupyter.yaml
+
+To access the notebook locally, use SSH port forwarding.
+
+.. code:: bash
+
+  ssh -L 8888:localhost:8888 jupyter
+
+You can verify that this notebook has access to the mounted storage bucket.
+
+.. image:: ../images/jupyter-covid.png
+  :width: 100%
+  :alt: accessing covid data from notebook
+
+
+
diff --git a/docs/source/examples/managed-jobs.rst b/docs/source/examples/managed-jobs.rst
new file mode 100644
index 00000000000..a47b4345b9f
--- /dev/null
+++ b/docs/source/examples/managed-jobs.rst
@@ -0,0 +1,485 @@
+.. _managed-jobs:
+
+Managed Jobs
+============
+
+.. tip::
+
+  This feature is great for scaling out: running a single job for long durations, or running many jobs (pipelines).
+
+SkyPilot supports **managed jobs** (:code:`sky jobs`), which can automatically recover from any spot preemptions or hardware failures.
+It can be used in three modes:
+
+#. :ref:`Managed Spot Jobs <spot-jobs>`: Jobs run on auto-recovering spot instances. This can **save significant costs** (e.g., up to 70\% for GPU VMs) by making preemptible spot instances useful for long-running jobs.
+#. :ref:`On-demand <on-demand>`: Jobs run on auto-recovering on-demand instances. This is useful for jobs that require guaranteed resources.
+#. :ref:`Pipelines <pipeline>`: Run pipelines that contain multiple tasks (which can have different resource requirements and ``setup``/``run`` commands). This is useful for running a sequence of tasks that depend on each other, e.g., data processing, training a model, and then running inference on it.
+
+
+.. _spot-jobs:
+
+Managed Spot Jobs
+-----------------
+
+In this mode, :code:`sky jobs launch --use-spot` is used to launch a managed spot job. SkyPilot automatically finds available spot resources across regions and clouds to maximize availability.
+Any spot preemptions are automatically handled by SkyPilot without user intervention.
+
+
+Quick comparison between *unmanaged spot clusters* vs. *managed spot jobs*:
+
+.. list-table::
+   :widths: 30 18 12 35
+   :header-rows: 1
+
+   * - Command
+     - Managed?
+     - SSH-able?
+     - Best for
+   * - :code:`sky launch --use-spot`
+     - Unmanaged spot cluster
+     - Yes
+     - Interactive dev on spot instances (especially for hardware with low preemption rates)
+   * - :code:`sky jobs launch --use-spot`
+     - Managed spot job (auto-recovery)
+     - No
+     - Scaling out long-running jobs (e.g., data processing, training, batch inference)
+
+Here is an example of a BERT training job failing over different regions across AWS and GCP.
+
+.. image:: https://i.imgur.com/Vteg3fK.gif
+  :width: 600
+  :alt: GIF for BERT training on Spot V100
+
+.. image:: ../images/spot-training.png
+  :width: 600
+  :alt: Static plot, BERT training on Spot V100
+
+To use managed spot jobs, there are two requirements:
+
+#. :ref:`Job YAML <job-yaml>`: Managed Spot requires a YAML to describe the job, tested with :code:`sky launch`.
+#. :ref:`Checkpointing <checkpointing>` (optional): For job recovery due to preemptions, the user application code can checkpoint its progress periodically to a :ref:`mounted cloud bucket <sky-storage>`. The program can reload the latest checkpoint when restarted.
+
+
+.. _job-yaml:
+
+Job YAML
+~~~~~~~~
+
+To launch a managed job, you can simply reuse your job YAML (recommended to test it with :code:`sky launch` first).
+For example, we found the BERT fine-tuning YAML works with :code:`sky launch`, and want to
+launch it with SkyPilot managed spot jobs.
+
+We can launch it with the following:
+
+.. code-block:: console
+
+  $ sky jobs launch -n bert-qa bert_qa.yaml
+
+
+.. code-block:: yaml
+
+  # bert_qa.yaml
+  name: bert-qa
+
+  resources:
+    accelerators: V100:1
+    # Use spot instances to save cost.
+    use_spot: true
+
+  # Assume your working directory is under `~/transformers`.
+  # To make this example work, please run the following command:
+  # git clone https://github.com/huggingface/transformers.git ~/transformers -b v4.30.1
+  workdir: ~/transformers
+
+  setup: |
+    # Fill in your wandb key: copy from https://wandb.ai/authorize
+    # Alternatively, you can use `--env WANDB_API_KEY=$WANDB_API_KEY`
+    # to pass the key in the command line, during `sky spot launch`.
+    echo export WANDB_API_KEY=[YOUR-WANDB-API-KEY] >> ~/.bashrc
+
+    pip install -e .
+    cd examples/pytorch/question-answering/
+    pip install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
+    pip install wandb
+
+  run: |
+    cd ./examples/pytorch/question-answering/
+    python run_qa.py \
+    --model_name_or_path bert-base-uncased \
+    --dataset_name squad \
+    --do_train \
+    --do_eval \
+    --per_device_train_batch_size 12 \
+    --learning_rate 3e-5 \
+    --num_train_epochs 50 \
+    --max_seq_length 384 \
+    --doc_stride 128 \
+    --report_to wandb
+
+
+.. note::
+
+  :ref:`workdir <sync-code-artifacts>` and :ref:`file mounts with local files <sync-code-artifacts>` will be automatically uploaded to a
+  :ref:`cloud bucket <sky-storage>`. The bucket will be created during the job running time, and cleaned up after the job
+  finishes.
+
+SkyPilot will launch and start monitoring the job. When a spot preemption or any machine failure happens, SkyPilot will automatically
+search for resources across regions and clouds to re-launch the job.
+
+In this example, the job will be restarted from scratch after each preemption recovery.
+To resume the job from previous states, user's application needs to implement checkpointing and recovery.
+
+
+.. _checkpointing:
+
+Checkpointing and Recovery
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To allow job recovery, a cloud bucket is typically needed to store the job's states (e.g., model checkpoints).
+Below is an example of mounting a bucket to :code:`/checkpoint`.
+
+.. code-block:: yaml
+
+  file_mounts:
+    /checkpoint:
+      name: # NOTE: Fill in your bucket name
+      mode: MOUNT
+
+The :code:`MOUNT` mode in :ref:`SkyPilot bucket mounting <sky-storage>` ensures the checkpoints outputted to :code:`/checkpoint` are automatically synced to a persistent bucket.
+Note that the application code should save program checkpoints periodically and reload those states when the job is restarted.
+This is typically achieved by reloading the latest checkpoint at the beginning of your program.
+
+.. _spot-jobs-end-to-end:
+
+An End-to-End Example
+~~~~~~~~~~~~~~~~~~~~~
+
+Below we show an `example <https://github.com/skypilot-org/skypilot/blob/master/examples/spot/bert_qa.yaml>`_ for fine-tuning a BERT model on a question-answering task with HuggingFace.
+
+.. code-block:: yaml
+  :emphasize-lines: 13-16,42-45
+
+  # bert_qa.yaml
+  name: bert-qa
+
+  resources:
+    accelerators: V100:1
+    use_spot: true
+
+  # Assume your working directory is under `~/transformers`.
+  # To make this example work, please run the following command:
+  # git clone https://github.com/huggingface/transformers.git ~/transformers -b v4.30.1
+  workdir: ~/transformers
+
+  file_mounts:
+    /checkpoint:
+      name: # NOTE: Fill in your bucket name
+      mode: MOUNT
+
+  setup: |
+    # Fill in your wandb key: copy from https://wandb.ai/authorize
+    # Alternatively, you can use `--env WANDB_API_KEY=$WANDB_API_KEY`
+    # to pass the key in the command line, during `sky jobs launch`.
+    echo export WANDB_API_KEY=[YOUR-WANDB-API-KEY] >> ~/.bashrc
+
+    pip install -e .
+    cd examples/pytorch/question-answering/
+    pip install -r requirements.txt
+    pip install wandb
+
+  run: |
+    cd ./examples/pytorch/question-answering/
+    python run_qa.py \
+    --model_name_or_path bert-base-uncased \
+    --dataset_name squad \
+    --do_train \
+    --do_eval \
+    --per_device_train_batch_size 12 \
+    --learning_rate 3e-5 \
+    --num_train_epochs 50 \
+    --max_seq_length 384 \
+    --doc_stride 128 \
+    --report_to wandb \
+    --run_name $SKYPILOT_TASK_ID \
+    --output_dir /checkpoint/bert_qa/ \
+    --save_total_limit 10 \
+    --save_steps 1000
+
+
+
+As HuggingFace has built-in support for periodically checkpointing, we only need to pass the highlighted arguments for setting up
+the output directory and frequency of checkpointing (see more
+on `Huggingface API <https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.save_steps>`_).
+You may also refer to another example `here <https://github.com/skypilot-org/skypilot/tree/master/examples/spot/resnet_ddp>`__ for periodically checkpointing with PyTorch.
+
+We also set :code:`--run_name` to :code:`$SKYPILOT_TASK_ID` so that the logs for all recoveries of the same job will be saved
+to the same run in Weights & Biases.
+
+.. note::
+  The environment variable :code:`$SKYPILOT_TASK_ID` (example: "sky-managed-2022-10-06-05-17-09-750781_bert-qa_8-0") can be used to identify the same job, i.e., it is kept identical across all
+  recoveries of the job.
+  It can be accessed in the task's :code:`run` commands or directly in the program itself (e.g., access
+  via :code:`os.environ` and pass to Weights & Biases for tracking purposes in your training script). It is made available to
+  the task whenever it is invoked.
+
+With the highlighted changes, the managed spot job can now resume training after preemption! We can enjoy the benefits of
+cost savings from spot instances without worrying about preemption or losing progress.
+
+.. code-block:: console
+
+  $ sky jobs launch -n bert-qa bert_qa.yaml
+
+.. tip::
+
+  Try copy-paste this example and adapt it to your own job.
+
+
+
+Real-World Examples
+~~~~~~~~~~~~~~~~~~~
+
+* `Vicuna <https://vicuna.lmsys.org/>`_ LLM chatbot: `instructions <https://github.com/skypilot-org/skypilot/tree/master/llm/vicuna>`_, `YAML <https://github.com/skypilot-org/skypilot/blob/master/llm/vicuna/train.yaml>`__
+* BERT (shown above): `YAML <https://github.com/skypilot-org/skypilot/blob/master/examples/spot/bert_qa.yaml>`__
+* PyTorch DDP, ResNet: `YAML <https://github.com/skypilot-org/skypilot/blob/master/examples/spot/resnet.yaml>`__
+* PyTorch Lightning DDP, CIFAR-10: `YAML <https://github.com/skypilot-org/skypilot/blob/master/examples/spot/lightning_cifar10.yaml>`__
+
+
+.. _on-demand:
+
+Using On-Demand Instances
+--------------------------------
+
+The same ``sky jobs launch`` and YAML interfaces can run jobs on auto-recovering
+on-demand instances. This is useful to have SkyPilot monitor any underlying
+machine failures and transparently recover the job.
+
+To do so, simply set :code:`use_spot: false` in the :code:`resources` section, or override it with :code:`--use-spot false` in the CLI.
+
+.. code-block:: console
+
+  $ sky jobs launch -n bert-qa bert_qa.yaml --use-spot false
+
+.. tip::
+
+  It is useful to think of ``sky jobs launch`` as a "serverless" managed job
+  interface, while ``sky launch`` is a cluster interface (that you can launch
+  tasks on, albeit not managed).
+
+Either Spot Or On-Demand
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You can use ``any_of`` to specify either spot or on-demand instances as
+candidate resources for a job. See documentation :ref:`here
+<multiple-resources>` for more details.
+
+.. code-block:: yaml
+
+  resources:
+    accelerators: A100:8
+    any_of:
+      - use_spot: true
+      - use_spot: false
+
+In this example, SkyPilot will perform cost optimizations to select the resource to use, which almost certainly
+will be spot instances. If spot instances are not available, SkyPilot will fall back to launch on-demand instances.
+
+More advanced policies for resource selection, such as the `Can't Be Late
+<https://www.usenix.org/conference/nsdi24/presentation/wu-zhanghao>`__ (NSDI'24)
+paper, may be supported in the future.
+
+Useful CLIs
+-----------
+
+Here are some commands for managed jobs. Check :code:`sky jobs --help` and :ref:`CLI reference <cli>` for more details.
+
+See all managed jobs:
+
+.. code-block:: console
+
+  $ sky jobs queue
+
+.. code-block:: console
+
+  Fetching managed job statuses...
+  Managed jobs:
+  ID NAME     RESOURCES           SUBMITTED   TOT. DURATION   JOB DURATION   #RECOVERIES  STATUS
+  2  roberta  1x [A100:8][Spot]   2 hrs ago   2h 47m 18s      2h 36m 18s     0            RUNNING
+  1  bert-qa  1x [V100:1][Spot]   4 hrs ago   4h 24m 26s      4h 17m 54s     0            RUNNING
+
+Stream the logs of a running managed job:
+
+.. code-block:: console
+
+  $ sky jobs logs -n bert-qa  # by name
+  $ sky jobs logs 2           # by job ID
+
+Cancel a managed job:
+
+.. code-block:: console
+
+  $ sky jobs cancel -n bert-qa  # by name
+  $ sky jobs cancel 2           # by job ID
+
+.. note::
+  If any failure happens for a managed job, you can check :code:`sky jobs queue -a` for the brief reason
+  of the failure. For more details, it would be helpful to check :code:`sky jobs logs --controller <job_id>`.
+
+
+.. _pipeline:
+
+Job Pipelines
+-------------
+
+A pipeline is a managed job that contains a sequence of tasks running one after another.
+
+This is useful for running a sequence of tasks that depend on each other, e.g., training a model and then running inference on it.
+Different tasks can have different resource requirements to use appropriate per-task resources, which saves costs, while  keeping the burden of managing the tasks off the user.
+
+.. note::
+  In other words, a managed job is either a single task or a pipeline of tasks. All managed jobs are submitted by :code:`sky jobs launch`.
+
+To run a pipeline, specify the sequence of tasks in a YAML file. Here is an example:
+
+.. code-block:: yaml
+
+  name: pipeline
+
+  ---
+
+  name: train
+
+  resources:
+    accelerators: V100:8
+    any_of:
+      - use_spot: true
+      - use_spot: false
+
+  file_mounts:
+    /checkpoint:
+      name: train-eval # NOTE: Fill in your bucket name
+      mode: MOUNT
+
+  setup: |
+    echo setup for training
+
+  run: |
+    echo run for training
+    echo save checkpoints to /checkpoint
+
+  ---
+
+  name: eval
+
+  resources:
+    accelerators: T4:1
+    use_spot: false
+
+  file_mounts:
+    /checkpoint:
+      name: train-eval # NOTE: Fill in your bucket name
+      mode: MOUNT
+
+  setup: |
+    echo setup for eval
+
+  run: |
+    echo load trained model from /checkpoint
+    echo eval model on test set
+
+
+The YAML above defines a pipeline with two tasks. The first :code:`name:
+pipeline` names the pipeline. The first task has name :code:`train` and the
+second task has name :code:`eval`. The tasks are separated by a line with three
+dashes :code:`---`. Each task has its own :code:`resources`, :code:`setup`, and
+:code:`run` sections. Tasks are executed sequentially.
+
+To submit the pipeline, the same command :code:`sky jobs launch` is used. The pipeline will be automatically launched and monitored by SkyPilot. You can check the status of the pipeline with :code:`sky jobs queue` or :code:`sky jobs dashboard`.
+
+.. code-block:: console
+
+  $ sky jobs launch -n pipeline pipeline.yaml
+  $ sky jobs queue
+  Fetching managed job statuses...
+  Managed jobs
+  In progress jobs: 1 RECOVERING
+  ID  TASK  NAME      RESOURCES                    SUBMITTED    TOT. DURATION  JOB DURATION  #RECOVERIES  STATUS
+  8         pipeline  -                            50 mins ago  47m 45s        -             1            RECOVERING
+   ↳  0     train     1x [V100:8][Spot|On-demand]  50 mins ago  47m 45s        -             1            RECOVERING
+   ↳  1     eval      1x [T4:1]                    -            -              -             0            PENDING
+
+.. note::
+
+  The :code:`$SKYPILOT_TASK_ID` environment variable is also available in the :code:`run` section of each task. It is unique for each task in the pipeline.
+  For example, the :code:`$SKYPILOT_TASK_ID` for the :code:`eval` task above is:
+  "sky-managed-2022-10-06-05-17-09-750781_pipeline_eval_8-1".
+
+
+
+Dashboard
+---------
+
+Use ``sky jobs dashboard`` to open a dashboard to see all jobs:
+
+.. code-block:: console
+
+  $ sky jobs dashboard
+
+This automatically opens a browser tab to show the dashboard:
+
+.. image:: ../images/job-dashboard.png
+
+The UI shows the same information as the CLI ``sky jobs queue -a``. The UI is
+especially useful when there are many in-progress jobs to monitor, which the
+terminal-based CLI may need more than one page to display.
+
+
+Concept: Jobs Controller
+------------------------
+
+The jobs controller is a small on-demand CPU VM running in the cloud that manages all jobs of a user.
+It is automatically launched when the first managed job is submitted, and it is autostopped after it has been idle for 10 minutes (i.e., after all managed jobs finish and no new managed job is submitted in that duration).
+Thus, **no user action is needed** to manage its lifecycle.
+
+You can see the controller with :code:`sky status` and refresh its status by using the :code:`-r/--refresh` flag.
+
+While the cost of the jobs controller is negligible (~$0.4/hour when running and less than $0.004/hour when stopped),
+you can still tear it down manually with
+:code:`sky down <job-controller-name>`, where the ``<job-controller-name>`` can be found in the output of :code:`sky status`.
+
+.. note::
+  Tearing down the jobs controller loses all logs and status information for the finished managed jobs. It is only allowed when there are no in-progress managed jobs to ensure no resource leakage.
+
+Customizing Job Controller Resources
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You may want to customize the resources of the jobs controller for several reasons:
+
+#. Changing the maximum number of jobs that can be run concurrently, which is 2x the vCPUs of the controller. (Default: 16)
+#. Use a lower-cost controller (if you have a low number of concurrent managed jobs).
+#. Enforcing the jobs controller to run on a specific location. (Default: cheapest location)
+#. Changing the disk_size of the jobs controller to store more logs. (Default: 50GB)
+
+To achieve the above, you can specify custom configs in :code:`~/.sky/config.yaml` with the following fields:
+
+.. code-block:: yaml
+
+  jobs:
+    # NOTE: these settings only take effect for a new jobs controller, not if
+    # you have an existing one.
+    controller:
+      resources:
+        # All configs below are optional.
+        # Specify the location of the jobs controller.
+        cloud: gcp
+        region: us-central1
+        # Specify the maximum number of managed jobs that can be run concurrently.
+        cpus: 4+  # number of vCPUs, max concurrent jobs = 2 * cpus
+        # Specify the disk_size in GB of the jobs controller.
+        disk_size: 100
+
+The :code:`resources` field has the same spec as a normal SkyPilot job; see `here <https://skypilot.readthedocs.io/en/latest/reference/yaml-spec.html>`__.
+
+.. note::
+  These settings will not take effect if you have an existing controller (either
+  stopped or live).  For them to take effect, tear down the existing controller
+  first, which requires all in-progress jobs to finish or be canceled.
+
diff --git a/docs/source/examples/spot-jobs.rst b/docs/source/examples/spot-jobs.rst
index 5940e404bb3..2b3df600425 100644
--- a/docs/source/examples/spot-jobs.rst
+++ b/docs/source/examples/spot-jobs.rst
@@ -1,389 +1,23 @@
-.. _spot-jobs:
-
 Managed Spot Jobs
-================================================
-
-.. tip::
-
-  This feature is great for scaling out: running a single job for long durations, or running many jobs.
-
-SkyPilot supports managed spot jobs that can **automatically recover from preemptions**.
-This feature **saves significant cost** (e.g., up to 70\% for GPU VMs) by making preemptible spot instances practical for long-running jobs.
-
-SkyPilot automatically finds available spot resources across regions and clouds to maximize availability.
-Here is an example of a BERT training job failing over different regions across AWS and GCP.
-
-.. image:: https://i.imgur.com/Vteg3fK.gif
-  :width: 600
-  :alt: GIF for BERT training on Spot V100
-
-.. image:: ../images/spot-training.png
-  :width: 600
-  :alt: Static plot, BERT training on Spot V100
-
-To use managed spot jobs, there are two requirements:
-
-#. **Task YAML**: Managed Spot requires a YAML to describe the job, tested with :code:`sky launch`.
-#. **Checkpointing** (optional): For job recovery due to preemptions, the user application code can checkpoint its progress periodically to a :ref:`mounted cloud bucket <sky-storage>`. The program can reload the latest checkpoint when restarted.
-
-
-Task YAML
----------
-
-To launch a spot job, you can simply reuse your task YAML (recommended to test it with :code:`sky launch` first).
-For example, we found the BERT fine-tuning YAML works with :code:`sky launch`, and want to
-launch it with SkyPilot managed spot jobs.
-
-We can launch it with the following:
-
-.. code-block:: console
-
-  $ sky spot launch -n bert-qa bert_qa.yaml
-
-
-.. code-block:: yaml
-
-  # bert_qa.yaml
-  name: bert-qa
-
-  resources:
-    accelerators: V100:1
-
-  # Assume your working directory is under `~/transformers`.
-  # To make this example work, please run the following command:
-  # git clone https://github.com/huggingface/transformers.git ~/transformers -b v4.30.1
-  workdir: ~/transformers
-
-  setup: |
-    # Fill in your wandb key: copy from https://wandb.ai/authorize
-    # Alternatively, you can use `--env WANDB_API_KEY=$WANDB_API_KEY`
-    # to pass the key in the command line, during `sky spot launch`.
-    echo export WANDB_API_KEY=[YOUR-WANDB-API-KEY] >> ~/.bashrc
-
-    pip install -e .
-    cd examples/pytorch/question-answering/
-    pip install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
-    pip install wandb
-
-  run: |
-    cd ./examples/pytorch/question-answering/
-    python run_qa.py \
-    --model_name_or_path bert-base-uncased \
-    --dataset_name squad \
-    --do_train \
-    --do_eval \
-    --per_device_train_batch_size 12 \
-    --learning_rate 3e-5 \
-    --num_train_epochs 50 \
-    --max_seq_length 384 \
-    --doc_stride 128 \
-    --report_to wandb
-
-
-.. note::
-
-  :ref:`workdir <sync-code-artifacts>` and :ref:`file mounts with local files <sync-code-artifacts>` will be automatically uploaded to a
-  :ref:`cloud bucket <sky-storage>`. The bucket will be created during the job running time, and cleaned up after the job
-  finishes.
-
-SkyPilot will launch and start monitoring the spot job. When a preemption happens, SkyPilot will automatically
-search for resources across regions and clouds to re-launch the job.
-
-In this example, the job will be restarted from scratch after each preemption recovery.
-To resume the job from previous states, user's application needs to implement checkpointing and recovery.
-
-
-Checkpointing and recovery
---------------------------
-
-To allow spot recovery, a cloud bucket is typically needed to store the job's states (e.g., model checkpoints).
-Below is an example of mounting a bucket to :code:`/checkpoint`.
-
-.. code-block:: yaml
-
-  file_mounts:
-    /checkpoint:
-      name: # NOTE: Fill in your bucket name
-      mode: MOUNT
-
-The :code:`MOUNT` mode in :ref:`SkyPilot bucket mounting <sky-storage>` ensures the checkpoints outputted to :code:`/checkpoint` are automatically synced to a persistent bucket.
-Note that the application code should save program checkpoints periodically and reload those states when the job is restarted.
-This is typically achieved by reloading the latest checkpoint at the beginning of your program.
-
-.. _spot-jobs-end-to-end:
-
-An end-to-end example
----------------------
-
-Below we show an `example <https://github.com/skypilot-org/skypilot/blob/master/examples/spot/bert_qa.yaml>`_ for fine-tuning a BERT model on a question-answering task with HuggingFace.
-
-.. code-block:: yaml
-  :emphasize-lines: 12-15,41-44
-
-  # bert_qa.yaml
-  name: bert-qa
-
-  resources:
-    accelerators: V100:1
-
-  # Assume your working directory is under `~/transformers`.
-  # To make this example work, please run the following command:
-  # git clone https://github.com/huggingface/transformers.git ~/transformers -b v4.30.1
-  workdir: ~/transformers
-
-  file_mounts:
-    /checkpoint:
-      name: # NOTE: Fill in your bucket name
-      mode: MOUNT
-
-  setup: |
-    # Fill in your wandb key: copy from https://wandb.ai/authorize
-    # Alternatively, you can use `--env WANDB_API_KEY=$WANDB_API_KEY`
-    # to pass the key in the command line, during `sky spot launch`.
-    echo export WANDB_API_KEY=[YOUR-WANDB-API-KEY] >> ~/.bashrc
-
-    pip install -e .
-    cd examples/pytorch/question-answering/
-    pip install -r requirements.txt
-    pip install wandb
-
-  run: |
-    cd ./examples/pytorch/question-answering/
-    python run_qa.py \
-    --model_name_or_path bert-base-uncased \
-    --dataset_name squad \
-    --do_train \
-    --do_eval \
-    --per_device_train_batch_size 12 \
-    --learning_rate 3e-5 \
-    --num_train_epochs 50 \
-    --max_seq_length 384 \
-    --doc_stride 128 \
-    --report_to wandb \
-    --run_name $SKYPILOT_TASK_ID \
-    --output_dir /checkpoint/bert_qa/ \
-    --save_total_limit 10 \
-    --save_steps 1000
-
-
-
-As HuggingFace has built-in support for periodically checkpointing, we only need to pass the highlighted arguments for setting up
-the output directory and frequency of checkpointing (see more
-on `Huggingface API <https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.save_steps>`_).
-You may also refer to another example `here <https://github.com/skypilot-org/skypilot/tree/master/examples/spot/resnet_ddp>`__ for periodically checkpointing with PyTorch.
-
-We also set :code:`--run_name` to :code:`$SKYPILOT_TASK_ID` so that the logs for all recoveries of the same job will be saved
-to the same run in Weights & Biases.
-
-.. note::
-  The environment variable :code:`$SKYPILOT_TASK_ID` (example: "sky-managed-2022-10-06-05-17-09-750781_pipeline_eval_8-1") can be used to identify the same job, i.e., it is kept identical across all
-  recoveries of the job.
-  It can be accessed in the task's :code:`run` commands or directly in the program itself (e.g., access
-  via :code:`os.environ` and pass to Weights & Biases for tracking purposes in your training script). It is made available to
-  the task whenever it is invoked.
-
-With the highlighted changes, the managed spot job can now resume training after preemption with ``sky spot launch``! We can enjoy the benefits of
-cost savings from spot instances without worrying about preemption or losing progress.
-
-.. code-block:: console
-
-  $ sky spot launch -n bert-qa bert_qa.yaml
-
-.. tip::
-
-  Try copy-paste this example and adapt it to your own job.
-
-
-Useful CLIs
------------
-
-Here are some commands for managed spot jobs. Check :code:`sky spot --help` for more details.
-
-See all spot jobs:
-
-.. code-block:: console
-
-  $ sky spot queue
-
-.. code-block:: console
-
-  Fetching managed spot job statuses...
-  Managed spot jobs:
-  ID NAME     RESOURCES     SUBMITTED   TOT. DURATION   JOB DURATION   #RECOVERIES  STATUS
-  2  roberta  1x [A100:8]   2 hrs ago   2h 47m 18s      2h 36m 18s     0            RUNNING
-  1  bert-qa  1x [V100:1]   4 hrs ago   4h 24m 26s      4h 17m 54s     0            RUNNING
-
-Stream the logs of a running spot job:
-
-.. code-block:: console
-
-  $ sky spot logs -n bert-qa  # by name
-  $ sky spot logs 2           # by job ID
-
-Cancel a spot job:
-
-.. code-block:: console
-
-  $ sky spot cancel -n bert-qa  # by name
-  $ sky spot cancel 2           # by job ID
-
-.. note::
-  If any failure happens for a spot job, you can check :code:`sky spot queue -a` for the brief reason
-  of the failure. For more details, it would be helpful to check :code:`sky spot logs --controller <job_id>`.
-
-Dashboard
------------
-
-Use ``sky spot dashboard`` to open a dashboard to see all jobs:
-
-.. code-block:: console
-
-  $ sky spot dashboard
-
-This automatically opens a browser tab to show the dashboard:
-
-.. image:: ../images/spot-dashboard.png
-
-The UI shows the same information as the CLI ``sky spot queue -a``. The UI is
-especially useful when there are many in-progress jobs to monitor, which the
-terminal-based CLI may need more than one page to display.
-
-Real-world examples
--------------------------
-
-* `Vicuna <https://vicuna.lmsys.org/>`_ LLM chatbot: `instructions <https://github.com/skypilot-org/skypilot/tree/master/llm/vicuna>`_, `YAML <https://github.com/skypilot-org/skypilot/blob/master/llm/vicuna/train.yaml>`__
-* BERT (shown above): `YAML <https://github.com/skypilot-org/skypilot/blob/master/examples/spot/bert_qa.yaml>`__
-* PyTorch DDP, ResNet: `YAML <https://github.com/skypilot-org/skypilot/blob/master/examples/spot/resnet.yaml>`__
-* PyTorch Lightning DDP, CIFAR-10: `YAML <https://github.com/skypilot-org/skypilot/blob/master/examples/spot/lightning_cifar10.yaml>`__
-
-Spot controller
--------------------------------
-
-The spot controller is a small on-demand CPU VM running in the cloud that manages all spot jobs of a user.
-It is automatically launched when the first managed spot job is submitted, and it is autostopped after it has been idle for 10 minutes (i.e., after all spot jobs finish and no new spot job is submitted in that duration).
-Thus, **no user action is needed** to manage its lifecycle.
-
-You can see the controller with :code:`sky status` and refresh its status by using the :code:`-r/--refresh` flag.
-
-While the cost of the spot controller is negligible (~$0.4/hour when running and less than $0.004/hour when stopped),
-you can still tear it down manually with
-:code:`sky down <spot-controller-name>`, where the ``<spot-controller-name>`` can be found in the output of :code:`sky status`.
-
-.. note::
-  Tearing down the spot controller loses all logs and status information for the finished spot jobs. It is only allowed when there are no in-progress spot jobs to ensure no resource leakage.
-
-Customizing spot controller resources
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-You may want to customize the resources of the spot controller for several reasons:
-
-1. Use a lower-cost controller (if you have a low number of concurrent spot jobs).
-2. Enforcing the spot controller to run on a specific location. (Default: cheapest location)
-3. Changing the maximum number of spot jobs that can be run concurrently, which is 2x the vCPUs of the controller. (Default: 16)
-4. Changing the disk_size of the spot controller to store more logs. (Default: 50GB)
-
-To achieve the above, you can specify custom configs in :code:`~/.sky/config.yaml` with the following fields:
-
-.. code-block:: yaml
-
-  spot:
-    # NOTE: these settings only take effect for a new spot controller, not if
-    # you have an existing one.
-    controller:
-      resources:
-        # All configs below are optional.
-        # Specify the location of the spot controller.
-        cloud: gcp
-        region: us-central1
-        # Specify the maximum number of spot jobs that can be run concurrently.
-        cpus: 4+  # number of vCPUs, max concurrent spot jobs = 2 * cpus
-        # Specify the disk_size in GB of the spot controller.
-        disk_size: 100
-
-The :code:`resources` field has the same spec as a normal SkyPilot job; see `here <https://skypilot.readthedocs.io/en/latest/reference/yaml-spec.html>`__.
-
-.. note::
-  These settings will not take effect if you have an existing controller (either
-  stopped or live).  For them to take effect, tear down the existing controller
-  first, which requires all in-progress spot jobs to finish or be canceled.
-
-
-Spot Pipeline
--------------------------
-
-Spot Pipeline is a feature that allows you to submit a spot job that contains a sequence of spot tasks running one after another.
-This is useful for running a sequence of jobs that depend on each other, e.g., training a model and then running inference on it.
-This allows the multiple tasks to have different resource requirements to fully utilize the resources and save cost, while keeping the burden of managing the tasks off the user. 
-
-.. note::
-  A spot job is either a single task or a pipeline of tasks. A spot job is submitted by :code:`sky spot launch`.
-  
-  All tasks in a pipeline will be run on spot instances.
-
-To use Spot Pipeline, you can specify the sequence of jobs in a YAML file. Here is an example:
-
-.. code-block:: yaml
-
-  name: pipeline
-
-  ---
-  
-  name: train
-
-  resources:
-    accelerators: V100:8
-
-  file_mounts:
-    /checkpoint:
-      name: train-eval # NOTE: Fill in your bucket name
-      mode: MOUNT
-
-  setup: |
-    echo setup for training
-
-  run: |
-    echo run for training
-    echo save checkpoints to /checkpoint
-
-  ---
-
-  name: eval
-
-  resources:
-    accelerators: T4:1
-
-  file_mounts:
-    /checkpoint:
-      name: train-eval # NOTE: Fill in your bucket name
-      mode: MOUNT
-
-  setup: |
-    echo setup for eval
-
-  run: |
-    echo load trained model from /checkpoint
-    echo eval model on test set
-
-
-The above YAML file defines a pipeline with two tasks. The first :code:`name: pipeline` names the pipeline. The first task has name :code:`train` and the second task has name :code:`eval`. The tasks are separated by a line with three dashes :code:`---`. Each task has its own :code:`resources`, :code:`setup`, and :code:`run` sections. The :code:`setup` and :code:`run` sections are executed sequentially.
-
-To submit the pipeline, the same command :code:`sky spot launch` is used. The pipeline will be automatically launched and monitored by SkyPilot. You can check the status of the pipeline with :code:`sky spot queue` or :code:`sky spot dashboard`.
-
-.. note::
-
-  The :code:`$SKYPILOT_TASK_ID` environment variable is also available in the :code:`run` section of each task. It is unique for each task in the pipeline.
-  For example, the :code:`$SKYPILOT_TASK_ID` for the :code:`eval` task above is:
-  "sky-managed-2022-10-06-05-17-09-750781_pipeline_eval_8-1".
-
-.. code-block:: console
-
-  $ sky spot launch -n pipeline pipeline.yaml
-  $ sky spot queue
-  Fetching managed spot job statuses...
-  Managed spot jobs
-  In progress tasks: 1 PENDING, 1 RECOVERING
-  ID  TASK  NAME           RESOURCES        SUBMITTED    TOT. DURATION  JOB DURATION  #RECOVERIES  STATUS     
-  8         pipeline       -                50 mins ago  47m 45s        -             1            RECOVERING   
-   ↳  0     train          1x [V100:8]      50 mins ago  47m 45s        -             1            RECOVERING 
-   ↳  1     eval           1x [T4:1]        -            -              -             0            PENDING 
-
+==================
+
+.. raw:: html
+
+    <script type="text/javascript">
+        // Function to perform the replacement and redirection
+        function redirectToManagedJobs() {
+            var currentUrl = window.location.href;
+            
+            // Check if the URL contains 'spot-jobs.html'
+            if (currentUrl.includes("spot-jobs.html")) {
+                // Replace 'spot-jobs.html' with 'managed-jobs.html'
+                var newUrl = currentUrl.replace("spot-jobs.html", "managed-jobs.html");
+                
+                // Redirect to the new URL
+                window.location.href = newUrl;
+            }
+        }
+
+        // Call the redirection function on page load
+        redirectToManagedJobs();
+    </script>
diff --git a/docs/source/getting-started/installation.rst b/docs/source/getting-started/installation.rst
index 490d2d76311..e5b318d4f87 100644
--- a/docs/source/getting-started/installation.rst
+++ b/docs/source/getting-started/installation.rst
@@ -164,6 +164,10 @@ section :ref:`below <cloud-account-setup>`.
   If your clouds show ``enabled`` --- |:tada:| |:tada:| **Congratulations!** |:tada:| |:tada:| You can now head over to
   :ref:`Quickstart <quickstart>` to get started with SkyPilot.
 
+.. tip::
+
+  To check credentials only for specific clouds, pass the clouds as arguments: :code:`sky check aws gcp`
+
 .. _cloud-account-setup:
 
 Cloud account setup
@@ -308,9 +312,10 @@ Cudo Compute
 ~~~~~~~~~~~~~~~~~~
 
 `Cudo Compute <https://www.cudocompute.com/>`__ GPU cloud provides low cost GPUs powered with green energy.
-
-1. Create an API Key by following `this guide <https://www.cudocompute.com/docs/guide/api-keys/>`__.
-2. Download and install the `cudoctl <https://www.cudocompute.com/docs/cli-tool/>`__ command line tool
+1. Create a billing account by following `this guide <https://www.cudocompute.com/docs/guide/billing/>`__.
+2. Create a project `<https://www.cudocompute.com/docs/guide/projects/>`__.
+3. Create an API Key by following `this guide <https://www.cudocompute.com/docs/guide/api-keys/>`__.
+3. Download and install the `cudoctl <https://www.cudocompute.com/docs/cli-tool/>`__ command line tool
 3. Run :code:`cudoctl init`:
 
 .. code-block:: shell
@@ -322,7 +327,7 @@ Cudo Compute
     ✔ context: default
     config file saved ~/.config/cudo/cudo.yml
 
-  pip install "cudocompute>=0.1.8"
+  pip install "cudo-compute>=0.1.10"
 
 If you want to want to use skypilot with a different Cudo Compute account or project, just run :code:`cudoctl init`: again.
 
diff --git a/docs/source/images/gcp-vm.png b/docs/source/images/gcp-vm.png
new file mode 100644
index 00000000000..262a47318fd
Binary files /dev/null and b/docs/source/images/gcp-vm.png differ
diff --git a/docs/source/images/job-dashboard.png b/docs/source/images/job-dashboard.png
new file mode 100644
index 00000000000..6c25d7d83bc
Binary files /dev/null and b/docs/source/images/job-dashboard.png differ
diff --git a/docs/source/images/k8s-pod.png b/docs/source/images/k8s-pod.png
new file mode 100644
index 00000000000..d87eb168d5b
Binary files /dev/null and b/docs/source/images/k8s-pod.png differ
diff --git a/docs/source/images/managed-jobs-arch.png b/docs/source/images/managed-jobs-arch.png
new file mode 100644
index 00000000000..c78f0680331
Binary files /dev/null and b/docs/source/images/managed-jobs-arch.png differ
diff --git a/docs/source/images/spot-controller.png b/docs/source/images/spot-controller.png
deleted file mode 100644
index fce9a1d8cc2..00000000000
Binary files a/docs/source/images/spot-controller.png and /dev/null differ
diff --git a/docs/source/images/spot-dashboard.png b/docs/source/images/spot-dashboard.png
deleted file mode 100644
index 2b322418350..00000000000
Binary files a/docs/source/images/spot-dashboard.png and /dev/null differ
diff --git a/docs/source/reference/cli.rst b/docs/source/reference/cli.rst
index d4c45ce7b82..985f63482b6 100644
--- a/docs/source/reference/cli.rst
+++ b/docs/source/reference/cli.rst
@@ -3,8 +3,8 @@
 Command Line Interface
 ======================
 
-Core CLI
----------
+Cluster CLI
+-----------
 
 .. _sky-launch:
 .. click:: sky.cli:launch
@@ -41,9 +41,6 @@ Core CLI
    :prog: sky autostop
    :nested: full
 
-Job Queue CLI
---------------
-
 .. _sky-queue:
 .. click:: sky.cli:queue
    :prog: sky queue
@@ -59,7 +56,31 @@ Job Queue CLI
    :prog: sky cancel
    :nested: full
 
-Sky Serve CLI
+Managed (Spot) Jobs CLI
+---------------------------
+
+.. _sky-job-launch:
+.. click:: sky.cli:jobs_launch
+   :prog: sky jobs launch
+   :nested: full
+
+.. _sky-job-queue:
+.. click:: sky.cli:jobs_queue
+   :prog: sky jobs queue
+   :nested: full
+
+.. _sky-job-cancel:
+.. click:: sky.cli:jobs_cancel
+   :prog: sky jobs cancel
+   :nested: full
+
+.. _sky-job-logs:
+.. click:: sky.cli:jobs_logs
+   :prog: sky jobs logs
+   :nested: full
+
+
+SkyServe CLI
 -------------
 
 .. click:: sky.cli:serve_up
@@ -82,28 +103,6 @@ Sky Serve CLI
    :prog: sky serve update
    :nested: full
 
-Managed Spot Jobs CLI
----------------------------
-
-.. _sky-spot-launch:
-.. click:: sky.cli:spot_launch
-   :prog: sky spot launch
-   :nested: full
-
-.. _sky-spot-queue:
-.. click:: sky.cli:spot_queue
-   :prog: sky spot queue
-   :nested: full
-
-.. _sky-spot-cancel:
-.. click:: sky.cli:spot_cancel
-   :prog: sky spot cancel
-   :nested: full
-
-.. _sky-spot-logs:
-.. click:: sky.cli:spot_logs
-   :prog: sky spot logs
-   :nested: full
 
 Storage CLI
 ------------
diff --git a/docs/source/reference/config.rst b/docs/source/reference/config.rst
index 3e1214d958d..74cd2c01092 100644
--- a/docs/source/reference/config.rst
+++ b/docs/source/reference/config.rst
@@ -14,12 +14,12 @@ Available fields and semantics:
 
 .. code-block:: yaml
 
-  # Custom spot controller resources (optional).
+  # Custom managed jobs controller resources (optional).
   #
-  # These take effects only when a spot controller does not already exist.
+  # These take effects only when a managed jobs controller does not already exist.
   #
-  # Ref: https://skypilot.readthedocs.io/en/latest/examples/spot-jobs.html#customizing-spot-controller-resources
-  spot:
+  # Ref: https://skypilot.readthedocs.io/en/latest/examples/managed-jobs.html#customizing-job-controller-resources
+  jobs:
     controller:
       resources:  # same spec as 'resources' in a task YAML
         cloud: gcp
@@ -27,6 +27,19 @@ Available fields and semantics:
         cpus: 4+  # number of vCPUs, max concurrent spot jobs = 2 * cpus
         disk_size: 100
 
+  # Allow list for clouds to be used in `sky check`
+  #
+  # This field is used to restrict the clouds that SkyPilot will check and use
+  # when running `sky check`. Any cloud already enabled but not specified here
+  # will be disabled on the next `sky check` run.
+  # If this field is not set, SkyPilot will check and use all supported clouds.
+  #
+  # Default: null (use all supported clouds).
+  allowed_clouds:
+    - aws
+    - gcp
+    - kubernetes
+
   # Advanced AWS configurations (optional).
   # Apply to all new instances but not existing ones.
   aws:
@@ -36,7 +49,7 @@ Available fields and semantics:
     #
     # Users should guarantee that these key-values are valid AWS tags, otherwise
     # errors from the cloud provider will be surfaced.
-    instance_tags:
+    labels:
       # (Example) AWS Migration Acceleration Program (MAP). This tag enables the
       # program's discounts.
       # Ref: https://docs.aws.amazon.com/mgn/latest/ug/map-program-tagging.html
@@ -109,24 +122,39 @@ Available fields and semantics:
     # permission to create a security group.
     security_group_name: my-security-group
 
-    # Identity to use for all AWS instances (optional).
+    # Identity to use for AWS instances (optional).
     #
     # LOCAL_CREDENTIALS: The user's local credential files will be uploaded to
     # AWS instances created by SkyPilot. They are used for accessing cloud
     # resources (e.g., private buckets) or launching new instances (e.g., for
-    # spot/serve controllers).
+    # jobs/serve controllers).
     #
     # SERVICE_ACCOUNT: Local credential files are not uploaded to AWS
     # instances. SkyPilot will auto-create and reuse a service account (IAM
     # role) for AWS instances.
     #
+    # Customized service account (IAM role): <string> or <list of single-element dict>
+    # - <string>: apply the service account with the specified name to all instances.
+    #    Example:
+    #       remote_identity: my-service-account-name
+    # - <list of single-element dict>: A list of single-element dict mapping from the cluster name (pattern)
+    #   to the service account name to use. The matching of the cluster name is done in the same order
+    #   as the list.
+    #   NOTE: If none of the wildcard expressions in the dict match the cluster name, LOCAL_CREDENTIALS will be used.
+    #   To specify your default, use "*" as the wildcard expression.
+    #   Example:
+    #       remote_identity:
+    #         - my-cluster-name: my-service-account-1
+    #         - sky-serve-controller-*: my-service-account-2
+    #         - "*": my-default-service-account
+    #
     # Two caveats of SERVICE_ACCOUNT for multicloud users:
     #
     # - This only affects AWS instances. Local AWS credentials will still be
     #   uploaded to non-AWS instances (since those instances may need to access
     #   AWS resources).
-    # - If the SkyPilot spot/serve controller is on AWS, this setting will make
-    #   non-AWS managed spot jobs / non-AWS service replicas fail to access any
+    # - If the SkyPilot jobs/serve controller is on AWS, this setting will make
+    #   non-AWS managed jobs / non-AWS service replicas fail to access any
     #   resources on AWS (since the controllers don't have AWS credential
     #   files to assign to these non-AWS instances).
     #
@@ -142,9 +170,9 @@ Available fields and semantics:
     #
     # Users should guarantee that these key-values are valid GCP labels, otherwise
     # errors from the cloud provider will be surfaced.
-    instance_tags:
+    labels:
       Owner: user-unique-name
-      my-tag: my-value
+      my-label: my-value
 
     # VPC to use (optional).
     #
@@ -190,21 +218,21 @@ Available fields and semantics:
 
 
     # Reserved capacity (optional).
-    # 
+    #
     # Whether to prioritize reserved instance types/locations (considered as 0
     # cost) in the optimizer.
-    # 
+    #
     # If you have "automatically consumed" reservations in your GCP project:
     # Setting this to true guarantees the optimizer will pick any matching
     # reservation and GCP will auto consume your reservation, and setting to
     # false means optimizer uses regular, non-zero pricing in optimization (if
     # by chance any matching reservation is selected, GCP still auto consumes
     # the reservation).
-    # 
+    #
     # If you have "specifically targeted" reservations (set by the
     # `specific_reservations` field below): This field will automatically be set
     # to true.
-    # 
+    #
     # Default: false.
     prioritize_reservations: false
     #
@@ -248,7 +276,7 @@ Available fields and semantics:
     # LOCAL_CREDENTIALS: The user's local credential files will be uploaded to
     # GCP instances created by SkyPilot. They are used for accessing cloud
     # resources (e.g., private buckets) or launching new instances (e.g., for
-    # spot/serve controllers).
+    # jobs/serve controllers).
     #
     # SERVICE_ACCOUNT: Local credential files are not uploaded to GCP
     # instances. SkyPilot will auto-create and reuse a service account for GCP
@@ -259,8 +287,8 @@ Available fields and semantics:
     # - This only affects GCP instances. Local GCP credentials will still be
     #   uploaded to non-GCP instances (since those instances may need to access
     #   GCP resources).
-    # - If the SkyPilot spot/serve controller is on GCP, this setting will make
-    #   non-GCP managed spot jobs / non-GCP service replicas fail to access any
+    # - If the SkyPilot jobs/serve controller is on GCP, this setting will make
+    #   non-GCP managed jobs / non-GCP service replicas fail to access any
     #   resources on GCP (since the controllers don't have GCP credential
     #   files to assign to these non-GCP instances).
     #
@@ -286,8 +314,7 @@ Available fields and semantics:
 
     # The mode to use for opening ports on Kubernetes
     #
-    # This must be either: 'ingress' or 'loadbalancer'. If not specified,
-    # defaults to 'loadbalancer'.
+    # This must be either: 'loadbalancer', 'ingress' or 'podip'.
     #
     # loadbalancer: Creates services of type `LoadBalancer` to expose ports.
     # See https://skypilot.readthedocs.io/en/latest/reference/kubernetes/kubernetes-setup.html#loadbalancer-service.
@@ -298,8 +325,40 @@ Available fields and semantics:
     # Requires an Nginx ingress controller to be configured on the Kubernetes cluster.
     # Refer to https://skypilot.readthedocs.io/en/latest/reference/kubernetes/kubernetes-setup.html#nginx-ingress
     # for details on deploying the NGINX ingress controller.
+    #
+    # podip: Directly returns the IP address of the pod. This mode does not
+    # create any Kubernetes services and is a lightweight way to expose ports.
+    # NOTE - ports exposed with podip mode are not accessible from outside the
+    # Kubernetes cluster. This mode is useful for hosting internal services
+    # that need to be accessed only by other pods in the same cluster.
+    #
+    # Default: loadbalancer
     ports: loadbalancer
 
+    # Identity to use for all Kubernetes pods (optional).
+    #
+    # LOCAL_CREDENTIALS: The user's local ~/.kube/config will be uploaded to the
+    # Kubernetes pods created by SkyPilot. They are used for authenticating with
+    # the Kubernetes API server and launching new pods (e.g., for
+    # spot/serve controllers).
+    #
+    # SERVICE_ACCOUNT: Local ~/.kube/config is not uploaded to Kubernetes pods.
+    # SkyPilot will auto-create and reuse a service account with necessary roles
+    # in the user's namespace.
+    #
+    # <string>: The name of a service account to use for all Kubernetes pods.
+    # This service account must exist in the user's namespace and have all
+    # necessary permissions. Refer to https://skypilot.readthedocs.io/en/latest/cloud-setup/cloud-permissions/kubernetes.html
+    # for details on the roles required by the service account.
+    #
+    # Using SERVICE_ACCOUNT or a custom service account only affects Kubernetes
+    # instances. Local ~/.kube/config will still be uploaded to non-Kubernetes
+    # instances (e.g., a serve controller on GCP or AWS may need to provision
+    # Kubernetes resources).
+    #
+    # Default: 'SERVICE_ACCOUNT'.
+    remote_identity: my-k8s-service-account
+
     # Attach custom metadata to Kubernetes objects created by SkyPilot
     #
     # Uses the same schema as Kubernetes metadata object: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.26/#objectmeta-v1-meta
@@ -330,6 +389,25 @@ Available fields and semantics:
     # Default: 10 seconds
     provision_timeout: 10
 
+    # Autoscaler configured in the Kubernetes cluster (optional)
+    #
+    # This field informs SkyPilot about the cluster autoscaler used in the
+    # Kubernetes cluster. Setting this field disables pre-launch checks for
+    # GPU capacity in the cluster and SkyPilot relies on the autoscaler to
+    # provision nodes with the required GPU capacity.
+    #
+    # Remember to set provision_timeout accordingly when using an autoscaler.
+    #
+    # Supported values: gke, karpenter, generic
+    #   gke: uses cloud.google.com/gke-accelerator label to identify GPUs on nodes
+    #   karpenter: uses karpenter.k8s.aws/instance-gpu-name label to identify GPUs on nodes
+    #   generic: uses skypilot.co/accelerator labels to identify GPUs on nodes
+    # Refer to https://skypilot.readthedocs.io/en/latest/reference/kubernetes/kubernetes-setup.html#setting-up-gpu-support
+    # for more details on setting up labels for GPU support.
+    #
+    # Default: null (no autoscaler, autodetect label format for GPU nodes)
+    autoscaler: gke
+
     # Additional fields to override the pod fields used by SkyPilot (optional)
     #
     # Any key:value pairs added here would get added to the pod spec used to
diff --git a/docs/source/reference/job-queue.rst b/docs/source/reference/job-queue.rst
index 47be2012365..6397c7bbbb6 100644
--- a/docs/source/reference/job-queue.rst
+++ b/docs/source/reference/job-queue.rst
@@ -1,7 +1,7 @@
 .. _job-queue:
 
-Job Queue
-=========
+Cluster Job Queue
+=================
 
 SkyPilot's **job queue** allows multiple jobs to be scheduled on a cluster.
 
diff --git a/docs/source/reference/kubernetes/index.rst b/docs/source/reference/kubernetes/index.rst
index 1edfde01240..4087fd968be 100644
--- a/docs/source/reference/kubernetes/index.rst
+++ b/docs/source/reference/kubernetes/index.rst
@@ -3,247 +3,107 @@
 Running on Kubernetes
 =============================
 
-.. note::
-    Kubernetes support is under active development. `Please share your feedback <https://forms.gle/KmAtyNhEysiw2ZCR7>`_
-    or `directly reach out to the development team <http://slack.skypilot.co>`_
-    for feature requests and more.
-
 SkyPilot tasks can be run on your private on-prem or cloud Kubernetes clusters.
 The Kubernetes cluster gets added to the list of "clouds" in SkyPilot and SkyPilot
 tasks can be submitted to your Kubernetes cluster just like any other cloud provider.
 
-**Benefits of using SkyPilot to run jobs on your Kubernetes cluster:**
-
-* Get SkyPilot features (setup management, job execution, queuing, logging, SSH access) on your Kubernetes resources
-* Replace complex Kubernetes manifests with simple SkyPilot tasks
-* Seamlessly "burst" jobs to the cloud if your Kubernetes cluster is congested
-* Retain observability and control over your cluster with your existing Kubernetes tools
-
-**Supported Kubernetes deployments:**
-
-* Hosted Kubernetes services (EKS, GKE)
-* On-prem clusters (Kubeadm, Rancher)
-* Local development clusters (KinD, minikube)
-
-
-Kubernetes Cluster Requirements
+Why use SkyPilot on Kubernetes?
 -------------------------------
 
-To connect and use a Kubernetes cluster, SkyPilot needs:
-
-* An existing Kubernetes cluster running Kubernetes v1.20 or later.
-* A `Kubeconfig <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/>`_ file containing access credentials and namespace to be used.
-
-In a typical workflow:
-
-1. A cluster administrator sets up a Kubernetes cluster. Detailed admin guides for
-   different deployment environments (Amazon EKS, Google GKE, On-Prem and local debugging) are included in the :ref:`Kubernetes cluster setup guide <kubernetes-setup>`.
-
-2. Users who want to run SkyPilot tasks on this cluster are issued Kubeconfig
-   files containing their credentials (`kube-context <https://kubernetes.io/docs/tasks/access-application-cluster/configure-access-multiple-clusters/#define-clusters-users-and-contexts>`_).
-   SkyPilot reads this Kubeconfig file to communicate with the cluster.
-
-Submitting SkyPilot tasks to Kubernetes Clusters
-------------------------------------------------
-.. _kubernetes-instructions:
-
-Once your cluster administrator has :ref:`setup a Kubernetes cluster <kubernetes-setup>` and provided you with a kubeconfig file:
-
-0. Make sure `kubectl <https://kubernetes.io/docs/tasks/tools/>`_, ``socat`` and ``nc`` (netcat) are installed on your local machine.
-
-   .. code-block:: console
-
-     $ # MacOS
-     $ brew install kubectl socat netcat
-
-     $ # Linux (may have socat already installed)
-     $ sudo apt-get install kubectl socat netcat
-
-
-1. Place your kubeconfig file at ``~/.kube/config``.
-
-   .. code-block:: console
-
-     $ mkdir -p ~/.kube
-     $ cp /path/to/kubeconfig ~/.kube/config
-
-   You can verify your credentials are setup correctly by running :code:`kubectl get pods`.
-
-2. Run :code:`sky check` and verify that Kubernetes is enabled in SkyPilot.
-
-   .. code-block:: console
-
-     $ sky check
-
-     Checking credentials to enable clouds for SkyPilot.
-     ...
-     Kubernetes: enabled
-     ...
-
-
-   .. note::
-     :code:`sky check` will also check if GPU support is available on your cluster. If GPU support is not available, it
-     will show the reason.
-     To setup GPU support on the cluster, refer to the :ref:`Kubernetes cluster setup guide <kubernetes-setup>`.
-
-4. You can now run any SkyPilot task on your Kubernetes cluster.
-
-   .. code-block:: console
+.. tab-set::
 
-        $ sky launch --cpus 2+ task.yaml
-        == Optimizer ==
-        Target: minimizing cost
-        Estimated cost: $0.0 / hour
+    .. tab-item:: For AI Developers
+        :sync: why-ai-devs-tab
 
-        Considered resources (1 node):
-        ---------------------------------------------------------------------------------------------------
-         CLOUD        INSTANCE          vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN
-        ---------------------------------------------------------------------------------------------------
-         Kubernetes   2CPU--2GB         2       2         -              kubernetes    0.00          ✔
-         AWS          m6i.large         2       8         -              us-east-1     0.10
-         Azure        Standard_D2s_v5   2       8         -              eastus        0.10
-         GCP          n2-standard-2     2       8         -              us-central1   0.10
-         IBM          bx2-8x32          8       32        -              us-east       0.38
-         Lambda       gpu_1x_a10        30      200       A10:1          us-east-1     0.60
-        ---------------------------------------------------------------------------------------------------.
+        .. grid:: 2
+            :gutter: 3
 
+            .. grid-item-card::  ✅ Ease of use
+                :text-align: center
 
-.. note::
-  SkyPilot will use the cluster and namespace set in the ``current-context`` in the
-  kubeconfig file. To manage your ``current-context``:
+                ..
+                    TODO(romilb): We should have a comparison of a popular Kubernetes manifest vs a SkyPilot YAML in terms of LoC in a mini blog and link it here.
 
-  .. code-block:: console
+                No complex kubernetes manifests - write a simple SkyPilot YAML and run with one command ``sky launch``.
 
-    $ # See current context
-    $ kubectl config current-context
+            .. grid-item-card::  📋 Interactive development on Kubernetes
+                :text-align: center
 
-    $ # Switch current-context
-    $ kubectl config use-context mycontext
+                :ref:`SSH access to pods <dev-ssh>`, :ref:`VSCode integration <dev-vscode>`, :ref:`job management <managed-jobs>`, :ref:`autodown idle pods <auto-stop>` and more.
 
-    $ # Set a specific namespace to be used in the current-context
-    $ kubectl config set-context --current --namespace=mynamespace
+            .. grid-item-card::  ☁️ Burst to the cloud
+                :text-align: center
 
+                Kubernetes cluster is full? SkyPilot :ref:`seamlessly gets resources on the cloud <kubernetes-optimizer-table>` to get your job running sooner.
 
-Using Custom Images
--------------------
-By default, we use and maintain a SkyPilot container image that has conda and a few other basic tools installed.
+            .. grid-item-card::  🖼 Run popular models on Kubernetes
+                :text-align: center
 
-To use your own image, add :code:`image_id: docker:<your image tag>` to the :code:`resources` section of your task YAML.
+                Train and serve `Llama-3 <https://skypilot.readthedocs.io/en/latest/gallery/llms/llama-3.html>`_, `Mixtral <https://skypilot.readthedocs.io/en/latest/gallery/llms/mixtral.html>`_, and more on your Kubernetes with ready-to-use recipes from the :ref:`AI gallery <ai-gallery>`.
 
-.. code-block:: yaml
 
-    resources:
-      image_id: docker:myrepo/myimage:latest
-    ...
+    .. tab-item:: For Infrastructure Admins
+        :sync: why-admins-tab
 
-Your image must satisfy the following requirements:
+        .. grid:: 2
+            :gutter: 3
 
-* Image must be **debian-based** and must have the apt package manager installed.
-* The default user in the image must have root privileges or passwordless sudo access.
+            .. grid-item-card::  ☁️ Unified platform for all Infrastructure
+                :text-align: center
 
-.. note::
+                Scale beyond your Kubernetes cluster to capacity on :ref:`across clouds and regions <auto-failover>` without manual intervention.
 
-    If your cluster runs on non-x86_64 architecture (e.g., Apple Silicon), your image must be built natively for that architecture. Otherwise, your job may get stuck at :code:`Start streaming logs ...`. See `GitHub issue <https://github.com/skypilot-org/skypilot/issues/3035>`_ for more.
+            .. grid-item-card::  🚯️ Minimize resource wastage
+                :text-align: center
 
-Using Images from Private Repositories
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-To use images from private repositories (e.g., Private DockerHub, Amazon ECR, Google Container Registry), create a `secret <https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/#create-a-secret-by-providing-credentials-on-the-command-line>`_ in your Kubernetes cluster and edit your :code:`~/.sky/config.yaml` to specify the secret like so:
+                SkyPilot can run with your custom pod scheduler and automatically terminate idle pods to free up resources for other users.
 
-.. code-block:: yaml
+            .. grid-item-card::  👀 Observability
+                :text-align: center
 
-    kubernetes:
-      pod_config:
-        spec:
-          imagePullSecrets:
-            - name: your-secret-here
+                Works with your existing observability and monitoring tools, such as the :ref:`Kubernetes Dashboard <kubernetes-observability>`.
 
-.. tip::
+            .. grid-item-card::  🍽️ Self-serve infra for your teams
+                :text-align: center
 
-    If you use Amazon ECR, your secret credentials may expire every 12 hours. Consider using `k8s-ecr-login-renew <https://github.com/nabsul/k8s-ecr-login-renew>`_ to automatically refresh your secrets.
+                Reduce operational overhead by letting your teams provision their own resources, while you retain control over the Kubernetes cluster.
 
 
-Opening Ports
--------------
+Table of Contents
+-----------------
 
-Opening ports on SkyPilot clusters running on Kubernetes is supported through two modes:
+.. grid:: 1 1 3 3
+    :gutter: 3
 
-1. `LoadBalancer services <https://kubernetes.io/docs/concepts/services-networking/service/#loadbalancer>`_ (default)
-2. `Nginx IngressController <https://kubernetes.github.io/ingress-nginx/>`_
+    .. grid-item-card::  👋 Get Started
+        :link: kubernetes-getting-started
+        :link-type: ref
+        :text-align: center
 
-One of these modes must be supported and configured on your cluster. Refer to the :ref:`setting up ports on Kubernetes guide <kubernetes-ports>` on how to do this.
+        Already have a kubeconfig? Launch your first SkyPilot task on Kubernetes - it's as simple as ``sky launch``.
 
-.. tip::
+    .. grid-item-card::  ⚙️ Cluster Configuration
+        :link: kubernetes-setup
+        :link-type: ref
+        :text-align: center
 
-  On Google GKE, Amazon EKS or other cloud-hosted Kubernetes services, the default LoadBalancer services mode is supported out of the box and no additional configuration is needed.
+        Are you a cluster admin? Find cluster deployment guides and setup instructions here.
 
-Once your cluster is  configured, launch a task which exposes services on a port by adding :code:`ports` to the :code:`resources` section of your task YAML.
+    .. grid-item-card::  🔍️ Troubleshooting
+        :link: kubernetes-troubleshooting
+        :link-type: ref
+        :text-align: center
 
-.. code-block:: yaml
+        Running into problems with SkyPilot on your Kubernetes cluster? Find common issues and solutions here.
 
-    # task.yaml
-    resources:
-      ports: 8888
 
-    run: |
-      python -m http.server 8888
-
-After launching the cluster with :code:`sky launch -c myclus task.yaml`, you can get the URL to access the port using :code:`sky status --endpoints myclus`.
-
-.. code-block:: bash
-
-    # List all ports exposed by the cluster
-    $ sky status --endpoints myclus
-    8888: 34.173.13.241:8888
-
-    # curl a specific port's endpoint
-    $ curl $(sky status --endpoint 8888 myclus)
-    ...
-
-.. tip::
-
-    To learn more about opening ports in SkyPilot tasks, see :ref:`Opening Ports <ports>`.
-
-FAQs
-----
-
-* **Are autoscaling Kubernetes clusters supported?**
-
-  To run on an autoscaling cluster, you may need to adjust the resource provisioning timeout (:code:`Kubernetes.TIMEOUT` in `clouds/kubernetes.py`) to a large value to give enough time for the cluster to autoscale. We are working on a better interface to adjust this timeout - stay tuned!
-
-* **Can SkyPilot provision a Kubernetes cluster for me? Will SkyPilot add more nodes to my Kubernetes clusters?**
-
-  The goal of Kubernetes support is to run SkyPilot tasks on an existing Kubernetes cluster. It does not provision any new Kubernetes clusters or add new nodes to an existing Kubernetes cluster.
-
-* **I have multiple users in my organization who share the same Kubernetes cluster. How do I provide isolation for their SkyPilot workloads?**
-
-  For isolation, you can create separate Kubernetes namespaces and set them in the kubeconfig distributed to users. SkyPilot will use the namespace set in the kubeconfig for running all tasks.
-
-* **How can I specify custom configuration for the pods created by SkyPilot?**
-
-  You can override the pod configuration used by SkyPilot by setting the :code:`pod_config` key in :code:`~/.sky/config.yaml`.
-  The value of :code:`pod_config` should be a dictionary that follows the `Kubernetes Pod API <https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.22/#pod-v1-core>`_.
-
-  For example, to set custom environment variables and attach a volume on your pods, you can add the following to your :code:`~/.sky/config.yaml` file:
-
-  .. code-block:: yaml
+.. toctree::
+   :hidden:
 
-      kubernetes:
-        pod_config:
-          spec:
-            containers:
-              - env:
-                - name: MY_ENV_VAR
-                  value: MY_ENV_VALUE
-                volumeMounts:       # Custom volume mounts for the pod
-                  - mountPath: /foo
-                    name: example-volume
-            volumes:
-              - name: example-volume
-                hostPath:
-                  path: /tmp
-                  type: Directory
+   kubernetes-getting-started
+   kubernetes-setup
+   kubernetes-troubleshooting
 
-  For more details refer to :ref:`config-yaml`.
 
 Features and Roadmap
 --------------------
@@ -256,11 +116,4 @@ Kubernetes support is under active development. Some features are in progress an
 * Multi-node tasks - ✅ Available
 * Custom images - ✅ Available
 * Opening ports and exposing services - ✅ Available
-* Multiple Kubernetes Clusters - 🚧 In progress
-
-
-.. toctree::
-   :hidden:
-
-   kubernetes-setup
-   kubernetes-troubleshooting
+* Multiple Kubernetes Clusters - 🚧 In progress
\ No newline at end of file
diff --git a/docs/source/reference/kubernetes/kubernetes-getting-started.rst b/docs/source/reference/kubernetes/kubernetes-getting-started.rst
new file mode 100644
index 00000000000..99a777ce1c0
--- /dev/null
+++ b/docs/source/reference/kubernetes/kubernetes-getting-started.rst
@@ -0,0 +1,235 @@
+.. _kubernetes-getting-started:
+
+Getting Started on Kubernetes
+=============================
+
+Prerequisites
+-------------
+
+To connect and use a Kubernetes cluster, SkyPilot needs:
+
+* An existing Kubernetes cluster running Kubernetes v1.20 or later.
+* A `Kubeconfig <https://kubernetes.io/docs/concepts/configuration/organize-cluster-access-kubeconfig/>`_ file containing access credentials and namespace to be used.
+
+**Supported Kubernetes deployments:**
+
+* Hosted Kubernetes services (EKS, GKE)
+* On-prem clusters (Kubeadm, Rancher, K3s)
+* Local development clusters (KinD, minikube)
+
+In a typical workflow:
+
+1. A cluster administrator sets up a Kubernetes cluster. Refer to admin guides for
+   :ref:`Kubernetes cluster setup <kubernetes-setup>` for different deployment environments (Amazon EKS, Google GKE, On-Prem and local debugging) and :ref:`required permissions <cloud-permissions-kubernetes>`.
+
+2. Users who want to run SkyPilot tasks on this cluster are issued Kubeconfig
+   files containing their credentials (`kube-context <https://kubernetes.io/docs/tasks/access-application-cluster/configure-access-multiple-clusters/#define-clusters-users-and-contexts>`_).
+   SkyPilot reads this Kubeconfig file to communicate with the cluster.
+
+Launching your first task
+-------------------------
+.. _kubernetes-instructions:
+
+Once your cluster administrator has :ref:`setup a Kubernetes cluster <kubernetes-setup>` and provided you with a kubeconfig file:
+
+0. Make sure `kubectl <https://kubernetes.io/docs/tasks/tools/>`_, ``socat`` and ``nc`` (netcat) are installed on your local machine.
+
+   .. code-block:: console
+
+     $ # MacOS
+     $ brew install kubectl socat netcat
+
+     $ # Linux (may have socat already installed)
+     $ sudo apt-get install kubectl socat netcat
+
+
+1. Place your kubeconfig file at ``~/.kube/config``.
+
+   .. code-block:: console
+
+     $ mkdir -p ~/.kube
+     $ cp /path/to/kubeconfig ~/.kube/config
+
+   You can verify your credentials are setup correctly by running :code:`kubectl get pods`.
+
+2. Run :code:`sky check` and verify that Kubernetes is enabled in SkyPilot.
+
+   .. code-block:: console
+
+     $ sky check
+
+     Checking credentials to enable clouds for SkyPilot.
+     ...
+     Kubernetes: enabled
+     ...
+
+
+   .. note::
+     :code:`sky check` will also check if GPU support is available on your cluster. If GPU support is not available, it
+     will show the reason.
+     To setup GPU support on the cluster, refer to the :ref:`Kubernetes cluster setup guide <kubernetes-setup>`.
+
+.. _kubernetes-optimizer-table:
+
+4. You can now run any SkyPilot task on your Kubernetes cluster.
+
+   .. code-block:: console
+
+        $ sky launch --cpus 2+ task.yaml
+        == Optimizer ==
+        Target: minimizing cost
+        Estimated cost: $0.0 / hour
+
+        Considered resources (1 node):
+        ---------------------------------------------------------------------------------------------------
+         CLOUD        INSTANCE          vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN
+        ---------------------------------------------------------------------------------------------------
+         Kubernetes   2CPU--2GB         2       2         -              kubernetes    0.00          ✔
+         AWS          m6i.large         2       8         -              us-east-1     0.10
+         Azure        Standard_D2s_v5   2       8         -              eastus        0.10
+         GCP          n2-standard-2     2       8         -              us-central1   0.10
+         IBM          bx2-8x32          8       32        -              us-east       0.38
+         Lambda       gpu_1x_a10        30      200       A10:1          us-east-1     0.60
+        ---------------------------------------------------------------------------------------------------.
+
+
+.. note::
+  SkyPilot will use the cluster and namespace set in the ``current-context`` in the
+  kubeconfig file. To manage your ``current-context``:
+
+  .. code-block:: console
+
+    $ # See current context
+    $ kubectl config current-context
+
+    $ # Switch current-context
+    $ kubectl config use-context mycontext
+
+    $ # Set a specific namespace to be used in the current-context
+    $ kubectl config set-context --current --namespace=mynamespace
+
+
+Using Custom Images
+-------------------
+By default, we use and maintain a SkyPilot container image that has conda and a few other basic tools installed.
+
+To use your own image, add :code:`image_id: docker:<your image tag>` to the :code:`resources` section of your task YAML.
+
+.. code-block:: yaml
+
+    resources:
+      image_id: docker:myrepo/myimage:latest
+    ...
+
+Your image must satisfy the following requirements:
+
+* Image must be **debian-based** and must have the apt package manager installed.
+* The default user in the image must have root privileges or passwordless sudo access.
+
+.. note::
+
+    If your cluster runs on non-x86_64 architecture (e.g., Apple Silicon), your image must be built natively for that architecture. Otherwise, your job may get stuck at :code:`Start streaming logs ...`. See `GitHub issue <https://github.com/skypilot-org/skypilot/issues/3035>`_ for more.
+
+Using Images from Private Repositories
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+To use images from private repositories (e.g., Private DockerHub, Amazon ECR, Google Container Registry), create a `secret <https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/#create-a-secret-by-providing-credentials-on-the-command-line>`_ in your Kubernetes cluster and edit your :code:`~/.sky/config.yaml` to specify the secret like so:
+
+.. code-block:: yaml
+
+    kubernetes:
+      pod_config:
+        spec:
+          imagePullSecrets:
+            - name: your-secret-here
+
+.. tip::
+
+    If you use Amazon ECR, your secret credentials may expire every 12 hours. Consider using `k8s-ecr-login-renew <https://github.com/nabsul/k8s-ecr-login-renew>`_ to automatically refresh your secrets.
+
+
+Opening Ports
+-------------
+
+Opening ports on SkyPilot clusters running on Kubernetes is supported through two modes:
+
+1. `LoadBalancer services <https://kubernetes.io/docs/concepts/services-networking/service/#loadbalancer>`_ (default)
+2. `Nginx IngressController <https://kubernetes.github.io/ingress-nginx/>`_
+
+One of these modes must be supported and configured on your cluster. Refer to the :ref:`setting up ports on Kubernetes guide <kubernetes-ports>` on how to do this.
+
+.. tip::
+
+  On Google GKE, Amazon EKS or other cloud-hosted Kubernetes services, the default LoadBalancer services mode is supported out of the box and no additional configuration is needed.
+
+Once your cluster is  configured, launch a task which exposes services on a port by adding :code:`ports` to the :code:`resources` section of your task YAML.
+
+.. code-block:: yaml
+
+    # task.yaml
+    resources:
+      ports: 8888
+
+    run: |
+      python -m http.server 8888
+
+After launching the cluster with :code:`sky launch -c myclus task.yaml`, you can get the URL to access the port using :code:`sky status --endpoints myclus`.
+
+.. code-block:: bash
+
+    # List all ports exposed by the cluster
+    $ sky status --endpoints myclus
+    8888: 34.173.13.241:8888
+
+    # curl a specific port's endpoint
+    $ curl $(sky status --endpoint 8888 myclus)
+    ...
+
+.. tip::
+
+    To learn more about opening ports in SkyPilot tasks, see :ref:`Opening Ports <ports>`.
+
+FAQs
+----
+
+* **Are autoscaling Kubernetes clusters supported?**
+
+  To run on an autoscaling cluster, you may need to adjust the resource provisioning timeout (:code:`Kubernetes.TIMEOUT` in `clouds/kubernetes.py`) to a large value to give enough time for the cluster to autoscale. We are working on a better interface to adjust this timeout - stay tuned!
+
+* **Can SkyPilot provision a Kubernetes cluster for me? Will SkyPilot add more nodes to my Kubernetes clusters?**
+
+  The goal of Kubernetes support is to run SkyPilot tasks on an existing Kubernetes cluster. It does not provision any new Kubernetes clusters or add new nodes to an existing Kubernetes cluster.
+
+* **I have multiple users in my organization who share the same Kubernetes cluster. How do I provide isolation for their SkyPilot workloads?**
+
+  For isolation, you can create separate Kubernetes namespaces and set them in the kubeconfig distributed to users. SkyPilot will use the namespace set in the kubeconfig for running all tasks.
+
+* **How do I view the pods created by SkyPilot on my Kubernetes cluster?**
+
+  You can use your existing observability tools to filter resources with the label :code:`parent=skypilot` (:code:`kubectl get pods -l 'parent=skypilot'`). As an example, follow the instructions :ref:`here <kubernetes-observability>` to deploy the Kubernetes Dashboard on your cluster.
+
+* **How can I specify custom configuration for the pods created by SkyPilot?**
+
+  You can override the pod configuration used by SkyPilot by setting the :code:`pod_config` key in :code:`~/.sky/config.yaml`.
+  The value of :code:`pod_config` should be a dictionary that follows the `Kubernetes Pod API <https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.22/#pod-v1-core>`_.
+
+  For example, to set custom environment variables and attach a volume on your pods, you can add the following to your :code:`~/.sky/config.yaml` file:
+
+  .. code-block:: yaml
+
+      kubernetes:
+        pod_config:
+          spec:
+            containers:
+              - env:
+                - name: MY_ENV_VAR
+                  value: MY_ENV_VALUE
+                volumeMounts:       # Custom volume mounts for the pod
+                  - mountPath: /foo
+                    name: example-volume
+            volumes:
+              - name: example-volume
+                hostPath:
+                  path: /tmp
+                  type: Directory
+
+  For more details refer to :ref:`config-yaml`.
diff --git a/docs/source/reference/kubernetes/kubernetes-setup.rst b/docs/source/reference/kubernetes/kubernetes-setup.rst
index fca4d539327..4acf271bdca 100644
--- a/docs/source/reference/kubernetes/kubernetes-setup.rst
+++ b/docs/source/reference/kubernetes/kubernetes-setup.rst
@@ -18,7 +18,7 @@ SkyPilot's Kubernetes support is designed to work with most Kubernetes distribut
 To connect to a Kubernetes cluster, SkyPilot needs:
 
 * An existing Kubernetes cluster running Kubernetes v1.20 or later.
-* A `Kubeconfig <kubeconfig>`_ file containing access credentials and namespace to be used.
+* A `Kubeconfig <kubeconfig>`_ file containing access credentials and namespace to be used. To reduce the permissions for a user, check :ref:`required permissions guide<cloud-permissions-kubernetes>`.
 
 
 Deployment Guides
@@ -382,7 +382,7 @@ To use this mode:
     # ingress-nginx-controller         LoadBalancer        10.24.4.254   35.202.58.117   80:31253/TCP,443:32699/TCP
 
 .. note::
-    If the ``EXTERNAL-IP`` field is ``<none>``, you must manually assign an External IP.
+    If the ``EXTERNAL-IP`` field is ``<none>``, you may manually assign it an External IP.
     This can be done by patching the service with an IP that can be accessed from outside the cluster.
     If the service type is ``NodePort``, you can set the ``EXTERNAL-IP`` to any node's IP address:
 
@@ -395,6 +395,22 @@ To use this mode:
     If the ``EXTERNAL-IP`` field is left as ``<none>``, SkyPilot will use ``localhost`` as the external IP for the Ingress,
     and the endpoint may not be accessible from outside the cluster.
 
+.. note::
+    If you cannot update the ``EXTERNAL-IP`` field of the service, you can also
+    specify the Ingress IP or hostname through the ``skypilot.co/external-ip``
+    annotation on the ``ingress-nginx-controller`` service. In this case,
+    having a valid ``EXTERNAL-IP`` field is not required.
+
+    For example, if your ``ingress-nginx-controller`` service is ``NodePort``:
+
+    .. code-block:: bash
+
+      # Add skypilot.co/external-ip annotation to the  nginx ingress service.
+      # Replace <IP> in the following command with the IP you select.
+      # Can be any node's IP if using NodePort service type.
+      $ kubectl annotate service ingress-nginx-controller skypilot.co/external-ip=<IP> -n ingress-nginx
+
+
 3. Update the :ref:`SkyPilot config <config-yaml>` at :code:`~/.sky/config` to use the ingress mode.
 
 .. code-block:: yaml
diff --git a/docs/source/reference/yaml-spec.rst b/docs/source/reference/yaml-spec.rst
index 782fd9fd208..1e56240989c 100644
--- a/docs/source/reference/yaml-spec.rst
+++ b/docs/source/reference/yaml-spec.rst
@@ -92,10 +92,21 @@ Available fields:
       # If unspecified, defaults to False (on-demand instances).
       use_spot: False
 
-      # The recovery strategy for spot jobs (optional).
-      # `use_spot` must be True for this to have any effect. For now, only
-      # `FAILOVER` strategy is supported.
-      spot_recovery: none
+      # The recovery strategy for managed jobs (optional).
+      #
+      # In effect for managed jobs. Possible values are `FAILOVER` and `EAGER_NEXT_REGION`.
+      #
+      # If `FAILOVER` is specified, the job will be restarted in the same region
+      # if the node fails, and go to the next region if no available resources
+      # are found in the same region. 
+      #
+      # If `EAGER_NEXT_REGION` is specified, the job will go to the next region
+      # directly if the node fails. This is useful for spot instances, as in
+      # practice, preemptions in a region usually indicate a shortage of resources
+      # in that region.
+      #
+      # default: EAGER_NEXT_REGION
+      job_recovery: none
 
       # Disk size in GB to allocate for OS (mounted at /). Increase this if you
       # have a large working directory or tasks that write out large outputs.
@@ -208,6 +219,20 @@ Available fields:
       # To use a more limited but easier to manage tool:
       # https://github.com/IBM/vpc-img-inst
 
+      # Labels to apply to the instances (optional).
+      #
+      # If specified, these labels will be applied to the VMs or pods created
+      # by SkyPilot. These are useful for assigning metadata that may be
+      # used by external tools. Implementation depends on the chosen cloud -
+      # On AWS, labels map to instance tags. On GCP, labels map to instance
+      # labels. On Kubernetes, labels map to pod labels. On other clouds,
+      # labels are not supported and will be ignored.
+      #
+      # Note: Labels are applied only on the first launch of the cluster. They
+      # are not updated on subsequent launches.
+      labels:
+        my-label: my-value
+
       # Candidate resources (optional). If specified, SkyPilot will only use
       # these candidate resources to launch the cluster. The fields specified
       # outside of `any_of`, `ordered` will be used as the default values for
diff --git a/docs/source/running-jobs/environment-variables.rst b/docs/source/running-jobs/environment-variables.rst
index 16502f70818..2f3427c1bf5 100644
--- a/docs/source/running-jobs/environment-variables.rst
+++ b/docs/source/running-jobs/environment-variables.rst
@@ -12,6 +12,20 @@ You can specify environment variables to be made available to a task in two ways
 - The ``envs`` field (dict) in a :ref:`task YAML <yaml-spec>`
 - The ``--env`` flag in the ``sky launch/exec`` :ref:`CLI <cli>` (takes precedence over the above)
 
+.. tip::
+
+  If an environment variable is required to be specified with `--env` during
+  ``sky launch/exec``, you can set it to ``null`` in task YAML to raise an
+  error when it is forgotten to be specified. For example, the ``WANDB_API_KEY``
+  and ``HF_TOKEN`` in the following task YAML:
+
+  .. code-block:: yaml
+
+    envs:
+      WANDB_API_KEY:
+      HF_TOKEN: null
+      MYVAR: val
+
 The ``file_mounts``, ``setup``, and ``run`` sections of a task YAML can access the variables via the ``${MYVAR}`` syntax.
 
 Using in ``file_mounts``
diff --git a/docs/source/serving/auth.rst b/docs/source/serving/auth.rst
new file mode 100644
index 00000000000..91e02a64b07
--- /dev/null
+++ b/docs/source/serving/auth.rst
@@ -0,0 +1,123 @@
+.. _serve-auth:
+
+Authorization
+=============
+
+SkyServe provides robust authorization capabilities at the replica level, allowing you to control access to service endpoints with API keys.
+
+Setup API Keys
+--------------
+
+SkyServe relies on the authorization of the service running on underlying service replicas, e.g., the inference engine. We take the vLLM inference engine as an example, which supports static API key authorization with an argument :code:`--api-key`.
+
+We define a SkyServe service spec for serving Llama-3 chatbot with vLLM and an API key. In the example YAML below, we define the authorization token as an environment variable, :code:`AUTH_TOKEN`, and pass it to both the service field to enable :code:`readiness_probe` to access the replicas and the vllm entrypoint to start services on replicas with the API key.
+
+.. code-block:: yaml
+  :emphasize-lines: 5,10-11,28
+
+  # auth.yaml
+  envs:
+    MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
+    HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
+    AUTH_TOKEN: # TODO: Fill with your own auth token (a random string), or use --env to pass.
+
+  service:
+    readiness_probe:
+      path: /v1/models
+      headers:
+        Authorization: Bearer $AUTH_TOKEN
+    replicas: 2
+
+  resources:
+    accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB}
+    ports: 8000
+
+  setup: |
+    pip install vllm==0.4.0.post1 flash-attn==2.5.7 gradio openai
+    # python -c "import huggingface_hub; huggingface_hub.login('${HF_TOKEN}')"
+
+  run: |
+    export PATH=$PATH:/sbin
+    python -m vllm.entrypoints.openai.api_server \
+      --model $MODEL_NAME --trust-remote-code \
+      --gpu-memory-utilization 0.95 \
+      --host 0.0.0.0 --port 8000 \
+      --api-key $AUTH_TOKEN
+
+To deploy the service, run the following command:
+
+.. code-block:: bash
+
+  HF_TOKEN=xxx AUTH_TOKEN=yyy sky serve up auth.yaml -n auth --env HF_TOKEN --env AUTH_TOKEN
+
+To send a request to the service endpoint, a service client need to include the static API key in a request's header:
+
+.. code-block:: bash
+  :emphasize-lines: 5
+
+  $ ENDPOINT=$(sky serve status --endpoint auth)
+  $ AUTH_TOKEN=yyy
+  $ curl http://$ENDPOINT/v1/chat/completions \
+      -H "Content-Type: application/json" \
+      -H "Authorization: Bearer $AUTH_TOKEN" \
+      -d '{
+        "model": "meta-llama/Meta-Llama-3-8B-Instruct",
+        "messages": [
+          {
+            "role": "system",
+            "content": "You are a helpful assistant."
+          },
+          {
+            "role": "user",
+            "content": "Who are you?"
+          }
+        ],
+        "stop_token_ids": [128009, 128001]
+      }' | jq
+
+.. raw:: HTML
+
+  <details>
+  
+  <summary>Example output</summary>
+  
+  
+.. code-block:: console
+
+  {
+    "id": "cmpl-cad2c1a2a6ee44feabed0b28be294d6f",
+    "object": "chat.completion",
+    "created": 1716819147,
+    "model": "meta-llama/Meta-Llama-3-8B-Instruct",
+    "choices": [
+      {
+        "index": 0,
+        "message": {
+          "role": "assistant",
+          "content": "I'm so glad you asked! I'm LLaMA, an AI assistant developed by Meta AI that can understand and respond to human input in a conversational manner. I'm here to help you with any questions, tasks, or topics you'd like to discuss. I can provide information on a wide range of subjects, from science and history to entertainment and culture. I can also assist with language-related tasks such as language translation, text summarization, and even writing and proofreading. My goal is to provide accurate and helpful responses to your inquiries, while also being friendly and engaging. So, what's on your mind? How can I assist you today?"
+        },
+        "logprobs": null,
+        "finish_reason": "stop",
+        "stop_reason": 128009
+      }
+    ],
+    "usage": {
+      "prompt_tokens": 26,
+      "total_tokens": 160,
+      "completion_tokens": 134
+    }
+  }
+  
+.. raw:: html
+
+  </details>
+
+A service client without an API key will not be able to access the service and get a :code:`401 Unauthorized` error:
+
+.. code-block:: bash
+
+  $ curl http://$ENDPOINT/v1/models
+  {"error": "Unauthorized"}
+
+  $ curl http://$ENDPOINT/v1/models -H "Authorization: Bearer random-string"
+  {"error": "Unauthorized"}
diff --git a/docs/source/serving/sky-serve.rst b/docs/source/serving/sky-serve.rst
index eb2daa6ffb8..c00fa427bd6 100644
--- a/docs/source/serving/sky-serve.rst
+++ b/docs/source/serving/sky-serve.rst
@@ -22,7 +22,7 @@ Why SkyServe?
 
 How it works:
 
-- Each service gets an endpoint that automatically redirects requests to its replicas.
+- Each service gets an endpoint that automatically distributes requests to its replicas.
 - Replicas of the same service can run in different regions and clouds — reducing cloud costs and increasing availability.
 - SkyServe handles the load balancing, recovery, and autoscaling of the replicas.
 
@@ -127,7 +127,7 @@ Run :code:`sky serve up service.yaml` to deploy the service with automatic price
 
 If you see the :code:`STATUS` column becomes :code:`READY`, then the service is ready to accept traffic!
 
-Simply ``curl -L`` the service endpoint, which automatically load-balances across the two replicas:
+Simply ``curl`` the service endpoint, which automatically load-balances across the two replicas:
 
 .. tab-set::
 
@@ -136,7 +136,7 @@ Simply ``curl -L`` the service endpoint, which automatically load-balances acros
 
         .. code-block:: console
 
-            $ curl -L 3.84.15.251:30001/v1/chat/completions \
+            $ curl 3.84.15.251:30001/v1/chat/completions \
                 -X POST \
                 -d '{"model": "mistralai/Mixtral-8x7B-Instruct-v0.1", "messages": [{"role": "user", "content": "Who are you?"}]}' \
                 -H 'Content-Type: application/json'
@@ -149,7 +149,7 @@ Simply ``curl -L`` the service endpoint, which automatically load-balances acros
 
         .. code-block:: console
 
-            $ curl -L 44.211.131.51:30001/generate \
+            $ curl 44.211.131.51:30001/generate \
                 -X POST \
                 -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
                 -H 'Content-Type: application/json'
@@ -240,7 +240,7 @@ Under the hood, :code:`sky serve up`:
 #. Launches a controller which handles autoscaling, monitoring and load balancing;
 #. Returns a Service Endpoint which will be used to accept traffic;
 #. Meanwhile, the controller provisions replica VMs which later run the services;
-#. Once any replica is ready, the requests sent to the Service Endpoint will be **HTTP-redirect** to one of the endpoint replicas.
+#. Once any replica is ready, the requests sent to the Service Endpoint will be distributed to one of the endpoint replicas.
 
 After the controller is provisioned, you'll see the following in :code:`sky serve status` output:
 
@@ -264,7 +264,7 @@ sending requests to :code:`<endpoint-url>` (e.g., ``44.201.119.3:30001``):
 
 .. code-block:: console
 
-    $ curl -L <endpoint-url>
+    $ curl <endpoint-url>
     <html>
     <head>
         <title>My First SkyServe Service</title>
@@ -274,12 +274,6 @@ sending requests to :code:`<endpoint-url>` (e.g., ``44.201.119.3:30001``):
     </body>
     </html>
 
-.. note::
-
-  Since we are using HTTP-redirect, we need to use :code:`curl -L
-  <endpoint-url>`. The :code:`curl` command by default won't follow the
-  redirect.
-
 Tutorial: Serve a Chatbot LLM!
 ------------------------------
 
@@ -308,11 +302,12 @@ Let's bring up a real LLM chat service with FastChat + Vicuna. We'll use the `Vi
       conda activate chatbot
 
       echo 'Starting controller...'
-      python -u -m fastchat.serve.controller > ~/controller.log 2>&1 &
+      python -u -m fastchat.serve.controller --host 127.0.0.1 > ~/controller.log 2>&1 &
       sleep 10
       echo 'Starting model worker...'
       python -u -m fastchat.serve.model_worker \
                 --model-path lmsys/vicuna-${MODEL_SIZE}b-v1.3 2>&1 \
+                --host 127.0.0.1 \
                 | tee model_worker.log &
 
       echo 'Waiting for model worker to start...'
@@ -367,7 +362,7 @@ Send a request using the following cURL command:
 
 .. code-block:: console
 
-    $ curl -L http://<endpoint-url>/v1/chat/completions \
+    $ curl http://<endpoint-url>/v1/chat/completions \
         -X POST \
         -d '{"model":"vicuna-7b-v1.3","messages":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"Who are you?"}],"temperature":0}' \
         -H 'Content-Type: application/json'
@@ -449,6 +444,11 @@ Autoscaling
 
 See :ref:`Autoscaling <serve-autoscaling>` for more information.
 
+Authorization
+-------------
+
+See :ref:`Authorization <serve-auth>` for more information.
+
 
 SkyServe Architecture
 ---------------------
@@ -467,7 +467,7 @@ SkyServe has a centralized controller VM that manages the deployment of your ser
 It is composed of the following components:
 
 #. **Controller**: The controller will monitor the status of the replicas and re-launch a new replica if one of them fails. It also autoscales the number of replicas if autoscaling config is set (see :ref:`Service YAML spec <service-yaml-spec>` for more information).
-#. **Load Balancer**: The load balancer will route the traffic to all ready replicas. It is a lightweight HTTP server that listens on the service endpoint and **HTTP-redirects** the requests to one of the replicas.
+#. **Load Balancer**: The load balancer will route the traffic to all ready replicas. It is a lightweight HTTP server that listens on the service endpoint and distribute the requests to one of the replicas.
 
 All of the process group shares a single controller VM. The controller VM will be launched in the cloud with the best price/performance ratio. You can also :ref:`customize the controller resources <customizing-sky-serve-controller-resources>` based on your needs.
 
diff --git a/docs/source/serving/spot-policy.rst b/docs/source/serving/spot-policy.rst
new file mode 100644
index 00000000000..1c03dbe7ba4
--- /dev/null
+++ b/docs/source/serving/spot-policy.rst
@@ -0,0 +1,160 @@
+.. _spot_policy:
+
+Using Spot Instances for Serving
+================================
+
+SkyServe supports serving models on a mixture of spot and on-demand replicas with two options: :code:`base_ondemand_fallback_replicas` and :code:`dynamic_ondemand_fallback`.
+
+
+Base on-demand Fallback
+-----------------------
+
+:code:`base_ondemand_fallback_replicas` sets the number of on-demand replicas to keep running at all times. This is useful for ensuring service availability and making sure that there is always some capacity available, even if spot replicas are not available. :code:`use_spot` should be set to :code:`true` to enable spot replicas.
+
+.. code-block:: yaml
+
+    service:
+      readiness_probe: /health
+      replica_policy:
+        min_replicas: 2
+        max_replicas: 3
+        target_qps_per_replica: 1
+        # Ensures that one of the replicas is run on on-demand instances
+        base_ondemand_fallback_replicas: 1
+
+    resources:
+      ports: 8081
+      cpus: 2+
+      use_spot: true
+
+    workdir: examples/serve/http_server
+
+    run: python3 server.py
+
+
+.. tip::
+
+    Kubernetes instances are considered on-demand instances. You can use the :code:`base_ondemand_fallback_replicas` option to have some replicas run on Kubernetes, while others run on cloud spot instances.
+
+
+Dynamic on-demand Fallback
+--------------------------
+
+SkyServe supports dynamically fallback to on-demand replicas when spot replicas are not available.
+This is enabled by setting :code:`dynamic_ondemand_fallback` to be :code:`true`.
+This is useful for ensuring the required capacity of replicas in the case of spot instance interruptions.
+When spot replicas are available, SkyServe will automatically switch back to using spot replicas to maximize cost savings.
+
+.. code-block:: yaml
+
+    service:
+      readiness_probe: /health
+      replica_policy:
+        min_replicas: 2
+        max_replicas: 3
+        target_qps_per_replica: 1
+        # Allows replicas to be run on on-demand instances if spot instances are not available
+        dynamic_ondemand_fallback: true
+
+    resources:
+      ports: 8081
+      cpus: 2+
+      use_spot: true
+
+    workdir: examples/serve/http_server
+
+    run: python3 server.py
+
+
+.. tip::
+
+    SkyServe supports specifying both :code:`base_ondemand_fallback_replicas` and :code:`dynamic_ondemand_fallback`. Specifying both will set a base number of on-demand replicas and dynamically fallback to on-demand replicas when spot replicas are not available.
+
+Example
+-------
+
+The following example demonstrates how to use spot replicas with SkyServe with dynamic fallback. The example is a simple HTTP server that listens on port 8081 with :code:`dynamic_ondemand_fallback: true`. To run: 
+
+.. code-block:: console
+
+    $ sky serve up examples/serve/spot_policy/dynamic_on_demand_fallback.yaml -n http-server
+
+When the service is up, we can check the status of the service and the replicas using the following command. Initially, we will see:
+
+.. code-block:: console
+
+    $ sky serve status http-server
+
+    Services
+    NAME         VERSION  UPTIME  STATUS      REPLICAS  ENDPOINT         
+    http-server  1        1m 17s  NO_REPLICA  0/4       54.227.229.217:30001
+
+    Service Replicas
+    SERVICE_NAME  ID  VERSION  ENDPOINT                   LAUNCHED    RESOURCES             STATUS         REGION       
+    http-server   1   1        -                          1 min ago   1x GCP([Spot]vCPU=2)  PROVISIONING  us-east1     
+    http-server   2   1        -                          1 min ago   1x GCP([Spot]vCPU=2)  PROVISIONING  us-central1  
+    http-server   3   1        -                          1 mins ago  1x GCP(vCPU=2)        PROVISIONING  us-east1     
+    http-server   4   1        -                          1 min ago   1x GCP(vCPU=2)        PROVISIONING  us-central1
+
+When the required number of spot replicas are not available, SkyServe will provision the number of on-demand replicas needed to meet the target number of replicas. For example, when the target number is 2 and only 1 spot replica is ready, SkyServe will provision 1 on-demand replica to meet the target number of replicas. 
+
+.. code-block:: console
+
+    $ sky serve status http-server
+
+    Services
+    NAME         VERSION  UPTIME  STATUS  REPLICAS  ENDPOINT              
+    http-server  1        1m 17s  READY   2/4       54.227.229.217:30001  
+
+    Service Replicas
+    SERVICE_NAME  ID  VERSION  ENDPOINT                   LAUNCHED    RESOURCES             STATUS         REGION       
+    http-server   1   1        http://34.23.22.160:8081   3 min ago   1x GCP([Spot]vCPU=2)  READY          us-east1     
+    http-server   2   1        http://34.68.226.193:8081  3 min ago   1x GCP([Spot]vCPU=2)  READY          us-central1  
+    http-server   3   1        -                          3 mins ago  1x GCP(vCPU=2)        SHUTTING_DOWN  us-east1     
+    http-server   4   1        -                          3 min ago   1x GCP(vCPU=2)        SHUTTING_DOWN  us-central1
+
+When the spot replicas are ready, SkyServe will automatically scale down on-demand replicas to maximize cost savings.
+
+.. code-block:: console
+
+    $ sky serve status http-server
+
+    Services
+    NAME         VERSION  UPTIME  STATUS  REPLICAS  ENDPOINT              
+    http-server  1        3m 59s  READY   2/2       54.227.229.217:30001  
+
+    Service Replicas
+    SERVICE_NAME  ID  VERSION  ENDPOINT                   LAUNCHED    RESOURCES             STATUS  REGION       
+    http-server   1   1        http://34.23.22.160:8081   4 mins ago  1x GCP([Spot]vCPU=2)  READY   us-east1     
+    http-server   2   1        http://34.68.226.193:8081  4 mins ago  1x GCP([Spot]vCPU=2)  READY   us-central1 
+
+In the event of spot instance interruptions (e.g. replica 1), SkyServe will automatically fallback to on-demand replicas (e.g. launch one on-demand replica) to meet the required capacity of replicas. SkyServe will continue trying to provision one spot replica in the event where spot availability is back. Note that SkyServe will try different regions and clouds to maximize the chance of successfully provisioning spot instances.
+
+.. code-block:: console
+
+    $ sky serve status http-server
+
+    Services
+    NAME         VERSION  UPTIME  STATUS  REPLICAS  ENDPOINT              
+    http-server  1        7m 2s   READY   1/3       54.227.229.217:30001  
+
+    Service Replicas
+    SERVICE_NAME  ID  VERSION  ENDPOINT                   LAUNCHED     RESOURCES             STATUS        REGION       
+    http-server   2   1        http://34.68.226.193:8081  7 mins ago   1x GCP([Spot]vCPU=2)  READY         us-central1  
+    http-server   5   1        -                          13 secs ago  1x GCP([Spot]vCPU=2)  PROVISIONING  us-central1  
+    http-server   6   1        -                          13 secs ago  1x GCP(vCPU=2)        PROVISIONING  us-central1
+
+Eventually, when the spot availability is back, SkyServe will automatically scale down on-demand replicas.
+
+.. code-block:: console
+
+    $ sky serve status http-server
+
+    Services
+    NAME         VERSION  UPTIME  STATUS  REPLICAS  ENDPOINT              
+    http-server  1        10m 5s  READY   2/3       54.227.229.217:30001  
+
+    Service Replicas
+    SERVICE_NAME  ID  VERSION  ENDPOINT                   LAUNCHED     RESOURCES             STATUS         REGION       
+    http-server   2   1        http://34.68.226.193:8081  10 mins ago  1x GCP([Spot]vCPU=2)  READY          us-central1  
+    http-server   5   1        http://34.121.49.94:8081   1 min ago    1x GCP([Spot]vCPU=2)  READY          us-central1
\ No newline at end of file
diff --git a/docs/source/serving/user-guides.rst b/docs/source/serving/user-guides.rst
index e6e63fd40a5..8b9cba92b45 100644
--- a/docs/source/serving/user-guides.rst
+++ b/docs/source/serving/user-guides.rst
@@ -5,3 +5,5 @@ Serving User Guides
 
    autoscaling
    update
+   auth
+   spot-policy
diff --git a/examples/cog/README.md b/examples/cog/README.md
index 4fa4890420f..b2193e2e18f 100644
--- a/examples/cog/README.md
+++ b/examples/cog/README.md
@@ -28,7 +28,7 @@ After the service is launched, access the deployment with the following:
 ```console
 ENDPOINT=$(sky serve status --endpoint cog)
 
-curl -L http://$ENDPOINT/predictions -X POST \
+curl http://$ENDPOINT/predictions -X POST \
   -H 'Content-Type: application/json' \
   -d '{"input": {"image": "https://blog.skypilot.co/introducing-sky-serve/images/sky-serve-thumbnail.png"}}' \
   | jq -r '.output | split(",")[1]' | base64 --decode > output.png
diff --git a/examples/job_queue/job_docker.yaml b/examples/job_queue/job_docker.yaml
index da604125865..37f23f3eef1 100644
--- a/examples/job_queue/job_docker.yaml
+++ b/examples/job_queue/job_docker.yaml
@@ -8,6 +8,9 @@
 
 name: job_docker
 
+envs:
+  TIME_TO_SLEEP: 180
+
 resources:
   accelerators: T4:0.5
   image_id: docker:ubuntu:20.04
@@ -18,7 +21,7 @@ setup: |
 run: |
   timestamp=$(date +%s)
   conda env list
-  for i in {1..180}; do
+  for i in $(seq 1 $TIME_TO_SLEEP); do
     echo "$timestamp $i"
     sleep 1
   done
diff --git a/examples/managed_job.yaml b/examples/managed_job.yaml
new file mode 100644
index 00000000000..30ad2db287a
--- /dev/null
+++ b/examples/managed_job.yaml
@@ -0,0 +1,17 @@
+name: minimal
+
+setup: |
+  echo "running setup"
+  pip install tqdm
+
+run: |
+  conda env list
+  echo "start counting"
+  python -u - << EOF
+  import time
+  import tqdm
+
+  for i in tqdm.trange(240):
+    time.sleep(1)
+  
+  EOF
diff --git a/examples/managed_spot_with_storage.yaml b/examples/managed_job_with_storage.yaml
similarity index 83%
rename from examples/managed_spot_with_storage.yaml
rename to examples/managed_job_with_storage.yaml
index 1b81e459bb4..ecefccd8b3d 100644
--- a/examples/managed_spot_with_storage.yaml
+++ b/examples/managed_job_with_storage.yaml
@@ -3,13 +3,13 @@
 # Runs a task that uses cloud buckets for uploading and accessing files.
 #
 # Usage:
-#   sky spot launch -c spot-storage examples/managed_spot_with_storage.yaml
+#   sky spot launch -c spot-storage examples/managed_job_with_storage.yaml
 #   sky down spot-storage
 
 resources:
   cloud: aws
   use_spot: true
-  spot_recovery: failover
+  job_recovery: failover
 
 workdir: ./examples
 
@@ -41,8 +41,8 @@ file_mounts:
 
 run: |
   set -ex
-  ls ~/sky_workdir/managed_spot_with_storage.yaml
-  ls ~/bucket_workdir/managed_spot_with_storage.yaml
+  ls ~/sky_workdir/managed_job_with_storage.yaml
+  ls ~/bucket_workdir/managed_job_with_storage.yaml
   ls -l /imagenet-image/datasets
   
 
diff --git a/examples/managed_spot.yaml b/examples/managed_spot.yaml
index 4bfcb63f40a..712819eb9ca 100644
--- a/examples/managed_spot.yaml
+++ b/examples/managed_spot.yaml
@@ -1,5 +1,8 @@
 name: minimal
 
+resources:
+  use_spot: true
+
 setup: |
   echo "running setup"
   pip install tqdm
diff --git a/examples/serve/gorilla/gorilla.yaml b/examples/serve/gorilla/gorilla.yaml
index ee46aa94568..e3072d816fb 100644
--- a/examples/serve/gorilla/gorilla.yaml
+++ b/examples/serve/gorilla/gorilla.yaml
@@ -35,11 +35,12 @@ run: |
   conda activate chatbot
   
   echo 'Starting controller...'
-  python -u -m fastchat.serve.controller > ~/controller.log 2>&1 &
+  python -u -m fastchat.serve.controller --host 127.0.0.1 > ~/controller.log 2>&1 &
   sleep 10
   echo 'Starting model worker...'
   python -u -m fastchat.serve.model_worker \
             --model-path gorilla-llm/gorilla-falcon-7b-hf-v0 2>&1 \
+            --host 127.0.0.1 \
             | tee model_worker.log &
 
   echo 'Waiting for model worker to start...'
diff --git a/examples/serve/llama2/llama2.yaml b/examples/serve/llama2/llama2.yaml
index 5eaaea449d0..42c82ea0cc9 100644
--- a/examples/serve/llama2/llama2.yaml
+++ b/examples/serve/llama2/llama2.yaml
@@ -25,7 +25,7 @@ resources:
 
 envs:
   MODEL_SIZE: 7
-  HF_TOKEN: <your-huggingface-token> # TODO: Replace with huggingface token
+  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
 
 setup: |
   conda activate chatbot
diff --git a/examples/serve/misc/cancel/README.md b/examples/serve/misc/cancel/README.md
index 65b88c2d540..61c24383909 100644
--- a/examples/serve/misc/cancel/README.md
+++ b/examples/serve/misc/cancel/README.md
@@ -1,6 +1,6 @@
 # SkyServe cancel example
 
-This example demonstrates the redirect support canceling a request.
+This example demonstrates the SkyServe load balancer support canceling a request.
 
 ## Running the example
 
@@ -33,7 +33,7 @@ Client disconnected, stopping computation.
 You can also run
 
 ```bash
-curl -L http://<endpoint>/
+curl http://<endpoint>/
 ```
 
 and manually Ctrl + C to cancel the request and see logs.
diff --git a/examples/serve/stable_diffusion_service.yaml b/examples/serve/stable_diffusion_service.yaml
index 86ef257e7ca..2adaf6e4ca6 100644
--- a/examples/serve/stable_diffusion_service.yaml
+++ b/examples/serve/stable_diffusion_service.yaml
@@ -18,7 +18,7 @@ file_mounts:
   /stable_diffusion: examples/stable_diffusion
 
 setup: |
-  sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
+  sudo curl "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
   sudo chmod +x /usr/local/bin/docker-compose
   cd stable-diffusion-webui-docker  
   sudo rm -r stable-diffusion-webui-docker
diff --git a/examples/serve/vicuna-v1.5.yaml b/examples/serve/vicuna-v1.5.yaml
index c94115ea3d7..0f659e85697 100644
--- a/examples/serve/vicuna-v1.5.yaml
+++ b/examples/serve/vicuna-v1.5.yaml
@@ -34,11 +34,12 @@ run: |
   conda activate chatbot
   
   echo 'Starting controller...'
-  python -u -m fastchat.serve.controller > ~/controller.log 2>&1 &
+  python -u -m fastchat.serve.controller --host 127.0.0.1 > ~/controller.log 2>&1 &
   sleep 10
   echo 'Starting model worker...'
   python -u -m fastchat.serve.model_worker \
             --model-path lmsys/vicuna-${MODEL_SIZE}b-v1.5 2>&1 \
+            --host 127.0.0.1 \
             | tee model_worker.log &
 
   echo 'Waiting for model worker to start...'
diff --git a/examples/spot_pipeline/bert_qa_train_eval.yaml b/examples/spot_pipeline/bert_qa_train_eval.yaml
index 32fb526ca91..62bd34c3b76 100644
--- a/examples/spot_pipeline/bert_qa_train_eval.yaml
+++ b/examples/spot_pipeline/bert_qa_train_eval.yaml
@@ -42,7 +42,7 @@ run: |
     echo Model saved to /checkpoint/bert_qa/$SKYPILOT_TASK_ID
 
 envs:
-    WANDB_API_KEY: # NOTE: Fill in your wandb key
+    WANDB_API_KEY: # TODO: Fill with your own WANDB_API_KEY, or use --env to pass.
 
 ---
 
@@ -84,4 +84,4 @@ run: |
     --save_steps 1000
 
 envs:
-    WANDB_API_KEY: # NOTE: Fill in your wandb key
+    WANDB_API_KEY: # TODO: Fill with your own WANDB_API_KEY, or use --env to pass.
diff --git a/examples/stable_diffusion/README.md b/examples/stable_diffusion/README.md
index 9bec3354848..2a4383f1347 100644
--- a/examples/stable_diffusion/README.md
+++ b/examples/stable_diffusion/README.md
@@ -8,7 +8,7 @@
 
 4. Run `ssh -L 7860:localhost:7860 stable-diffusion`
 
-5. Open [`http://localhost:7860/`](http://localhost:7860/) in browser.
+5. Open [`http://localhost:7860/`](http://localhost:7860/) in browser. If the page doesn't load, try again in a few minutes to allow the container to start.
 
 6. Type in text prompt and click "Generate".
 
diff --git a/examples/stable_diffusion/stable_diffusion_docker.yaml b/examples/stable_diffusion/stable_diffusion_docker.yaml
index 9c07790ba6b..1a830241f14 100644
--- a/examples/stable_diffusion/stable_diffusion_docker.yaml
+++ b/examples/stable_diffusion/stable_diffusion_docker.yaml
@@ -7,8 +7,6 @@ file_mounts:
   /stable_diffusion: .
 
 setup: |
-  sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
-  sudo chmod +x /usr/local/bin/docker-compose
   cd stable-diffusion-webui-docker  
   sudo rm -r stable-diffusion-webui-docker
   git clone https://github.com/AbdBarho/stable-diffusion-webui-docker.git
@@ -17,9 +15,16 @@ setup: |
   wget https://huggingface.co/CompVis/stable-diffusion-v-1-4-original/resolve/main/sd-v1-4.ckpt -P models
   mv models/sd-v1-4.ckpt models/model.ckpt
   docker pull berkeleyskypilot/stable-diffusion
-  rm docker-compose.yml
-  cp /stable_diffusion/docker-compose.yml .
 
 run: |
   cd stable-diffusion-webui-docker
-  docker-compose up
+  docker run -d \
+    --name model \
+    --restart on-failure \
+    -p 7860:7860 \
+    -v $(pwd)/cache:/cache \
+    -v $(pwd)/output:/output \
+    -v $(pwd)/models:/models \
+    -e CLI_ARGS='--extra-models-cpu --optimized-turbo' \
+    --gpus all \
+    berkeleyskypilot/stable-diffusion
diff --git a/examples/tpu/tpuvm_mnist.yaml b/examples/tpu/tpuvm_mnist.yaml
index 5d40089329f..d4e119119e0 100644
--- a/examples/tpu/tpuvm_mnist.yaml
+++ b/examples/tpu/tpuvm_mnist.yaml
@@ -5,7 +5,7 @@ resources:
 
 # The setup command.  Will be run under the working directory.
 setup: |
-  git clone https://github.com/google/flax.git --branch v0.6.11
+  git clone https://github.com/google/flax.git --branch v0.8.2
 
   conda activate flax
   if [ $? -eq 0 ]; then
@@ -14,8 +14,8 @@ setup: |
     conda create -n flax python=3.10 -y
     conda activate flax
     # Make sure to install TPU related packages in a conda env to avoid package conflicts.
-    pip install "jax[tpu]==0.4.23" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
-    pip install --upgrade clu
+    pip install "jax[tpu]==0.4.26" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
+    pip install clu
     pip install -e flax
     pip install tensorflow tensorflow-datasets
   fi
@@ -24,8 +24,9 @@ setup: |
 # The command to run.  Will be run under the working directory.
 run: |
   conda activate flax
+  pip install clu
   cd flax/examples/mnist
   python3 main.py --workdir=/tmp/mnist \
-  --config=configs/default.py \
-  --config.learning_rate=0.05 \
-  --config.num_epochs=10
+    --config=configs/default.py \
+    --config.learning_rate=0.05 \
+    --config.num_epochs=10
diff --git a/llm/axolotl/axolotl-docker.yaml b/llm/axolotl/axolotl-docker.yaml
new file mode 100644
index 00000000000..b883ebdde46
--- /dev/null
+++ b/llm/axolotl/axolotl-docker.yaml
@@ -0,0 +1,29 @@
+# Usage:
+#   HF_TOKEN=abc sky launch -c axolotl axolotl.yaml --env HF_TOKEN -y -i30 --down
+
+name: axolotl
+
+resources:
+  accelerators: L4:1
+  cloud: gcp # optional
+
+workdir: mistral
+
+setup: |
+  docker pull winglian/axolotl:main-py3.10-cu118-2.0.1
+
+run: |
+  docker run --gpus all \
+    -v ~/sky_workdir:/sky_workdir \
+    -v /root/.cache:/root/.cache \
+    winglian/axolotl:main-py3.10-cu118-2.0.1 \
+    huggingface-cli login --token ${HF_TOKEN} 
+
+  docker run --gpus all \
+    -v ~/sky_workdir:/sky_workdir \
+    -v /root/.cache:/root/.cache \
+    winglian/axolotl:main-py3.10-cu118-2.0.1 \
+    accelerate launch -m axolotl.cli.train /sky_workdir/qlora.yaml
+
+envs:
+  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
diff --git a/llm/axolotl/axolotl-spot.yaml b/llm/axolotl/axolotl-spot.yaml
index b6c81b742c9..4832fa72c04 100644
--- a/llm/axolotl/axolotl-spot.yaml
+++ b/llm/axolotl/axolotl-spot.yaml
@@ -12,6 +12,7 @@ resources:
   accelerators: A100:1
   cloud: gcp # optional
   use_spot: True
+  image_id: docker:winglian/axolotl:main-py3.10-cu118-2.0.1
 
 workdir: mistral
 
@@ -20,29 +21,13 @@ file_mounts:
     name: ${BUCKET}
     mode: MOUNT
 
-setup: |
-  docker pull winglian/axolotl:main-py3.10-cu118-2.0.1
-
 run: |
-  docker run --gpus all \
-    -v ~/sky_workdir:/sky_workdir \
-    -v /root/.cache:/root/.cache \
-    winglian/axolotl:main-py3.10-cu118-2.0.1 \
-    huggingface-cli login --token ${HF_TOKEN}
+  huggingface-cli login --token ${HF_TOKEN}
   
-  docker run --gpus all \
-    -v ~/sky_workdir:/sky_workdir \
-    -v /root/.cache:/root/.cache \
-    -v /sky-notebook:/sky-notebook \
-    winglian/axolotl:main-py3.10-cu118-2.0.1 \
-    accelerate launch -m axolotl.cli.train /sky_workdir/qlora-checkpoint.yaml
+  accelerate launch -m axolotl.cli.train qlora-checkpoint.yaml
 
 envs:
-  HF_TOKEN: <your-huggingface-token> # TODO: Replace with huggingface token
-  BUCKET: <a-unique-bucket-name-to-use>
-  
-  
-
-
-  
+  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
+  BUCKET: # TODO: Fill with your unique bucket name, or use --env to pass.
 
+4
diff --git a/llm/axolotl/axolotl.yaml b/llm/axolotl/axolotl.yaml
index d9cfd91aa6d..b43127c9361 100644
--- a/llm/axolotl/axolotl.yaml
+++ b/llm/axolotl/axolotl.yaml
@@ -5,31 +5,14 @@ name: axolotl
 
 resources:
   accelerators: L4:1
-  cloud: gcp # optional
+  image_id: docker:winglian/axolotl:main-py3.10-cu118-2.0.1
 
 workdir: mistral
 
-setup: |
-  docker pull winglian/axolotl:main-py3.10-cu118-2.0.1
-
 run: |
-  docker run --gpus all \
-    -v ~/sky_workdir:/sky_workdir \
-    -v /root/.cache:/root/.cache \
-    winglian/axolotl:main-py3.10-cu118-2.0.1 \
-    huggingface-cli login --token ${HF_TOKEN} 
+  huggingface-cli login --token ${HF_TOKEN} 
 
-  docker run --gpus all \
-    -v ~/sky_workdir:/sky_workdir \
-    -v /root/.cache:/root/.cache \
-    winglian/axolotl:main-py3.10-cu118-2.0.1 \
-    accelerate launch -m axolotl.cli.train /sky_workdir/qlora.yaml
+  accelerate launch -m axolotl.cli.train qlora.yaml
 
 envs:
-  HF_TOKEN: <your-huggingface-token> # TODO: Replace with huggingface token
-  
-  
-
-
-  
-
+  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
diff --git a/llm/axolotl/mistral/qlora-checkpoint.yaml b/llm/axolotl/mistral/qlora-checkpoint.yaml
index 278a5d72b9a..1f1cc67446c 100644
--- a/llm/axolotl/mistral/qlora-checkpoint.yaml
+++ b/llm/axolotl/mistral/qlora-checkpoint.yaml
@@ -71,6 +71,7 @@ warmup_steps: 10
 eval_steps: 0.05
 eval_table_size:
 eval_table_max_new_tokens: 128
+eval_sample_packing: false
 save_steps: 2 ## increase based on your dataset
 save_strategy: steps
 debug:
@@ -81,4 +82,4 @@ fsdp_config:
 special_tokens:
   bos_token: "<s>"
   eos_token: "</s>"
-  unk_token: "<unk>"
\ No newline at end of file
+  unk_token: "<unk>"
diff --git a/llm/axolotl/mistral/qlora.yaml b/llm/axolotl/mistral/qlora.yaml
index 42c3742b52d..39b2c55b1ce 100644
--- a/llm/axolotl/mistral/qlora.yaml
+++ b/llm/axolotl/mistral/qlora.yaml
@@ -69,6 +69,7 @@ warmup_steps: 10
 eval_steps: 0.05
 eval_table_size:
 eval_table_max_new_tokens: 128
+eval_sample_packing: false
 save_steps:
 debug:
 deepspeed:
@@ -78,4 +79,4 @@ fsdp_config:
 special_tokens:
   bos_token: "<s>"
   eos_token: "</s>"
-  unk_token: "<unk>"
\ No newline at end of file
+  unk_token: "<unk>"
diff --git a/llm/codellama/README.md b/llm/codellama/README.md
index 1ed02e301d1..8e5025d22b5 100644
--- a/llm/codellama/README.md
+++ b/llm/codellama/README.md
@@ -68,7 +68,7 @@ Launching a cluster 'code-llama'. Proceed? [Y/n]:
 ```bash
 IP=$(sky status --ip code-llama)
 
-curl -L http://$IP:8000/v1/completions \
+curl http://$IP:8000/v1/completions \
     -H "Content-Type: application/json" \
     -d '{
       "model": "codellama/CodeLlama-70b-Instruct-hf",
@@ -131,7 +131,7 @@ availability of the service while minimizing the cost.
 ```bash
 ENDPOINT=$(sky serve status --endpoint code-llama)
 
-curl -L http://$ENDPOINT/v1/completions \
+curl http://$ENDPOINT/v1/completions \
     -H "Content-Type: application/json" \
     -d '{
       "model": "codellama/CodeLlama-70b-Instruct-hf",
@@ -146,7 +146,7 @@ We can also access the Code Llama service with the openAI Chat API.
 ```bash
 ENDPOINT=$(sky serve status --endpoint code-llama)
 
-curl -L http://$ENDPOINT/v1/chat/completions \
+curl http://$ENDPOINT/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
       "model": "codellama/CodeLlama-70b-Instruct-hf",
diff --git a/llm/codellama/gui.yaml b/llm/codellama/gui.yaml
index 486e1409aa9..acd11e46871 100644
--- a/llm/codellama/gui.yaml
+++ b/llm/codellama/gui.yaml
@@ -38,7 +38,9 @@ run: |
       "model_name": "codellama/CodeLlama-70b-Instruct-hf",
       "api_base": "http://${ENDPOINT}/v1",
       "api_key": "empty",
-      "model_path": "codellama/CodeLlama-70b-Instruct-hf"
+      "model_path": "codellama/CodeLlama-70b-Instruct-hf",
+      "anony_only": false,
+      "api_type": "openai"
     }
   }
   EOF
@@ -47,4 +49,4 @@ run: |
 
   echo 'Starting gradio server...'
   python -u -m fastchat.serve.gradio_web_server --share \
-    --register-openai-compatible-models ~/model_info.json | tee ~/gradio.log
+    --register ~/model_info.json | tee ~/gradio.log
diff --git a/llm/dbrx/README.md b/llm/dbrx/README.md
index b546a323f6f..3011af9d4e6 100644
--- a/llm/dbrx/README.md
+++ b/llm/dbrx/README.md
@@ -22,7 +22,7 @@ In this recipe, you will serve `databricks/dbrx-instruct` on your own infra  --
 ```yaml
 envs:
   MODEL_NAME: databricks/dbrx-instruct
-  HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.
+  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
 
 service:
   replicas: 2
@@ -171,7 +171,7 @@ Wait until the model is ready (this can take 10+ minutes), as indicated by these
 ...
 (task, pid=17433) INFO 03-28 04:32:50 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
 ```
-:tada: **Congratulations!** :tada: You have now launched the DBRX Instruct LLM on your infra.
+🎉 **Congratulations!** 🎉 You have now launched the DBRX Instruct LLM on your infra.
 
 You can play with the model via
 - Standard OpenAPI-compatible endpoints (e.g., `/v1/chat/completions`)
@@ -256,7 +256,7 @@ ENDPOINT=$(sky serve status --endpoint dbrx)
 
 To curl the endpoint:
 ```console
-curl -L $ENDPOINT/v1/chat/completions \
+curl $ENDPOINT/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
     "model": "databricks/dbrx-instruct",
diff --git a/llm/dbrx/dbrx.yaml b/llm/dbrx/dbrx.yaml
index ffa777ab86d..0c9abd06d30 100644
--- a/llm/dbrx/dbrx.yaml
+++ b/llm/dbrx/dbrx.yaml
@@ -31,7 +31,7 @@
 
 envs:
   MODEL_NAME: databricks/dbrx-instruct
-  HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token, or use --env to pass.
+  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
 
 service:
   replicas: 2
diff --git a/llm/falcon/falcon.yaml b/llm/falcon/falcon.yaml
index 256d936d61b..b752db5256b 100644
--- a/llm/falcon/falcon.yaml
+++ b/llm/falcon/falcon.yaml
@@ -7,7 +7,7 @@ workdir: .
 
 envs:
   MODEL_NAME: tiiuae/falcon-7b # [ybelkada/falcon-7b-sharded-bf16, tiiuae/falcon-7b, tiiuae/falcon-40b]
-  WANDB_API_KEY: $WANDB_KEY # Change to your own wandb key
+  WANDB_API_KEY: # TODO: Fill with your own WANDB_API_KEY, or use --env to pass.
   OUTPUT_BUCKET_NAME: # Set a unique name for the bucket which will store model weights
 
 file_mounts:
@@ -39,4 +39,4 @@ run: |
   --bnb_4bit_compute_dtype bfloat16 \
   --max_steps 500 \
   --dataset_name timdettmers/openassistant-guanaco \
-  --output_dir /results
\ No newline at end of file
+  --output_dir /results
diff --git a/llm/gemma/README.md b/llm/gemma/README.md
index 2801c4fd6f3..ef5027b2807 100644
--- a/llm/gemma/README.md
+++ b/llm/gemma/README.md
@@ -37,7 +37,7 @@ After the cluster is launched, we can access the model with the following comman
 ```bash
 IP=$(sky status --ip gemma)
 
-curl -L http://$IP:8000/v1/completions \
+curl http://$IP:8000/v1/completions \
   -H "Content-Type: application/json" \
   -d '{
       "model": "google/gemma-7b-it",
@@ -50,7 +50,7 @@ Chat API is also supported:
 ```bash
 IP=$(sky status --ip gemma)
 
-curl -L http://$IP:8000/v1/chat/completions \
+curl http://$IP:8000/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
       "model": "google/gemma-7b-it",
@@ -78,7 +78,7 @@ After the cluster is launched, we can access the model with the following comman
 ```bash
 ENDPOINT=$(sky serve status --endpoint gemma)
 
-curl -L http://$ENDPOINT/v1/completions \
+curl http://$ENDPOINT/v1/completions \
   -H "Content-Type: application/json" \
   -d '{
       "model": "google/gemma-7b-it",
@@ -89,7 +89,7 @@ curl -L http://$ENDPOINT/v1/completions \
 
 Chat API is also supported:
 ```bash
-curl -L http://$ENDPOINT/v1/chat/completions \
+curl http://$ENDPOINT/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
       "model": "google/gemma-7b-it",
diff --git a/llm/gemma/serve.yaml b/llm/gemma/serve.yaml
index 73f5b9c2b5d..4c5a2c984c5 100644
--- a/llm/gemma/serve.yaml
+++ b/llm/gemma/serve.yaml
@@ -17,7 +17,7 @@ service:
 
 envs:
   MODEL_NAME: google/gemma-7b-it
-  HF_TOKEN: <your-huggingface-token> # TODO: Replace with huggingface token
+  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
 
 resources: 
   accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB}
diff --git a/llm/gpt-2/README.md b/llm/gpt-2/README.md
new file mode 100644
index 00000000000..bc9893fec5b
--- /dev/null
+++ b/llm/gpt-2/README.md
@@ -0,0 +1,119 @@
+# Run GPT-2 in llm.c on any cloud with SkyPilot
+
+This is a reproducible package of llm.c's GPT-2 (124M) training by @karpathy (https://github.com/karpathy/llm.c/discussions/481).
+With SkyPilot, you can run GPT-2 (124M) training on any cloud. SkyPilot looks for the cheapest resources available on the clouds enabled for a user, launches and manages the whole data processing and training pipeline, leading to a close to ~\$20 target cost as @karpathy mentioned in the discussion.
+
+## Prerequisites
+
+1. Install [SkyPilot](https://github.com/skypilot-org/skypilot):
+```bash
+pip install "skypilot-nightly[aws,gcp,azure,kubernetes,lambda,fluidstack]" # Choose the clouds you want to enable
+```
+2. Enable clouds for SkyPilot:
+```bash
+sky check
+```
+Please check the instructions for enabling clouds at [SkyPilot doc](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html).
+
+3. Download the YAML for starting the training:
+```bash
+wget https://raw.githubusercontent.com/skypilot-org/skypilot/blob/master/llm/gpt-2/gpt2.yaml
+```
+
+## Run GPT-2 training
+
+Run the following command to start GPT-2 (124M) training on a GPU VM with 8 A100 GPUs (replace `your-bucket-name` with your bucket name):
+
+```bash
+sky launch -c gpt2 gpt2.yaml
+```
+
+![GPT-2 training with 8 A100 GPUs](https://imgur.com/v8SGpsF.png)
+
+Or, you can train the model with a single A100, by adding `--gpus A100`:
+```bash
+sky launch -c gpt2 gpt2.yaml --gpus A100
+```
+
+![GPT-2 training with a single A100](https://imgur.com/hN65g4r.png)
+
+
+It is also possible to speed up the training of the model on 8 H100 (2.3x more tok/s than 8x A100s):
+```bash
+sky launch -c gpt2 gpt2.yaml --gpus H100:8
+```
+
+![GPT-2 training with 8 H100](https://imgur.com/STbi80b.png)
+
+### Download logs and visualizations
+
+After the training is finished, you can download the logs and visualizations with the following command:
+```bash
+scp -r gpt2:~/llm.c/log124M .
+```
+We can visualize the training progress with the notebook provided in [llm.c](https://github.com/karpathy/llm.c/blob/master/dev/vislog.ipynb). (Note: we cut off the training after 10K steps, which already achieve similar validation loss as OpenAI GPT-2 checkpoint.)
+
+<div align="center">
+<img src="https://imgur.com/lskPEAQ.png" width="60%">
+</div>
+
+> Yes! We are able to reproduce the training of GPT-2 (124M) on any cloud with SkyPilot.
+
+
+
+## Advanced: Run GPT-2 training in two stages
+
+The data processing for GPT-2 training is CPU-bound, while the training is GPU-bound. Having the data processing on a GPU VM is not cost-effective. With SkyPilot, you can easily
+separate the data processing and training into two stages and execute them sequantially manually, or let SkyPilot manage the dependencies between the two stages.
+
+With this data processing can be run on cheaper CPU VMs (e.g., ~\$0.4/hour), and run the training on more expensive GPU VMs (e.g., ~\$1.3-\$3.6/hour for a single A100 GPU, or \$10.3-\$32.8/hour for 8 A100 GPUs).
+
+We can run the data processing on a CPU VM and store the processed data in a cloud bucket. Then, we can run the training on a GPU VM with the processed data.
+
+```bash
+wget https://raw.githubusercontent.com//skypilot-org/skypilot/blob/master/llm/gpt-2/gpt2-data.yaml
+wget https://raw.githubusercontent.com/skypilot-org/skypilot/blob/master/llm/gpt-2/gpt2-train.yaml
+```
+
+### Run two stages manually
+#### Data processing
+
+Run the following command to process the training data on a CPU VM and store it in a cloud bucket for future use (replace `your-bucket-name` with your bucket name):
+
+```bash
+sky launch -c gpt2-data gpt2-data.yaml --env BUCKET_NAME=your-bucket-name
+```
+
+
+#### Training
+
+After the data is processed, you can then train the model on a GPU VM with 8 A100 GPUs (replace `your-bucket-name` with your bucket name):
+
+```bash
+sky launch -c gpt2-train --detach-setup gpt2-train.yaml --env BUCKET_NAME=your-bucket-name
+```
+
+Or, you can train the model with a single A100, by adding `--gpus A100`:
+```bash
+sky launch -c gpt2-train --detach-setup gpt2-train.yaml --gpus A100 --env BUCKET_NAME=your-bucket-name
+```
+
+
+### Run in a Pipeline
+
+We can also combine the two steps into a single SkyPilot job, and let SkyPilot to handle the dependencies between the two steps. Here is an example of how to do this (replace `your-bucket-name` with your bucket name):
+```bash
+sky jobs launch -n gpt2 gpt2-pipeline.yaml --env BUCKET_NAME=your-bucket-name
+```
+
+> Note: the pipeline yaml can be retrieved with the following command:
+```bash
+cat gpt2-data.yaml > gpt2-pipeline.yaml; echo "---" >> gpt2-pipeline.yaml; cat gpt2-train.yaml >> gpt2-pipeline.yaml
+```
+
+SkyPilot will first download and process the dataset on a CPU VM and store the
+processed data in a GCS bucket. Then, it will launch a GPT-2 training job on a
+GPU VM. The training job will train GPT-2 (124M) on the processed data.
+
+
+
diff --git a/llm/gpt-2/gpt2-data.yaml b/llm/gpt-2/gpt2-data.yaml
new file mode 100644
index 00000000000..fc7bb02bf95
--- /dev/null
+++ b/llm/gpt-2/gpt2-data.yaml
@@ -0,0 +1,34 @@
+name: gpt2-data
+
+envs:
+  BUCKET_NAME: # TODO: Fill in your bucket name
+  BUCKET_STORE: s3 # Can be s3, gcs, or r2.
+
+resources:
+  cpus: 8+
+
+file_mounts:
+  /cache:
+    name: $BUCKET_NAME
+    store: $BUCKET_STORE
+    mode: MOUNT
+
+setup: |
+  pip install tqdm tiktoken requests datasets
+  git clone https://github.com/karpathy/llm.c.git@ed37d9261ba13ef212c01e2de8b309cbb46a2aa7 || true
+
+  # Adding revision to fix the dataset version, as the latest fineweb
+  # dataset removed the samples, causing error:
+  #   Please pass `features` or at least one example when writing data
+  sed -i 's/fw = load_dataset("HuggingFaceFW\/fineweb", name=remote_name, split="train")/fw = load_dataset("HuggingFaceFW\/fineweb", name=remote_name, split="train", revision="9767af12bf8f0f7d3c91e0345b89bc6b9cbe1a94")/' dev/data/fineweb.py
+
+
+run: |
+  cd llm.c
+  # tokenize the FineWeb dataset 10B tokens sample (takes ~1 hour, get lunch?)
+  # writes ~19GB of raw GPT-2 tokens to dev/data/fineweb10B
+  # and ~46GB in ~/.cache/huggingface/datasets/HuggingFaceFW___fineweb
+  python dev/data/fineweb.py --version 10B
+
+  rsync -Pavz --exclude "datasets/downloads/" ~/.cache/huggingface /cache/
+  rsync -Pavz dev/data/fineweb10B /cache/
diff --git a/llm/gpt-2/gpt2-pipeline.yaml b/llm/gpt-2/gpt2-pipeline.yaml
new file mode 100644
index 00000000000..e5ea05f7948
--- /dev/null
+++ b/llm/gpt-2/gpt2-pipeline.yaml
@@ -0,0 +1,129 @@
+name: gpt2-data
+
+envs:
+  BUCKET_NAME: # TODO: Fill in your bucket name
+  BUCKET_STORE: s3 # Can be s3, gcs, or r2.
+
+resources:
+  cpus: 8+
+
+file_mounts:
+  /cache:
+    name: $BUCKET_NAME
+    store: $BUCKET_STORE
+    mode: MOUNT
+
+setup: |
+  pip install tqdm tiktoken requests datasets
+  git clone https://github.com/karpathy/llm.c.git@ed37d9261ba13ef212c01e2de8b309cbb46a2aa7 || true
+
+  # Adding revision to fix the dataset version, as the latest fineweb
+  # dataset removed the samples, causing error:
+  #   Please pass `features` or at least one example when writing data
+  sed -i 's/fw = load_dataset("HuggingFaceFW\/fineweb", name=remote_name, split="train")/fw = load_dataset("HuggingFaceFW\/fineweb", name=remote_name, split="train", revision="9767af12bf8f0f7d3c91e0345b89bc6b9cbe1a94")/' dev/data/fineweb.py
+
+
+run: |
+  cd llm.c
+  # tokenize the FineWeb dataset 10B tokens sample (takes ~1 hour, get lunch?)
+  # writes ~19GB of raw GPT-2 tokens to dev/data/fineweb10B
+  # and ~46GB in ~/.cache/huggingface/datasets/HuggingFaceFW___fineweb
+  python dev/data/fineweb.py --version 10B
+
+  rsync -Pavz --exclude "datasets/downloads/" ~/.cache/huggingface /cache/
+  rsync -Pavz dev/data/fineweb10B /cache/
+---
+name: gpt2-train
+
+envs:
+  BUCKET_NAME: # TODO: Fill in your bucket name
+  BUCKET_STORE: s3 # Can be s3, gcs, or r2.
+
+resources:
+  accelerators: A100:8
+  # Use docker image for latest version g++ to enable the compilation of llm.c.
+  image_id: docker:nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
+  any_of:
+    # Avoid using docker image for lambda due to the docker is not supported on
+    # Lambda yet, but the base image works.
+    - cloud: lambda
+      image_id: null
+    - cloud: aws
+    - cloud: gcp
+    - cloud: azure
+    - cloud: fluidstack
+    - cloud: kubernetes
+  
+file_mounts:
+  ~/.cache/huggingface:
+    name: $BUCKET_NAME
+    store: $BUCKET_STORE
+    mode: COPY
+
+setup: |
+  cd ~
+
+  # install cudnn so we can use FlashAttention and run fast (optional)
+  # https://developer.nvidia.com/cudnn-downloads
+  # for me, CUDA 12 (run `nvcc --version`) running on Linux x86_64 Ubuntu 22.04
+  if [ -f ./CUDNN_INSTALLED ]; then
+    echo "cudnn already installed"
+  else
+    system=$(lsb_release -si | tr '[:upper:]' '[:lower:]')
+    # Get version and remove the dot
+    version=$(lsb_release -sr | tr -d .)
+    export system_version="${system}${version}"
+    wget https://developer.download.nvidia.com/compute/cudnn/9.1.1/local_installers/cudnn-local-repo-${system_version}-9.1.1_1.0-1_amd64.deb -O cudnn-installer.deb
+    sudo dpkg -i cudnn-installer.deb
+    sudo cp /var/cudnn-local-repo-${system_version}-9.1.1/cudnn-*-keyring.gpg /usr/share/keyrings/
+    # Remove problematic kubernetes.list source
+    sudo apt-get update --allow-releaseinfo-change || true
+
+    sudo apt-get -y install cudnn-cuda-12
+
+    touch ./CUDNN_INSTALLED
+  fi
+
+  # "install" cudnn-frontend to ~/
+  sudo apt -y install git
+  git clone https://github.com/NVIDIA/cudnn-frontend.git || true
+
+  # install MPI (optional, if you intend to use multiple GPUs)
+  # SkyPilot do not install MPI as that requires NCCL which needs to be manually
+  # installed.
+  sudo apt install -y openmpi-bin openmpi-doc libopenmpi-dev
+  # install nccl
+  pip install nvidia-nccl-cu12
+  export LIBRARY_PATH=$LIBRARY_PATH:/usr/local/nccl2/lib
+  export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/usr/local/nccl2/include
+
+  git clone https://github.com/karpathy/llm.c.git || true
+  cd llm.c
+  ln -s ~/.cache/huggingface/fineweb10B dev/data/
+  # compile llm.c (mixed precision, with cuDNN flash-attention)
+  # first compilation is ~1 minute, mostly due to cuDNN
+  make train_gpt2cu USE_CUDNN=1
+
+
+run: |
+  cd ~/llm.c
+  # train on multiple GPUs
+  mpirun -np $SKYPILOT_NUM_GPUS_PER_NODE --allow-run-as-root ./train_gpt2cu \
+      -i "dev/data/fineweb10B/fineweb_train_*.bin" \
+      -j "dev/data/fineweb10B/fineweb_val_*.bin" \
+      -o log124M \
+      -e "d12" \
+      -b 64 -t 1024 \
+      -d 524288 \
+      -r 1 \
+      -z 1 \
+      -c 0.1 \
+      -l 0.0006 \
+      -q 0.0 \
+      -u 700 \
+      -n 5000 \
+      -v 250 -s 20000 \
+      -h 1
+
+  # Upload the log and model to the bucket
+  rsync -Pavz log124M ~/.cache/huggingface
diff --git a/llm/gpt-2/gpt2-train.yaml b/llm/gpt-2/gpt2-train.yaml
new file mode 100644
index 00000000000..3a4e8c28d14
--- /dev/null
+++ b/llm/gpt-2/gpt2-train.yaml
@@ -0,0 +1,94 @@
+name: gpt2-train
+
+envs:
+  BUCKET_NAME: # TODO: Fill in your bucket name
+  BUCKET_STORE: s3 # Can be s3, gcs, or r2.
+
+resources:
+  accelerators: A100:8
+  # Use docker image for latest version g++ to enable the compilation of llm.c.
+  image_id: docker:nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
+  any_of:
+    # Avoid using docker image for lambda due to the docker is not supported on
+    # Lambda yet, but the base image works.
+    - cloud: lambda
+      image_id: null
+    - cloud: aws
+    - cloud: gcp
+    - cloud: azure
+    - cloud: fluidstack
+    - cloud: kubernetes
+  
+file_mounts:
+  ~/.cache/huggingface:
+    name: $BUCKET_NAME
+    store: $BUCKET_STORE
+    mode: COPY
+
+setup: |
+  cd ~
+
+  # install cudnn so we can use FlashAttention and run fast (optional)
+  # https://developer.nvidia.com/cudnn-downloads
+  # for me, CUDA 12 (run `nvcc --version`) running on Linux x86_64 Ubuntu 22.04
+  if [ -f ./CUDNN_INSTALLED ]; then
+    echo "cudnn already installed"
+  else
+    system=$(lsb_release -si | tr '[:upper:]' '[:lower:]')
+    # Get version and remove the dot
+    version=$(lsb_release -sr | tr -d .)
+    export system_version="${system}${version}"
+    wget https://developer.download.nvidia.com/compute/cudnn/9.1.1/local_installers/cudnn-local-repo-${system_version}-9.1.1_1.0-1_amd64.deb -O cudnn-installer.deb
+    sudo dpkg -i cudnn-installer.deb
+    sudo cp /var/cudnn-local-repo-${system_version}-9.1.1/cudnn-*-keyring.gpg /usr/share/keyrings/
+    # Remove problematic kubernetes.list source
+    sudo apt-get update --allow-releaseinfo-change || true
+
+    sudo apt-get -y install cudnn-cuda-12
+
+    touch ./CUDNN_INSTALLED
+  fi
+
+  # "install" cudnn-frontend to ~/
+  sudo apt -y install git
+  git clone https://github.com/NVIDIA/cudnn-frontend.git || true
+
+  # install MPI (optional, if you intend to use multiple GPUs)
+  # SkyPilot do not install MPI as that requires NCCL which needs to be manually
+  # installed.
+  sudo apt install -y openmpi-bin openmpi-doc libopenmpi-dev
+  # install nccl
+  pip install nvidia-nccl-cu12
+  export LIBRARY_PATH=$LIBRARY_PATH:/usr/local/nccl2/lib
+  export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/usr/local/nccl2/include
+
+  git clone https://github.com/karpathy/llm.c.git || true
+  cd llm.c
+  ln -s ~/.cache/huggingface/fineweb10B dev/data/
+  # compile llm.c (mixed precision, with cuDNN flash-attention)
+  # first compilation is ~1 minute, mostly due to cuDNN
+  make train_gpt2cu USE_CUDNN=1
+
+
+run: |
+  cd ~/llm.c
+  # train on multiple GPUs
+  mpirun -np $SKYPILOT_NUM_GPUS_PER_NODE --allow-run-as-root ./train_gpt2cu \
+      -i "dev/data/fineweb10B/fineweb_train_*.bin" \
+      -j "dev/data/fineweb10B/fineweb_val_*.bin" \
+      -o log124M \
+      -e "d12" \
+      -b 64 -t 1024 \
+      -d 524288 \
+      -r 1 \
+      -z 1 \
+      -c 0.1 \
+      -l 0.0006 \
+      -q 0.0 \
+      -u 700 \
+      -n 5000 \
+      -v 250 -s 20000 \
+      -h 1
+
+  # Upload the log and model to the bucket
+  rsync -Pavz log124M ~/.cache/huggingface
diff --git a/llm/gpt-2/gpt2.yaml b/llm/gpt-2/gpt2.yaml
new file mode 100644
index 00000000000..8e203772128
--- /dev/null
+++ b/llm/gpt-2/gpt2.yaml
@@ -0,0 +1,95 @@
+name: train
+
+resources:
+  accelerators: A100:8
+  # Use docker image for latest version g++ to enable the compilation of llm.c.
+  image_id: docker:nvidia/cuda:12.4.1-cudnn-devel-ubuntu22.04
+  any_of:
+    # Avoid using docker image for lambda due to the docker is not supported on
+    # Lambda yet, but the base image works.
+    - cloud: lambda
+      image_id: null
+    - cloud: aws
+    - cloud: gcp
+    - cloud: azure
+    - cloud: fluidstack
+    - cloud: kubernetes
+  
+
+setup: |
+  cd ~
+  pip install tqdm tiktoken requests datasets
+
+  # Training dependencies
+  # install cudnn so we can use FlashAttention and run fast (optional)
+  # https://developer.nvidia.com/cudnn-downloads
+  # for me, CUDA 12 (run `nvcc --version`) running on Linux x86_64 Ubuntu 22.04
+  if [ -f ./CUDNN_INSTALLED ]; then
+    echo "cudnn already installed"
+  else
+    system=$(lsb_release -si | tr '[:upper:]' '[:lower:]')
+    # Get version and remove the dot
+    version=$(lsb_release -sr | tr -d .)
+    export system_version="${system}${version}"
+    wget https://developer.download.nvidia.com/compute/cudnn/9.1.1/local_installers/cudnn-local-repo-${system_version}-9.1.1_1.0-1_amd64.deb -O cudnn-installer.deb
+    sudo dpkg -i cudnn-installer.deb
+    sudo cp /var/cudnn-local-repo-${system_version}-9.1.1/cudnn-*-keyring.gpg /usr/share/keyrings/
+    # Remove problematic kubernetes.list source
+    sudo apt-get update --allow-releaseinfo-change || true
+
+    sudo apt-get -y install cudnn-cuda-12
+
+    touch ./CUDNN_INSTALLED
+  fi
+
+  # "install" cudnn-frontend to ~/
+  sudo apt -y install git
+  git clone https://github.com/NVIDIA/cudnn-frontend.git || true
+
+  # install MPI (optional, if you intend to use multiple GPUs)
+  # SkyPilot do not install MPI as that requires NCCL which needs to be manually
+  # installed.
+  sudo apt install -y openmpi-bin openmpi-doc libopenmpi-dev
+  # install nccl
+  pip install nvidia-nccl-cu12
+  export LIBRARY_PATH=$LIBRARY_PATH:/usr/local/nccl2/lib
+  export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/usr/local/nccl2/include
+
+  git clone https://github.com/karpathy/llm.c.git || true
+  cd llm.c
+
+  # add revision to fix the dataset version, as the latest fineweb
+  # dataset removed the samples, causing error:
+  #   Please pass `features` or at least one example when writing data
+  sed -i 's/fw = load_dataset("HuggingFaceFW\/fineweb", name=remote_name, split="train")/fw = load_dataset("HuggingFaceFW\/fineweb", name=remote_name, split="train", revision="9767af12bf8f0f7d3c91e0345b89bc6b9cbe1a94")/' dev/data/fineweb.py
+
+  # compile llm.c (mixed precision, with cuDNN flash-attention)
+  # first compilation is ~1 minute, mostly due to cuDNN
+  make train_gpt2cu USE_CUDNN=1
+
+
+run: |
+  cd ~/llm.c
+  # Processing data
+  # tokenize the FineWeb dataset 10B tokens sample (takes ~1 hour, get lunch?)
+  # writes ~19GB of raw GPT-2 tokens to dev/data/fineweb10B
+  # and ~46GB in ~/.cache/huggingface/datasets/HuggingFaceFW___fineweb
+  python dev/data/fineweb.py --version 10B
+
+  # Start training on multiple GPUs
+  mpirun -np $SKYPILOT_NUM_GPUS_PER_NODE --allow-run-as-root ./train_gpt2cu \
+      -i "dev/data/fineweb10B/fineweb_train_*.bin" \
+      -j "dev/data/fineweb10B/fineweb_val_*.bin" \
+      -o log124M \
+      -e "d12" \
+      -b 64 -t 1024 \
+      -d 524288 \
+      -r 1 \
+      -z 1 \
+      -c 0.1 \
+      -l 0.0006 \
+      -q 0.0 \
+      -u 700 \
+      -n 5000 \
+      -v 250 -s 20000 \
+      -h 1
diff --git a/llm/llama-2/README.md b/llm/llama-2/README.md
index 7b20ea4aed7..d8f8151572e 100644
--- a/llm/llama-2/README.md
+++ b/llm/llama-2/README.md
@@ -33,7 +33,7 @@ Fill the access token in the [chatbot-hf.yaml](https://github.com/skypilot-org/s
 ```yaml
 envs:
   MODEL_SIZE: 7
-  HF_TOKEN: <your-huggingface-token>
+  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
 ```
 
 
diff --git a/llm/llama-2/chatbot-hf.yaml b/llm/llama-2/chatbot-hf.yaml
index 4c0132e4dd4..ee9d0281296 100644
--- a/llm/llama-2/chatbot-hf.yaml
+++ b/llm/llama-2/chatbot-hf.yaml
@@ -6,7 +6,7 @@ resources:
 
 envs:
   MODEL_SIZE: 7
-  HF_TOKEN: <your-huggingface-token> # TODO: Replace with huggingface token
+  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
 
 setup: |
   conda activate chatbot
@@ -24,12 +24,13 @@ run: |
   conda activate chatbot
   
   echo 'Starting controller...'
-  python -u -m fastchat.serve.controller > ~/controller.log 2>&1 &
+  python -u -m fastchat.serve.controller --host 127.0.0.1 > ~/controller.log 2>&1 &
   sleep 10
   echo 'Starting model worker...'
   python -u -m fastchat.serve.model_worker \
             --model-path meta-llama/Llama-2-${MODEL_SIZE}b-chat-hf \
             --num-gpus $SKYPILOT_NUM_GPUS_PER_NODE 2>&1 \
+            --host 127.0.0.1 \
             | tee model_worker.log &
 
   echo 'Waiting for model worker to start...'
diff --git a/llm/llama-2/chatbot-meta.yaml b/llm/llama-2/chatbot-meta.yaml
index a0481fe760f..733a2a867d2 100644
--- a/llm/llama-2/chatbot-meta.yaml
+++ b/llm/llama-2/chatbot-meta.yaml
@@ -6,7 +6,7 @@ resources:
 
 envs:
   MODEL_SIZE: 7
-  HF_TOKEN: <your-huggingface-token> # TODO: Replace with huggingface token
+  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
 
 setup: |
   set -ex
diff --git a/llm/llama-3/README.md b/llm/llama-3/README.md
new file mode 100644
index 00000000000..d0c28dc93c6
--- /dev/null
+++ b/llm/llama-3/README.md
@@ -0,0 +1,354 @@
+<!-- $REMOVE -->
+# Scale Serving Llama-3 on Any Cloud or Kubernetes with SkyPilot
+<!-- $END_REMOVE -->
+<!-- $UNCOMMENT# Llama-3: Open LLM from Meta -->
+
+
+<p align="center">
+<img src="https://imgur.com/1NEZs9f.png" alt="Llama-3 x SkyPilot" style="width: 50%;">
+</p>
+
+[Llama-3](https://github.com/meta-llama/llama3) is the latest top open-source LLM from Meta. It has been released with a license that authorizes commercial use. You can deploy a private Llama-3 chatbot with SkyPilot in your own cloud with just one simple command.
+
+* [Llama-3 release](https://github.com/meta-llama/llama3)
+* [Llama-3 blog](https://ai.meta.com/blog/meta-llama-3/)
+
+
+
+
+
+## Why use SkyPilot vs. commercial hosted solutions?
+
+* No lock-in: run on any supported cloud - AWS, Azure, GCP, Lambda Cloud, IBM, Samsung, OCI
+* Everything stays in your cloud account (your VMs & buckets)
+* No one else sees your chat history
+* Pay absolute minimum — no managed solution markups
+* Freely choose your own model size, GPU type, number of GPUs, etc, based on scale and budget.
+
+…and you get all of this with 1 click — let SkyPilot automate the infra.
+
+
+## Prerequisites
+
+- Go to the [HuggingFace model page](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) and request access to the model `meta-llama/Meta-Llama-3-70B-Instruct`.
+- Check that you have installed SkyPilot ([docs](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html)).
+- Check that `sky check` shows clouds or Kubernetes are enabled.
+
+## SkyPilot YAML
+
+<details>
+<summary>Click to see the full recipe YAML</summary>
+
+```yaml
+
+envs:
+  MODEL_NAME: meta-llama/Meta-Llama-3-70B-Instruct
+  # MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
+  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
+
+service:
+  replicas: 2
+  # An actual request for readiness probe.
+  readiness_probe:
+    path: /v1/chat/completions
+    post_data:
+      model: $MODEL_NAME
+      messages:
+        - role: user
+          content: Hello! What is your name?
+      max_tokens: 1
+
+resources:
+  accelerators: {L4:8, A10g:8, A10:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}
+  # accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
+  cpus: 32+
+  use_spot: True
+  disk_size: 512  # Ensure model checkpoints can fit.
+  disk_tier: best
+  ports: 8081  # Expose to internet traffic.
+
+setup: |
+  conda activate vllm
+  if [ $? -ne 0 ]; then
+    conda create -n vllm python=3.10 -y
+    conda activate vllm
+  fi
+
+  pip install vllm==0.4.2
+  # Install Gradio for web UI.
+  pip install gradio openai
+  pip install flash-attn==2.5.9.post1
+
+
+run: |
+  conda activate vllm
+  echo 'Starting vllm api server...'
+
+  # https://github.com/vllm-project/vllm/issues/3098
+  export PATH=$PATH:/sbin
+
+  # NOTE: --gpu-memory-utilization 0.95 needed for 4-GPU nodes.
+  python -u -m vllm.entrypoints.openai.api_server \
+    --port 8081 \
+    --model $MODEL_NAME \
+    --trust-remote-code --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
+    --gpu-memory-utilization 0.95 \
+    --max-num-seqs 64 \
+    2>&1 | tee api_server.log &
+
+  while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do
+    echo 'Waiting for vllm api server to start...'
+    sleep 5
+  done
+
+  echo 'Starting gradio server...'
+  git clone https://github.com/vllm-project/vllm.git || true
+  python vllm/examples/gradio_openai_chatbot_webserver.py \
+    -m $MODEL_NAME \
+    --port 8811 \
+    --model-url http://localhost:8081/v1 \
+    --stop-token-ids 128009,128001
+
+```
+
+</details>
+
+You can also get the full YAML file [here](https://github.com/skypilot-org/skypilot/blob/master/llm/llama-3/llama3.yaml).
+
+## Serving Llama-3: single instance
+
+Launch a single spot instance to serve Llama-3 on your infra:
+```console
+HF_TOKEN=xxx sky launch llama3.yaml -c llama3 --env HF_TOKEN
+```
+
+<details>
+<summary>Example outputs:</summary>
+
+```console
+...
+I 04-18 16:31:30 optimizer.py:693] == Optimizer ==
+I 04-18 16:31:30 optimizer.py:704] Target: minimizing cost
+I 04-18 16:31:30 optimizer.py:716] Estimated cost: $1.2 / hour
+I 04-18 16:31:30 optimizer.py:716]
+I 04-18 16:31:30 optimizer.py:839] Considered resources (1 node):
+I 04-18 16:31:30 optimizer.py:909] -----------------------------------------------------------------------------------------------------------------
+I 04-18 16:31:30 optimizer.py:909]  CLOUD   INSTANCE                          vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE      COST ($)   CHOSEN
+I 04-18 16:31:30 optimizer.py:909] -----------------------------------------------------------------------------------------------------------------
+I 04-18 16:31:30 optimizer.py:909]  Azure   Standard_NC48ads_A100_v4[Spot]    48      440       A100-80GB:2    eastus           1.22          ✔
+I 04-18 16:31:30 optimizer.py:909]  AWS     g6.48xlarge[Spot]                 192     768       L4:8           us-east-1b       1.43
+I 04-18 16:31:30 optimizer.py:909]  Azure   Standard_NC96ads_A100_v4[Spot]    96      880       A100-80GB:4    eastus           2.44
+I 04-18 16:31:30 optimizer.py:909]  AWS     g5.48xlarge[Spot]                 192     768       A10G:8         us-east-2b       2.45
+I 04-18 16:31:30 optimizer.py:909]  GCP     g2-standard-96[Spot]              96      384       L4:8           asia-east1-a     2.49
+I 04-18 16:31:30 optimizer.py:909]  Azure   Standard_ND96asr_v4[Spot]         96      900       A100:8         eastus           4.82
+I 04-18 16:31:30 optimizer.py:909]  GCP     a2-highgpu-4g[Spot]               48      340       A100:4         europe-west4-a   4.82
+I 04-18 16:31:30 optimizer.py:909]  AWS     p4d.24xlarge[Spot]                96      1152      A100:8         us-east-2b       4.90
+I 04-18 16:31:30 optimizer.py:909]  Azure   Standard_ND96amsr_A100_v4[Spot]   96      1924      A100-80GB:8    southcentralus   5.17
+I 04-18 16:31:30 optimizer.py:909]  GCP     a2-ultragpu-4g[Spot]              48      680       A100-80GB:4    us-east4-c       7.39
+I 04-18 16:31:30 optimizer.py:909]  GCP     a2-highgpu-8g[Spot]               96      680       A100:8         europe-west4-a   9.65
+I 04-18 16:31:30 optimizer.py:909]  GCP     a2-ultragpu-8g[Spot]              96      1360      A100-80GB:8    us-east4-c       14.79
+I 04-18 16:31:30 optimizer.py:909] -----------------------------------------------------------------------------------------------------------------
+I 04-18 16:31:30 optimizer.py:909]
+...
+```
+
+</details>
+
+To run on Kubernetes or use an on-demand instance, pass `--no-use-spot` to the above command.
+
+<details>
+<summary>Example outputs with Kubernetes / on-demand instances:</summary>
+
+```console
+$ HF_TOKEN=xxx sky launch llama3.yaml -c llama3 --env HF_TOKEN --no-use-spot
+...
+I 04-18 16:34:13 optimizer.py:693] == Optimizer ==
+I 04-18 16:34:13 optimizer.py:704] Target: minimizing cost
+I 04-18 16:34:13 optimizer.py:716] Estimated cost: $5.0 / hour
+I 04-18 16:34:13 optimizer.py:716]
+I 04-18 16:34:13 optimizer.py:839] Considered resources (1 node):
+I 04-18 16:34:13 optimizer.py:909] ------------------------------------------------------------------------------------------------------------------
+I 04-18 16:34:13 optimizer.py:909]  CLOUD        INSTANCE                    vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE        COST ($)   CHOSEN
+I 04-18 16:34:13 optimizer.py:909] ------------------------------------------------------------------------------------------------------------------
+I 04-18 16:34:13 optimizer.py:909]  Kubernetes   32CPU--512GB--8A100         32      512       A100:8         kubernetes         0.00          ✔
+I 04-18 16:34:13 optimizer.py:909]  Fluidstack   recE2ZDQmqR9HBKYs5xSnjtPw   64      240       A100-80GB:2    generic_1_canada   4.96
+I 04-18 16:34:13 optimizer.py:909]  Fluidstack   recUiB2e6s3XDxwE9           60      440       A100:4         calgary_1_canada   5.88
+I 04-18 16:34:13 optimizer.py:909]  Azure        Standard_NC48ads_A100_v4    48      440       A100-80GB:2    eastus             7.35
+I 04-18 16:34:13 optimizer.py:909]  GCP          g2-standard-96              96      384       L4:8           us-east4-a         7.98
+I 04-18 16:34:13 optimizer.py:909]  Fluidstack   recWGm4oJ9AB3XVPxzRaujgbx   126     480       A100-80GB:4    generic_1_canada   9.89
+I 04-18 16:34:13 optimizer.py:909]  Paperspace   A100-80Gx4                  46      320       A100-80GB:4    East Coast (NY2)   12.72
+I 04-18 16:34:13 optimizer.py:909]  AWS          g6.48xlarge                 192     768       L4:8           us-east-1          13.35
+I 04-18 16:34:13 optimizer.py:909]  GCP          a2-highgpu-4g               48      340       A100:4         us-central1-a      14.69
+I 04-18 16:34:13 optimizer.py:909]  Azure        Standard_NC96ads_A100_v4    96      880       A100-80GB:4    eastus             14.69
+I 04-18 16:34:13 optimizer.py:909]  AWS          g5.48xlarge                 192     768       A10G:8         us-east-1          16.29
+I 04-18 16:34:13 optimizer.py:909]  Fluidstack   recUYj6oGJCvAvCXC7KQo5Fc7   252     960       A100-80GB:8    generic_1_canada   19.79
+I 04-18 16:34:13 optimizer.py:909]  GCP          a2-ultragpu-4g              48      680       A100-80GB:4    us-central1-a      20.11
+I 04-18 16:34:13 optimizer.py:909]  Paperspace   A100-80Gx8                  96      640       A100-80GB:8    East Coast (NY2)   25.44
+I 04-18 16:34:13 optimizer.py:909]  Azure        Standard_ND96asr_v4         96      900       A100:8         eastus             27.20
+I 04-18 16:34:13 optimizer.py:909]  GCP          a2-highgpu-8g               96      680       A100:8         us-central1-a      29.39
+I 04-18 16:34:13 optimizer.py:909]  Azure        Standard_ND96amsr_A100_v4   96      1924      A100-80GB:8    eastus             32.77
+I 04-18 16:34:13 optimizer.py:909]  AWS          p4d.24xlarge                96      1152      A100:8         us-east-1          32.77
+I 04-18 16:34:13 optimizer.py:909]  GCP          a2-ultragpu-8g              96      1360      A100-80GB:8    us-central1-a      40.22
+I 04-18 16:34:13 optimizer.py:909]  AWS          p4de.24xlarge               96      1152      A100-80GB:8    us-east-1          40.97
+I 04-18 16:34:13 optimizer.py:909] ------------------------------------------------------------------------------------------------------------------
+...
+```
+
+</details>
+
+Wait until the model is ready (this can take 10+ minutes), as indicated by these lines:
+```console
+...
+(task, pid=17433) Waiting for vllm api server to start...
+...
+(task, pid=17433) INFO:     Started server process [20621]
+(task, pid=17433) INFO:     Waiting for application startup.
+(task, pid=17433) INFO:     Application startup complete.
+(task, pid=17433) INFO:     Uvicorn running on http://0.0.0.0:8081 (Press CTRL+C to quit)
+...
+(task, pid=17433) Running on local URL:  http://127.0.0.1:8811
+(task, pid=17433) Running on public URL: https://xxxxxxxxxx.gradio.live
+...
+(task, pid=17433) INFO 03-28 04:32:50 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
+```
+🎉 **Congratulations!** 🎉 You have now launched the Llama-3 Instruct LLM on your infra.
+
+You can play with the model via
+- Standard OpenAPI-compatible endpoints (e.g., `/v1/chat/completions`)
+- Gradio UI (automatically launched)
+
+To curl `/v1/chat/completions`:
+```console
+ENDPOINT=$(sky status --endpoint 8081 llama3)
+
+# We need to manually specify the stop_token_ids to make sure the model finish
+# on <|eot_id|>.
+curl http://$ENDPOINT/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "meta-llama/Meta-Llama-3-70B-Instruct",
+    "messages": [
+      {
+        "role": "system",
+        "content": "You are a helpful assistant."
+      },
+      {
+        "role": "user",
+        "content": "Who are you?"
+      }
+    ],
+    "stop_token_ids": [128009,  128001]
+  }'
+```
+
+To use the Gradio UI, open the URL shown in the logs:
+```console
+(task, pid=17433) Running on public URL: https://xxxxxxxxxx.gradio.live
+```
+
+
+<p align="center">
+<img src="https://imgur.com/zPpY2Bg.gif" alt="Gradio UI serving Llama-3" style="width: 80%;">
+</p>
+
+To stop the instance:
+```console
+sky stop llama3
+```
+
+To shut down all resources:
+```console
+sky down llama3
+```
+
+**Note**: If you would like to try the 8B model, you can use the following accelerators:
+```yaml
+resources:
+  accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB}
+```
+
+## Serving Llama-3: scaling up with SkyServe
+
+After playing with the model, you can deploy the model with autoscaling and load-balancing using SkyServe.
+
+With no change to the YAML, launch a fully managed service on your infra:
+```console
+HF_TOKEN=xxx sky serve up llama3.yaml -n llama3 --env HF_TOKEN
+```
+
+Wait until the service is ready:
+```console
+watch -n10 sky serve status llama3
+```
+
+<details>
+<summary>Example outputs:</summary>
+
+```console
+Services
+NAME  VERSION  UPTIME  STATUS  REPLICAS  ENDPOINT
+llama3  1        35s     READY   2/2       xx.yy.zz.100:30001
+
+Service Replicas
+SERVICE_NAME  ID  VERSION  IP            LAUNCHED     RESOURCES                       STATUS  REGION
+llama3          1   1        xx.yy.zz.121  18 mins ago  1x GCP([Spot]{'A100-80GB': 4})  READY   us-east4
+llama3          2   1        xx.yy.zz.245  18 mins ago  1x GCP([Spot]{'A100-80GB': 4})  READY   us-east4
+```
+</details>
+
+
+Get a single endpoint that load-balances across replicas:
+```console
+ENDPOINT=$(sky serve status --endpoint llama3)
+```
+
+> **Tip:** SkyServe fully manages the lifecycle of your replicas. For example, if a spot replica is preempted, the controller will automatically replace it. This significantly reduces the operational burden while saving costs.
+
+To curl the endpoint:
+```console
+curl -L $ENDPOINT/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "meta-llama/Meta-Llama-3-70B-Instruct",
+    "messages": [
+      {
+        "role": "system",
+        "content": "You are a helpful assistant."
+      },
+      {
+        "role": "user",
+        "content": "Who are you?"
+      }
+    ]
+  }'
+```
+
+To shut down all resources:
+```console
+sky serve down llama3
+```
+
+See more details in [SkyServe docs](https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html).
+
+
+### **Optional**: Connect a GUI to your Llama-3 endpoint
+
+
+
+It is also possible to access the Code Llama service with a separate GUI frontend, so the user requests send to the GUI will be load-balanced across replicas.
+
+1. Start the chat web UI:
+```bash
+sky launch -c llama3-gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint llama3)
+```
+
+2. Then, we can access the GUI at the returned gradio link:
+```
+| INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live
+```
+
+
+
+## Finetuning Llama-3
+
+You can finetune Llama-3 on your own data. We have an tutorial for finetunning Llama-2 for Vicuna on SkyPilot, which can be adapted for Llama-3. You can find the tutorial [here](https://skypilot.readthedocs.io/en/latest/gallery/tutorials/finetuning.html) and a detailed blog post [here](https://blog.skypilot.co/finetuning-llama2-operational-guide/).
diff --git a/llm/llama-3/gui.yaml b/llm/llama-3/gui.yaml
new file mode 100644
index 00000000000..a97d9f26a62
--- /dev/null
+++ b/llm/llama-3/gui.yaml
@@ -0,0 +1,46 @@
+# Starts a GUI server that connects to the Llama-3 OpenAI API server.
+#
+# This works with the endpoint.yaml, please refer to llm/llama-3/README.md
+# for more details.
+#
+# Usage:
+#
+#  1. If you have a endpoint started on a cluster (sky launch):
+#     `sky launch -c llama3-gui ./gui.yaml --env ENDPOINT=$(sky status --endpoint 8081 llama3)`
+#  2. If you have a SkyPilot Service started (sky serve up) called llama3:
+#     `sky launch -c llama3-gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint llama3)`
+#
+# After the GUI server is started, you will see a gradio link in the output and
+# you can click on it to open the GUI.
+
+envs:
+  MODEL_NAME: meta-llama/Meta-Llama-3-70B-Instruct
+  ENDPOINT: x.x.x.x:3031 # Address of the API server running llama3. 
+
+resources:
+  cpus: 2
+
+setup: |
+  conda activate llama3
+  if [ $? -ne 0 ]; then
+    conda create -n llama3 python=3.10 -y
+    conda activate llama3
+  fi
+
+  # Install Gradio for web UI.
+  pip install gradio openai
+
+run: |
+  conda activate llama3
+  export PATH=$PATH:/sbin
+  WORKER_IP=$(hostname -I | cut -d' ' -f1)
+  CONTROLLER_PORT=21001
+  WORKER_PORT=21002
+
+  echo 'Starting gradio server...'
+  git clone https://github.com/vllm-project/vllm.git || true
+  python vllm/examples/gradio_openai_chatbot_webserver.py \
+    -m $MODEL_NAME \
+    --port 8811 \
+    --model-url http://$ENDPOINT/v1 \
+    --stop-token-ids 128009,128001 | tee ~/gradio.log
diff --git a/llm/llama-3/llama3.yaml b/llm/llama-3/llama3.yaml
new file mode 100644
index 00000000000..ff9dc4967ac
--- /dev/null
+++ b/llm/llama-3/llama3.yaml
@@ -0,0 +1,126 @@
+# Serving Meta Llama-3 on your own infra.
+#
+# Usage:
+#
+#  HF_TOKEN=xxx sky launch llama3.yaml -c llama3 --env HF_TOKEN
+#
+# curl /v1/chat/completions:
+#
+#   ENDPOINT=$(sky status --endpoint 8081 llama3)
+#  
+#   # We need to manually specify the stop_token_ids to make sure the model finish
+#   # on <|eot_id|>.
+#   curl http://$ENDPOINT/v1/chat/completions \
+#     -H "Content-Type: application/json" \
+#     -d '{
+#       "model": "meta-llama/Meta-Llama-3-8B-Instruct",
+#       "messages": [
+#         {
+#           "role": "system",
+#           "content": "You are a helpful assistant."
+#         },
+#         {
+#           "role": "user",
+#           "content": "Who are you?"
+#         }
+#       ],
+#       "stop_token_ids": [128009,  128001]
+#     }'
+#
+# Chat with model with Gradio UI:
+#
+#   Running on local URL:  http://127.0.0.1:8811
+#   Running on public URL: https://<hash>.gradio.live
+#
+# Scale up with SkyServe:
+#  HF_TOKEN=xxx sky serve up llama3.yaml -n llama3 --env HF_TOKEN
+#
+# curl /v1/chat/completions:
+#
+#   ENDPOINT=$(sky serve status --endpoint llama3)
+#   curl -L $ENDPOINT/v1/models
+#   curl -L http://$ENDPOINT/v1/chat/completions \
+#     -H "Content-Type: application/json" \
+#     -d '{
+#       "model": "databricks/llama3-instruct",
+#       "messages": [
+#         {
+#           "role": "system",
+#           "content": "You are a helpful assistant."
+#         },
+#         {
+#           "role": "user",
+#           "content": "Who are you?"
+#         }
+#       ]
+#     }'
+
+
+envs:
+  MODEL_NAME: meta-llama/Meta-Llama-3-70B-Instruct
+  # MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
+  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
+
+service:
+  replicas: 2
+  # An actual request for readiness probe.
+  readiness_probe:
+    path: /v1/chat/completions
+    post_data:
+      model: $MODEL_NAME
+      messages:
+        - role: user
+          content: Hello! What is your name?
+      max_tokens: 1
+
+resources:
+  accelerators: {L4:8, A10g:8, A10:8, A100:4, A100:8, A100-80GB:2, A100-80GB:4, A100-80GB:8}
+  # accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
+  cpus: 32+
+  use_spot: True
+  disk_size: 512  # Ensure model checkpoints can fit.
+  disk_tier: best
+  ports: 8081  # Expose to internet traffic.
+
+setup: |
+  conda activate vllm
+  if [ $? -ne 0 ]; then
+    conda create -n vllm python=3.10 -y
+    conda activate vllm
+  fi
+
+  pip install vllm==0.4.2
+  # Install Gradio for web UI.
+  pip install gradio openai
+  pip install flash-attn==2.5.9.post1
+
+
+run: |
+  conda activate vllm
+  echo 'Starting vllm api server...'
+
+  # https://github.com/vllm-project/vllm/issues/3098
+  export PATH=$PATH:/sbin
+
+  # NOTE: --gpu-memory-utilization 0.95 needed for 4-GPU nodes.
+  python -u -m vllm.entrypoints.openai.api_server \
+    --port 8081 \
+    --model $MODEL_NAME \
+    --trust-remote-code --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
+    --gpu-memory-utilization 0.95 \
+    --max-num-seqs 64 \
+    2>&1 | tee api_server.log &
+
+  while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do
+    echo 'Waiting for vllm api server to start...'
+    sleep 5
+  done
+
+  echo 'Starting gradio server...'
+  git clone https://github.com/vllm-project/vllm.git || true
+  python vllm/examples/gradio_openai_chatbot_webserver.py \
+    -m $MODEL_NAME \
+    --port 8811 \
+    --model-url http://localhost:8081/v1 \
+    --stop-token-ids 128009,128001
+
diff --git a/llm/mixtral/README.md b/llm/mixtral/README.md
index 208b40ca14b..0bddb77c665 100644
--- a/llm/mixtral/README.md
+++ b/llm/mixtral/README.md
@@ -53,7 +53,7 @@ We can now access the model through the OpenAI API with the IP and port:
 ```bash
 IP=$(sky status --ip mixtral)
 
-curl -L http://$IP:8000/v1/completions \
+curl http://$IP:8000/v1/completions \
   -H "Content-Type: application/json" \
   -d '{
       "model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
@@ -66,7 +66,7 @@ Chat API is also supported:
 ```bash
 IP=$(sky status --ip mixtral)
 
-curl -L http://$IP:8000/v1/chat/completions \
+curl http://$IP:8000/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
       "model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
@@ -119,7 +119,7 @@ After the `sky serve up` command, there will be a single endpoint for the servic
 ```bash
 ENDPOINT=$(sky serve status --endpoint mixtral)
 
-curl -L http://$ENDPOINT/v1/completions \
+curl http://$ENDPOINT/v1/completions \
   -H "Content-Type: application/json" \
   -d '{
       "model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
@@ -132,7 +132,7 @@ Chat API is also supported:
 ```bash
 ENDPOINT=$(sky serve status --endpoint mixtral)
 
-curl -L http://$ENDPOINT/v1/chat/completions \
+curl http://$ENDPOINT/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
       "model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
diff --git a/llm/qwen/README.md b/llm/qwen/README.md
index dac1852ff79..6a76af71287 100644
--- a/llm/qwen/README.md
+++ b/llm/qwen/README.md
@@ -1,8 +1,15 @@
-# Serving Qwen1.5 on Your Own Cloud
+# Serving Qwen2 on Your Own Cloud
 
-[Qwen1.5](https://github.com/QwenLM/Qwen1.5) is one of the top open LLMs.
-As of Feb 2024, Qwen1.5-72B-Chat is ranked higher than Mixtral-8x7b-Instruct-v0.1 on the LMSYS Chatbot Arena Leaderboard.
+[Qwen2](https://github.com/QwenLM/Qwen2) is one of the top open LLMs.
+As of Jun 2024, Qwen1.5-110B-Chat is ranked higher than GPT-4-0613 on the [LMSYS Chatbot Arena Leaderboard](https://chat.lmsys.org/?leaderboard).
 
+📰 **Update (26 April 2024) -** SkyPilot now also supports the [**Qwen1.5-110B**](https://qwenlm.github.io/blog/qwen1.5-110b/) model! It performs competitively with Llama-3-70B across a [series of evaluations](https://qwenlm.github.io/blog/qwen1.5-110b/#model-quality). Use [serve-110b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/serve-110b.yaml) to serve the 110B model.
+
+📰 **Update (6 Jun 2024) -** SkyPilot now also supports the [**Qwen2**](https://qwenlm.github.io/blog/qwen2/) model! It further improves the competitive model, Qwen1.5.
+
+<p align="center">
+    <img src="https://i.imgur.com/d7tEhAl.gif" alt="qwen" width="600"/>
+</p>
 
 ## References
 * [Qwen docs](https://qwen.readthedocs.io/en/latest/)
@@ -20,19 +27,19 @@ As of Feb 2024, Qwen1.5-72B-Chat is ranked higher than Mixtral-8x7b-Instruct-v0.
 
 After [installing SkyPilot](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html), run your own Qwen model on vLLM with SkyPilot in 1-click:
 
-1. Start serving Qwen 72B on a single instance with any available GPU in the list specified in [serve-72b.yaml](serve-72b.yaml) with a vLLM powered OpenAI-compatible endpoint (You can also switch to [serve-7b.yaml](serve-7b.yaml) for a smaller model):
+1. Start serving Qwen 110B on a single instance with any available GPU in the list specified in [serve-110b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/serve-110b.yaml) with a vLLM powered OpenAI-compatible endpoint (You can also switch to [serve-72b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/serve-72b.yaml) or [serve-7b.yaml](https://github.com/skypilot-org/skypilot/blob/master/llm/qwen/serve-7b.yaml) for a smaller model):
 
 ```console
-sky launch -c qwen serve-72b.yaml
+sky launch -c qwen serve-110b.yaml
 ```
 2. Send a request to the endpoint for completion:
 ```bash
 IP=$(sky status --ip qwen)
 
-curl -L http://$IP:8000/v1/completions \
+curl http://$IP:8000/v1/completions \
     -H "Content-Type: application/json" \
     -d '{
-      "model": "Qwen/Qwen1.5-72B-Chat",
+      "model": "Qwen/Qwen1.5-110B-Chat",
       "prompt": "My favorite food is",
       "max_tokens": 512
   }' | jq -r '.choices[0].text'
@@ -40,10 +47,10 @@ curl -L http://$IP:8000/v1/completions \
 
 3. Send a request for chat completion:
 ```bash
-curl -L http://$IP:8000/v1/chat/completions \
+curl http://$IP:8000/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
-      "model": "Qwen/Qwen1.5-72B-Chat",
+      "model": "Qwen/Qwen1.5-110B-Chat",
       "messages": [
         {
           "role": "system",
@@ -87,14 +94,14 @@ As shown, the service is now backed by 2 replicas, one on Azure and one on GCP,
 type is chosen to be **the cheapest available one** on the clouds. That said, it maximizes the
 availability of the service while minimizing the cost.
 
-3. To access the model, we use a `curl -L` command (`-L` to follow redirect) to send the request to the endpoint:
+3. To access the model, we use a `curl` command to send the request to the endpoint:
 ```bash
 ENDPOINT=$(sky serve status --endpoint qwen)
 
-curl -L http://$ENDPOINT/v1/chat/completions \
+curl http://$ENDPOINT/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
-      "model": "Qwen/Qwen1.5-72B-Chat",
+      "model": "Qwen/Qwen2-72B-Instruct",
       "messages": [
         {
           "role": "system",
@@ -110,13 +117,13 @@ curl -L http://$ENDPOINT/v1/chat/completions \
 ```
 
 
-## **Optional:** Accessing Code Llama with Chat GUI
+## **Optional:** Accessing Qwen with Chat GUI
 
-It is also possible to access the Code Llama service with a GUI using [FastChat](https://github.com/lm-sys/FastChat).
+It is also possible to access the Qwen service with a GUI using [vLLM](https://github.com/vllm-project/vllm).
 
-1. Start the chat web UI:
+1. Start the chat web UI (change the `--env` flag to the model you are running):
 ```bash
-sky launch -c qwen-gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint qwen)
+sky launch -c qwen-gui ./gui.yaml --env MODEL_NAME='Qwen/Qwen2-72B-Instruct' --env ENDPOINT=$(sky serve status --endpoint qwen)
 ```
 
 2. Then, we can access the GUI at the returned gradio link:
@@ -124,5 +131,3 @@ sky launch -c qwen-gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint q
 | INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live
 ```
 
-Note that you may get better results to use a higher temperature and top_p value.
-
diff --git a/llm/qwen/gui.yaml b/llm/qwen/gui.yaml
index 86e2fe27220..bf3c01abf76 100644
--- a/llm/qwen/gui.yaml
+++ b/llm/qwen/gui.yaml
@@ -13,7 +13,8 @@
 # you can click on it to open the GUI.
 
 envs:
-  ENDPOINT: x.x.x.x:3031 # Address of the API server running qwen. 
+  ENDPOINT: x.x.x.x:3031 # Address of the API server running qwen.
+  MODEL_NAME: Qwen/Qwen1.5-72B-Chat
 
 resources:
   cpus: 2
@@ -25,8 +26,8 @@ setup: |
     conda activate qwen
   fi
 
-  pip install "fschat[model_worker,webui]"
-  pip install "openai<1"
+  # Install Gradio for web UI.
+  pip install gradio openai
 
 run: |
   conda activate qwen
@@ -35,19 +36,9 @@ run: |
   CONTROLLER_PORT=21001
   WORKER_PORT=21002
 
-  cat <<EOF > ~/model_info.json
-  {
-    "Qwen/Qwen1.5-72B-Chat": {
-      "model_name": "Qwen/Qwen1.5-72B-Chat",
-      "api_base": "http://${ENDPOINT}/v1",
-      "api_key": "empty",
-      "model_path": "Qwen/Qwen1.5-72B-Chat"
-    }
-  }
-  EOF
-
-  python3 -m fastchat.serve.controller --host 0.0.0.0 --port ${CONTROLLER_PORT} > ~/controller.log 2>&1 &
-
   echo 'Starting gradio server...'
-  python -u -m fastchat.serve.gradio_web_server --share \
-    --register-openai-compatible-models ~/model_info.json | tee ~/gradio.log
+  git clone https://github.com/vllm-project/vllm.git || true
+  python vllm/examples/gradio_openai_chatbot_webserver.py \
+    -m $MODEL_NAME \
+    --port 8811 \
+    --model-url http://$ENDPOINT/v1 | tee ~/gradio.log
diff --git a/llm/qwen/serve-110b.yaml b/llm/qwen/serve-110b.yaml
new file mode 100644
index 00000000000..1e98bd254e9
--- /dev/null
+++ b/llm/qwen/serve-110b.yaml
@@ -0,0 +1,43 @@
+envs:
+  MODEL_NAME: Qwen/Qwen1.5-110B-Chat
+
+service:
+  # Specifying the path to the endpoint to check the readiness of the replicas.
+  readiness_probe:
+    path: /v1/chat/completions
+    post_data:
+      model: $MODEL_NAME
+      messages:
+        - role: user
+          content: Hello! What is your name?
+      max_tokens: 1
+    initial_delay_seconds: 1200
+  # How many replicas to manage.
+  replicas: 2
+  
+
+resources:
+  accelerators: {A100:8, A100-80GB:4, A100-80GB:8}
+  disk_size: 1024
+  disk_tier: best
+  memory: 32+
+  ports: 8000
+
+setup: |
+  conda activate qwen
+  if [ $? -ne 0 ]; then
+    conda create -n qwen python=3.10 -y
+    conda activate qwen
+  fi
+  pip install vllm==0.4.2
+  pip install flash-attn==2.5.9.post1
+
+run: |
+  conda activate qwen
+  export PATH=$PATH:/sbin
+  python -u -m vllm.entrypoints.openai.api_server \
+    --host 0.0.0.0 \
+    --model $MODEL_NAME \
+    --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
+    --max-num-seqs 16 | tee ~/openai_api_server.log
+
diff --git a/llm/qwen/serve-72b.yaml b/llm/qwen/serve-72b.yaml
index 86248011bbf..34e3e348f2f 100644
--- a/llm/qwen/serve-72b.yaml
+++ b/llm/qwen/serve-72b.yaml
@@ -1,5 +1,5 @@
 envs:
-  MODEL_NAME: Qwen/Qwen1.5-72B-Chat
+  MODEL_NAME: Qwen/Qwen2-72B-Instruct
 
 service:
   # Specifying the path to the endpoint to check the readiness of the replicas.
@@ -29,8 +29,8 @@ setup: |
     conda create -n qwen python=3.10 -y
     conda activate qwen
   fi
-  pip install -U vllm==0.3.2
-  pip install -U transformers==4.38.0
+  pip install vllm==0.4.2
+  pip install flash-attn==2.5.9.post1
 
 run: |
   conda activate qwen
diff --git a/llm/qwen/serve-7b.yaml b/llm/qwen/serve-7b.yaml
index a1ec7ee3f2b..f33adcdd2cd 100644
--- a/llm/qwen/serve-7b.yaml
+++ b/llm/qwen/serve-7b.yaml
@@ -1,5 +1,5 @@
 envs:
-  MODEL_NAME: Qwen/Qwen1.5-7B-Chat
+  MODEL_NAME: Qwen/Qwen2-7B-Instruct
 
 service:
   # Specifying the path to the endpoint to check the readiness of the replicas.
@@ -27,8 +27,8 @@ setup: |
     conda create -n qwen python=3.10 -y
     conda activate qwen
   fi
-  pip install -U vllm==0.3.2
-  pip install -U transformers==4.38.0
+  pip install vllm==0.4.2
+  pip install flash-attn==2.5.9.post1
 
 run: |
   conda activate qwen
diff --git a/llm/sglang/README.md b/llm/sglang/README.md
index 3ffcc2f484b..fc79529148a 100644
--- a/llm/sglang/README.md
+++ b/llm/sglang/README.md
@@ -68,7 +68,7 @@ ENDPOINT=$(sky serve status --endpoint sglang-llava)
 </figure>
 
 ```bash
-curl -L $ENDPOINT/v1/chat/completions \
+curl $ENDPOINT/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
     "model": "liuhaotian/llava-v1.6-vicuna-7b",
@@ -149,7 +149,7 @@ ENDPOINT=$(sky serve status --endpoint sglang-llama2)
 4. Once it status is `READY`, you can use the endpoint to interact with the model:
 
 ```bash
-curl -L $ENDPOINT/v1/chat/completions \
+curl $ENDPOINT/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
     "model": "meta-llama/Llama-2-7b-chat-hf",
diff --git a/llm/sglang/llama2.yaml b/llm/sglang/llama2.yaml
index 08427ab2001..8b58c4365d6 100644
--- a/llm/sglang/llama2.yaml
+++ b/llm/sglang/llama2.yaml
@@ -6,7 +6,7 @@ service:
 
 envs:
   MODEL_NAME: meta-llama/Llama-2-7b-chat-hf
-  HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token
+  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
 
 resources:
   accelerators: {L4:1, A10G:1, A10:1, A100:1, A100-80GB:1}
diff --git a/llm/tgi/README.md b/llm/tgi/README.md
index 8fb68222d68..8c8360d0465 100644
--- a/llm/tgi/README.md
+++ b/llm/tgi/README.md
@@ -17,7 +17,7 @@ A user can access the model with the following command:
 ```bash
 ENDPOINT=$(sky status --endpoint 8080 tgi)
 
-curl -L $(sky serve status tgi --endpoint)/generate \
+curl $(sky serve status tgi --endpoint)/generate \
     -H 'Content-Type: application/json' \
     -d '{
       "inputs": "What is Deep Learning?",
@@ -51,7 +51,7 @@ After the service is launched, we can access the model with the following comman
 ```bash
 ENDPOINT=$(sky serve status --endpoint tgi)
 
-curl -L $ENDPOINT/generate \
+curl $ENDPOINT/generate \
     -H 'Content-Type: application/json' \
     -d '{
       "inputs": "What is Deep Learning?",
diff --git a/llm/vicuna-llama-2/README.md b/llm/vicuna-llama-2/README.md
index 0fc5da6c4ba..899792c299d 100644
--- a/llm/vicuna-llama-2/README.md
+++ b/llm/vicuna-llama-2/README.md
@@ -31,7 +31,7 @@ cd skypilot/llm/vicuna-llama-2
 Paste the access token into [train.yaml](https://github.com/skypilot-org/skypilot/tree/master/llm/vicuna-llama-2/train.yaml):
 ```yaml
 envs:
-  HF_TOKEN: <your-huggingface-token>  # Change to your own huggingface token
+  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
 ```
 
 ## Train your own Vicuna on Llama-2
diff --git a/llm/vicuna-llama-2/serve.yaml b/llm/vicuna-llama-2/serve.yaml
index 0a98dab5d26..69f89f2fc28 100644
--- a/llm/vicuna-llama-2/serve.yaml
+++ b/llm/vicuna-llama-2/serve.yaml
@@ -27,11 +27,12 @@ run: |
   conda activate chatbot
   
   echo 'Starting controller...'
-  python -u -m fastchat.serve.controller > ~/controller.log 2>&1 &
+  python -u -m fastchat.serve.controller --host 127.0.0.1 > ~/controller.log 2>&1 &
   sleep 10
   echo 'Starting model worker...'
   python -u -m fastchat.serve.model_worker \
             --model-path /skypilot-vicuna 2>&1 \
+            --host 127.0.0.1 \
             | tee model_worker.log &
 
   echo 'Waiting for model worker to start...'
diff --git a/llm/vicuna-llama-2/train.yaml b/llm/vicuna-llama-2/train.yaml
index e23d5797e76..8d35c2dff85 100644
--- a/llm/vicuna-llama-2/train.yaml
+++ b/llm/vicuna-llama-2/train.yaml
@@ -1,7 +1,7 @@
 envs:
-  HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token
-  ARTIFACT_BUCKET_NAME: YOUR_OWN_BUCKET_NAME # Change to your own bucket name
-  WANDB_API_KEY: "" # Change to your own wandb api key
+  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
+  ARTIFACT_BUCKET_NAME: # TODO: Fill with your unique bucket name, or use --env to pass.
+  WANDB_API_KEY: # TODO: Fill with your own WANDB_API_KEY, or use --env to pass.
   MODEL_SIZE: 7
   USE_XFORMERS: 1
 
diff --git a/llm/vicuna/serve-openai-api-endpoint.yaml b/llm/vicuna/serve-openai-api-endpoint.yaml
index 247043ee3c2..639dfadc6d6 100644
--- a/llm/vicuna/serve-openai-api-endpoint.yaml
+++ b/llm/vicuna/serve-openai-api-endpoint.yaml
@@ -19,11 +19,12 @@ run: |
   conda activate chatbot
   
   echo 'Starting controller...'
-  python -u -m fastchat.serve.controller > ~/controller.log 2>&1 &
+  python -u -m fastchat.serve.controller --host 127.0.0.1 > ~/controller.log 2>&1 &
   sleep 10
   echo 'Starting model worker...'
   python -u -m fastchat.serve.model_worker \
             --model-path lmsys/vicuna-${MODEL_SIZE}b-v1.3 2>&1 \
+            --host 127.0.0.1 \
             | tee model_worker.log &
 
   echo 'Waiting for model worker to start...'
diff --git a/llm/vicuna/serve.yaml b/llm/vicuna/serve.yaml
index d458112a42f..49185fcea20 100644
--- a/llm/vicuna/serve.yaml
+++ b/llm/vicuna/serve.yaml
@@ -19,11 +19,12 @@ run: |
   conda activate chatbot
   
   echo 'Starting controller...'
-  python -u -m fastchat.serve.controller > ~/controller.log 2>&1 &
+  python -u -m fastchat.serve.controller --host 127.0.0.1 > ~/controller.log 2>&1 &
   sleep 10
   echo 'Starting model worker...'
   python -u -m fastchat.serve.model_worker \
             --model-path lmsys/vicuna-${MODEL_SIZE}b-v1.3 2>&1 \
+            --host 127.0.0.1 \
             | tee model_worker.log &
 
   echo 'Waiting for model worker to start...'
diff --git a/llm/vicuna/train.yaml b/llm/vicuna/train.yaml
index c577561e858..a2121aaf8fd 100644
--- a/llm/vicuna/train.yaml
+++ b/llm/vicuna/train.yaml
@@ -1,3 +1,10 @@
+envs:
+  MODEL_SIZE: 7
+  SEQ_LEN: 2048
+  GC_SCALE: 4
+  USE_FLASH_ATTN: 0
+  WANDB_API_KEY: # TODO: Fill with your own WANDB_API_KEY, or use --env to pass.
+
 resources:
   accelerators: A100-80GB:8
   disk_size: 1000
@@ -109,10 +116,3 @@ run: |
   gsutil -m rsync -r -x 'checkpoint-*' $LOCAL_CKPT_PATH/ $CKPT_PATH/
   exit $returncode
 
-
-envs:
-  MODEL_SIZE: 7
-  SEQ_LEN: 2048
-  GC_SCALE: 4
-  USE_FLASH_ATTN: 0
-  WANDB_API_KEY: ""
diff --git a/llm/vllm/README.md b/llm/vllm/README.md
index 568b8ff70bd..61932cd8571 100644
--- a/llm/vllm/README.md
+++ b/llm/vllm/README.md
@@ -154,7 +154,7 @@ ENDPOINT=$(sky serve status --endpoint vllm-llama2)
 4. Once it status is `READY`, you can use the endpoint to interact with the model:
 
 ```bash
-curl -L $ENDPOINT/v1/chat/completions \
+curl $ENDPOINT/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
     "model": "meta-llama/Llama-2-7b-chat-hf",
@@ -171,7 +171,7 @@ curl -L $ENDPOINT/v1/chat/completions \
   }'
 ```
 
-Notice that it is the same with previously curl command, except for thr `-L` argument. You should get a similar response as the following:
+Notice that it is the same with previously curl command. You should get a similar response as the following:
 
 ```console
 {
diff --git a/llm/vllm/serve-openai-api.yaml b/llm/vllm/serve-openai-api.yaml
index 9ddf7b280ba..a68f476edc7 100644
--- a/llm/vllm/serve-openai-api.yaml
+++ b/llm/vllm/serve-openai-api.yaml
@@ -1,6 +1,6 @@
 envs:
   MODEL_NAME: meta-llama/Llama-2-7b-chat-hf
-  HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token
+  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
 
 resources:
   accelerators: {L4:1, A10G:1, A10:1, A100:1, A100-80GB:1}
diff --git a/llm/vllm/service-with-auth.yaml b/llm/vllm/service-with-auth.yaml
new file mode 100644
index 00000000000..0a40df29293
--- /dev/null
+++ b/llm/vllm/service-with-auth.yaml
@@ -0,0 +1,42 @@
+# service.yaml
+# The newly-added `service` section to the `serve-openai-api.yaml` file.
+service:
+  # Specifying the path to the endpoint to check the readiness of the service.
+  readiness_probe:
+    path: /v1/models
+    # Set authorization headers here if needed.
+    headers:
+      Authorization: Bearer $AUTH_TOKEN
+  # How many replicas to manage.
+  replicas: 1
+
+# Fields below are the same with `serve-openai-api.yaml`.
+envs:
+  MODEL_NAME: meta-llama/Llama-2-7b-chat-hf
+  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
+  AUTH_TOKEN: # TODO: Fill with your own auth token (a random string), or use --env to pass.
+
+resources:
+  accelerators: {L4:1, A10G:1, A10:1, A100:1, A100-80GB:1}
+  ports: 8000
+
+setup: |
+  conda activate vllm
+  if [ $? -ne 0 ]; then
+    conda create -n vllm python=3.10 -y
+    conda activate vllm
+  fi
+
+  pip install transformers==4.38.0
+  pip install vllm==0.3.2
+
+  python -c "import huggingface_hub; huggingface_hub.login('${HF_TOKEN}')"
+
+
+run: |
+  conda activate vllm
+  echo 'Starting vllm openai api server...'
+  python -m vllm.entrypoints.openai.api_server \
+    --model $MODEL_NAME --tokenizer hf-internal-testing/llama-tokenizer \
+    --host 0.0.0.0 --port 8000 --api-key $AUTH_TOKEN
+
diff --git a/llm/vllm/service.yaml b/llm/vllm/service.yaml
index 335f8a50650..1e5d92a60e5 100644
--- a/llm/vllm/service.yaml
+++ b/llm/vllm/service.yaml
@@ -9,7 +9,7 @@ service:
 # Fields below are the same with `serve-openai-api.yaml`.
 envs:
   MODEL_NAME: meta-llama/Llama-2-7b-chat-hf
-  HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token
+  HF_TOKEN: # TODO: Fill with your own huggingface token, or use --env to pass.
 
 resources:
   accelerators: {L4:1, A10G:1, A10:1, A100:1, A100-80GB:1}
diff --git a/sky/__init__.py b/sky/__init__.py
index d25c8297ea5..a077fb8966a 100644
--- a/sky/__init__.py
+++ b/sky/__init__.py
@@ -102,16 +102,16 @@ def set_proxy_env_var(proxy_var: str, urllib_var: Optional[str]):
 from sky.data import StoreType
 from sky.execution import exec  # pylint: disable=redefined-builtin
 from sky.execution import launch
+# TODO (zhwu): These imports are for backward compatibility, and spot APIs
+# should be called with `sky.spot.xxx` instead. Remove in release 0.8.0
+from sky.jobs.core import spot_cancel
+from sky.jobs.core import spot_launch
+from sky.jobs.core import spot_queue
+from sky.jobs.core import spot_tail_logs
 from sky.optimizer import Optimizer
 from sky.optimizer import OptimizeTarget
 from sky.resources import Resources
 from sky.skylet.job_lib import JobStatus
-# TODO (zhwu): These imports are for backward compatibility, and spot APIs
-# should be called with `sky.spot.xxx` instead. Remove in release 0.7.0
-from sky.spot.core import spot_cancel
-from sky.spot.core import spot_launch
-from sky.spot.core import spot_queue
-from sky.spot.core import spot_tail_logs
 from sky.status_lib import ClusterStatus
 from sky.task import Task
 
diff --git a/sky/adaptors/cloudflare.py b/sky/adaptors/cloudflare.py
index 2a49dc6fff0..864248614f3 100644
--- a/sky/adaptors/cloudflare.py
+++ b/sky/adaptors/cloudflare.py
@@ -23,6 +23,7 @@
 R2_PROFILE_NAME = 'r2'
 _INDENT_PREFIX = '    '
 NAME = 'Cloudflare'
+SKY_CHECK_NAME = 'Cloudflare (for R2 object store)'
 
 
 @contextlib.contextmanager
diff --git a/sky/adaptors/gcp.py b/sky/adaptors/gcp.py
index 6465709d42c..9f63bec87ee 100644
--- a/sky/adaptors/gcp.py
+++ b/sky/adaptors/gcp.py
@@ -21,8 +21,9 @@ def build(service_name: str, version: str, *args, **kwargs):
         service_name: GCP service name (e.g., 'compute', 'storagetransfer').
         version: Service version (e.g., 'v1').
     """
-    from googleapiclient import discovery
-    return discovery.build(service_name, version, *args, **kwargs)
+
+    return googleapiclient.discovery.build(service_name, version, *args,
+                                           **kwargs)
 
 
 @common.load_lazy_modules(_LAZY_MODULES)
diff --git a/sky/adaptors/kubernetes.py b/sky/adaptors/kubernetes.py
index c6b4e8bce0b..7cdb3ff3059 100644
--- a/sky/adaptors/kubernetes.py
+++ b/sky/adaptors/kubernetes.py
@@ -21,6 +21,8 @@
 _networking_api = None
 _custom_objects_api = None
 _node_api = None
+_apps_api = None
+_api_client = None
 
 # Timeout to use for API calls
 API_TIMEOUT = 5
@@ -54,9 +56,9 @@ def _load_config():
                            '    If you were running a local Kubernetes '
                            'cluster, run `sky local up` to start the cluster.')
             else:
-                err_str = (
-                    'Failed to load Kubernetes configuration. '
-                    f'Please check if your kubeconfig file is valid.{suffix}')
+                err_str = ('Failed to load Kubernetes configuration. '
+                           'Please check if your kubeconfig file exists at '
+                           f'~/.kube/config and is valid.{suffix}')
             err_str += '\nTo disable Kubernetes for SkyPilot: run `sky check`.'
             with ux_utils.print_exception_no_traceback():
                 raise ValueError(err_str) from None
@@ -108,6 +110,24 @@ def node_api():
     return _node_api
 
 
+def apps_api():
+    global _apps_api
+    if _apps_api is None:
+        _load_config()
+        _apps_api = kubernetes.client.AppsV1Api()
+
+    return _apps_api
+
+
+def api_client():
+    global _api_client
+    if _api_client is None:
+        _load_config()
+        _api_client = kubernetes.client.ApiClient()
+
+    return _api_client
+
+
 def api_exception():
     return kubernetes.client.rest.ApiException
 
diff --git a/sky/authentication.py b/sky/authentication.py
index 7b5699d3337..966dad670c5 100644
--- a/sky/authentication.py
+++ b/sky/authentication.py
@@ -15,7 +15,7 @@
 The local machine's public key should not be uploaded to the
 `~/.ssh/sky-key.pub` on the remote VM, because it will cause private/public
 key pair mismatch when the user tries to launch new VM from that remote VM
-using SkyPilot, e.g., the node is used as a spot controller. (Lambda cloud
+using SkyPilot, e.g., the node is used as a jobs controller. (Lambda cloud
 is an exception, due to the limitation of the cloud provider. See the
 comments in setup_lambda_authentication)
 """
@@ -408,7 +408,7 @@ def setup_kubernetes_authentication(config: Dict[str, Any]) -> Dict[str, Any]:
     # Add the user's public key to the SkyPilot cluster.
     public_key_path = os.path.expanduser(PUBLIC_SSH_KEY_PATH)
     secret_name = clouds.Kubernetes.SKY_SSH_KEY_SECRET_NAME
-    secret_field_name = clouds.Kubernetes.SKY_SSH_KEY_SECRET_FIELD_NAME
+    secret_field_name = clouds.Kubernetes().ssh_key_secret_field_name
     namespace = kubernetes_utils.get_current_kube_config_context_namespace()
     k8s = kubernetes.kubernetes
     with open(public_key_path, 'r', encoding='utf-8') as f:
diff --git a/sky/backends/backend_utils.py b/sky/backends/backend_utils.py
index 0da2bd9ef0b..03f644930f4 100644
--- a/sky/backends/backend_utils.py
+++ b/sky/backends/backend_utils.py
@@ -1,6 +1,7 @@
 """Util constants/functions for the backends."""
 from datetime import datetime
 import enum
+import fnmatch
 import functools
 import os
 import pathlib
@@ -47,7 +48,9 @@
 from sky.utils import common_utils
 from sky.utils import controller_utils
 from sky.utils import env_options
+from sky.utils import resources_utils
 from sky.utils import rich_utils
+from sky.utils import schemas
 from sky.utils import subprocess_utils
 from sky.utils import timeline
 from sky.utils import ux_utils
@@ -108,6 +111,9 @@
 # Remote dir that holds our runtime files.
 _REMOTE_RUNTIME_FILES_DIR = '~/.sky/.runtime_files'
 
+_ENDPOINTS_RETRY_MESSAGE = ('If the cluster was recently started, '
+                            'please retry after a while.')
+
 # Include the fields that will be used for generating tags that distinguishes
 # the cluster in ray, to avoid the stopped cluster being discarded due to
 # updates in the yaml template.
@@ -327,7 +333,7 @@ def wrap_file_mount(cls, path: str) -> str:
     def make_safe_symlink_command(cls, *, source: str, target: str) -> str:
         """Returns a command that safely symlinks 'source' to 'target'.
 
-        All intermediate directories of 'source' will be owned by $USER,
+        All intermediate directories of 'source' will be owned by $(whoami),
         excluding the root directory (/).
 
         'source' must be an absolute path; both 'source' and 'target' must not
@@ -354,17 +360,17 @@ def make_safe_symlink_command(cls, *, source: str, target: str) -> str:
                                                                        target)
         # Below, use sudo in case the symlink needs sudo access to create.
         # Prepare to create the symlink:
-        #  1. make sure its dir(s) exist & are owned by $USER.
+        #  1. make sure its dir(s) exist & are owned by $(whoami).
         dir_of_symlink = os.path.dirname(source)
         commands = [
             # mkdir, then loop over '/a/b/c' as /a, /a/b, /a/b/c.  For each,
-            # chown $USER on it so user can use these intermediate dirs
+            # chown $(whoami) on it so user can use these intermediate dirs
             # (excluding /).
             f'sudo mkdir -p {dir_of_symlink}',
             # p: path so far
             ('(p=""; '
              f'for w in $(echo {dir_of_symlink} | tr "/" " "); do '
-             'p=${p}/${w}; sudo chown $USER $p; done)')
+             'p=${p}/${w}; sudo chown $(whoami) $p; done)')
         ]
         #  2. remove any existing symlink (ln -f may throw 'cannot
         #     overwrite directory', if the link exists and points to a
@@ -380,7 +386,7 @@ def make_safe_symlink_command(cls, *, source: str, target: str) -> str:
             # Link.
             f'sudo ln -s {target} {source}',
             # chown.  -h to affect symlinks only.
-            f'sudo chown -h $USER {source}',
+            f'sudo chown -h $(whoami) {source}',
         ]
         return ' && '.join(commands)
 
@@ -797,8 +803,14 @@ def write_cluster_config(
     assert cluster_name is not None
     excluded_clouds = []
     remote_identity = skypilot_config.get_nested(
-        (str(cloud).lower(), 'remote_identity'), 'LOCAL_CREDENTIALS')
-    if remote_identity == 'SERVICE_ACCOUNT':
+        (str(cloud).lower(), 'remote_identity'),
+        schemas.get_default_remote_identity(str(cloud).lower()))
+    if remote_identity is not None and not isinstance(remote_identity, str):
+        for profile in remote_identity:
+            if fnmatch.fnmatchcase(cluster_name, list(profile.keys())[0]):
+                remote_identity = list(profile.values())[0]
+                break
+    if remote_identity != schemas.RemoteIdentityOptions.LOCAL_CREDENTIALS.value:
         if not cloud.supports_service_account_on_remote():
             raise exceptions.InvalidCloudConfigs(
                 'remote_identity: SERVICE_ACCOUNT is specified in '
@@ -840,13 +852,20 @@ def write_cluster_config(
             ssh_proxy_command = ssh_proxy_command_config[region_name]
     logger.debug(f'Using ssh_proxy_command: {ssh_proxy_command!r}')
 
-    # User-supplied instance tags.
-    instance_tags = {}
-    instance_tags = skypilot_config.get_nested(
-        (str(cloud).lower(), 'instance_tags'), {})
-    # instance_tags is a dict, which is guaranteed by the type check in
+    # User-supplied global instance tags from ~/.sky/config.yaml.
+    labels = skypilot_config.get_nested((str(cloud).lower(), 'labels'), {})
+    # Deprecated: instance_tags have been replaced by labels. For backward
+    # compatibility, we support them and the schema allows them only if
+    # `labels` are not specified. This should be removed after 0.7.0.
+    labels = skypilot_config.get_nested((str(cloud).lower(), 'instance_tags'),
+                                        labels)
+    # labels is a dict, which is guaranteed by the type check in
     # schemas.py
-    assert isinstance(instance_tags, dict), instance_tags
+    assert isinstance(labels, dict), labels
+
+    # Get labels from resources and override from the labels to_provision.
+    if to_provision.labels:
+        labels.update(to_provision.labels)
 
     # Dump the Ray ports to a file for Ray job submission
     dump_port_command = (
@@ -879,8 +898,10 @@ def write_cluster_config(
                 'vpc_name': skypilot_config.get_nested(
                     (str(cloud).lower(), 'vpc_name'), None),
 
-                # User-supplied instance tags.
-                'instance_tags': instance_tags,
+                # User-supplied labels.
+                'labels': labels,
+                # User-supplied remote_identity
+                'remote_identity': remote_identity,
                 # The reservation pools that specified by the user. This is
                 # currently only used by GCP.
                 'specific_reservations': specific_reservations,
@@ -904,7 +925,14 @@ def write_cluster_config(
                 'dump_port_command': dump_port_command,
                 # Sky-internal constants.
                 'sky_ray_cmd': constants.SKY_RAY_CMD,
-                'sky_pip_cmd': constants.SKY_PIP_CMD,
+                # pip install needs to have python env activated to make sure
+                # installed packages are within the env path.
+                'sky_pip_cmd': f'{constants.SKY_PIP_CMD}',
+                # Activate the SkyPilot runtime environment when starting ray
+                # cluster, so that ray autoscaler can access cloud SDK and CLIs
+                # on remote
+                'sky_activate_python_env':
+                    constants.ACTIVATE_SKY_REMOTE_PYTHON_ENV,
                 'ray_version': constants.SKY_REMOTE_RAY_VERSION,
                 # Command for waiting ray cluster to be ready on head.
                 'ray_head_wait_initialized_command':
@@ -951,7 +979,11 @@ def write_cluster_config(
         with open(tmp_yaml_path, 'w', encoding='utf-8') as f:
             f.write(restored_yaml_content)
 
-    config_dict['cluster_name_on_cloud'] = cluster_name_on_cloud
+    # Read the cluster name from the tmp yaml file, to take the backward
+    # compatbility restortion above into account.
+    # TODO: remove this after 2 minor releases, 0.8.0.
+    yaml_config = common_utils.read_yaml(tmp_yaml_path)
+    config_dict['cluster_name_on_cloud'] = yaml_config['cluster_name']
 
     # Optimization: copy the contents of source files in file_mounts to a
     # special dir, and upload that as the only file_mount instead. Delay
@@ -1059,7 +1091,7 @@ def get_ready_nodes_counts(pattern, output):
 def get_docker_user(ip: str, cluster_config_file: str) -> str:
     """Find docker container username."""
     ssh_credentials = ssh_credential_from_yaml(cluster_config_file)
-    runner = command_runner.SSHCommandRunner(ip, port=22, **ssh_credentials)
+    runner = command_runner.SSHCommandRunner(node=(ip, 22), **ssh_credentials)
     container_name = constants.DEFAULT_DOCKER_CONTAINER_NAME
     whoami_returncode, whoami_stdout, whoami_stderr = runner.run(
         f'sudo docker exec {container_name} whoami',
@@ -1092,7 +1124,7 @@ def wait_until_ray_cluster_ready(
     try:
         head_ip = _query_head_ip_with_retries(
             cluster_config_file, max_attempts=WAIT_HEAD_NODE_IP_MAX_ATTEMPTS)
-    except exceptions.FetchIPError as e:
+    except exceptions.FetchClusterInfoError as e:
         logger.error(common_utils.format_exception(e))
         return False, None  # failed
 
@@ -1108,8 +1140,7 @@ def wait_until_ray_cluster_ready(
     ssh_credentials = ssh_credential_from_yaml(cluster_config_file, docker_user)
     last_nodes_so_far = 0
     start = time.time()
-    runner = command_runner.SSHCommandRunner(head_ip,
-                                             port=22,
+    runner = command_runner.SSHCommandRunner(node=(head_ip, 22),
                                              **ssh_credentials)
     with rich_utils.safe_status(
             '[bold cyan]Waiting for workers...') as worker_status:
@@ -1215,7 +1246,7 @@ def ssh_credential_from_yaml(
 
 
 def parallel_data_transfer_to_nodes(
-    runners: List[command_runner.SSHCommandRunner],
+    runners: List[command_runner.CommandRunner],
     source: Optional[str],
     target: str,
     cmd: Optional[str],
@@ -1225,32 +1256,36 @@ def parallel_data_transfer_to_nodes(
     # Advanced options.
     log_path: str = os.devnull,
     stream_logs: bool = False,
+    source_bashrc: bool = False,
 ):
     """Runs a command on all nodes and optionally runs rsync from src->dst.
 
     Args:
-        runners: A list of SSHCommandRunner objects that represent multiple nodes.
+        runners: A list of CommandRunner objects that represent multiple nodes.
         source: Optional[str]; Source for rsync on local node
         target: str; Destination on remote node for rsync
         cmd: str; Command to be executed on all nodes
         action_message: str; Message to be printed while the command runs
         log_path: str; Path to the log file
         stream_logs: bool; Whether to stream logs to stdout
+        source_bashrc: bool; Source bashrc before running the command.
     """
     fore = colorama.Fore
     style = colorama.Style
 
     origin_source = source
 
-    def _sync_node(runner: 'command_runner.SSHCommandRunner') -> None:
+    def _sync_node(runner: 'command_runner.CommandRunner') -> None:
         if cmd is not None:
             rc, stdout, stderr = runner.run(cmd,
                                             log_path=log_path,
                                             stream_logs=stream_logs,
-                                            require_outputs=True)
+                                            require_outputs=True,
+                                            source_bashrc=source_bashrc)
             err_msg = ('Failed to run command before rsync '
                        f'{origin_source} -> {target}. '
-                       'Ensure that the network is stable, then retry.')
+                       'Ensure that the network is stable, then retry. '
+                       f'{cmd}')
             if log_path != os.devnull:
                 err_msg += f' See logs in {log_path}'
             subprocess_utils.handle_returncode(rc,
@@ -1315,7 +1350,7 @@ def _query_head_ip_with_retries(cluster_yaml: str,
     """Returns the IP of the head node by querying the cloud.
 
     Raises:
-      exceptions.FetchIPError: if we failed to get the head IP.
+      exceptions.FetchClusterInfoError: if we failed to get the head IP.
     """
     backoff = common_utils.Backoff(initial_backoff=5, max_backoff_factor=5)
     for i in range(max_attempts):
@@ -1344,8 +1379,8 @@ def _query_head_ip_with_retries(cluster_yaml: str,
             break
         except subprocess.CalledProcessError as e:
             if i == max_attempts - 1:
-                raise exceptions.FetchIPError(
-                    reason=exceptions.FetchIPError.Reason.HEAD) from e
+                raise exceptions.FetchClusterInfoError(
+                    reason=exceptions.FetchClusterInfoError.Reason.HEAD) from e
             # Retry if the cluster is not up yet.
             logger.debug('Retrying to get head ip.')
             time.sleep(backoff.current_backoff())
@@ -1370,7 +1405,7 @@ def get_node_ips(cluster_yaml: str,
             IPs.
 
     Raises:
-        exceptions.FetchIPError: if we failed to get the IPs. e.reason is
+        exceptions.FetchClusterInfoError: if we failed to get the IPs. e.reason is
             HEAD or WORKER.
     """
     ray_config = common_utils.read_yaml(cluster_yaml)
@@ -1391,11 +1426,12 @@ def get_node_ips(cluster_yaml: str,
                 'Failed to get cluster info for '
                 f'{ray_config["cluster_name"]} from the new provisioner '
                 f'with {common_utils.format_exception(e)}.')
-            raise exceptions.FetchIPError(
-                exceptions.FetchIPError.Reason.HEAD) from e
+            raise exceptions.FetchClusterInfoError(
+                exceptions.FetchClusterInfoError.Reason.HEAD) from e
         if len(metadata.instances) < expected_num_nodes:
             # Simulate the exception when Ray head node is not up.
-            raise exceptions.FetchIPError(exceptions.FetchIPError.Reason.HEAD)
+            raise exceptions.FetchClusterInfoError(
+                exceptions.FetchClusterInfoError.Reason.HEAD)
         return metadata.get_feasible_ips(get_internal_ips)
 
     if get_internal_ips:
@@ -1425,8 +1461,8 @@ def get_node_ips(cluster_yaml: str,
                 break
             except subprocess.CalledProcessError as e:
                 if retry_cnt == worker_ip_max_attempts - 1:
-                    raise exceptions.FetchIPError(
-                        exceptions.FetchIPError.Reason.WORKER) from e
+                    raise exceptions.FetchClusterInfoError(
+                        exceptions.FetchClusterInfoError.Reason.WORKER) from e
                 # Retry if the ssh is not ready for the workers yet.
                 backoff_time = backoff.current_backoff()
                 logger.debug('Retrying to get worker ip '
@@ -1451,8 +1487,8 @@ def get_node_ips(cluster_yaml: str,
                                f'detected IP(s): {worker_ips[-n:]}.')
                 worker_ips = worker_ips[-n:]
             else:
-                raise exceptions.FetchIPError(
-                    exceptions.FetchIPError.Reason.WORKER)
+                raise exceptions.FetchClusterInfoError(
+                    exceptions.FetchClusterInfoError.Reason.WORKER)
     else:
         worker_ips = []
     return head_ip_list + worker_ips
@@ -1516,7 +1552,7 @@ def check_owner_identity(cluster_name: str) -> None:
         for i, (owner,
                 current) in enumerate(zip(owner_identity,
                                           current_user_identity)):
-            # Clean up the owner identiy for the backslash and newlines, caused
+            # Clean up the owner identity for the backslash and newlines, caused
             # by the cloud CLI output, e.g. gcloud.
             owner = owner.replace('\n', '').replace('\\', '')
             if owner == current:
@@ -1739,14 +1775,11 @@ def _update_cluster_status_no_lock(
 
     def run_ray_status_to_check_ray_cluster_healthy() -> bool:
         try:
-            # TODO(zhwu): This function cannot distinguish transient network
-            # error in ray's get IPs vs. ray runtime failing.
-
             # NOTE: fetching the IPs is very slow as it calls into
             # `ray get head-ip/worker-ips`. Using cached IPs is safe because
             # in the worst case we time out in the `ray status` SSH command
             # below.
-            external_ips = handle.cached_external_ips
+            runners = handle.get_command_runners(force_cached=True)
             # This happens when user interrupt the `sky launch` process before
             # the first time resources handle is written back to local database.
             # This is helpful when user interrupt after the provision is done
@@ -1754,27 +1787,13 @@ def run_ray_status_to_check_ray_cluster_healthy() -> bool:
             # helps keep the cluster status to INIT after `sky status -r`, so
             # user will be notified that any auto stop/down might not be
             # triggered.
-            if external_ips is None or len(external_ips) == 0:
+            if not runners:
                 logger.debug(f'Refreshing status ({cluster_name!r}): No cached '
                              f'IPs found. Handle: {handle}')
-                raise exceptions.FetchIPError(
-                    reason=exceptions.FetchIPError.Reason.HEAD)
-
-            # Potentially refresh the external SSH ports, in case the existing
-            # cluster before #2491 was launched without external SSH ports
-            # cached.
-            external_ssh_ports = handle.external_ssh_ports()
-            head_ssh_port = external_ssh_ports[0]
-
-            # Check if ray cluster status is healthy.
-            ssh_credentials = ssh_credential_from_yaml(handle.cluster_yaml,
-                                                       handle.docker_user,
-                                                       handle.ssh_user)
-
-            runner = command_runner.SSHCommandRunner(external_ips[0],
-                                                     **ssh_credentials,
-                                                     port=head_ssh_port)
-            rc, output, stderr = runner.run(
+                raise exceptions.FetchClusterInfoError(
+                    reason=exceptions.FetchClusterInfoError.Reason.HEAD)
+            head_runner = runners[0]
+            rc, output, stderr = head_runner.run(
                 instance_setup.RAY_STATUS_WITH_SKY_RAY_PORT_COMMAND,
                 stream_logs=False,
                 require_outputs=True,
@@ -1794,7 +1813,7 @@ def run_ray_status_to_check_ray_cluster_healthy() -> bool:
                 f'Refreshing status ({cluster_name!r}): ray status not showing '
                 f'all nodes ({ready_head + ready_workers}/'
                 f'{total_nodes}); output: {output}; stderr: {stderr}')
-        except exceptions.FetchIPError:
+        except exceptions.FetchClusterInfoError:
             logger.debug(
                 f'Refreshing status ({cluster_name!r}) failed to get IPs.')
         except RuntimeError as e:
@@ -2232,18 +2251,18 @@ def check_cluster_available(
 
 # TODO(tian): Refactor to controller_utils. Current blocker: circular import.
 def is_controller_accessible(
-    controller_type: controller_utils.Controllers,
+    controller: controller_utils.Controllers,
     stopped_message: str,
     non_existent_message: Optional[str] = None,
     exit_if_not_accessible: bool = False,
 ) -> 'backends.CloudVmRayResourceHandle':
-    """Check if the spot/serve controller is up.
+    """Check if the jobs/serve controller is up.
 
     The controller is accessible when it is in UP or INIT state, and the ssh
     connection is successful.
 
     It can be used to check if the controller is accessible (since the autostop
-    is set for the controller) before the spot/serve commands interact with the
+    is set for the controller) before the jobs/serve commands interact with the
     controller.
 
     ClusterNotUpError will be raised whenever the controller cannot be accessed.
@@ -2267,10 +2286,8 @@ def is_controller_accessible(
           failed to be connected.
     """
     if non_existent_message is None:
-        non_existent_message = (
-            controller_type.value.default_hint_if_non_existent)
-    cluster_name = controller_type.value.cluster_name
-    controller_name = controller_type.value.name.replace(' controller', '')
+        non_existent_message = controller.value.default_hint_if_non_existent
+    cluster_name = controller.value.cluster_name
     need_connection_check = False
     controller_status, handle = None, None
     try:
@@ -2292,7 +2309,7 @@ def is_controller_accessible(
         # will not start the controller manually from the cloud console.
         #
         # The acquire_lock_timeout is set to 0 to avoid hanging the command when
-        # multiple spot.launch commands are running at the same time. Our later
+        # multiple jobs.launch commands are running at the same time. Our later
         # code will check if the controller is accessible by directly checking
         # the ssh connection to the controller, if it fails to get accurate
         # status of the controller.
@@ -2304,6 +2321,7 @@ def is_controller_accessible(
         # We do not catch the exceptions related to the cluster owner identity
         # mismatch, please refer to the comment in
         # `backend_utils.check_cluster_available`.
+        controller_name = controller.value.name.replace(' controller', '')
         logger.warning(
             'Failed to get the status of the controller. It is not '
             f'fatal, but {controller_name} commands/calls may hang or return '
@@ -2329,18 +2347,18 @@ def is_controller_accessible(
     elif (controller_status == status_lib.ClusterStatus.INIT or
           need_connection_check):
         # Check ssh connection if (1) controller is in INIT state, or (2) we failed to fetch the
-        # status, both of which can happen when controller's status lock is held by another `sky spot launch` or
+        # status, both of which can happen when controller's status lock is held by another `sky jobs launch` or
         # `sky serve up`. If we have controller's head_ip available and it is ssh-reachable,
         # we can allow access to the controller.
         ssh_credentials = ssh_credential_from_yaml(handle.cluster_yaml,
                                                    handle.docker_user,
                                                    handle.ssh_user)
 
-        runner = command_runner.SSHCommandRunner(handle.head_ip,
-                                                 **ssh_credentials,
-                                                 port=handle.head_ssh_port)
+        runner = command_runner.SSHCommandRunner(node=(handle.head_ip,
+                                                       handle.head_ssh_port),
+                                                 **ssh_credentials)
         if not runner.check_connection():
-            error_msg = controller_type.value.connection_error_hint
+            error_msg = controller.value.connection_error_hint
     else:
         assert controller_status == status_lib.ClusterStatus.UP, handle
 
@@ -2379,7 +2397,7 @@ def get_clusters(
     of the clusters.
 
     Args:
-        include_controller: Whether to include controllers, e.g. spot controller
+        include_controller: Whether to include controllers, e.g. jobs controller
             or sky serve controller.
         refresh: Whether to refresh the status of the clusters. (Refreshing will
             set the status to STOPPED if the cluster cannot be pinged.)
@@ -2539,8 +2557,8 @@ def get_task_demands_dict(task: 'task_lib.Task') -> Dict[str, float]:
         optionally accelerator demands.
     """
     # TODO: Custom CPU and other memory resources are not supported yet.
-    # For sky spot/serve controller task, we set the CPU resource to a smaller
-    # value to support a larger number of spot jobs and services.
+    # For sky jobs/serve controller task, we set the CPU resource to a smaller
+    # value to support a larger number of managed jobs and services.
     resources_dict = {
         'CPU': (constants.CONTROLLER_PROCESS_CPU_DEMAND
                 if task.is_controller_task() else DEFAULT_TASK_CPU_DEMAND)
@@ -2557,41 +2575,58 @@ def get_task_demands_dict(task: 'task_lib.Task') -> Dict[str, float]:
     return resources_dict
 
 
-def get_task_resources_str(task: 'task_lib.Task') -> str:
+def get_task_resources_str(task: 'task_lib.Task',
+                           is_managed_job: bool = False) -> str:
     """Returns the resources string of the task.
 
     The resources string is only used as a display purpose, so we only show
     the accelerator demands (if any). Otherwise, the CPU demand is shown.
     """
-    task_cpu_demand = (constants.CONTROLLER_PROCESS_CPU_DEMAND if
-                       task.is_controller_task() else DEFAULT_TASK_CPU_DEMAND)
+    spot_str = ''
+    task_cpu_demand = (str(constants.CONTROLLER_PROCESS_CPU_DEMAND)
+                       if task.is_controller_task() else
+                       str(DEFAULT_TASK_CPU_DEMAND))
     if task.best_resources is not None:
         accelerator_dict = task.best_resources.accelerators
+        if is_managed_job:
+            if task.best_resources.use_spot:
+                spot_str = '[Spot]'
+            task_cpu_demand = task.best_resources.cpus
         if accelerator_dict is None:
             resources_str = f'CPU:{task_cpu_demand}'
         else:
             resources_str = ', '.join(
                 f'{k}:{v}' for k, v in accelerator_dict.items())
-    elif len(task.resources) == 1:
-        resources_dict = list(task.resources)[0].accelerators
-        if resources_dict is None:
-            resources_str = f'CPU:{task_cpu_demand}'
-        else:
-            resources_str = ', '.join(
-                f'{k}:{v}' for k, v in resources_dict.items())
     else:
         resource_accelerators = []
+        min_cpus = float('inf')
+        spot_type: Set[str] = set()
         for resource in task.resources:
+            task_cpu_demand = '1+'
+            if resource.cpus is not None:
+                task_cpu_demand = resource.cpus
+            min_cpus = min(min_cpus, float(task_cpu_demand.strip('+ ')))
+            if resource.use_spot:
+                spot_type.add('Spot')
+            else:
+                spot_type.add('On-demand')
+
             if resource.accelerators is None:
                 continue
             for k, v in resource.accelerators.items():
                 resource_accelerators.append(f'{k}:{v}')
 
+        if is_managed_job:
+            if len(task.resources) > 1:
+                task_cpu_demand = f'{min_cpus}+'
+            if 'Spot' in spot_type:
+                spot_str = '|'.join(sorted(spot_type))
+                spot_str = f'[{spot_str}]'
         if resource_accelerators:
             resources_str = ', '.join(set(resource_accelerators))
         else:
             resources_str = f'CPU:{task_cpu_demand}'
-    resources_str = f'{task.num_nodes}x [{resources_str}]'
+    resources_str = f'{task.num_nodes}x[{resources_str}]{spot_str}'
     return resources_str
 
 
@@ -2619,27 +2654,6 @@ def stop_handler(signum, frame):
         raise KeyboardInterrupt(exceptions.SIGTSTP_CODE)
 
 
-def run_command_and_handle_ssh_failure(runner: command_runner.SSHCommandRunner,
-                                       command: str,
-                                       failure_message: str) -> str:
-    """Runs command remotely and returns output with proper error handling."""
-    rc, stdout, stderr = runner.run(command,
-                                    require_outputs=True,
-                                    stream_logs=False)
-    if rc == 255:
-        # SSH failed
-        raise RuntimeError(
-            f'SSH with user {runner.ssh_user} and key {runner.ssh_private_key} '
-            f'to {runner.ip} failed. This is most likely due to incorrect '
-            'credentials or incorrect permissions for the key file. Check '
-            'your credentials and try again.')
-    subprocess_utils.handle_returncode(rc,
-                                       command,
-                                       failure_message,
-                                       stderr=stderr)
-    return stdout
-
-
 def check_rsync_installed() -> None:
     """Checks if rsync is installed.
 
@@ -2681,3 +2695,115 @@ def check_stale_runtime_on_remote(returncode: int, stderr: str,
                     f'not interrupted): {colorama.Style.BRIGHT}sky start -f -y '
                     f'{cluster_name}{colorama.Style.RESET_ALL}'
                     f'\n--- Details ---\n{stderr.strip()}\n')
+
+
+def get_endpoints(cluster: str,
+                  port: Optional[Union[int, str]] = None,
+                  skip_status_check: bool = False) -> Dict[int, str]:
+    """Gets the endpoint for a given cluster and port number (endpoint).
+
+    Args:
+        cluster: The name of the cluster.
+        port: The port number to get the endpoint for. If None, endpoints
+            for all ports are returned.
+        skip_status_check: Whether to skip the status check for the cluster.
+            This is useful when the cluster is known to be in a INIT state
+            and the caller wants to query the endpoints. Used by serve
+            controller to query endpoints during cluster launch when multiple
+            services may be getting launched in parallel (and as a result,
+            the controller may be in INIT status due to a concurrent launch).
+
+    Returns: A dictionary of port numbers to endpoints. If endpoint is None,
+        the dictionary will contain all ports:endpoints exposed on the cluster.
+        If the endpoint is not exposed yet (e.g., during cluster launch or
+        waiting for cloud provider to expose the endpoint), an empty dictionary
+        is returned.
+
+    Raises:
+        ValueError: if the port is invalid or the cloud provider does not
+            support querying endpoints.
+        exceptions.ClusterNotUpError: if the cluster is not in UP status.
+    """
+    # Cast endpoint to int if it is not None
+    if port is not None:
+        try:
+            port = int(port)
+        except ValueError:
+            with ux_utils.print_exception_no_traceback():
+                raise ValueError(f'Invalid endpoint {port!r}.') from None
+    cluster_records = get_clusters(include_controller=True,
+                                   refresh=False,
+                                   cluster_names=[cluster])
+    cluster_record = cluster_records[0]
+    if (not skip_status_check and
+            cluster_record['status'] != status_lib.ClusterStatus.UP):
+        with ux_utils.print_exception_no_traceback():
+            raise exceptions.ClusterNotUpError(
+                f'Cluster {cluster_record["name"]!r} '
+                'is not in UP status.', cluster_record['status'])
+    handle = cluster_record['handle']
+    if not isinstance(handle, backends.CloudVmRayResourceHandle):
+        with ux_utils.print_exception_no_traceback():
+            raise ValueError('Querying IP address is not supported '
+                             f'for cluster {cluster!r} with backend '
+                             f'{get_backend_from_handle(handle).NAME}.')
+
+    launched_resources = handle.launched_resources
+    cloud = launched_resources.cloud
+    try:
+        cloud.check_features_are_supported(
+            launched_resources, {clouds.CloudImplementationFeatures.OPEN_PORTS})
+    except exceptions.NotSupportedError:
+        with ux_utils.print_exception_no_traceback():
+            raise ValueError('Querying endpoints is not supported '
+                             f'for cluster {cluster!r} on {cloud}.') from None
+
+    config = common_utils.read_yaml(handle.cluster_yaml)
+    port_details = provision_lib.query_ports(repr(cloud),
+                                             handle.cluster_name_on_cloud,
+                                             handle.launched_resources.ports,
+                                             head_ip=handle.head_ip,
+                                             provider_config=config['provider'])
+
+    # Validation before returning the endpoints
+    if port is not None:
+        # If the requested endpoint was not to be exposed
+        port_set = resources_utils.port_ranges_to_set(
+            handle.launched_resources.ports)
+        if port not in port_set:
+            logger.warning(f'Port {port} is not exposed on '
+                           f'cluster {cluster!r}.')
+            return {}
+        # If the user requested a specific port endpoint, check if it is exposed
+        if port not in port_details:
+            error_msg = (f'Port {port} not exposed yet. '
+                         f'{_ENDPOINTS_RETRY_MESSAGE} ')
+            if handle.launched_resources.cloud.is_same_cloud(
+                    clouds.Kubernetes()):
+                # Add Kubernetes specific debugging info
+                error_msg += (kubernetes_utils.get_endpoint_debug_message())
+            logger.warning(error_msg)
+            return {}
+        return {port: port_details[port][0].url()}
+    else:
+        if not port_details:
+            # If cluster had no ports to be exposed
+            if handle.launched_resources.ports is None:
+                logger.warning(f'Cluster {cluster!r} does not have any '
+                               'ports to be exposed.')
+                return {}
+            # Else ports have not been exposed even though they exist.
+            # In this case, ask the user to retry.
+            else:
+                error_msg = (f'No endpoints exposed yet. '
+                             f'{_ENDPOINTS_RETRY_MESSAGE} ')
+                if handle.launched_resources.cloud.is_same_cloud(
+                        clouds.Kubernetes()):
+                    # Add Kubernetes specific debugging info
+                    error_msg += \
+                        kubernetes_utils.get_endpoint_debug_message()
+                logger.warning(error_msg)
+                return {}
+        return {
+            port_num: urls[0].url() for port_num, urls in port_details.items()
+        }
diff --git a/sky/backends/cloud_vm_ray_backend.py b/sky/backends/cloud_vm_ray_backend.py
index 44ade8c9c5e..f0b5db6e2ba 100644
--- a/sky/backends/cloud_vm_ray_backend.py
+++ b/sky/backends/cloud_vm_ray_backend.py
@@ -1,7 +1,7 @@
 """Backend: runs on cloud virtual machines, managed by Ray."""
-import base64
 import copy
 import enum
+import functools
 import getpass
 import inspect
 import json
@@ -9,6 +9,7 @@
 import os
 import pathlib
 import re
+import shlex
 import signal
 import subprocess
 import sys
@@ -28,12 +29,12 @@
 from sky import clouds
 from sky import exceptions
 from sky import global_user_state
+from sky import jobs as managed_jobs
 from sky import optimizer
 from sky import provision as provision_lib
 from sky import resources as resources_lib
 from sky import serve as serve_lib
 from sky import sky_logging
-from sky import spot as spot_lib
 from sky import status_lib
 from sky import task as task_lib
 from sky.backends import backend_utils
@@ -135,6 +136,19 @@
     pathlib.Path(sky.__file__).resolve().parent / 'backends' /
     'monkey_patches' / 'monkey_patch_ray_up.py')
 
+# The maximum size of a command line arguments is 128 KB, i.e. the command
+# executed with /bin/sh should be less than 128KB.
+# https://github.com/torvalds/linux/blob/master/include/uapi/linux/binfmts.h
+#
+# If a user have very long run or setup commands, the generated command may
+# exceed the limit, as we directly include scripts in job submission commands.
+# If the command is too long, we instead write it to a file, rsync and execute
+# it.
+#
+# We use 120KB as a threshold to be safe for other arguments that
+# might be added during ssh.
+_MAX_INLINE_SCRIPT_LENGTH = 120 * 1024
+
 
 def _get_cluster_config_template(cloud):
     cloud_to_template = {
@@ -268,7 +282,15 @@ def get_or_fail(futures, pg) -> List[int]:
                 \"\"\"Wait for tasks, if any fails, cancel all unready.\"\"\"
                 returncodes = [1] * len(futures)
                 # Wait for 1 task to be ready.
-                ready, unready = ray.wait(futures)
+                ready = []
+                # Keep invoking ray.wait if ready is empty. This is because
+                # ray.wait with timeout=None will only wait for 10**6 seconds,
+                # which will cause tasks running for more than 12 days to return
+                # before becoming ready.
+                # (Such tasks are common in serving jobs.)
+                # Reference: https://github.com/ray-project/ray/blob/ray-2.9.3/python/ray/_private/worker.py#L2845-L2846
+                while not ready:
+                    ready, unready = ray.wait(futures)
                 idx = futures.index(ready[0])
                 returncodes[idx] = ray.get(ready[0])
                 while unready:
@@ -1470,10 +1492,12 @@ def _retry_zones(
             if zones and len(zones) == 1:
                 launched_resources = launched_resources.copy(zone=zones[0].name)
 
-            prev_cluster_ips, prev_ssh_ports = None, None
+            prev_cluster_ips, prev_ssh_ports, prev_cluster_info = (None, None,
+                                                                   None)
             if prev_handle is not None:
                 prev_cluster_ips = prev_handle.stable_internal_external_ips
                 prev_ssh_ports = prev_handle.stable_ssh_ports
+                prev_cluster_info = prev_handle.cached_cluster_info
             # Record early, so if anything goes wrong, 'sky status' will show
             # the cluster name and users can appropriately 'sky down'.  It also
             # means a second 'sky launch -c <name>' will attempt to reuse.
@@ -1492,7 +1516,9 @@ def _retry_zones(
                 # optimize the case where the cluster is restarted, i.e., no
                 # need to query IPs and ports from the cloud provider.
                 stable_internal_external_ips=prev_cluster_ips,
-                stable_ssh_ports=prev_ssh_ports)
+                stable_ssh_ports=prev_ssh_ports,
+                cluster_info=prev_cluster_info,
+            )
             usage_lib.messages.usage.update_final_cluster_status(
                 status_lib.ClusterStatus.INIT)
 
@@ -1573,14 +1599,14 @@ def _retry_zones(
                 # manually or by the cloud provider.
                 # Optimize the case where the cluster's head IPs can be parsed
                 # from the output of 'ray up'.
-                kwargs = {}
                 if handle.launched_nodes == 1:
-                    kwargs = {
-                        'internal_ips': [head_internal_ip],
-                        'external_ips': [head_external_ip]
-                    }
-                handle.update_cluster_ips(max_attempts=_FETCH_IP_MAX_ATTEMPTS,
-                                          **kwargs)
+                    handle.update_cluster_ips(
+                        max_attempts=_FETCH_IP_MAX_ATTEMPTS,
+                        internal_ips=[head_internal_ip],
+                        external_ips=[head_external_ip])
+                else:
+                    handle.update_cluster_ips(
+                        max_attempts=_FETCH_IP_MAX_ATTEMPTS)
                 handle.update_ssh_ports(max_attempts=_FETCH_IP_MAX_ATTEMPTS)
                 if cluster_exists:
                     # Guard against the case where there's an existing cluster
@@ -1983,9 +2009,20 @@ def provision_with_retries(
                     cloud_user = None
                 else:
                     cloud_user = to_provision.cloud.get_current_user_identity()
+
+                requested_features = self._requested_features.copy()
+                # Skip stop feature for Kubernetes controllers.
+                if (isinstance(to_provision.cloud, clouds.Kubernetes) and
+                        controller_utils.Controllers.from_name(cluster_name)
+                        is not None):
+                    assert (clouds.CloudImplementationFeatures.STOP
+                            in requested_features), requested_features
+                    requested_features.remove(
+                        clouds.CloudImplementationFeatures.STOP)
+
                 # Skip if to_provision.cloud does not support requested features
                 to_provision.cloud.check_features_are_supported(
-                    to_provision, self._requested_features)
+                    to_provision, requested_features)
 
                 config_dict = self._retry_zones(
                     to_provision,
@@ -2106,7 +2143,7 @@ class CloudVmRayResourceHandle(backends.backend.ResourceHandle):
     """
     # Bump if any fields get added/removed/changed, and add backward
     # compaitibility logic in __setstate__.
-    _VERSION = 7
+    _VERSION = 8
 
     def __init__(
             self,
@@ -2119,6 +2156,7 @@ def __init__(
             stable_internal_external_ips: Optional[List[Tuple[str,
                                                               str]]] = None,
             stable_ssh_ports: Optional[List[int]] = None,
+            cluster_info: Optional[provision_common.ClusterInfo] = None,
             # The following 2 fields are deprecated. SkyPilot new provisioner
             # API handles the TPU node creation/deletion.
             # Backward compatibility for TPU nodes created before #2943.
@@ -2135,10 +2173,10 @@ def __init__(
         # internal or external ips, depending on the use_internal_ips flag.
         self.stable_internal_external_ips = stable_internal_external_ips
         self.stable_ssh_ports = stable_ssh_ports
+        self.cached_cluster_info = cluster_info
         self.launched_nodes = launched_nodes
         self.launched_resources = launched_resources
         self.docker_user: Optional[str] = None
-        self.ssh_user: Optional[str] = None
         # Deprecated. SkyPilot new provisioner API handles the TPU node
         # creation/deletion.
         # Backward compatibility for TPU nodes created before #2943.
@@ -2202,13 +2240,8 @@ def update_ssh_ports(self, max_attempts: int = 1) -> None:
         Use this method to use any cloud-specific port fetching logic.
         """
         del max_attempts  # Unused.
-        if isinstance(self.launched_resources.cloud, clouds.RunPod):
-            cluster_info = provision_lib.get_cluster_info(
-                str(self.launched_resources.cloud).lower(),
-                region=self.launched_resources.region,
-                cluster_name_on_cloud=self.cluster_name_on_cloud,
-                provider_config=None)
-            self.stable_ssh_ports = cluster_info.get_ssh_ports()
+        if self.cached_cluster_info is not None:
+            self.stable_ssh_ports = self.cached_cluster_info.get_ssh_ports()
             return
 
         head_ssh_port = 22
@@ -2216,11 +2249,49 @@ def update_ssh_ports(self, max_attempts: int = 1) -> None:
             [head_ssh_port] + [22] *
             (self.num_ips_per_node * self.launched_nodes - 1))
 
+    def _update_cluster_info(self):
+        # When a cluster is on a cloud that does not support the new
+        # provisioner, we should skip updating cluster_info.
+        if (self.launched_resources.cloud.PROVISIONER_VERSION >=
+                clouds.ProvisionerVersion.SKYPILOT):
+            provider_name = str(self.launched_resources.cloud).lower()
+            config = {}
+            if os.path.exists(self.cluster_yaml):
+                # It is possible that the cluster yaml is not available when
+                # the handle is unpickled for service replicas from the
+                # controller with older version.
+                config = common_utils.read_yaml(self.cluster_yaml)
+            try:
+                cluster_info = provision_lib.get_cluster_info(
+                    provider_name,
+                    region=self.launched_resources.region,
+                    cluster_name_on_cloud=self.cluster_name_on_cloud,
+                    provider_config=config.get('provider', None))
+            except Exception as e:  # pylint: disable=broad-except
+                # This could happen when the VM is not fully launched, and a
+                # user is trying to terminate it with `sky down`.
+                logger.debug('Failed to get cluster info for '
+                             f'{self.cluster_name} from the new provisioner '
+                             f'with {common_utils.format_exception(e)}.')
+                raise exceptions.FetchClusterInfoError(
+                    exceptions.FetchClusterInfoError.Reason.HEAD) from e
+            if cluster_info.num_instances != self.launched_nodes:
+                logger.debug(
+                    f'Available nodes in the cluster {self.cluster_name} '
+                    'do not match the number of nodes requested ('
+                    f'{cluster_info.num_instances} != '
+                    f'{self.launched_nodes}).')
+                raise exceptions.FetchClusterInfoError(
+                    exceptions.FetchClusterInfoError.Reason.HEAD)
+            self.cached_cluster_info = cluster_info
+
     def update_cluster_ips(
             self,
             max_attempts: int = 1,
             internal_ips: Optional[List[Optional[str]]] = None,
-            external_ips: Optional[List[Optional[str]]] = None) -> None:
+            external_ips: Optional[List[Optional[str]]] = None,
+            cluster_info: Optional[provision_common.ClusterInfo] = None
+    ) -> None:
         """Updates the cluster IPs cached in the handle.
 
         We cache the cluster IPs in the handle to avoid having to retrieve
@@ -2246,61 +2317,74 @@ def update_cluster_ips(
                 external IPs from the cloud provider.
 
         Raises:
-            exceptions.FetchIPError: if we failed to get the IPs. e.reason is
-                HEAD or WORKER.
+            exceptions.FetchClusterInfoError: if we failed to get the cluster
+                infos. e.reason is HEAD or WORKER.
         """
-
-        def is_provided_ips_valid(ips: Optional[List[Optional[str]]]) -> bool:
-            return (ips is not None and
-                    len(ips) == self.num_ips_per_node * self.launched_nodes and
-                    all(ip is not None for ip in ips))
-
-        use_internal_ips = self._use_internal_ips()
-
-        # cluster_feasible_ips is the list of IPs of the nodes in the cluster
-        # which can be used to connect to the cluster. It is a list of external
-        # IPs if the cluster is assigned public IPs, otherwise it is a list of
-        # internal IPs.
-        cluster_feasible_ips: List[str]
-        if is_provided_ips_valid(external_ips):
-            logger.debug(f'Using provided external IPs: {external_ips}')
-            cluster_feasible_ips = typing.cast(List[str], external_ips)
+        if cluster_info is not None:
+            self.cached_cluster_info = cluster_info
+            use_internal_ips = self._use_internal_ips()
+            cluster_feasible_ips = self.cached_cluster_info.get_feasible_ips(
+                use_internal_ips)
+            cluster_internal_ips = self.cached_cluster_info.get_feasible_ips(
+                force_internal_ips=True)
         else:
-            cluster_feasible_ips = backend_utils.get_node_ips(
-                self.cluster_yaml,
-                self.launched_nodes,
-                head_ip_max_attempts=max_attempts,
-                worker_ip_max_attempts=max_attempts,
-                get_internal_ips=use_internal_ips)
-
-        if self.cached_external_ips == cluster_feasible_ips:
-            logger.debug('Skipping the fetching of internal IPs as the cached '
-                         'external IPs matches the newly fetched ones.')
-            # Optimization: If the cached external IPs are the same as the
-            # retrieved feasible IPs, then we can skip retrieving internal
-            # IPs since the cached IPs are up-to-date.
-            return
-        logger.debug(
-            'Cached external IPs do not match with the newly fetched ones: '
-            f'cached ({self.cached_external_ips}), new ({cluster_feasible_ips})'
-        )
+            # For clouds that do not support the SkyPilot Provisioner API.
+            # TODO(zhwu): once all the clouds are migrated to SkyPilot
+            # Provisioner API, we should remove this else block
+            def is_provided_ips_valid(
+                    ips: Optional[List[Optional[str]]]) -> bool:
+                return (ips is not None and len(ips)
+                        == self.num_ips_per_node * self.launched_nodes and
+                        all(ip is not None for ip in ips))
+
+            use_internal_ips = self._use_internal_ips()
+
+            # cluster_feasible_ips is the list of IPs of the nodes in the
+            # cluster which can be used to connect to the cluster. It is a list
+            # of external IPs if the cluster is assigned public IPs, otherwise
+            # it is a list of internal IPs.
+            if is_provided_ips_valid(external_ips):
+                logger.debug(f'Using provided external IPs: {external_ips}')
+                cluster_feasible_ips = typing.cast(List[str], external_ips)
+            else:
+                cluster_feasible_ips = backend_utils.get_node_ips(
+                    self.cluster_yaml,
+                    self.launched_nodes,
+                    head_ip_max_attempts=max_attempts,
+                    worker_ip_max_attempts=max_attempts,
+                    get_internal_ips=use_internal_ips)
+
+            if self.cached_external_ips == cluster_feasible_ips:
+                logger.debug(
+                    'Skipping the fetching of internal IPs as the cached '
+                    'external IPs matches the newly fetched ones.')
+                # Optimization: If the cached external IPs are the same as the
+                # retrieved feasible IPs, then we can skip retrieving internal
+                # IPs since the cached IPs are up-to-date.
+                return
 
-        if use_internal_ips:
-            # Optimization: if we know use_internal_ips is True (currently
-            # only exposed for AWS and GCP), then our provisioner is guaranteed
-            # to not assign public IPs, thus the first list of IPs returned
-            # above are already private IPs. So skip the second query.
-            cluster_internal_ips = list(cluster_feasible_ips)
-        elif is_provided_ips_valid(internal_ips):
-            logger.debug(f'Using provided internal IPs: {internal_ips}')
-            cluster_internal_ips = typing.cast(List[str], internal_ips)
-        else:
-            cluster_internal_ips = backend_utils.get_node_ips(
-                self.cluster_yaml,
-                self.launched_nodes,
-                head_ip_max_attempts=max_attempts,
-                worker_ip_max_attempts=max_attempts,
-                get_internal_ips=True)
+            logger.debug(
+                'Cached external IPs do not match with the newly fetched ones: '
+                f'cached ({self.cached_external_ips}), new '
+                f'({cluster_feasible_ips})')
+
+            if use_internal_ips:
+                # Optimization: if we know use_internal_ips is True (currently
+                # only exposed for AWS and GCP), then our provisioner is
+                # guaranteed to not assign public IPs, thus the first list of
+                # IPs returned above are already private IPs. So skip the second
+                # query.
+                cluster_internal_ips = list(cluster_feasible_ips)
+            elif is_provided_ips_valid(internal_ips):
+                logger.debug(f'Using provided internal IPs: {internal_ips}')
+                cluster_internal_ips = typing.cast(List[str], internal_ips)
+            else:
+                cluster_internal_ips = backend_utils.get_node_ips(
+                    self.cluster_yaml,
+                    self.launched_nodes,
+                    head_ip_max_attempts=max_attempts,
+                    worker_ip_max_attempts=max_attempts,
+                    get_internal_ips=True)
 
         assert len(cluster_feasible_ips) == len(cluster_internal_ips), (
             f'Cluster {self.cluster_name!r}:'
@@ -2319,6 +2403,39 @@ def is_provided_ips_valid(ips: Optional[List[Optional[str]]]) -> bool:
             internal_external_ips[1:], key=lambda x: x[1])
         self.stable_internal_external_ips = stable_internal_external_ips
 
+    @functools.lru_cache()
+    @timeline.event
+    def get_command_runners(self,
+                            force_cached: bool = False,
+                            avoid_ssh_control: bool = False
+                           ) -> List[command_runner.CommandRunner]:
+        """Returns a list of command runners for the cluster."""
+        ssh_credentials = backend_utils.ssh_credential_from_yaml(
+            self.cluster_yaml, self.docker_user, self.ssh_user)
+        if avoid_ssh_control:
+            ssh_credentials.pop('ssh_control_name', None)
+        if (clouds.ProvisionerVersion.RAY_PROVISIONER_SKYPILOT_TERMINATOR >=
+                self.launched_resources.cloud.PROVISIONER_VERSION):
+            ip_list = (self.cached_external_ips
+                       if force_cached else self.external_ips())
+            if ip_list is None:
+                return []
+            # Potentially refresh the external SSH ports, in case the existing
+            # cluster before #2491 was launched without external SSH ports
+            # cached.
+            port_list = self.external_ssh_ports()
+            runners = command_runner.SSHCommandRunner.make_runner_list(
+                zip(ip_list, port_list), **ssh_credentials)
+            return runners
+        if self.cached_cluster_info is None:
+            assert not force_cached, 'cached_cluster_info is None.'
+            self._update_cluster_info()
+        assert self.cached_cluster_info is not None, self
+        runners = provision_lib.get_command_runners(
+            self.cached_cluster_info.provider_name, self.cached_cluster_info,
+            **ssh_credentials)
+        return runners
+
     @property
     def cached_internal_ips(self) -> Optional[List[str]]:
         if self.stable_internal_external_ips is not None:
@@ -2384,6 +2501,16 @@ def setup_docker_user(self, cluster_config_file: str):
     def cluster_yaml(self):
         return os.path.expanduser(self._cluster_yaml)
 
+    @property
+    def ssh_user(self):
+        if self.cached_cluster_info is not None:
+            # Overload ssh_user with the user stored in cluster_info, which is
+            # useful for kubernetes case, where the ssh_user can depend on the
+            # container image used. For those clusters launched with ray
+            # autoscaler, we directly use the ssh_user in yaml config.
+            return self.cached_cluster_info.ssh_user
+        return None
+
     @property
     def head_ip(self):
         external_ips = self.cached_external_ips
@@ -2429,8 +2556,8 @@ def __setstate__(self, state):
         if version < 6:
             state['cluster_name_on_cloud'] = state['cluster_name']
 
-        if version < 7:
-            self.ssh_user = None
+        if version < 8:
+            self.cached_cluster_info = None
 
         self.__dict__.update(state)
 
@@ -2440,7 +2567,7 @@ def __setstate__(self, state):
         if version < 3 and head_ip is not None:
             try:
                 self.update_cluster_ips()
-            except exceptions.FetchIPError:
+            except exceptions.FetchClusterInfoError:
                 # This occurs when an old cluster from was autostopped,
                 # so the head IP in the database is not updated.
                 pass
@@ -2449,6 +2576,14 @@ def __setstate__(self, state):
 
         self._update_cluster_region()
 
+        if version < 8:
+            try:
+                self._update_cluster_info()
+            except exceptions.FetchClusterInfoError:
+                # This occurs when an old cluster from was autostopped,
+                # so the head IP in the database is not updated.
+                pass
+
 
 class CloudVmRayBackend(backends.Backend['CloudVmRayResourceHandle']):
     """Backend: runs on cloud virtual machines, managed by Ray.
@@ -2561,6 +2696,17 @@ def check_resources_fit_cluster(
                             f'{example_resource.zone!r},'
                             'but the existing cluster '
                             f'{zone_str}')
+                if (example_resource.requires_fuse and
+                        not launched_resources.requires_fuse):
+                    # Will not be reached for non-k8s case since the
+                    # less_demanding_than only fails fuse requirement when
+                    # the cloud is Kubernetes AND the cluster doesn't have fuse.
+                    with ux_utils.print_exception_no_traceback():
+                        raise exceptions.ResourcesMismatchError(
+                            'Task requires FUSE support for mounting object '
+                            'stores, but the existing cluster with '
+                            f'{launched_resources!r} does not support FUSE '
+                            f'mounting. Launch a new cluster to run this task.')
             requested_resource_str = ', '.join(requested_resource_list)
             if isinstance(task.resources, list):
                 requested_resource_str = f'[{requested_resource_str}]'
@@ -2729,22 +2875,17 @@ def _provision(
                     provision_record=provision_record,
                     custom_resource=resources_vars.get('custom_resources'),
                     log_dir=self.log_dir)
-                # We must query the IPs from the cloud provider, when the
-                # provisioning is done, to make sure the cluster IPs are
-                # up-to-date.
+                # We use the IPs from the cluster_info to update_cluster_ips,
+                # when the provisioning is done, to make sure the cluster IPs
+                # are up-to-date.
                 # The staled IPs may be caused by the node being restarted
                 # manually or by the cloud provider.
                 # Optimize the case where the cluster's IPs can be retrieved
                 # from cluster_info.
-                internal_ips, external_ips = zip(*cluster_info.ip_tuples())
-                if not cluster_info.has_external_ips():
-                    external_ips = internal_ips
+                handle.docker_user = cluster_info.docker_user
                 handle.update_cluster_ips(max_attempts=_FETCH_IP_MAX_ATTEMPTS,
-                                          internal_ips=list(internal_ips),
-                                          external_ips=list(external_ips))
+                                          cluster_info=cluster_info)
                 handle.update_ssh_ports(max_attempts=_FETCH_IP_MAX_ATTEMPTS)
-                handle.docker_user = cluster_info.docker_user
-                handle.ssh_user = cluster_info.ssh_user
 
                 # Update launched resources.
                 handle.launched_resources = handle.launched_resources.copy(
@@ -2776,11 +2917,7 @@ def _provision(
                     handle.launched_resources.cloud.get_zone_shell_cmd())
                 # zone is None for Azure
                 if get_zone_cmd is not None:
-                    ssh_credentials = backend_utils.ssh_credential_from_yaml(
-                        handle.cluster_yaml, handle.docker_user,
-                        handle.ssh_user)
-                    runners = command_runner.SSHCommandRunner.make_runner_list(
-                        ip_list, port_list=ssh_port_list, **ssh_credentials)
+                    runners = handle.get_command_runners()
 
                     def _get_zone(runner):
                         retry_count = 0
@@ -2821,8 +2958,11 @@ def _get_zone(runner):
             logger.debug('Checking if skylet is running on the head node.')
             with rich_utils.safe_status(
                     '[bold cyan]Preparing SkyPilot runtime'):
+                # We need to source bashrc for skylet to make sure the autostop
+                # event can access the path to the cloud CLIs.
                 self.run_on_head(handle,
-                                 instance_setup.MAYBE_SKYLET_RESTART_CMD)
+                                 instance_setup.MAYBE_SKYLET_RESTART_CMD,
+                                 source_bashrc=True)
 
             self._update_after_cluster_provisioned(
                 handle, to_provision_config.prev_handle, task,
@@ -2921,7 +3061,6 @@ def _sync_workdir(self, handle: CloudVmRayResourceHandle,
         fore = colorama.Fore
         style = colorama.Style
         ip_list = handle.external_ips()
-        port_list = handle.external_ssh_ports()
         assert ip_list is not None, 'external_ips is not cached in handle'
         full_workdir = os.path.abspath(os.path.expanduser(workdir))
 
@@ -2946,14 +3085,10 @@ def _sync_workdir(self, handle: CloudVmRayResourceHandle,
 
         log_path = os.path.join(self.log_dir, 'workdir_sync.log')
 
-        ssh_credentials = backend_utils.ssh_credential_from_yaml(
-            handle.cluster_yaml, handle.docker_user, handle.ssh_user)
-
         # TODO(zhwu): refactor this with backend_utils.parallel_cmd_with_rsync
-        runners = command_runner.SSHCommandRunner.make_runner_list(
-            ip_list, port_list=port_list, **ssh_credentials)
+        runners = handle.get_command_runners()
 
-        def _sync_workdir_node(runner: command_runner.SSHCommandRunner) -> None:
+        def _sync_workdir_node(runner: command_runner.CommandRunner) -> None:
             runner.rsync(
                 source=workdir,
                 target=SKY_REMOTE_WORKDIR,
@@ -3000,47 +3135,51 @@ def _setup(self, handle: CloudVmRayResourceHandle, task: task_lib.Task,
             return
         setup = task.setup
         # Sync the setup script up and run it.
-        ip_list = handle.external_ips()
         internal_ips = handle.internal_ips()
-        port_list = handle.external_ssh_ports()
-        assert ip_list is not None, 'external_ips is not cached in handle'
-        ssh_credentials = backend_utils.ssh_credential_from_yaml(
-            handle.cluster_yaml, handle.docker_user, handle.ssh_user)
-        # Disable connection sharing for setup script to avoid old
-        # connections being reused, which may cause stale ssh agent
-        # forwarding.
-        ssh_credentials.pop('ssh_control_name', None)
-
         remote_setup_file_name = f'/tmp/sky_setup_{self.run_timestamp}'
         # Need this `-i` option to make sure `source ~/.bashrc` work
         setup_cmd = f'/bin/bash -i {remote_setup_file_name} 2>&1'
+        runners = handle.get_command_runners(avoid_ssh_control=True)
 
         def _setup_node(node_id: int) -> None:
             setup_envs = task.envs.copy()
             setup_envs.update(self._skypilot_predefined_env_vars(handle))
             setup_envs['SKYPILOT_SETUP_NODE_IPS'] = '\n'.join(internal_ips)
             setup_envs['SKYPILOT_SETUP_NODE_RANK'] = str(node_id)
-            runner = command_runner.SSHCommandRunner(ip_list[node_id],
-                                                     port=port_list[node_id],
-                                                     **ssh_credentials)
+            runner = runners[node_id]
             setup_script = log_lib.make_task_bash_script(setup,
                                                          env_vars=setup_envs)
-            with tempfile.NamedTemporaryFile('w', prefix='sky_setup_') as f:
-                f.write(setup_script)
-                f.flush()
-                setup_sh_path = f.name
-                runner.rsync(source=setup_sh_path,
-                             target=remote_setup_file_name,
-                             up=True,
-                             stream_logs=False)
+            encoded_script = shlex.quote(setup_script)
+            if (detach_setup or
+                    len(encoded_script) > _MAX_INLINE_SCRIPT_LENGTH):
+                with tempfile.NamedTemporaryFile('w', prefix='sky_setup_') as f:
+                    f.write(setup_script)
+                    f.flush()
+                    setup_sh_path = f.name
+                    runner.rsync(source=setup_sh_path,
+                                 target=remote_setup_file_name,
+                                 up=True,
+                                 stream_logs=False)
+                create_script_code = 'true'
+            else:
+                create_script_code = (f'{{ echo {encoded_script} > '
+                                      f'{remote_setup_file_name}; }}')
+
             if detach_setup:
                 return
             setup_log_path = os.path.join(self.log_dir,
-                                          f'setup-{runner.ip}.log')
+                                          f'setup-{runner.node_id}.log')
             returncode = runner.run(
-                setup_cmd,
+                f'{create_script_code} && {setup_cmd}',
                 log_path=setup_log_path,
                 process_stream=False,
+                # We do not source bashrc for setup, since bashrc is sourced
+                # in the script already.
+                # Skip an empty line and two lines due to the /bin/bash -i and
+                # source ~/.bashrc in the setup_cmd.
+                #   bash: cannot set terminal process group (7398): Inappropriate ioctl for device # pylint: disable=line-too-long
+                #   bash: no job control in this shell
+                skip_lines=3,
             )
 
             def error_message() -> str:
@@ -3070,7 +3209,7 @@ def error_message() -> str:
                                                command=setup_cmd,
                                                error_msg=error_message)
 
-        num_nodes = len(ip_list)
+        num_nodes = len(runners)
         plural = 's' if num_nodes > 1 else ''
         if not detach_setup:
             logger.info(f'{fore.CYAN}Running setup on {num_nodes} node{plural}.'
@@ -3096,7 +3235,7 @@ def _exec_code_on_head(
         codegen: str,
         job_id: int,
         detach_run: bool = False,
-        spot_dag: Optional['dag.Dag'] = None,
+        managed_job_dag: Optional['dag.Dag'] = None,
     ) -> None:
         """Executes generated code on the head node."""
         style = colorama.Style
@@ -3110,10 +3249,8 @@ def _exec_code_on_head(
 
         mkdir_code = (f'{cd} && mkdir -p {remote_log_dir} && '
                       f'touch {remote_log_path}')
-        encoded_script = base64.b64encode(
-            codegen.encode('utf-8')).decode('utf-8')
-        create_script_code = (f'{{ echo "{encoded_script}" | base64 --decode > '
-                              f'{script_path}; }}')
+        encoded_script = shlex.quote(codegen)
+        create_script_code = (f'{{ echo {encoded_script} > {script_path}; }}')
         job_submit_cmd = (
             f'RAY_DASHBOARD_PORT=$({constants.SKY_PYTHON_CMD} -c "from sky.skylet import job_lib; print(job_lib.get_job_submission_port())" 2> /dev/null || echo 8265);'  # pylint: disable=line-too-long
             f'{cd} && {constants.SKY_RAY_CMD} job submit '
@@ -3125,23 +3262,41 @@ def _exec_code_on_head(
 
         code = job_lib.JobLibCodeGen.queue_job(job_id, job_submit_cmd)
         job_submit_cmd = ' && '.join([mkdir_code, create_script_code, code])
-
-        if spot_dag is not None:
-            # Add the spot job to spot queue table.
-            spot_codegen = spot_lib.SpotCodeGen()
-            spot_code = spot_codegen.set_pending(job_id, spot_dag)
-            # Set the spot job to PENDING state to make sure that this spot
-            # job appears in the `sky spot queue`, when there are already 16
-            # controller process jobs running on the controller VM with 8
-            # CPU cores.
-            # The spot job should be set to PENDING state *after* the
+        if len(job_submit_cmd) > _MAX_INLINE_SCRIPT_LENGTH:
+            runners = handle.get_command_runners()
+            head_runner = runners[0]
+            with tempfile.NamedTemporaryFile('w', prefix='sky_app_') as fp:
+                fp.write(codegen)
+                fp.flush()
+                script_path = os.path.join(SKY_REMOTE_APP_DIR,
+                                           f'sky_job_{job_id}')
+                # We choose to sync code + exec, because the alternative of 'ray
+                # submit' may not work as it may use system python (python2) to
+                # execute the script. Happens for AWS.
+                head_runner.rsync(source=fp.name,
+                                  target=script_path,
+                                  up=True,
+                                  stream_logs=False)
+            job_submit_cmd = f'{mkdir_code} && {code}'
+
+        if managed_job_dag is not None:
+            # Add the managed job to job queue database.
+            managed_job_codegen = managed_jobs.ManagedJobCodeGen()
+            managed_job_code = managed_job_codegen.set_pending(
+                job_id, managed_job_dag)
+            # Set the managed job to PENDING state to make sure that this
+            # managed job appears in the `sky jobs queue`, when there are
+            # already 2x vCPU controller processes running on the controller VM,
+            # e.g., 16 controller processes running on a controller with 8
+            # vCPUs.
+            # The managed job should be set to PENDING state *after* the
             # controller process job has been queued, as our skylet on spot
-            # controller will set the spot job in FAILED state if the
+            # controller will set the managed job in FAILED state if the
             # controller process job does not exist.
-            # We cannot set the spot job to PENDING state in the codegen for
+            # We cannot set the managed job to PENDING state in the codegen for
             # the controller process job, as it will stay in the job pending
             # table and not be executed until there is an empty slot.
-            job_submit_cmd = job_submit_cmd + ' && ' + spot_code
+            job_submit_cmd = job_submit_cmd + ' && ' + managed_job_code
 
         returncode, stdout, stderr = self.run_on_head(handle,
                                                       job_submit_cmd,
@@ -3162,8 +3317,9 @@ def _exec_code_on_head(
 
         try:
             if not detach_run:
-                if handle.cluster_name == spot_lib.SPOT_CONTROLLER_NAME:
-                    self.tail_spot_logs(handle, job_id)
+                if (handle.cluster_name in controller_utils.Controllers.
+                        JOBS_CONTROLLER.value.candidate_cluster_names):
+                    self.tail_managed_job_logs(handle, job_id)
                 else:
                     # Sky logs. Not using subprocess.run since it will make the
                     # ssh keep connected after ctrl-c.
@@ -3171,24 +3327,24 @@ def _exec_code_on_head(
         finally:
             name = handle.cluster_name
             controller = controller_utils.Controllers.from_name(name)
-            if controller == controller_utils.Controllers.SPOT_CONTROLLER:
+            if controller == controller_utils.Controllers.JOBS_CONTROLLER:
                 logger.info(
-                    f'{fore.CYAN}Spot Job ID: '
+                    f'{fore.CYAN}Managed Job ID: '
                     f'{style.BRIGHT}{job_id}{style.RESET_ALL}'
                     '\nTo cancel the job:\t\t'
-                    f'{backend_utils.BOLD}sky spot cancel {job_id}'
+                    f'{backend_utils.BOLD}sky jobs cancel {job_id}'
                     f'{backend_utils.RESET_BOLD}'
                     '\nTo stream job logs:\t\t'
-                    f'{backend_utils.BOLD}sky spot logs {job_id}'
+                    f'{backend_utils.BOLD}sky jobs logs {job_id}'
                     f'{backend_utils.RESET_BOLD}'
                     f'\nTo stream controller logs:\t'
-                    f'{backend_utils.BOLD}sky spot logs --controller {job_id}'
+                    f'{backend_utils.BOLD}sky jobs logs --controller {job_id}'
                     f'{backend_utils.RESET_BOLD}'
-                    '\nTo view all spot jobs:\t\t'
-                    f'{backend_utils.BOLD}sky spot queue'
+                    '\nTo view all managed jobs:\t'
+                    f'{backend_utils.BOLD}sky jobs queue'
                     f'{backend_utils.RESET_BOLD}'
-                    '\nTo view the spot job dashboard:\t'
-                    f'{backend_utils.BOLD}sky spot dashboard'
+                    '\nTo view managed job dashboard:\t'
+                    f'{backend_utils.BOLD}sky jobs dashboard'
                     f'{backend_utils.RESET_BOLD}')
             elif controller is None:
                 logger.info(f'{fore.CYAN}Job ID: '
@@ -3476,15 +3632,7 @@ def sync_down_logs(
             logger.info(f'{fore.CYAN}Job {job_id} logs: {log_dir}'
                         f'{style.RESET_ALL}')
 
-        ip_list = handle.external_ips()
-        assert ip_list is not None, 'external_ips is not cached in handle'
-        ssh_port_list = handle.external_ssh_ports()
-        assert ssh_port_list is not None, 'external_ssh_ports is not cached ' \
-                                          'in handle'
-        ssh_credentials = backend_utils.ssh_credential_from_yaml(
-            handle.cluster_yaml, handle.docker_user, handle.ssh_user)
-        runners = command_runner.SSHCommandRunner.make_runner_list(
-            ip_list, port_list=ssh_port_list, **ssh_credentials)
+        runners = handle.get_command_runners()
 
         def _rsync_down(args) -> None:
             """Rsync down logs from remote nodes.
@@ -3496,7 +3644,10 @@ def _rsync_down(args) -> None:
             try:
                 os.makedirs(local_log_dir, exist_ok=True)
                 runner.rsync(
-                    source=f'{remote_log_dir}/*',
+                    # Require a `/` at the end to make sure the parent dir
+                    # are not created locally. We do not add additional '*' as
+                    # kubernetes's rsync does not work with an ending '*'.
+                    source=f'{remote_log_dir}/',
                     target=local_log_dir,
                     up=False,
                     stream_logs=False,
@@ -3505,7 +3656,7 @@ def _rsync_down(args) -> None:
                 if e.returncode == exceptions.RSYNC_FILE_NOT_FOUND_CODE:
                     # Raised by rsync_down. Remote log dir may not exist, since
                     # the job can be run on some part of the nodes.
-                    logger.debug(f'{runner.ip} does not have the tasks/*.')
+                    logger.debug(f'{runner.node_id} does not have the tasks/*.')
                 else:
                     raise
 
@@ -3518,12 +3669,20 @@ def _rsync_down(args) -> None:
     def tail_logs(self,
                   handle: CloudVmRayResourceHandle,
                   job_id: Optional[int],
-                  spot_job_id: Optional[int] = None,
+                  managed_job_id: Optional[int] = None,
                   follow: bool = True) -> int:
+        """Tail the logs of a job.
+
+        Args:
+            handle: The handle to the cluster.
+            job_id: The job ID to tail the logs of.
+            managed_job_id: The managed job ID for display purpose only.
+            follow: Whether to follow the logs.
+        """
         code = job_lib.JobLibCodeGen.tail_logs(job_id,
-                                               spot_job_id=spot_job_id,
+                                               managed_job_id=managed_job_id,
                                                follow=follow)
-        if job_id is None and spot_job_id is None:
+        if job_id is None and managed_job_id is None:
             logger.info(
                 'Job ID not provided. Streaming the logs of the latest job.')
 
@@ -3550,17 +3709,16 @@ def tail_logs(self,
             returncode = e.code
         return returncode
 
-    def tail_spot_logs(self,
-                       handle: CloudVmRayResourceHandle,
-                       job_id: Optional[int] = None,
-                       job_name: Optional[str] = None,
-                       follow: bool = True) -> None:
+    def tail_managed_job_logs(self,
+                              handle: CloudVmRayResourceHandle,
+                              job_id: Optional[int] = None,
+                              job_name: Optional[str] = None,
+                              controller: bool = False,
+                              follow: bool = True) -> None:
         # if job_name is not None, job_id should be None
         assert job_name is None or job_id is None, (job_name, job_id)
-        if job_name is not None:
-            code = spot_lib.SpotCodeGen.stream_logs_by_name(job_name, follow)
-        else:
-            code = spot_lib.SpotCodeGen.stream_logs_by_id(job_id, follow)
+        code = managed_jobs.ManagedJobCodeGen.stream_logs(
+            job_name, job_id, follow, controller)
 
         # With the stdin=subprocess.DEVNULL, the ctrl-c will not directly
         # kill the process, so we need to handle it manually here.
@@ -3676,7 +3834,7 @@ def teardown_no_lock(self,
                 # even when the command was executed successfully.
                 self.run_on_head(handle,
                                  f'{constants.SKY_RAY_CMD} stop --force')
-            except exceptions.FetchIPError:
+            except exceptions.FetchClusterInfoError:
                 # This error is expected if the previous cluster IP is
                 # failed to be found,
                 # i.e., the cluster is already stopped/terminated.
@@ -3953,6 +4111,14 @@ def post_teardown_cleanup(self,
                 pass
             except exceptions.PortDoesNotExistError:
                 logger.debug('Ports do not exist. Skipping cleanup.')
+            except Exception as e:  # pylint: disable=broad-except
+                if purge:
+                    logger.warning(
+                        f'Failed to cleanup ports. Skipping since purge is '
+                        f'set. Details: '
+                        f'{common_utils.format_exception(e, use_bracket=True)}')
+                else:
+                    raise
 
         # The cluster file must exist because the cluster_yaml will only
         # be removed after the cluster entry in the database is removed.
@@ -3991,6 +4157,16 @@ def set_autostop(self,
         # The core.autostop() function should have already checked that the
         # cloud and resources support requested autostop.
         if idle_minutes_to_autostop is not None:
+            # Skip auto-stop for Kubernetes clusters.
+            if (isinstance(handle.launched_resources.cloud, clouds.Kubernetes)
+                    and not down and idle_minutes_to_autostop >= 0):
+                # We should hit this code path only for the controllers on
+                # Kubernetes clusters.
+                assert (controller_utils.Controllers.from_name(
+                    handle.cluster_name) is not None), handle.cluster_name
+                logger.info('Auto-stop is not supported for Kubernetes '
+                            'clusters. Skipping.')
+                return
 
             # Check if we're stopping spot
             assert (handle.launched_resources is not None and
@@ -4053,6 +4229,7 @@ def run_on_head(
         require_outputs: bool = False,
         separate_stderr: bool = False,
         process_stream: bool = True,
+        source_bashrc: bool = False,
         **kwargs,
     ) -> Union[int, Tuple[int, str, str]]:
         """Runs 'cmd' on the cluster's head node.
@@ -4078,6 +4255,9 @@ def run_on_head(
             process_stream: Whether to post-process the stdout/stderr of the
                 command, such as replacing or skipping lines on the fly. If
                 enabled, lines are printed only when '\r' or '\n' is found.
+            source_bashrc: Whether to source bashrc when running on the command
+                on the VM. If it is a user-related commands, it would always be
+                good to source bashrc to make sure the env vars are set.
 
         Returns:
             returncode
@@ -4085,24 +4265,17 @@ def run_on_head(
             A tuple of (returncode, stdout, stderr).
 
         Raises:
-            exceptions.FetchIPError: If the head node IP cannot be fetched.
+            exceptions.FetchClusterInfoError: If the cluster info cannot be
+                fetched.
         """
         # This will try to fetch the head node IP if it is not cached.
-        external_ips = handle.external_ips(max_attempts=_FETCH_IP_MAX_ATTEMPTS)
-        head_ip = external_ips[0]
-        external_ssh_ports = handle.external_ssh_ports(
-            max_attempts=_FETCH_IP_MAX_ATTEMPTS)
-        head_ssh_port = external_ssh_ports[0]
 
-        ssh_credentials = backend_utils.ssh_credential_from_yaml(
-            handle.cluster_yaml, handle.docker_user, handle.ssh_user)
-        runner = command_runner.SSHCommandRunner(head_ip,
-                                                 port=head_ssh_port,
-                                                 **ssh_credentials)
+        runners = handle.get_command_runners()
+        head_runner = runners[0]
         if under_remote_workdir:
             cmd = f'cd {SKY_REMOTE_WORKDIR} && {cmd}'
 
-        return runner.run(
+        return head_runner.run(
             cmd,
             port_forward=port_forward,
             log_path=log_path,
@@ -4111,6 +4284,7 @@ def run_on_head(
             ssh_mode=ssh_mode,
             require_outputs=require_outputs,
             separate_stderr=separate_stderr,
+            source_bashrc=source_bashrc,
             **kwargs,
         )
 
@@ -4254,13 +4428,7 @@ def _execute_file_mounts(self, handle: CloudVmRayResourceHandle,
         style = colorama.Style
         logger.info(f'{fore.CYAN}Processing file mounts.{style.RESET_ALL}')
         start = time.time()
-        ip_list = handle.external_ips()
-        port_list = handle.external_ssh_ports()
-        assert ip_list is not None, 'external_ips is not cached in handle'
-        ssh_credentials = backend_utils.ssh_credential_from_yaml(
-            handle.cluster_yaml, handle.docker_user, handle.ssh_user)
-        runners = command_runner.SSHCommandRunner.make_runner_list(
-            ip_list, port_list=port_list, **ssh_credentials)
+        runners = handle.get_command_runners()
         log_path = os.path.join(self.log_dir, 'file_mounts.log')
 
         # Check the files and warn
@@ -4356,12 +4524,21 @@ def _execute_file_mounts(self, handle: CloudVmRayResourceHandle,
                 action_message='Syncing',
                 log_path=log_path,
                 stream_logs=False,
+                # Need to source bashrc, as the cloud specific CLI or SDK may
+                # require PATH in bashrc.
+                source_bashrc=True,
             )
         # (2) Run the commands to create symlinks on all the nodes.
         symlink_command = ' && '.join(symlink_commands)
         if symlink_command:
-
-            def _symlink_node(runner: command_runner.SSHCommandRunner):
+            # ALIAS_SUDO_TO_EMPTY_FOR_ROOT_CMD sets sudo to empty string for
+            # root. We need this as we do not source bashrc for the command for
+            # better performance, and our sudo handling is only in bashrc.
+            symlink_command = (
+                f'{command_runner.ALIAS_SUDO_TO_EMPTY_FOR_ROOT_CMD} && '
+                f'{symlink_command}')
+
+            def _symlink_node(runner: command_runner.CommandRunner):
                 returncode = runner.run(symlink_command, log_path=log_path)
                 subprocess_utils.handle_returncode(
                     returncode, symlink_command,
@@ -4401,13 +4578,7 @@ def _execute_storage_mounts(
         logger.info(f'{fore.CYAN}Processing {len(storage_mounts)} '
                     f'storage mount{plural}.{style.RESET_ALL}')
         start = time.time()
-        ip_list = handle.external_ips()
-        port_list = handle.external_ssh_ports()
-        assert ip_list is not None, 'external_ips is not cached in handle'
-        ssh_credentials = backend_utils.ssh_credential_from_yaml(
-            handle.cluster_yaml, handle.docker_user, handle.ssh_user)
-        runners = command_runner.SSHCommandRunner.make_runner_list(
-            ip_list, port_list=port_list, **ssh_credentials)
+        runners = handle.get_command_runners()
         log_path = os.path.join(self.log_dir, 'storage_mounts.log')
 
         for dst, storage_obj in storage_mounts.items():
@@ -4438,6 +4609,9 @@ def _execute_storage_mounts(
                     run_rsync=False,
                     action_message='Mounting',
                     log_path=log_path,
+                    # Need to source bashrc, as the cloud specific CLI or SDK
+                    # may require PATH in bashrc.
+                    source_bashrc=True,
                 )
             except exceptions.CommandError as e:
                 if e.returncode == exceptions.MOUNT_PATH_NON_EMPTY_CODE:
@@ -4546,8 +4720,8 @@ def _get_task_env_vars(self, task: task_lib.Task, job_id: int,
                            handle: CloudVmRayResourceHandle) -> Dict[str, str]:
         """Returns the environment variables for the task."""
         env_vars = task.envs.copy()
-        # If it is a managed spot job, the TASK_ID_ENV_VAR will have been
-        # already set by the controller.
+        # If it is a managed job, the TASK_ID_ENV_VAR will have been already set
+        # by the controller.
         if constants.TASK_ID_ENV_VAR not in env_vars:
             env_vars[
                 constants.TASK_ID_ENV_VAR] = common_utils.get_global_job_id(
@@ -4599,7 +4773,7 @@ def _execute_task_one_node(self, handle: CloudVmRayResourceHandle,
                                 codegen.build(),
                                 job_id,
                                 detach_run=detach_run,
-                                spot_dag=task.spot_dag)
+                                managed_job_dag=task.managed_job_dag)
 
     def _execute_task_n_nodes(self, handle: CloudVmRayResourceHandle,
                               task: task_lib.Task, job_id: int,
@@ -4654,4 +4828,4 @@ def _execute_task_n_nodes(self, handle: CloudVmRayResourceHandle,
                                 codegen.build(),
                                 job_id,
                                 detach_run=detach_run,
-                                spot_dag=task.spot_dag)
+                                managed_job_dag=task.managed_job_dag)
diff --git a/sky/benchmark/benchmark_utils.py b/sky/benchmark/benchmark_utils.py
index 9fd0c529453..e1323bb714a 100644
--- a/sky/benchmark/benchmark_utils.py
+++ b/sky/benchmark/benchmark_utils.py
@@ -172,7 +172,7 @@ def _create_benchmark_bucket() -> Tuple[str, str]:
         raise_if_no_cloud_access=True)
     # Already checked by raise_if_no_cloud_access=True.
     assert enabled_clouds
-    bucket_type = data.StoreType.from_cloud(enabled_clouds[0])
+    bucket_type = data.StoreType.from_cloud(enabled_clouds[0]).value
 
     # Create a benchmark bucket.
     logger.info(f'Creating a bucket {bucket_name} to save the benchmark logs.')
diff --git a/sky/check.py b/sky/check.py
index 9c28c43ec53..e8a61317d63 100644
--- a/sky/check.py
+++ b/sky/check.py
@@ -1,26 +1,33 @@
 """Credential checks: check cloud credentials and enable clouds."""
 import traceback
-from typing import Dict, Iterable, List, Optional, Tuple
+from types import ModuleType
+from typing import Dict, Iterable, List, Optional, Tuple, Union
 
 import click
 import colorama
 import rich
 
-from sky import clouds
+from sky import clouds as sky_clouds
 from sky import exceptions
 from sky import global_user_state
+from sky import skypilot_config
 from sky.adaptors import cloudflare
 from sky.utils import ux_utils
 
 
-# TODO(zhwu): add check for a single cloud to improve performance
-def check(quiet: bool = False, verbose: bool = False) -> None:
+def check(
+    quiet: bool = False,
+    verbose: bool = False,
+    clouds: Optional[Iterable[str]] = None,
+) -> None:
     echo = (lambda *_args, **_kwargs: None) if quiet else click.echo
     echo('Checking credentials to enable clouds for SkyPilot.')
-
     enabled_clouds = []
+    disabled_clouds = []
 
-    def check_one_cloud(cloud_tuple: Tuple[str, clouds.Cloud]) -> None:
+    def check_one_cloud(
+            cloud_tuple: Tuple[str, Union[sky_clouds.Cloud,
+                                          ModuleType]]) -> None:
         cloud_repr, cloud = cloud_tuple
         echo(f'  Checking {cloud_repr}...', nl=False)
         try:
@@ -43,52 +50,117 @@ def check_one_cloud(cloud_tuple: Tuple[str, clouds.Cloud]) -> None:
             if reason is not None:
                 echo(f'    Hint: {reason}')
         else:
+            disabled_clouds.append(cloud_repr)
             echo(f'    Reason: {reason}')
 
+    def get_cloud_tuple(
+            cloud_name: str) -> Tuple[str, Union[sky_clouds.Cloud, ModuleType]]:
+        # Validates cloud_name and returns a tuple of the cloud's name and
+        # the cloud object. Includes special handling for Cloudflare.
+        if cloud_name.lower().startswith('cloudflare'):
+            return cloudflare.SKY_CHECK_NAME, cloudflare
+        else:
+            cloud_obj = sky_clouds.CLOUD_REGISTRY.from_str(cloud_name)
+            assert cloud_obj is not None, f'Cloud {cloud_name!r} not found'
+            return repr(cloud_obj), cloud_obj
+
+    def get_all_clouds():
+        return tuple([repr(c) for c in sky_clouds.CLOUD_REGISTRY.values()] +
+                     [cloudflare.SKY_CHECK_NAME])
+
+    if clouds is not None:
+        cloud_list = clouds
+    else:
+        cloud_list = get_all_clouds()
+    clouds_to_check = [get_cloud_tuple(c) for c in cloud_list]
+
+    # Use allowed_clouds from config if it exists, otherwise check all clouds.
+    # Also validate names with get_cloud_tuple.
+    config_allowed_cloud_names = [
+        get_cloud_tuple(c)[0] for c in skypilot_config.get_nested(
+            ['allowed_clouds'], get_all_clouds())
+    ]
+    # Use disallowed_cloud_names for logging the clouds that will be disabled
+    # because they are not included in allowed_clouds in config.yaml.
+    disallowed_cloud_names = [
+        c for c in get_all_clouds() if c not in config_allowed_cloud_names
+    ]
+    # Check only the clouds which are allowed in the config.
     clouds_to_check = [
-        (repr(cloud), cloud) for cloud in clouds.CLOUD_REGISTRY.values()
+        c for c in clouds_to_check if c[0] in config_allowed_cloud_names
     ]
-    clouds_to_check.append(('Cloudflare, for R2 object store', cloudflare))
 
     for cloud_tuple in sorted(clouds_to_check):
         check_one_cloud(cloud_tuple)
 
-    # Cloudflare is not a real cloud in clouds.CLOUD_REGISTRY, and should not be
-    # inserted into the DB (otherwise `sky launch` and other code would error
-    # out when it's trying to look it up in the registry).
-    enabled_clouds = [
+    # Cloudflare is not a real cloud in sky_clouds.CLOUD_REGISTRY, and should
+    # not be inserted into the DB (otherwise `sky launch` and other code would
+    # error out when it's trying to look it up in the registry).
+    enabled_clouds_set = {
         cloud for cloud in enabled_clouds if not cloud.startswith('Cloudflare')
-    ]
-    global_user_state.set_enabled_clouds(enabled_clouds)
-
-    if len(enabled_clouds) == 0:
+    }
+    disabled_clouds_set = {
+        cloud for cloud in disabled_clouds if not cloud.startswith('Cloudflare')
+    }
+    config_allowed_clouds_set = {
+        cloud for cloud in config_allowed_cloud_names
+        if not cloud.startswith('Cloudflare')
+    }
+    previously_enabled_clouds_set = {
+        repr(cloud) for cloud in global_user_state.get_cached_enabled_clouds()
+    }
+
+    # Determine the set of enabled clouds: (previously enabled clouds + newly
+    # enabled clouds - newly disabled clouds) intersected with
+    # config_allowed_clouds, if specified in config.yaml.
+    # This means that if a cloud is already enabled and is not included in
+    # allowed_clouds in config.yaml, it will be disabled.
+    all_enabled_clouds = (config_allowed_clouds_set & (
+        (previously_enabled_clouds_set | enabled_clouds_set) -
+        disabled_clouds_set))
+    global_user_state.set_enabled_clouds(list(all_enabled_clouds))
+
+    disallowed_clouds_hint = None
+    if disallowed_cloud_names:
+        disallowed_clouds_hint = (
+            '\nNote: The following clouds were disabled because they were not '
+            'included in allowed_clouds in ~/.sky/config.yaml: '
+            f'{", ".join([c for c in disallowed_cloud_names])}')
+    if len(all_enabled_clouds) == 0:
         echo(
             click.style(
                 'No cloud is enabled. SkyPilot will not be able to run any '
                 'task. Run `sky check` for more info.',
                 fg='red',
                 bold=True))
+        if disallowed_clouds_hint:
+            echo(click.style(disallowed_clouds_hint, dim=True))
         raise SystemExit()
     else:
+        clouds_arg = (' ' +
+                      ' '.join(disabled_clouds) if clouds is not None else '')
         echo(
             click.style(
                 '\nTo enable a cloud, follow the hints above and rerun: ',
-                dim=True) + click.style('sky check', bold=True) + '\n' +
-            click.style(
+                dim=True) + click.style(f'sky check{clouds_arg}', bold=True) +
+            '\n' + click.style(
                 'If any problems remain, refer to detailed docs at: '
                 'https://skypilot.readthedocs.io/en/latest/getting-started/installation.html',  # pylint: disable=line-too-long
                 dim=True))
 
+        if disallowed_clouds_hint:
+            echo(click.style(disallowed_clouds_hint, dim=True))
+
         # Pretty print for UX.
         if not quiet:
             enabled_clouds_str = '\n  :heavy_check_mark: '.join(
-                [''] + sorted(enabled_clouds))
+                [''] + sorted(all_enabled_clouds))
             rich.print('\n[green]:tada: Enabled clouds :tada:'
                        f'{enabled_clouds_str}[/green]')
 
 
 def get_cached_enabled_clouds_or_refresh(
-        raise_if_no_cloud_access: bool = False) -> List[clouds.Cloud]:
+        raise_if_no_cloud_access: bool = False) -> List[sky_clouds.Cloud]:
     """Returns cached enabled clouds and if no cloud is enabled, refresh.
 
     This function will perform a refresh if no public cloud is enabled.
@@ -120,7 +192,8 @@ def get_cached_enabled_clouds_or_refresh(
 
 
 def get_cloud_credential_file_mounts(
-        excluded_clouds: Optional[Iterable[clouds.Cloud]]) -> Dict[str, str]:
+        excluded_clouds: Optional[Iterable[sky_clouds.Cloud]]
+) -> Dict[str, str]:
     """Returns the files necessary to access all enabled clouds.
 
     Returns a dictionary that will be added to a task's file mounts
@@ -130,7 +203,7 @@ def get_cloud_credential_file_mounts(
     file_mounts = {}
     for cloud in enabled_clouds:
         if (excluded_clouds is not None and
-                clouds.cloud_in_list(cloud, excluded_clouds)):
+                sky_clouds.cloud_in_iterable(cloud, excluded_clouds)):
             continue
         cloud_file_mounts = cloud.get_credential_file_mounts()
         file_mounts.update(cloud_file_mounts)
diff --git a/sky/cli.py b/sky/cli.py
index 72667cffc97..db5291d949c 100644
--- a/sky/cli.py
+++ b/sky/cli.py
@@ -47,14 +47,13 @@
 import sky
 from sky import backends
 from sky import check as sky_check
-from sky import clouds
+from sky import clouds as sky_clouds
 from sky import core
 from sky import exceptions
 from sky import global_user_state
-from sky import provision as provision_lib
+from sky import jobs as managed_jobs
 from sky import serve as serve_lib
 from sky import sky_logging
-from sky import spot as spot_lib
 from sky import status_lib
 from sky.adaptors import common as adaptors_common
 from sky.backends import backend_utils
@@ -91,9 +90,9 @@
 provision a new cluster with that name. Otherwise provision a new cluster with
 an autogenerated name."""
 
-# The maximum number of in-progress spot jobs to show in the status
+# The maximum number of in-progress managed jobs to show in the status
 # command.
-_NUM_SPOT_JOBS_TO_SHOW_IN_STATUS = 5
+_NUM_MANAGED_JOBS_TO_SHOW_IN_STATUS = 5
 
 _STATUS_PROPERTY_CLUSTER_NUM_ERROR_MESSAGE = (
     '{cluster_num} cluster{plural} {verb}. Please specify {cause} '
@@ -103,7 +102,7 @@
                             'please retry after a while.')
 
 _DAG_NOT_SUPPORTED_MESSAGE = ('YAML specifies a DAG which is only supported by '
-                              '`sky spot launch`. `{command}` supports a '
+                              '`sky jobs launch`. `{command}` supports a '
                               'single task only.')
 
 
@@ -480,7 +479,7 @@ def _parse_override_params(
         if cloud.lower() == 'none':
             override_params['cloud'] = None
         else:
-            override_params['cloud'] = clouds.CLOUD_REGISTRY.from_str(cloud)
+            override_params['cloud'] = sky_clouds.CLOUD_REGISTRY.from_str(cloud)
     if region is not None:
         if region.lower() == 'none':
             override_params['region'] = None
@@ -566,9 +565,10 @@ def _launch_with_confirm(
                 raise_if_no_cloud_access=True)
         except exceptions.NoCloudAccessError as e:
             # Catch the exception where the public cloud is not enabled, and
-            # only print the error message without the error type.
-            click.secho(e, fg='yellow')
-            sys.exit(1)
+            # make it yellow for better visibility.
+            with ux_utils.print_exception_no_traceback():
+                raise RuntimeError(f'{colorama.Fore.YELLOW}{e}'
+                                   f'{colorama.Style.RESET_ALL}') from e
         dag = sky.optimize(dag)
     task = dag.tasks[0]
 
@@ -688,7 +688,7 @@ def _pop_and_ignore_fields_in_override_params(
 
 
 def _make_task_or_dag_from_entrypoint_with_overrides(
-    entrypoint: List[str],
+    entrypoint: Tuple[str, ...],
     *,
     entrypoint_name: str = 'Task',
     name: Optional[str] = None,
@@ -708,8 +708,8 @@ def _make_task_or_dag_from_entrypoint_with_overrides(
     ports: Optional[Tuple[str]] = None,
     env: Optional[List[Tuple[str, str]]] = None,
     field_to_ignore: Optional[List[str]] = None,
-    # spot launch specific
-    spot_recovery: Optional[str] = None,
+    # job launch specific
+    job_recovery: Optional[str] = None,
 ) -> Union[sky.Task, sky.Dag]:
     """Creates a task or a dag from an entrypoint with overrides.
 
@@ -772,14 +772,16 @@ def _make_task_or_dag_from_entrypoint_with_overrides(
     else:
         task = sky.Task(name='sky-cmd', run=entrypoint)
         task.set_resources({sky.Resources()})
+        # env update has been done for DAG in load_chain_dag_from_yaml for YAML.
+        task.update_envs(env)
 
     # Override.
     if workdir is not None:
         task.workdir = workdir
 
-    # Spot launch specific.
-    if spot_recovery is not None:
-        override_params['spot_recovery'] = spot_recovery
+    # job launch specific.
+    if job_recovery is not None:
+        override_params['job_recovery'] = job_recovery
 
     task.set_resources_override(override_params)
 
@@ -787,7 +789,6 @@ def _make_task_or_dag_from_entrypoint_with_overrides(
         task.num_nodes = num_nodes
     if name is not None:
         task.name = name
-    task.update_envs(env)
     return task
 
 
@@ -816,13 +817,30 @@ def get_help(self, ctx):
         return super().get_help(ctx)
 
 
-def _with_deprecation_warning(f, original_name, alias_name):
+def _with_deprecation_warning(
+        f,
+        original_name: str,
+        alias_name: str,
+        override_command_argument: Optional[Dict[str, Any]] = None):
 
     @functools.wraps(f)
     def wrapper(self, *args, **kwargs):
+        override_str = ''
+        if override_command_argument is not None:
+            overrides = []
+            for k, v in override_command_argument.items():
+                if isinstance(v, bool):
+                    if v:
+                        overrides.append(f'--{k}')
+                    else:
+                        overrides.append(f'--no-{k}')
+                else:
+                    overrides.append(f'--{k.replace("_", "-")}={v}')
+            override_str = ' with additional arguments ' + ' '.join(overrides)
         click.secho(
-            f'WARNING: `{alias_name}` is deprecated and will be removed in a '
-            f'future release. Please use `{original_name}` instead.\n',
+            f'WARNING: `{alias_name}` has been renamed to `{original_name}` '
+            f'and will be removed in a future release. Please use the '
+            f'latter{override_str} instead.\n',
             err=True,
             fg='yellow')
         return f(self, *args, **kwargs)
@@ -830,17 +848,49 @@ def wrapper(self, *args, **kwargs):
     return wrapper
 
 
-def _add_command_alias_to_group(group, command, name, hidden):
+def _override_arguments(callback, override_command_argument: Dict[str, Any]):
+
+    def wrapper(*args, **kwargs):
+        logger.info(f'Overriding arguments: {override_command_argument}')
+        kwargs.update(override_command_argument)
+        return callback(*args, **kwargs)
+
+    return wrapper
+
+
+def _add_command_alias(
+    group: click.Group,
+    command: click.Command,
+    hidden: bool = False,
+    new_group: Optional[click.Group] = None,
+    new_command_name: Optional[str] = None,
+    override_command_argument: Optional[Dict[str, Any]] = None,
+    with_warning: bool = True,
+) -> None:
     """Add a alias of a command to a group."""
+    if new_group is None:
+        new_group = group
+    if new_command_name is None:
+        new_command_name = command.name
+    if new_group == group and new_command_name == command.name:
+        raise ValueError('Cannot add an alias to the same command.')
     new_command = copy.deepcopy(command)
     new_command.hidden = hidden
-    new_command.name = name
+    new_command.name = new_command_name
+
+    if override_command_argument:
+        new_command.callback = _override_arguments(new_command.callback,
+                                                   override_command_argument)
 
     orig = f'sky {group.name} {command.name}'
-    alias = f'sky {group.name} {name}'
-    new_command.invoke = _with_deprecation_warning(new_command.invoke, orig,
-                                                   alias)
-    group.add_command(new_command, name=name)
+    alias = f'sky {new_group.name} {new_command_name}'
+    if with_warning:
+        new_command.invoke = _with_deprecation_warning(
+            new_command.invoke,
+            orig,
+            alias,
+            override_command_argument=override_command_argument)
+    new_group.add_command(new_command, name=new_command_name)
 
 
 def _deprecate_and_hide_command(group, command_to_deprecate,
@@ -980,7 +1030,7 @@ def cli():
           'the same data on the boot disk as an existing cluster.'))
 @usage_lib.entrypoint
 def launch(
-    entrypoint: List[str],
+    entrypoint: Tuple[str, ...],
     cluster: Optional[str],
     dryrun: bool,
     detach_setup: bool,
@@ -1082,11 +1132,19 @@ def launch(
 
 @cli.command(cls=_DocumentedCodeCommand)
 @click.argument('cluster',
-                required=True,
+                required=False,
                 type=str,
                 **_get_shell_complete_args(_complete_cluster_name))
+@click.option(
+    '--cluster',
+    '-c',
+    'cluster_option',
+    hidden=True,
+    type=str,
+    help='This is the same as the positional argument, just for consistency.',
+    **_get_shell_complete_args(_complete_cluster_name))
 @click.argument('entrypoint',
-                required=True,
+                required=False,
                 type=str,
                 nargs=-1,
                 **_get_shell_complete_args(_complete_file_name))
@@ -1101,8 +1159,9 @@ def launch(
 @usage_lib.entrypoint
 # pylint: disable=redefined-builtin
 def exec(
-    cluster: str,
-    entrypoint: List[str],
+    cluster: Optional[str],
+    cluster_option: Optional[str],
+    entrypoint: Tuple[str, ...],
     detach_run: bool,
     name: Optional[str],
     cloud: Optional[str],
@@ -1180,6 +1239,17 @@ def exec(
       sky exec mycluster --env WANDB_API_KEY python train_gpu.py
 
     """
+    if cluster_option is None and cluster is None:
+        raise click.UsageError('Missing argument \'[CLUSTER]\' and '
+                               '\'[ENTRYPOINT]...\'')
+    if cluster_option is not None:
+        if cluster is not None:
+            entrypoint = (cluster,) + entrypoint
+        cluster = cluster_option
+    if not entrypoint:
+        raise click.UsageError('Missing argument \'[ENTRYPOINT]...\'')
+    assert cluster is not None, (cluster, cluster_option, entrypoint)
+
     env = _merge_env_vars(env_file, env)
     controller_utils.check_cluster_name_not_controller(
         cluster, operation_str='Executing task on it')
@@ -1219,30 +1289,30 @@ def exec(
     sky.exec(task, backend=backend, cluster_name=cluster, detach_run=detach_run)
 
 
-def _get_spot_jobs(
+def _get_managed_jobs(
         refresh: bool,
         skip_finished: bool,
         show_all: bool,
         limit_num_jobs_to_show: bool = False,
         is_called_by_user: bool = False) -> Tuple[Optional[int], str]:
-    """Get the in-progress spot jobs.
+    """Get the in-progress managed jobs.
 
     Args:
-        refresh: Query the latest statuses, restarting the spot controller if
+        refresh: Query the latest statuses, restarting the jobs controller if
             stopped.
         skip_finished: Show only in-progress jobs.
-        show_all: Show all information of each spot job (e.g., region, price).
+        show_all: Show all information of each job (e.g., region, price).
         limit_num_jobs_to_show: If True, limit the number of jobs to show to
-            _NUM_SPOT_JOBS_TO_SHOW_IN_STATUS, which is mainly used by
+            _NUM_MANAGED_JOBS_TO_SHOW_IN_STATUS, which is mainly used by
             `sky status`.
         is_called_by_user: If this function is called by user directly, or an
             internal call.
 
     Returns:
         A tuple of (num_in_progress_jobs, msg). If num_in_progress_jobs is None,
-        it means there is an error when querying the spot jobs. In this case,
+        it means there is an error when querying the managed jobs. In this case,
         msg contains the error message. Otherwise, msg contains the formatted
-        spot job table.
+        managed job table.
     """
     num_in_progress_jobs = None
     try:
@@ -1250,32 +1320,51 @@ def _get_spot_jobs(
             usage_lib.messages.usage.set_internal()
         with sky_logging.silent():
             # Make the call silent
-            spot_jobs = spot_lib.queue(refresh=refresh,
-                                       skip_finished=skip_finished)
-        num_in_progress_jobs = len(spot_jobs)
+            managed_jobs_ = managed_jobs.queue(refresh=refresh,
+                                               skip_finished=skip_finished)
+        num_in_progress_jobs = len(set(job['job_id'] for job in managed_jobs_))
     except exceptions.ClusterNotUpError as e:
         controller_status = e.cluster_status
         msg = str(e)
         if controller_status is None:
-            msg += (f' (See: {colorama.Style.BRIGHT}sky spot -h'
+            msg += (f' (See: {colorama.Style.BRIGHT}sky jobs -h'
                     f'{colorama.Style.RESET_ALL})')
         elif (controller_status == status_lib.ClusterStatus.STOPPED and
               is_called_by_user):
-            msg += (f' (See finished jobs: {colorama.Style.BRIGHT}'
-                    f'sky spot queue --refresh{colorama.Style.RESET_ALL})')
+            msg += (f' (See finished managed jobs: {colorama.Style.BRIGHT}'
+                    f'sky jobs queue --refresh{colorama.Style.RESET_ALL})')
     except RuntimeError as e:
-        msg = ('Failed to query spot jobs due to connection '
-               'issues. Try again later. '
-               f'Details: {common_utils.format_exception(e, use_bracket=True)}')
+        msg = ''
+        try:
+            # Check the controller status again, as the RuntimeError is likely
+            # due to the controller being autostopped when querying the jobs.
+            controller_type = controller_utils.Controllers.JOBS_CONTROLLER
+            record = backend_utils.refresh_cluster_record(
+                controller_type.value.cluster_name,
+                cluster_status_lock_timeout=0)
+            if (record is None or
+                    record['status'] == status_lib.ClusterStatus.STOPPED):
+                msg = controller_type.value.default_hint_if_non_existent
+        except Exception:  # pylint: disable=broad-except
+            # This is to an best effort to find the latest controller status to
+            # print more helpful message, so we can ignore any exception to
+            # print the original error.
+            pass
+        if not msg:
+            msg = (
+                'Failed to query managed jobs due to connection '
+                'issues. Try again later. '
+                f'Details: {common_utils.format_exception(e, use_bracket=True)}'
+            )
     except Exception as e:  # pylint: disable=broad-except
-        msg = ('Failed to query spot jobs: '
+        msg = ('Failed to query managed jobs: '
                f'{common_utils.format_exception(e, use_bracket=True)}')
     else:
-        max_jobs_to_show = (_NUM_SPOT_JOBS_TO_SHOW_IN_STATUS
+        max_jobs_to_show = (_NUM_MANAGED_JOBS_TO_SHOW_IN_STATUS
                             if limit_num_jobs_to_show else None)
-        msg = spot_lib.format_job_table(spot_jobs,
-                                        show_all=show_all,
-                                        max_jobs=max_jobs_to_show)
+        msg = managed_jobs.format_job_table(managed_jobs_,
+                                            show_all=show_all,
+                                            max_jobs=max_jobs_to_show)
     return num_in_progress_jobs, msg
 
 
@@ -1314,9 +1403,27 @@ def _get_services(service_names: Optional[List[str]],
             msg += (f' (See: {colorama.Style.BRIGHT}sky serve -h'
                     f'{colorama.Style.RESET_ALL})')
     except RuntimeError as e:
-        msg = ('Failed to fetch service statuses due to connection issues. '
-               'Please try again later. Details: '
-               f'{common_utils.format_exception(e, use_bracket=True)}')
+        msg = ''
+        try:
+            # Check the controller status again, as the RuntimeError is likely
+            # due to the controller being autostopped when querying the
+            # services.
+            controller_type = controller_utils.Controllers.SKY_SERVE_CONTROLLER
+            record = backend_utils.refresh_cluster_record(
+                controller_type.value.cluster_name,
+                cluster_status_lock_timeout=0)
+            if (record is None or
+                    record['status'] == status_lib.ClusterStatus.STOPPED):
+                msg = controller_type.value.default_hint_if_non_existent
+        except Exception:  # pylint: disable=broad-except
+            # This is to an best effort to find the latest controller status to
+            # print more helpful message, so we can ignore any exception to
+            # print the original error.
+            pass
+        if not msg:
+            msg = ('Failed to fetch service statuses due to connection issues. '
+                   'Please try again later. Details: '
+                   f'{common_utils.format_exception(e, use_bracket=True)}')
     except Exception as e:  # pylint: disable=broad-except
         msg = ('Failed to fetch service statuses: '
                f'{common_utils.format_exception(e, use_bracket=True)}')
@@ -1380,11 +1487,11 @@ def _get_services(service_names: Optional[List[str]],
               type=int,
               help=('Get the endpoint URL for the specified port number on the '
                     'cluster. This option will override all other options.'))
-@click.option('--show-spot-jobs/--no-show-spot-jobs',
+@click.option('--show-managed-jobs/--no-show-managed-jobs',
               default=True,
               is_flag=True,
               required=False,
-              help='Also show recent in-progress spot jobs, if any.')
+              help='Also show recent in-progress managed jobs, if any.')
 @click.option('--show-services/--no-show-services',
               default=True,
               is_flag=True,
@@ -1398,8 +1505,8 @@ def _get_services(service_names: Optional[List[str]],
 @usage_lib.entrypoint
 # pylint: disable=redefined-builtin
 def status(all: bool, refresh: bool, ip: bool, endpoints: bool,
-           endpoint: Optional[int], show_spot_jobs: bool, show_services: bool,
-           clusters: List[str]):
+           endpoint: Optional[int], show_managed_jobs: bool,
+           show_services: bool, clusters: List[str]):
     # NOTE(dev): Keep the docstring consistent between the Python API and CLI.
     """Show clusters.
 
@@ -1458,19 +1565,20 @@ def status(all: bool, refresh: bool, ip: bool, endpoints: bool,
       or for autostop-enabled clusters, use ``--refresh`` to query the latest
       cluster statuses from the cloud providers.
     """
-    # Using a pool with 2 worker to run the spot job query and sky serve service
-    # query in parallel to speed up. The pool provides a AsyncResult object that
-    # can be used as a future.
+    # Using a pool with 2 worker to run the managed job query and sky serve
+    # service query in parallel to speed up. The pool provides a AsyncResult
+    # object that can be used as a future.
     with multiprocessing.Pool(2) as pool:
-        # Do not show spot queue if user specifies clusters, and if user
+        # Do not show job queue if user specifies clusters, and if user
         # specifies --ip or --endpoint(s).
-        show_spot_jobs = show_spot_jobs and not any([clusters, ip, endpoints])
+        show_managed_jobs = show_managed_jobs and not any(
+            [clusters, ip, endpoints])
         show_endpoints = endpoints or endpoint is not None
         show_single_endpoint = endpoint is not None
-        if show_spot_jobs:
-            # Run the spot job query in parallel to speed up the status query.
-            spot_jobs_future = pool.apply_async(
-                _get_spot_jobs,
+        if show_managed_jobs:
+            # Run managed job query in parallel to speed up the status query.
+            managed_jobs_future = pool.apply_async(
+                _get_managed_jobs,
                 kwds=dict(refresh=False,
                           skip_finished=True,
                           show_all=False,
@@ -1563,71 +1671,28 @@ def status(all: bool, refresh: bool, ip: bool, endpoints: bool,
 
             head_ip = handle.external_ips()[0]
             if show_endpoints:
-                launched_resources = handle.launched_resources
-                cloud = launched_resources.cloud
-                try:
-                    cloud.check_features_are_supported(
-                        launched_resources,
-                        {clouds.CloudImplementationFeatures.OPEN_PORTS})
-                except exceptions.NotSupportedError:
-                    with ux_utils.print_exception_no_traceback():
-                        raise ValueError('Querying endpoints is not supported '
-                                         f'for {cloud}.') from None
-
-                config = common_utils.read_yaml(handle.cluster_yaml)
-                port_details = provision_lib.query_ports(
-                    repr(cloud), handle.cluster_name_on_cloud,
-                    handle.launched_resources.ports, config['provider'])
-
-                if endpoint is not None:
-                    # If cluster had no ports to be exposed
-                    ports_set = resources_utils.port_ranges_to_set(
-                        handle.launched_resources.ports)
-                    if endpoint not in ports_set:
-                        with ux_utils.print_exception_no_traceback():
-                            raise ValueError(f'Port {endpoint} is not exposed '
-                                             'on cluster '
-                                             f'{cluster_record["name"]!r}.')
-                    # If the user requested a specific port endpoint
-                    if endpoint not in port_details:
-                        error_msg = (f'Port {endpoint} not exposed yet. '
-                                     f'{_ENDPOINTS_RETRY_MESSAGE} ')
-                        if handle.launched_resources.cloud.is_same_cloud(
-                                clouds.Kubernetes()):
-                            # Add Kubernetes specific debugging info
-                            error_msg += (
-                                kubernetes_utils.get_endpoint_debug_message())
-                        with ux_utils.print_exception_no_traceback():
-                            raise RuntimeError(error_msg)
-                    click.echo(port_details[endpoint][0].url(ip=head_ip))
-                    return
-
-                if not port_details:
-                    # If cluster had no ports to be exposed
-                    if handle.launched_resources.ports is None:
-                        with ux_utils.print_exception_no_traceback():
-                            raise ValueError('Cluster does not have any ports '
-                                             'to be exposed.')
-                    # Else wait for the ports to be exposed
-                    else:
-                        error_msg = (f'No endpoints exposed yet. '
-                                     f'{_ENDPOINTS_RETRY_MESSAGE} ')
-                        if handle.launched_resources.cloud.is_same_cloud(
-                                clouds.Kubernetes()):
-                            # Add Kubernetes specific debugging info
-                            error_msg += \
-                                kubernetes_utils.get_endpoint_debug_message()
-                        with ux_utils.print_exception_no_traceback():
-                            raise RuntimeError(error_msg)
-
-                for port, urls in port_details.items():
-                    click.echo(
-                        f'{colorama.Fore.BLUE}{colorama.Style.BRIGHT}{port}'
-                        f'{colorama.Style.RESET_ALL}: '
-                        f'{colorama.Fore.CYAN}{colorama.Style.BRIGHT}'
-                        f'{urls[0].url(ip=head_ip)}{colorama.Style.RESET_ALL}')
+                if endpoint:
+                    cluster_endpoint = core.endpoints(cluster_record['name'],
+                                                      endpoint).get(
+                                                          endpoint, None)
+                    if not cluster_endpoint:
+                        raise click.Abort(
+                            f'Endpoint {endpoint} not found for cluster '
+                            f'{cluster_record["name"]!r}.')
+                    click.echo(cluster_endpoint)
+                else:
+                    cluster_endpoints = core.endpoints(cluster_record['name'])
+                    assert isinstance(cluster_endpoints, dict)
+                    if not cluster_endpoints:
+                        raise click.Abort(f'No endpoint found for cluster '
+                                          f'{cluster_record["name"]!r}.')
+                    for port, port_endpoint in cluster_endpoints.items():
+                        click.echo(
+                            f'{colorama.Fore.BLUE}{colorama.Style.BRIGHT}{port}'
+                            f'{colorama.Style.RESET_ALL}: '
+                            f'{colorama.Fore.CYAN}{colorama.Style.BRIGHT}'
+                            f'{port_endpoint}{colorama.Style.RESET_ALL}')
                 return
-
             click.echo(head_ip)
             return
         hints = []
@@ -1655,16 +1720,16 @@ def _try_get_future_result(future) -> Tuple[bool, Any]:
                 interrupted = True
             return interrupted, result
 
-        spot_jobs_query_interrupted = False
-        if show_spot_jobs:
+        managed_jobs_query_interrupted = False
+        if show_managed_jobs:
             click.echo(f'\n{colorama.Fore.CYAN}{colorama.Style.BRIGHT}'
-                       f'Managed spot jobs{colorama.Style.RESET_ALL}')
-            with rich_utils.safe_status('[cyan]Checking spot jobs[/]'):
-                spot_jobs_query_interrupted, result = _try_get_future_result(
-                    spot_jobs_future)
-                if spot_jobs_query_interrupted:
+                       f'Managed jobs{colorama.Style.RESET_ALL}')
+            with rich_utils.safe_status('[cyan]Checking managed jobs[/]'):
+                managed_jobs_query_interrupted, result = _try_get_future_result(
+                    managed_jobs_future)
+                if managed_jobs_query_interrupted:
                     # Set to -1, so that the controller is not considered
-                    # down, and the hint for showing sky spot queue
+                    # down, and the hint for showing sky jobs queue
                     # will still be shown.
                     num_in_progress_jobs = -1
                     msg = 'KeyboardInterrupt'
@@ -1673,29 +1738,30 @@ def _try_get_future_result(future) -> Tuple[bool, Any]:
 
             click.echo(msg)
             if num_in_progress_jobs is not None:
-                # spot controller is UP.
+                # jobs controller is UP.
                 job_info = ''
                 if num_in_progress_jobs > 0:
                     plural_and_verb = ' is'
                     if num_in_progress_jobs > 1:
                         plural_and_verb = 's are'
                     job_info = (
-                        f'{num_in_progress_jobs} spot job{plural_and_verb} '
+                        f'{num_in_progress_jobs} managed job{plural_and_verb} '
                         'in progress')
-                    if num_in_progress_jobs > _NUM_SPOT_JOBS_TO_SHOW_IN_STATUS:
+                    if (num_in_progress_jobs >
+                            _NUM_MANAGED_JOBS_TO_SHOW_IN_STATUS):
                         job_info += (
-                            f' ({_NUM_SPOT_JOBS_TO_SHOW_IN_STATUS} latest ones '
-                            'shown)')
+                            f' ({_NUM_MANAGED_JOBS_TO_SHOW_IN_STATUS} latest '
+                            'ones shown)')
                     job_info += '. '
                 hints.append(
-                    controller_utils.Controllers.SPOT_CONTROLLER.value.
+                    controller_utils.Controllers.JOBS_CONTROLLER.value.
                     in_progress_hint.format(job_info=job_info))
 
         if show_services:
             click.echo(f'\n{colorama.Fore.CYAN}{colorama.Style.BRIGHT}'
                        f'Services{colorama.Style.RESET_ALL}')
             num_services = None
-            if spot_jobs_query_interrupted:
+            if managed_jobs_query_interrupted:
                 # The pool is terminated, so we cannot run the service query.
                 msg = 'KeyboardInterrupt'
             else:
@@ -1712,7 +1778,7 @@ def _try_get_future_result(future) -> Tuple[bool, Any]:
                 hints.append(controller_utils.Controllers.SKY_SERVE_CONTROLLER.
                              value.in_progress_hint)
 
-        if show_spot_jobs or show_services:
+        if show_managed_jobs or show_services:
             try:
                 pool.close()
                 pool.join()
@@ -1983,7 +2049,7 @@ def logs(
               help='Skip confirmation prompt.')
 @click.argument('jobs', required=False, type=int, nargs=-1)
 @usage_lib.entrypoint
-def cancel(cluster: str, all: bool, jobs: List[int], yes: bool):  # pylint: disable=redefined-builtin
+def cancel(cluster: str, all: bool, jobs: List[int], yes: bool):  # pylint: disable=redefined-builtin, redefined-outer-name
     # NOTE(dev): Keep the docstring consistent between the Python API and CLI.
     """Cancel job(s).
 
@@ -2030,16 +2096,16 @@ def cancel(cluster: str, all: bool, jobs: List[int], yes: bool):  # pylint: disa
 
     try:
         core.cancel(cluster, all=all, job_ids=job_ids_to_cancel)
-    except exceptions.NotSupportedError:
+    except exceptions.NotSupportedError as e:
         controller = controller_utils.Controllers.from_name(cluster)
         assert controller is not None, cluster
-        click.echo(controller.value.decline_cancel_hint)
-        sys.exit(1)
+        with ux_utils.print_exception_no_traceback():
+            raise click.UsageError(controller.value.decline_cancel_hint) from e
     except ValueError as e:
         raise click.UsageError(str(e))
-    except exceptions.ClusterNotUpError as e:
-        click.echo(str(e))
-        sys.exit(1)
+    except exceptions.ClusterNotUpError:
+        with ux_utils.print_exception_no_traceback():
+            raise
 
 
 @cli.command(cls=_DocumentedCodeCommand)
@@ -2382,7 +2448,7 @@ def start(
     if not to_start:
         return
 
-    # Checks for controller clusters (spot controller / sky serve controller).
+    # Checks for controller clusters (jobs controller / sky serve controller).
     controllers, normal_clusters = [], []
     for name in to_start:
         if controller_utils.Controllers.from_name(name) is not None:
@@ -2501,14 +2567,26 @@ def down(
                            purge=purge)
 
 
-def _hint_or_raise_for_down_spot_controller(controller_name: str):
+def _hint_or_raise_for_down_jobs_controller(controller_name: str):
+    """Helper function to check job controller status before tearing it down.
+
+    Raises helpful exceptions and errors if the controller is not in a safe
+    state to be torn down.
+
+    Raises:
+        RuntimeError: if failed to get the job queue.
+        exceptions.NotSupportedError: if the controller is not in a safe state
+            to be torn down (e.g., because it has jobs running or
+            it is in init state)
+    """
     controller = controller_utils.Controllers.from_name(controller_name)
     assert controller is not None, controller_name
 
     with rich_utils.safe_status(
-            '[bold cyan]Checking for in-progress spot jobs[/]'):
+            '[bold cyan]Checking for in-progress managed jobs[/]'):
         try:
-            spot_jobs = spot_lib.queue(refresh=False, skip_finished=True)
+            managed_jobs_ = managed_jobs.queue(refresh=False,
+                                               skip_finished=True)
         except exceptions.ClusterNotUpError as e:
             if controller.value.connection_error_hint in str(e):
                 with ux_utils.print_exception_no_traceback():
@@ -2517,21 +2595,21 @@ def _hint_or_raise_for_down_spot_controller(controller_name: str):
                         decline_down_when_failed_to_fetch_status_hint)
             if e.cluster_status is None:
                 click.echo(
-                    'Managed spot controller has already been torn down.')
+                    'Managed jobs controller has already been torn down.')
                 sys.exit(0)
-            # At this point, the spot jobs are failed to be fetched due to the
-            # controller being STOPPED or being firstly launched, i.e., there is
-            # no in-prgress spot jobs.
-            spot_jobs = []
+            # At this point, the managed jobs are failed to be fetched due to
+            # the controller being STOPPED or being firstly launched, i.e.,
+            # there is no in-prgress managed jobs.
+            managed_jobs_ = []
 
     msg = (f'{colorama.Fore.YELLOW}WARNING: Tearing down the managed '
-           'spot controller. Please be aware of the following:'
+           'jobs controller. Please be aware of the following:'
            f'{colorama.Style.RESET_ALL}'
-           '\n * All logs and status information of the spot '
-           'jobs (output of `sky spot queue`) will be lost.')
+           '\n * All logs and status information of the managed '
+           'jobs (output of `sky jobs queue`) will be lost.')
     click.echo(msg)
-    if spot_jobs:
-        job_table = spot_lib.format_job_table(spot_jobs, show_all=False)
+    if managed_jobs_:
+        job_table = managed_jobs.format_job_table(managed_jobs_, show_all=False)
         msg = controller.value.decline_down_for_dirty_controller_hint
         # Add prefix to each line to align with the bullet point.
         msg += '\n'.join(
@@ -2539,11 +2617,22 @@ def _hint_or_raise_for_down_spot_controller(controller_name: str):
         with ux_utils.print_exception_no_traceback():
             raise exceptions.NotSupportedError(msg)
     else:
-        click.echo(' * No in-progress spot jobs found. It should be safe to '
+        click.echo(' * No in-progress managed jobs found. It should be safe to '
                    'terminate (see caveats above).')
 
 
 def _hint_or_raise_for_down_sky_serve_controller(controller_name: str):
+    """Helper function to check serve controller status before tearing it down.
+
+    Raises helpful exceptions and errors if the controller is not in a safe
+    state to be torn down.
+
+    Raises:
+        RuntimeError: if failed to get the service status.
+        exceptions.NotSupportedError: if the controller is not in a safe state
+            to be torn down (e.g., because it has services running or
+            it is in init state)
+    """
     controller = controller_utils.Controllers.from_name(controller_name)
     assert controller is not None, controller_name
     with rich_utils.safe_status('[bold cyan]Checking for live services[/]'):
@@ -2575,8 +2664,8 @@ def _hint_or_raise_for_down_sky_serve_controller(controller_name: str):
 
 
 _CONTROLLER_TO_HINT_OR_RAISE = {
-    controller_utils.Controllers.SPOT_CONTROLLER:
-        (_hint_or_raise_for_down_spot_controller),
+    controller_utils.Controllers.JOBS_CONTROLLER:
+        (_hint_or_raise_for_down_jobs_controller),
     controller_utils.Controllers.SKY_SERVE_CONTROLLER:
         (_hint_or_raise_for_down_sky_serve_controller),
 }
@@ -2591,9 +2680,9 @@ def _down_or_stop_clusters(
         idle_minutes_to_autostop: Optional[int] = None) -> None:
     """Tears down or (auto-)stops a cluster (or all clusters).
 
-    Controllers (spot controller and sky serve controller) can only be
+    Controllers (jobs controller and sky serve controller) can only be
     terminated if the cluster name is explicitly and uniquely specified (not
-    via glob) and purge is set to True.
+    via glob).
     """
     if down:
         command = 'down'
@@ -2662,12 +2751,13 @@ def _down_or_stop_clusters(
                     # TODO(zhwu): This hint or raise is not transactional, which
                     # means even if it passed the check with no in-progress spot
                     # or service and prompt the confirmation for termination,
-                    # a user could still do a `sky spot launch` or a
+                    # a user could still do a `sky jobs launch` or a
                     # `sky serve up` before typing the delete, causing a leaked
-                    # spot job or service. We should make this check atomic with
-                    # the termination.
+                    # managed job or service. We should make this check atomic
+                    # with the termination.
                     hint_or_raise(controller_name)
-                except exceptions.ClusterOwnerIdentityMismatchError as e:
+                except (exceptions.ClusterOwnerIdentityMismatchError,
+                        RuntimeError) as e:
                     if purge:
                         click.echo(common_utils.format_exception(e))
                     else:
@@ -2794,24 +2884,38 @@ def _down_or_stop(name: str):
         progress.refresh()
 
 
-@cli.command()
+@cli.command(cls=_DocumentedCodeCommand)
+@click.argument('clouds', required=False, type=str, nargs=-1)
 @click.option('--verbose',
               '-v',
               is_flag=True,
               default=False,
               help='Show the activated account for each cloud.')
 @usage_lib.entrypoint
-def check(verbose: bool):
+def check(clouds: Tuple[str], verbose: bool):
     """Check which clouds are available to use.
 
     This checks access credentials for all clouds supported by SkyPilot. If a
     cloud is detected to be inaccessible, the reason and correction steps will
     be shown.
 
+    If CLOUDS are specified, checks credentials for only those clouds.
+
     The enabled clouds are cached and form the "search space" to be considered
     for each task.
+
+    Examples:
+
+    .. code-block:: bash
+
+      # Check credentials for all supported clouds.
+      sky check
+      \b
+      # Check only specific clouds - AWS and GCP.
+      sky check aws gcp
     """
-    sky_check.check(verbose=verbose)
+    clouds_arg = clouds if len(clouds) > 0 else None
+    sky_check.check(verbose=verbose, clouds=clouds_arg)
 
 
 @cli.command()
@@ -2863,6 +2967,15 @@ def show_gpus(
     To show all regions for a specified accelerator, use
     ``sky show-gpus <accelerator> --all-regions``.
 
+    If ``--region`` or ``--all-regions`` is not specified, the price displayed
+    for each instance type is the lowest across all regions for both on-demand
+    and spot instances. There may be multiple regions with the same lowest
+    price.
+
+    If ``--cloud kubernetes`` is specified, it will show the maximum quantities
+    of the GPU available on a single node and the real-time availability of
+    the GPU across all nodes in the Kubernetes cluster.
+
     Definitions of certain fields:
 
     * ``DEVICE_MEM``: Memory of a single device; does not depend on the device
@@ -2870,10 +2983,15 @@ def show_gpus(
 
     * ``HOST_MEM``: Memory of the host instance (VM).
 
-    If ``--region`` or ``--all-regions`` is not specified, the price displayed
-    for each instance type is the lowest across all regions for both on-demand
-    and spot instances. There may be multiple regions with the same lowest
-    price.
+    * ``QTY_PER_NODE`` (Kubernetes only): GPU quantities that can be requested
+      on a single node.
+
+    * ``TOTAL_GPUS`` (Kubernetes only): Total number of GPUs available in the
+      Kubernetes cluster.
+
+    * ``TOTAL_FREE_GPUS`` (Kubernetes only): Number of currently free GPUs
+      in the Kubernetes cluster. This is fetched in real-time and may change
+      when other users are using the cluster.
     """
     # validation for the --region flag
     if region is not None and cloud is None:
@@ -2890,15 +3008,70 @@ def show_gpus(
             '--all-regions and --region flags cannot be used simultaneously.')
 
     # This will validate 'cloud' and raise if not found.
-    cloud_obj = clouds.CLOUD_REGISTRY.from_str(cloud)
+    cloud_obj = sky_clouds.CLOUD_REGISTRY.from_str(cloud)
     service_catalog.validate_region_zone(region, None, clouds=cloud)
     show_all = all
     if show_all and accelerator_str is not None:
         raise click.UsageError('--all is only allowed without a GPU name.')
 
+    # Kubernetes specific bools
+    cloud_is_kubernetes = isinstance(cloud_obj, sky_clouds.Kubernetes)
+    kubernetes_autoscaling = kubernetes_utils.get_autoscaler_type() is not None
+    kubernetes_is_enabled = sky_clouds.cloud_in_iterable(
+        sky_clouds.Kubernetes(), global_user_state.get_cached_enabled_clouds())
+
+    if cloud_is_kubernetes and region is not None:
+        raise click.UsageError(
+            'The --region flag cannot be set with --cloud kubernetes.')
+
     def _list_to_str(lst):
         return ', '.join([str(e) for e in lst])
 
+    def _get_kubernetes_realtime_gpu_table(
+            name_filter: Optional[str] = None,
+            quantity_filter: Optional[int] = None):
+        if quantity_filter:
+            qty_header = 'QTY_FILTER'
+            free_header = 'FILTERED_FREE_GPUS'
+        else:
+            qty_header = 'QTY_PER_NODE'
+            free_header = 'TOTAL_FREE_GPUS'
+        realtime_gpu_table = log_utils.create_table(
+            ['GPU', qty_header, 'TOTAL_GPUS', free_header])
+        counts, capacity, available = service_catalog.list_accelerator_realtime(
+            gpus_only=True,
+            clouds='kubernetes',
+            name_filter=name_filter,
+            region_filter=region,
+            quantity_filter=quantity_filter,
+            case_sensitive=False)
+        assert (set(counts.keys()) == set(capacity.keys()) == set(
+            available.keys())), (f'Keys of counts ({list(counts.keys())}), '
+                                 f'capacity ({list(capacity.keys())}), '
+                                 f'and available ({list(available.keys())}) '
+                                 'must be same.')
+        if len(counts) == 0:
+            err_msg = 'No GPUs found in Kubernetes cluster. '
+            debug_msg = 'To further debug, run: sky check '
+            if name_filter is not None:
+                gpu_info_msg = f' {name_filter!r}'
+                if quantity_filter is not None:
+                    gpu_info_msg += (' with requested quantity'
+                                     f' {quantity_filter}')
+                err_msg = (f'Resources{gpu_info_msg} not found '
+                           'in Kubernetes cluster. ')
+                debug_msg = ('To show available accelerators on kubernetes,'
+                             ' run: sky show-gpus --cloud kubernetes ')
+            full_err_msg = (err_msg + kubernetes_utils.NO_GPU_HELP_MESSAGE +
+                            debug_msg)
+            raise ValueError(full_err_msg)
+        for gpu, _ in sorted(counts.items()):
+            realtime_gpu_table.add_row([
+                gpu,
+                _list_to_str(counts.pop(gpu)), capacity[gpu], available[gpu]
+            ])
+        return realtime_gpu_table
+
     def _output():
         gpu_table = log_utils.create_table(
             ['COMMON_GPU', 'AVAILABLE_QUANTITIES'])
@@ -2909,29 +3082,69 @@ def _output():
 
         name, quantity = None, None
 
+        # Optimization - do not poll for Kubernetes API for fetching
+        # common GPUs because that will be fetched later for the table after
+        # common GPUs.
+        clouds_to_list = cloud
+        if cloud is None:
+            clouds_to_list = [
+                c for c in service_catalog.ALL_CLOUDS if c != 'kubernetes'
+            ]
+
+        k8s_messages = ''
         if accelerator_str is None:
+            # Collect k8s related messages in k8s_messages and print them at end
+            print_section_titles = False
+            # If cloud is kubernetes, we want to show real-time capacity
+            if kubernetes_is_enabled and (cloud is None or cloud_is_kubernetes):
+                try:
+                    # If --cloud kubernetes is not specified, we want to catch
+                    # the case where no GPUs are available on the cluster and
+                    # print the warning at the end.
+                    k8s_realtime_table = _get_kubernetes_realtime_gpu_table()
+                except ValueError as e:
+                    if not cloud_is_kubernetes:
+                        # Make it a note if cloud is not kubernetes
+                        k8s_messages += 'Note: '
+                    k8s_messages += str(e)
+                else:
+                    print_section_titles = True
+                    yield (f'{colorama.Fore.CYAN}{colorama.Style.BRIGHT}'
+                           f'Kubernetes GPUs{colorama.Style.RESET_ALL}\n')
+                    yield from k8s_realtime_table.get_string()
+                if kubernetes_autoscaling:
+                    k8s_messages += (
+                        '\n' + kubernetes_utils.KUBERNETES_AUTOSCALER_NOTE)
+            if cloud_is_kubernetes:
+                # Do not show clouds if --cloud kubernetes is specified
+                if not kubernetes_is_enabled:
+                    yield ('Kubernetes is not enabled. To fix, run: '
+                           'sky check kubernetes ')
+                yield k8s_messages
+                return
+
+            # For show_all, show the k8s message at the start since output is
+            # long and the user may not scroll to the end.
+            if show_all and k8s_messages:
+                yield k8s_messages
+                yield '\n\n'
+
             result = service_catalog.list_accelerator_counts(
                 gpus_only=True,
-                clouds=cloud,
+                clouds=clouds_to_list,
                 region_filter=region,
             )
 
-            if (len(result) == 0 and cloud_obj is not None and
-                    cloud_obj.is_same_cloud(clouds.Kubernetes())):
-                yield kubernetes_utils.NO_GPU_ERROR_MESSAGE
-                return
+            if print_section_titles:
+                # If section titles were printed above, print again here
+                yield '\n\n'
+                yield (f'{colorama.Fore.CYAN}{colorama.Style.BRIGHT}'
+                       f'Cloud GPUs{colorama.Style.RESET_ALL}\n')
 
             # "Common" GPUs
-            # If cloud is kubernetes, we want to show all GPUs here, even if
-            # they are not listed as common in SkyPilot.
-            if (cloud_obj is not None and
-                    cloud_obj.is_same_cloud(clouds.Kubernetes())):
-                for gpu, _ in sorted(result.items()):
+            for gpu in service_catalog.get_common_gpus():
+                if gpu in result:
                     gpu_table.add_row([gpu, _list_to_str(result.pop(gpu))])
-            else:
-                for gpu in service_catalog.get_common_gpus():
-                    if gpu in result:
-                        gpu_table.add_row([gpu, _list_to_str(result.pop(gpu))])
             yield from gpu_table.get_string()
 
             # Google TPUs
@@ -2952,6 +3165,9 @@ def _output():
             else:
                 yield ('\n\nHint: use -a/--all to see all accelerators '
                        '(including non-common ones) and pricing.')
+                if k8s_messages:
+                    yield '\n'
+                    yield k8s_messages
                 return
         else:
             # Parse accelerator string
@@ -2975,25 +3191,82 @@ def _output():
             else:
                 name, quantity = accelerator_str, None
 
+        print_section_titles = False
+        if (kubernetes_is_enabled and (cloud is None or cloud_is_kubernetes) and
+                not show_all):
+            # Print section title if not showing all and instead a specific
+            # accelerator is requested
+            print_section_titles = True
+            yield (f'{colorama.Fore.CYAN}{colorama.Style.BRIGHT}'
+                   f'Kubernetes GPUs{colorama.Style.RESET_ALL}\n')
+            try:
+                k8s_realtime_table = _get_kubernetes_realtime_gpu_table(
+                    name_filter=name, quantity_filter=quantity)
+                yield from k8s_realtime_table.get_string()
+            except ValueError as e:
+                # In the case of a specific accelerator, show the error message
+                # immediately (e.g., "Resources H100 not found ...")
+                yield str(e)
+            if kubernetes_autoscaling:
+                k8s_messages += ('\n' +
+                                 kubernetes_utils.KUBERNETES_AUTOSCALER_NOTE)
+            yield k8s_messages
+        if cloud_is_kubernetes:
+            # Do not show clouds if --cloud kubernetes is specified
+            if not kubernetes_is_enabled:
+                yield ('Kubernetes is not enabled. To fix, run: '
+                       'sky check kubernetes ')
+            return
+
+        # For clouds other than Kubernetes, get the accelerator details
         # Case-sensitive
         result = service_catalog.list_accelerators(gpus_only=True,
                                                    name_filter=name,
                                                    quantity_filter=quantity,
                                                    region_filter=region,
-                                                   clouds=cloud,
+                                                   clouds=clouds_to_list,
                                                    case_sensitive=False,
                                                    all_regions=all_regions)
+        # Import here to save module load speed.
+        # pylint: disable=import-outside-toplevel,line-too-long
+        from sky.clouds.service_catalog import common
+
+        # For each gpu name (count not included):
+        #   - Group by cloud
+        #   - Sort within each group by prices
+        #   - Sort groups by each cloud's (min price, min spot price)
+        new_result = {}
+        for i, (gpu, items) in enumerate(result.items()):
+            df = pd.DataFrame([t._asdict() for t in items])
+            # Determine the minimum prices for each cloud.
+            min_price_df = df.groupby('cloud').agg(min_price=('price', 'min'),
+                                                   min_spot_price=('spot_price',
+                                                                   'min'))
+            df = df.merge(min_price_df, on='cloud')
+            # Sort within each cloud by price.
+            df = df.groupby('cloud', group_keys=False).apply(
+                lambda x: x.sort_values(by=['price', 'spot_price']))
+            # Sort across groups (clouds).
+            df = df.sort_values(by=['min_price', 'min_spot_price'])
+            df = df.drop(columns=['min_price', 'min_spot_price'])
+            sorted_dataclasses = [
+                common.InstanceTypeInfo(*row)
+                for row in df.to_records(index=False)
+            ]
+            new_result[gpu] = sorted_dataclasses
+        result = new_result
 
-        if len(result) == 0:
-            if cloud == 'kubernetes':
-                yield kubernetes_utils.NO_GPU_ERROR_MESSAGE
-                return
+        if print_section_titles and not show_all:
+            yield '\n\n'
+            yield (f'{colorama.Fore.CYAN}{colorama.Style.BRIGHT}'
+                   f'Cloud GPUs{colorama.Style.RESET_ALL}\n')
 
+        if len(result) == 0:
             quantity_str = (f' with requested quantity {quantity}'
                             if quantity else '')
-            yield f'Resources \'{name}\'{quantity_str} not found. '
-            yield 'Try \'sky show-gpus --all\' '
-            yield 'to show available accelerators.'
+            cloud_str = f' on {cloud_obj}.' if cloud else ' in cloud catalogs.'
+            yield f'Resources \'{name}\'{quantity_str} not found{cloud_str} '
+            yield 'To show available accelerators, run: sky show-gpus --all'
             return
 
         for i, (gpu, items) in enumerate(result.items()):
@@ -3016,13 +3289,14 @@ def _output():
                 instance_type_str = item.instance_type if not pd.isna(
                     item.instance_type) else '(attachable)'
                 cpu_count = item.cpu_count
-                if pd.isna(cpu_count):
-                    cpu_str = '-'
-                elif isinstance(cpu_count, (float, int)):
+                if not pd.isna(cpu_count) and isinstance(
+                        cpu_count, (float, int)):
                     if int(cpu_count) == cpu_count:
                         cpu_str = str(int(cpu_count))
                     else:
                         cpu_str = f'{cpu_count:.1f}'
+                else:
+                    cpu_str = '-'
                 device_memory_str = (f'{item.device_memory:.0f}GB' if
                                      not pd.isna(item.device_memory) else '-')
                 host_memory_str = f'{item.memory:.0f}GB' if not pd.isna(
@@ -3147,12 +3421,12 @@ def bench():
 
 
 @cli.group(cls=_NaturalOrderGroup)
-def spot():
-    """Managed Spot CLI (spot instances with auto-recovery)."""
+def jobs():
+    """Managed Jobs CLI (jobs with auto-recovery)."""
     pass
 
 
-@spot.command('launch', cls=_DocumentedCodeCommand)
+@jobs.command('launch', cls=_DocumentedCodeCommand)
 @click.argument('entrypoint',
                 required=True,
                 type=str,
@@ -3160,10 +3434,16 @@ def spot():
                 **_get_shell_complete_args(_complete_file_name))
 # TODO(zhwu): Add --dryrun option to test the launch command.
 @_add_click_options(_TASK_OPTIONS_WITH_NAME + _EXTRA_RESOURCES_OPTIONS)
-@click.option('--spot-recovery',
+@click.option('--cluster',
+              '-c',
               default=None,
               type=str,
-              help='Spot recovery strategy to use for the managed spot task.')
+              hidden=True,
+              help=('Alias for --name, the name of the spot job.'))
+@click.option('--job-recovery',
+              default=None,
+              type=str,
+              help='Recovery strategy to use for managed jobs.')
 @click.option(
     '--detach-run',
     '-d',
@@ -3181,8 +3461,8 @@ def spot():
         '(Default: True; this flag is deprecated and will be removed in a '
         'future release.) Whether to retry provisioning infinitely until the '
         'cluster is up, if unavailability errors are encountered. This '  # pylint: disable=bad-docstring-quotes
-        'applies to launching the spot clusters (both the initial and any '
-        'recovery attempts), not the spot controller.'))
+        'applies to launching all managed jobs (both the initial and '
+        'any recovery attempts), not the jobs controller.'))
 @click.option('--yes',
               '-y',
               is_flag=True,
@@ -3191,9 +3471,10 @@ def spot():
               help='Skip confirmation prompt.')
 @timeline.event
 @usage_lib.entrypoint
-def spot_launch(
-    entrypoint: List[str],
+def jobs_launch(
+    entrypoint: Tuple[str, ...],
     name: Optional[str],
+    cluster: Optional[str],
     workdir: Optional[str],
     cloud: Optional[str],
     region: Optional[str],
@@ -3205,7 +3486,7 @@ def spot_launch(
     num_nodes: Optional[int],
     use_spot: Optional[bool],
     image_id: Optional[str],
-    spot_recovery: Optional[str],
+    job_recovery: Optional[str],
     env_file: Optional[Dict[str, str]],
     env: List[Tuple[str, str]],
     disk_size: Optional[int],
@@ -3215,7 +3496,7 @@ def spot_launch(
     retry_until_up: bool,
     yes: bool,
 ):
-    """Launch a managed spot job from a YAML or a command.
+    """Launch a managed job from a YAML or a command.
 
     If ENTRYPOINT points to a valid YAML file, it is read in as the task
     specification. Otherwise, it is interpreted as a bash command.
@@ -3225,10 +3506,15 @@ def spot_launch(
     .. code-block:: bash
 
       # You can use normal task YAMLs.
-      sky spot launch task.yaml
+      sky jobs launch task.yaml
 
-      sky spot launch 'echo hello!'
+      sky jobs launch 'echo hello!'
     """
+    if cluster is not None:
+        if name is not None and name != cluster:
+            raise click.UsageError('Cannot specify both --name and --cluster. '
+                                   'Use one of the flags as they are alias.')
+        name = cluster
     env = _merge_env_vars(env_file, env)
     task_or_dag = _make_task_or_dag_from_entrypoint_with_overrides(
         entrypoint,
@@ -3248,16 +3534,17 @@ def spot_launch(
         disk_size=disk_size,
         disk_tier=disk_tier,
         ports=ports,
-        spot_recovery=spot_recovery,
+        job_recovery=job_recovery,
     )
-    # Deprecation.
+    # Deprecation. We set the default behavior to be retry until up, and the
+    # flag `--retry-until-up` is deprecated. We can remove the flag in 0.8.0.
     if retry_until_up is not None:
         flag_str = '--retry-until-up'
         if not retry_until_up:
             flag_str = '--no-retry-until-up'
         click.secho(
             f'Flag {flag_str} is deprecated and will be removed in a '
-            'future release (managed spot jobs will always be retried). '
+            'future release (managed jobs will always be retried). '
             'Please file an issue if this does not work for you.',
             fg='yellow')
     else:
@@ -3275,27 +3562,26 @@ def spot_launch(
         dag.name = name
 
     dag_utils.maybe_infer_and_fill_dag_and_task_names(dag)
-    dag_utils.fill_default_spot_config_in_dag_for_spot_launch(dag)
+    dag_utils.fill_default_config_in_dag_for_job_launch(dag)
 
-    click.secho(
-        f'Managed spot job {dag.name!r} will be launched on (estimated):',
-        fg='yellow')
+    click.secho(f'Managed job {dag.name!r} will be launched on (estimated):',
+                fg='yellow')
     dag = sky.optimize(dag)
 
     if not yes:
-        prompt = f'Launching the spot job {dag.name!r}. Proceed?'
+        prompt = f'Launching a managed job {dag.name!r}. Proceed?'
         if prompt is not None:
             click.confirm(prompt, default=True, abort=True, show_default=True)
 
     common_utils.check_cluster_name_is_valid(name)
 
-    spot_lib.launch(dag,
-                    name,
-                    detach_run=detach_run,
-                    retry_until_up=retry_until_up)
+    managed_jobs.launch(dag,
+                        name,
+                        detach_run=detach_run,
+                        retry_until_up=retry_until_up)
 
 
-@spot.command('queue', cls=_DocumentedCodeCommand)
+@jobs.command('queue', cls=_DocumentedCodeCommand)
 @click.option('--all',
               '-a',
               default=False,
@@ -3308,7 +3594,7 @@ def spot_launch(
     default=False,
     is_flag=True,
     required=False,
-    help='Query the latest statuses, restarting the spot controller if stopped.'
+    help='Query the latest statuses, restarting the jobs controller if stopped.'
 )
 @click.option('--skip-finished',
               '-s',
@@ -3318,21 +3604,21 @@ def spot_launch(
               help='Show only pending/running jobs\' information.')
 @usage_lib.entrypoint
 # pylint: disable=redefined-builtin
-def spot_queue(all: bool, refresh: bool, skip_finished: bool):
-    """Show statuses of managed spot jobs.
+def jobs_queue(all: bool, refresh: bool, skip_finished: bool):
+    """Show statuses of managed jobs.
 
-    Each spot job can have one of the following statuses:
+    Each managed jobs can have one of the following statuses:
 
-    - ``PENDING``: Job is waiting for a free slot on the spot controller to be
+    - ``PENDING``: Job is waiting for a free slot on the jobs controller to be
       accepted.
 
-    - ``SUBMITTED``: Job is submitted to and accepted by the spot controller.
+    - ``SUBMITTED``: Job is submitted to and accepted by the jobs controller.
 
-    - ``STARTING``: Job is starting (provisioning a spot cluster).
+    - ``STARTING``: Job is starting (provisioning a cluster for the job).
 
     - ``RUNNING``: Job is running.
 
-    - ``RECOVERING``: The spot cluster is recovering from a preemption.
+    - ``RECOVERING``: The cluster of the job is recovering from a preemption.
 
     - ``SUCCEEDED``: Job succeeded.
 
@@ -3355,12 +3641,12 @@ def spot_queue(all: bool, refresh: bool, skip_finished: bool):
     - ``FAILED_CONTROLLER``: Job failed due to an unexpected error in the spot
       controller.
 
-    If the job failed, either due to user code or spot unavailability, the
-    error log can be found with ``sky spot logs --controller``, e.g.:
+    If the job failed, either due to user code or resource unavailability, the
+    error log can be found with ``sky jobs logs --controller``, e.g.:
 
     .. code-block:: bash
 
-      sky spot logs --controller job_id
+      sky jobs logs --controller job_id
 
     This also shows the logs for provisioning and any preemption and recovery
     attempts.
@@ -3369,37 +3655,37 @@ def spot_queue(all: bool, refresh: bool, skip_finished: bool):
 
     .. code-block:: bash
 
-      watch -n60 sky spot queue
+      watch -n60 sky jobs queue
 
     """
-    click.secho('Fetching managed spot job statuses...', fg='yellow')
-    with rich_utils.safe_status('[cyan]Checking spot jobs[/]'):
-        _, msg = _get_spot_jobs(refresh=refresh,
-                                skip_finished=skip_finished,
-                                show_all=all,
-                                is_called_by_user=True)
+    click.secho('Fetching managed job statuses...', fg='yellow')
+    with rich_utils.safe_status('[cyan]Checking managed jobs[/]'):
+        _, msg = _get_managed_jobs(refresh=refresh,
+                                   skip_finished=skip_finished,
+                                   show_all=all,
+                                   is_called_by_user=True)
     if not skip_finished:
         in_progress_only_hint = ''
     else:
         in_progress_only_hint = ' (showing in-progress jobs only)'
     click.echo(f'{colorama.Fore.CYAN}{colorama.Style.BRIGHT}'
-               f'Managed spot jobs{colorama.Style.RESET_ALL}'
+               f'Managed jobs{colorama.Style.RESET_ALL}'
                f'{in_progress_only_hint}\n{msg}')
 
 
-@spot.command('cancel', cls=_DocumentedCodeCommand)
+@jobs.command('cancel', cls=_DocumentedCodeCommand)
 @click.option('--name',
               '-n',
               required=False,
               type=str,
-              help='Managed spot job name to cancel.')
+              help='Managed job name to cancel.')
 @click.argument('job_ids', default=None, type=int, required=False, nargs=-1)
 @click.option('--all',
               '-a',
               is_flag=True,
               default=False,
               required=False,
-              help='Cancel all managed spot jobs.')
+              help='Cancel all managed jobs.')
 @click.option('--yes',
               '-y',
               is_flag=True,
@@ -3408,8 +3694,8 @@ def spot_queue(all: bool, refresh: bool, skip_finished: bool):
               help='Skip confirmation prompt.')
 @usage_lib.entrypoint
 # pylint: disable=redefined-builtin
-def spot_cancel(name: Optional[str], job_ids: Tuple[int], all: bool, yes: bool):
-    """Cancel managed spot jobs.
+def jobs_cancel(name: Optional[str], job_ids: Tuple[int], all: bool, yes: bool):
+    """Cancel managed jobs.
 
     You can provide either a job name or a list of job IDs to be cancelled.
     They are exclusive options.
@@ -3418,15 +3704,15 @@ def spot_cancel(name: Optional[str], job_ids: Tuple[int], all: bool, yes: bool):
 
     .. code-block:: bash
 
-      # Cancel managed spot job with name 'my-job'
-      $ sky spot cancel -n my-job
+      # Cancel managed job with name 'my-job'
+      $ sky jobs cancel -n my-job
       \b
-      # Cancel managed spot jobs with IDs 1, 2, 3
-      $ sky spot cancel 1 2 3
+      # Cancel managed jobs with IDs 1, 2, 3
+      $ sky jobs cancel 1 2 3
     """
     backend_utils.is_controller_accessible(
-        controller_type=controller_utils.Controllers.SPOT_CONTROLLER,
-        stopped_message='All managed spot jobs should have finished.',
+        controller=controller_utils.Controllers.JOBS_CONTROLLER,
+        stopped_message='All managed jobs should have finished.',
         exit_if_not_accessible=True)
 
     job_id_str = ','.join(map(str, job_ids))
@@ -3439,24 +3725,24 @@ def spot_cancel(name: Optional[str], job_ids: Tuple[int], all: bool, yes: bool):
             f'Provided {argument_str!r}.')
 
     if not yes:
-        job_identity_str = (f'managed spot jobs with IDs {job_id_str}'
+        job_identity_str = (f'managed jobs with IDs {job_id_str}'
                             if job_ids else repr(name))
         if all:
-            job_identity_str = 'all managed spot jobs'
+            job_identity_str = 'all managed jobs'
         click.confirm(f'Cancelling {job_identity_str}. Proceed?',
                       default=True,
                       abort=True,
                       show_default=True)
 
-    spot_lib.cancel(job_ids=job_ids, name=name, all=all)
+    managed_jobs.cancel(job_ids=job_ids, name=name, all=all)
 
 
-@spot.command('logs', cls=_DocumentedCodeCommand)
+@jobs.command('logs', cls=_DocumentedCodeCommand)
 @click.option('--name',
               '-n',
               required=False,
               type=str,
-              help='Managed spot job name.')
+              help='Managed job name.')
 @click.option(
     '--follow/--no-follow',
     is_flag=True,
@@ -3471,22 +3757,20 @@ def spot_cancel(name: Optional[str], job_ids: Tuple[int], all: bool, yes: bool):
           'launching/recoveries, etc.'))
 @click.argument('job_id', required=False, type=int)
 @usage_lib.entrypoint
-def spot_logs(name: Optional[str], job_id: Optional[int], follow: bool,
+def jobs_logs(name: Optional[str], job_id: Optional[int], follow: bool,
               controller: bool):
-    """Tail the log of a managed spot job."""
+    """Tail the log of a managed job."""
     try:
-        if controller:
-            core.tail_logs(spot_lib.SPOT_CONTROLLER_NAME,
-                           job_id=job_id,
-                           follow=follow)
-        else:
-            spot_lib.tail_logs(name=name, job_id=job_id, follow=follow)
-    except exceptions.ClusterNotUpError as e:
-        click.echo(e)
-        sys.exit(1)
+        managed_jobs.tail_logs(name=name,
+                               job_id=job_id,
+                               follow=follow,
+                               controller=controller)
+    except exceptions.ClusterNotUpError:
+        with ux_utils.print_exception_no_traceback():
+            raise
 
 
-@spot.command('dashboard', cls=_DocumentedCodeCommand)
+@jobs.command('dashboard', cls=_DocumentedCodeCommand)
 @click.option(
     '--port',
     '-p',
@@ -3496,19 +3780,18 @@ def spot_logs(name: Optional[str], job_id: Optional[int], follow: bool,
     help=('Local port to use for the dashboard. If None, a free port is '
           'automatically chosen.'))
 @usage_lib.entrypoint
-def spot_dashboard(port: Optional[int]):
-    """Opens a dashboard for spot jobs (needs controller to be UP)."""
+def jobs_dashboard(port: Optional[int]):
+    """Opens a dashboard for managed jobs (needs controller to be UP)."""
     # TODO(zongheng): ideally, the controller/dashboard server should expose the
     # API perhaps via REST. Then here we would (1) not have to use SSH to try to
     # see if the controller is UP first, which is slow; (2) not have to run SSH
     # port forwarding first (we'd just launch a local dashboard which would make
     # REST API calls to the controller dashboard server).
-    click.secho('Checking if spot controller is up...', fg='yellow')
-    hint = (
-        'Dashboard is not available if spot controller is not up. Run a spot '
-        'job first.')
+    click.secho('Checking if jobs controller is up...', fg='yellow')
+    hint = ('Dashboard is not available if jobs controller is not up. Run a '
+            'managed job first.')
     backend_utils.is_controller_accessible(
-        controller_type=controller_utils.Controllers.SPOT_CONTROLLER,
+        controller=controller_utils.Controllers.JOBS_CONTROLLER,
         stopped_message=hint,
         non_existent_message=hint,
         exit_if_not_accessible=True)
@@ -3519,8 +3802,9 @@ def spot_dashboard(port: Optional[int]):
         free_port = common_utils.find_free_port(remote_port)
     else:
         free_port = port
-    ssh_command = (f'ssh -qNL {free_port}:localhost:{remote_port} '
-                   f'{spot_lib.SPOT_CONTROLLER_NAME}')
+    ssh_command = (
+        f'ssh -qNL {free_port}:localhost:{remote_port} '
+        f'{controller_utils.Controllers.JOBS_CONTROLLER.value.cluster_name}')
     click.echo('Forwarding port: ', nl=False)
     click.secho(f'{ssh_command}', dim=True)
 
@@ -3539,12 +3823,31 @@ def spot_dashboard(port: Optional[int]):
             try:
                 os.killpg(os.getpgid(ssh_process.pid), signal.SIGTERM)
             except ProcessLookupError:
-                # This happens if spot controller is auto-stopped.
+                # This happens if jobs controller is auto-stopped.
                 pass
         finally:
             click.echo('Exiting.')
 
 
+# TODO(zhwu): Backward compatibility for the old `sky spot launch` command.
+# It is now renamed to `sky jobs launch` and the old command is deprecated.
+# Remove in v0.8.0.
+@cli.group(cls=_NaturalOrderGroup)
+def spot():
+    """Alias for Managed Jobs CLI (default to managed spot jobs)."""
+    pass
+
+
+_add_command_alias(jobs,
+                   jobs_launch,
+                   new_group=spot,
+                   override_command_argument={'use_spot': True})
+_add_command_alias(jobs, jobs_queue, new_group=spot)
+_add_command_alias(jobs, jobs_logs, new_group=spot)
+_add_command_alias(jobs, jobs_cancel, new_group=spot)
+_add_command_alias(jobs, jobs_dashboard, new_group=spot)
+
+
 @cli.group(cls=_NaturalOrderGroup)
 def serve():
     """SkyServe CLI (multi-region, multi-cloud serving)."""
@@ -3553,7 +3856,7 @@ def serve():
 
 def _generate_task_with_service(
     service_name: str,
-    service_yaml_args: List[str],
+    service_yaml_args: Tuple[str, ...],
     workdir: Optional[str],
     cloud: Optional[str],
     region: Optional[str],
@@ -3658,7 +3961,7 @@ def _generate_task_with_service(
 @timeline.event
 @usage_lib.entrypoint
 def serve_up(
-    service_yaml: List[str],
+    service_yaml: Tuple[str, ...],
     service_name: Optional[str],
     workdir: Optional[str],
     cloud: Optional[str],
@@ -3776,7 +4079,7 @@ def serve_up(
 @usage_lib.entrypoint
 def serve_update(
     service_name: str,
-    service_yaml: List[str],
+    service_yaml: Tuple[str, ...],
     workdir: Optional[str],
     cloud: Optional[str],
     region: Optional[str],
@@ -4040,7 +4343,7 @@ def serve_down(service_names: List[str], all: bool, purge: bool, yes: bool):
             f'Provided {argument_str!r}.')
 
     backend_utils.is_controller_accessible(
-        controller_type=controller_utils.Controllers.SKY_SERVE_CONTROLLER,
+        controller=controller_utils.Controllers.SKY_SERVE_CONTROLLER,
         stopped_message='All services should have been terminated.',
         exit_if_not_accessible=True)
 
@@ -4122,9 +4425,9 @@ def serve_logs(
                             target=target_component,
                             replica_id=replica_id,
                             follow=follow)
-    except exceptions.ClusterNotUpError as e:
-        click.echo(e)
-        sys.exit(1)
+    except exceptions.ClusterNotUpError:
+        with ux_utils.print_exception_no_traceback():
+            raise
 
 
 # ==============================
@@ -4811,13 +5114,14 @@ def local_up(gpus: bool):
                    f'exists.{style.RESET_ALL}\n'
                    'If you want to delete it instead, run: sky local down')
     else:
-        click.echo('Failed to create local cluster. '
-                   f'Full log: {log_path}'
-                   f'\nError: {style.BRIGHT}{stderr}{style.RESET_ALL}')
-        sys.exit(1)
+        with ux_utils.print_exception_no_traceback():
+            raise RuntimeError(
+                'Failed to create local cluster. '
+                f'Full log: {log_path}'
+                f'\nError: {style.BRIGHT}{stderr}{style.RESET_ALL}')
     # Run sky check
     with rich_utils.safe_status('[bold cyan]Running sky check...'):
-        sky_check.check(quiet=True)
+        sky_check.check(clouds=['kubernetes'], quiet=True)
     if cluster_created:
         # Prepare completion message which shows CPU and GPU count
         # Get number of CPUs
@@ -4911,14 +5215,15 @@ def local_down():
         elif returncode == 100:
             click.echo('\nLocal cluster does not exist.')
         else:
-            click.echo('Failed to create local cluster. '
-                       f'Stdout: {stdout}'
-                       f'\nError: {style.BRIGHT}{stderr}{style.RESET_ALL}')
-            sys.exit(1)
+            with ux_utils.print_exception_no_traceback():
+                raise RuntimeError(
+                    'Failed to create local cluster. '
+                    f'Stdout: {stdout}'
+                    f'\nError: {style.BRIGHT}{stderr}{style.RESET_ALL}')
     if cluster_removed:
         # Run sky check
         with rich_utils.safe_status('[bold cyan]Running sky check...'):
-            sky_check.check(quiet=True)
+            sky_check.check(clouds=['kubernetes'], quiet=True)
         click.echo(
             f'{colorama.Fore.GREEN}Local cluster removed.{style.RESET_ALL}')
 
diff --git a/sky/clouds/__init__.py b/sky/clouds/__init__.py
index 6f3cb642542..af8fb0f3032 100644
--- a/sky/clouds/__init__.py
+++ b/sky/clouds/__init__.py
@@ -1,7 +1,7 @@
 """Clouds in Sky."""
 
 from sky.clouds.cloud import Cloud
-from sky.clouds.cloud import cloud_in_list
+from sky.clouds.cloud import cloud_in_iterable
 from sky.clouds.cloud import CloudImplementationFeatures
 from sky.clouds.cloud import ProvisionerVersion
 from sky.clouds.cloud import Region
@@ -47,5 +47,5 @@
     'StatusVersion',
     'Fluidstack',
     # Utility functions
-    'cloud_in_list',
+    'cloud_in_iterable',
 ]
diff --git a/sky/clouds/aws.py b/sky/clouds/aws.py
index 6008c4d080d..b2b55e14d5e 100644
--- a/sky/clouds/aws.py
+++ b/sky/clouds/aws.py
@@ -37,7 +37,7 @@
 # It has the following purposes:
 #   - make all nodes (any cloud) able to access private S3 buckets
 #   - make some remote nodes able to launch new nodes on AWS (i.e., makes
-#     AWS head node able to launch AWS workers, or any-cloud spot controller
+#     AWS head node able to launch AWS workers, or any-cloud jobs controller
 #     able to launch spot clusters on AWS).
 #
 # If we detect the current user identity is AWS SSO, we will not upload this
@@ -81,6 +81,8 @@ class AWSIdentityType(enum.Enum):
 
     IAM_ROLE = 'iam-role'
 
+    CONTAINER_ROLE = 'container-role'
+
     #       Name                    Value             Type    Location
     #       ----                    -----             ----    --------
     #    profile                <not set>             None    None
@@ -338,9 +340,6 @@ def get_egress_cost(self, num_gigabytes: float):
         cost += 0.0
         return cost
 
-    def is_same_cloud(self, other: clouds.Cloud):
-        return isinstance(other, AWS)
-
     @classmethod
     def get_default_instance_type(
             cls,
@@ -541,10 +540,16 @@ def check_credentials(cls) -> Tuple[bool, Optional[str]]:
         elif identity_type == AWSIdentityType.IAM_ROLE:
             # When using an IAM role, the credentials may not exist in the
             # ~/.aws/credentials file. So we don't check for the existence of the
-            # file. This will happen when the user is on a VM (or spot-controller)
-            # created by an SSO account, i.e. the VM will be assigned the IAM
-            # role: skypilot-v1.
+            # file. This will happen when the user is on a VM (or
+            # jobs-controller) created by an SSO account, i.e. the VM will be
+            # assigned the IAM role: skypilot-v1.
             hints = f'AWS IAM role is set.{single_cloud_hint}'
+        elif identity_type == AWSIdentityType.CONTAINER_ROLE:
+            # Similar to the IAM ROLE, an ECS container may not store credentials
+            # in the~/.aws/credentials file. So we don't check for the existence of
+            # the file. i.e. the container will be assigned the IAM role of the
+            # task: skypilot-v1.
+            hints = f'AWS container-role is set.{single_cloud_hint}'
         else:
             # This file is required because it is required by the VMs launched on
             # other clouds to access private s3 buckets and resources like EC2.
@@ -604,6 +609,8 @@ def _is_access_key_of_type(type_str: str) -> bool:
             return AWSIdentityType.SSO
         elif _is_access_key_of_type(AWSIdentityType.IAM_ROLE.value):
             return AWSIdentityType.IAM_ROLE
+        elif _is_access_key_of_type(AWSIdentityType.CONTAINER_ROLE.value):
+            return AWSIdentityType.CONTAINER_ROLE
         elif _is_access_key_of_type(AWSIdentityType.ENV.value):
             return AWSIdentityType.ENV
         else:
@@ -745,7 +752,7 @@ def get_credential_file_mounts(self) -> Dict[str, str]:
         # credentials. We need to define a mechanism to find out the cloud
         # provider of the cluster to be launched in this function and make sure
         # the cluster will not be used for launching clusters in other clouds,
-        # e.g. spot controller.
+        # e.g. jobs controller.
         if self._current_identity_type(
         ) != AWSIdentityType.SHARED_CREDENTIALS_FILE:
             return {}
@@ -962,3 +969,22 @@ def delete_image(cls, image_id: str, region: Optional[str]) -> None:
             error_msg=f'Failed to delete image {image_id!r} on {region}.',
             stderr=stderr,
             stream_logs=True)
+
+    @classmethod
+    def is_label_valid(cls, label_key: str,
+                       label_value: str) -> Tuple[bool, Optional[str]]:
+        key_regex = re.compile(r'^[^aws:][\S]{0,127}$')
+        value_regex = re.compile(r'^[\S]{0,255}$')
+        key_valid = bool(key_regex.match(label_key))
+        value_valid = bool(value_regex.match(label_value))
+        error_msg = None
+        if not key_valid:
+            error_msg = (f'Invalid tag key {label_key} for AWS. '
+                         'Key must start with any character except \'aws:\' '
+                         'and must be 128 characters or fewer in length.')
+        if not value_valid:
+            error_msg = (f'Invalid tag value {label_value} for AWS. '
+                         'Value must be 256 characters or fewer in length.')
+        if not key_valid or not value_valid:
+            return False, error_msg
+        return True, None
diff --git a/sky/clouds/azure.py b/sky/clouds/azure.py
index a0e16c83471..4df1cd4a4bf 100644
--- a/sky/clouds/azure.py
+++ b/sky/clouds/azure.py
@@ -134,9 +134,6 @@ def get_egress_cost(self, num_gigabytes: float):
         cost += 0.0
         return cost
 
-    def is_same_cloud(self, other):
-        return isinstance(other, Azure)
-
     @classmethod
     def get_image_size(cls, image_id: str, region: Optional[str]) -> float:
         if region is None:
@@ -163,7 +160,7 @@ def get_image_size(cls, image_id: str, region: Optional[str]) -> float:
         try:
             image = compute_client.virtual_machine_images.get(
                 region, publisher, offer, sku, version)
-        except azure.exceptions().ResourceNotFoundError() as e:
+        except azure.exceptions().ResourceNotFoundError as e:
             with ux_utils.print_exception_no_traceback():
                 raise ValueError(f'Image not found: {image_id}') from e
         if image.os_disk_image is None:
@@ -294,7 +291,8 @@ def make_deploy_resources_variables(
         else:
             custom_resources = None
 
-        if resources.image_id is None:
+        if (resources.image_id is None or
+                resources.extract_docker_image() is not None):
             # pylint: disable=import-outside-toplevel
             from sky.clouds.service_catalog import azure_catalog
             gen_version = azure_catalog.get_gen_version_from_instance_type(
diff --git a/sky/clouds/cloud.py b/sky/clouds/cloud.py
index 34000fb9a16..c5ff78e1c79 100644
--- a/sky/clouds/cloud.py
+++ b/sky/clouds/cloud.py
@@ -42,7 +42,8 @@ class CloudImplementationFeatures(enum.Enum):
     CUSTOM_DISK_TIER = 'custom_disk_tier'
     OPEN_PORTS = 'open_ports'
     STORAGE_MOUNTING = 'storage_mounting'
-    HOST_CONTROLLERS = 'host_controllers'  # Can run spot/serve controllers
+    HOST_CONTROLLERS = 'host_controllers'  # Can run jobs/serve controllers
+    AUTO_TERMINATE = 'auto_terminate'  # Pod/VM can stop or down itself
 
 
 class Region(collections.namedtuple('Region', ['name'])):
@@ -246,8 +247,8 @@ def get_egress_cost(self, num_gigabytes: float):
         """
         raise NotImplementedError
 
-    def is_same_cloud(self, other: 'Cloud'):
-        raise NotImplementedError
+    def is_same_cloud(self, other: 'Cloud') -> bool:
+        return isinstance(other, self.__class__)
 
     def make_deploy_resources_variables(
         self,
@@ -320,6 +321,25 @@ def is_image_tag_valid(cls, image_tag: str, region: Optional[str]) -> bool:
                                                   region,
                                                   clouds=cls._REPR.lower())
 
+    @classmethod
+    def is_label_valid(cls, label_key: str,
+                       label_value: str) -> Tuple[bool, Optional[str]]:
+        """Validates that the label key and value are valid for this cloud.
+
+        Labels can be implemented in different ways across clouds. For example,
+        on AWS we use instance tags, on GCP we use labels, and on Kubernetes we
+        use labels. This method should be implemented to validate the label
+        format for the cloud.
+
+        Returns:
+            A tuple of a boolean indicating whether the label is valid and an
+            optional string describing the reason if the label is invalid.
+        """
+        # If a cloud does not support labels, they are ignored. Only clouds
+        # that support labels implement this method.
+        del label_key, label_value
+        return True, None
+
     def get_feasible_launchable_resources(
         self,
         resources: 'resources_lib.Resources',
@@ -477,15 +497,16 @@ def validate_region_zone(
                                                     zone,
                                                     clouds=self._REPR.lower())
 
-    def need_cleanup_after_preemption(
+    def need_cleanup_after_preemption_or_failure(
             self, resources: 'resources_lib.Resources') -> bool:
-        """Returns whether a spot resource needs cleanup after preeemption.
+        """Whether a resource needs cleanup after preeemption or failure.
 
         In most cases, spot resources do not need cleanup after preemption,
         as long as the cluster can be relaunched with the same name and tag,
         no matter the preemption behavior is to terminate or stop the cluster.
-        The only exception by far is GCP's Spot TPU VM. We override this method
-        in gcp.py.
+        Similar for on-demand resources that go into maintenance mode. The
+        only exception by far is GCP's TPU VM. We override this method in
+        gcp.py.
         """
         del resources
         return False
@@ -746,6 +767,6 @@ def __getstate__(self):
 
 
 # === Helper functions ===
-def cloud_in_list(cloud: Cloud, cloud_list: Iterable[Cloud]) -> bool:
+def cloud_in_iterable(cloud: Cloud, cloud_list: Iterable[Cloud]) -> bool:
     """Returns whether the cloud is in the given cloud list."""
     return any(cloud.is_same_cloud(c) for c in cloud_list)
diff --git a/sky/clouds/cudo.py b/sky/clouds/cudo.py
index ad7a22e6e03..1a32bb0bd2c 100644
--- a/sky/clouds/cudo.py
+++ b/sky/clouds/cudo.py
@@ -46,6 +46,15 @@ class Cudo(clouds.Cloud):
         'https://skypilot.readthedocs.io/en/latest/getting-started/installation.html'
     )
 
+    _PROJECT_HINT = (
+        'Create a project and then set it as the default project,:\n'
+        f'{_INDENT_PREFIX} $ cudoctl projects create my-project-name\n'
+        f'{_INDENT_PREFIX} $ cudoctl init\n'
+        f'{_INDENT_PREFIX}For more info: '
+        # pylint: disable=line-too-long
+        'https://skypilot.readthedocs.io/en/latest/getting-started/installation.html'
+    )
+
     _CLOUD_UNSUPPORTED_FEATURES = {
         clouds.CloudImplementationFeatures.STOP: 'Stopping not supported.',
         clouds.CloudImplementationFeatures.SPOT_INSTANCE:
@@ -155,10 +164,6 @@ def get_egress_cost(self, num_gigabytes: float) -> float:
         # `return 0.0` is a good placeholder.)
         return 0.0
 
-    def is_same_cloud(self, other: clouds.Cloud) -> bool:
-        # Returns true if the two clouds are the same cloud type.
-        return isinstance(other, Cudo)
-
     @classmethod
     def get_default_instance_type(
             cls,
@@ -293,7 +298,9 @@ def check_credentials(cls) -> Tuple[bool, Optional[str]]:
         project_id, error = cudo_api.get_project_id()
         if error is not None:
             return False, (
-                f'Error getting project '
+                f'Default project is not set. '
+                f'{cls._PROJECT_HINT}\n'
+                f'{cls._INDENT_PREFIX}'
                 f'{common_utils.format_exception(error, use_bracket=True)}')
         try:
             api = cudo_api.virtual_machines()
diff --git a/sky/clouds/fluidstack.py b/sky/clouds/fluidstack.py
index 30274c8b665..d7921a3f51a 100644
--- a/sky/clouds/fluidstack.py
+++ b/sky/clouds/fluidstack.py
@@ -50,6 +50,9 @@ class Fluidstack(clouds.Cloud):
         clouds.CloudImplementationFeatures.CUSTOM_DISK_TIER:
             'Custom disk tiers'
             f' is not supported in {_REPR}.',
+        clouds.CloudImplementationFeatures.HOST_CONTROLLERS:
+            'Host controllers'
+            f' are not supported in {_REPR}.',
     }
     # Using the latest SkyPilot provisioner API to provision and check status.
     PROVISIONER_VERSION = clouds.ProvisionerVersion.SKYPILOT
@@ -137,10 +140,6 @@ def get_egress_cost(self, num_gigabytes: float) -> float:
     def __repr__(self):
         return 'Fluidstack'
 
-    def is_same_cloud(self, other: clouds.Cloud) -> bool:
-        # Returns true if the two clouds are the same cloud type.
-        return isinstance(other, Fluidstack)
-
     @classmethod
     def get_default_instance_type(
             cls,
diff --git a/sky/clouds/gcp.py b/sky/clouds/gcp.py
index 4f5fefd9b10..7e7dacc539f 100644
--- a/sky/clouds/gcp.py
+++ b/sky/clouds/gcp.py
@@ -327,9 +327,6 @@ def get_egress_cost(self, num_gigabytes: float):
         else:
             return 0.08 * num_gigabytes
 
-    def is_same_cloud(self, other):
-        return isinstance(other, GCP)
-
     @classmethod
     def _is_machine_image(cls, image_id: str) -> bool:
         find_machine = re.match(r'projects/.*/.*/machineImages/.*', image_id)
@@ -473,8 +470,8 @@ def make_deploy_resources_variables(
                     # CUDA driver version 535.86.10, CUDA Library 12.2
                     image_id = 'skypilot:gpu-debian-11'
 
-        if resources.image_id is not None and resources.extract_docker_image(
-        ) is None:
+        if (resources.image_id is not None and
+                resources.extract_docker_image() is None):
             if None in resources.image_id:
                 image_id = resources.image_id[None]
             else:
@@ -757,13 +754,13 @@ def check_credentials(cls) -> Tuple[bool, Optional[str]]:
 
         # pylint: disable=import-outside-toplevel,unused-import
         import google.auth
-        import googleapiclient.discovery
 
         # This takes user's credential info from "~/.config/gcloud/application_default_credentials.json".  # pylint: disable=line-too-long
         credentials, project = google.auth.default()
-        crm = googleapiclient.discovery.build('cloudresourcemanager',
-                                              'v1',
-                                              credentials=credentials)
+        crm = gcp.build('cloudresourcemanager',
+                        'v1',
+                        credentials=credentials,
+                        cache_discovery=False)
         gcp_minimal_permissions = gcp_utils.get_minimal_permissions()
         permissions = {'permissions': gcp_minimal_permissions}
         request = crm.projects().testIamPermissions(resource=project,
@@ -859,13 +856,14 @@ def get_current_user_identity_str(cls) -> Optional[str]:
     def instance_type_exists(self, instance_type):
         return service_catalog.instance_type_exists(instance_type, 'gcp')
 
-    def need_cleanup_after_preemption(self,
-                                      resources: 'resources.Resources') -> bool:
-        """Returns whether a spot resource needs cleanup after preeemption."""
+    def need_cleanup_after_preemption_or_failure(
+            self, resources: 'resources.Resources') -> bool:
+        """Whether a resource needs cleanup after preeemption or failure."""
         # Spot TPU VMs require manual cleanup after preemption.
         # "If your Cloud TPU is preempted,
         # you must delete it and create a new one ..."
         # See: https://cloud.google.com/tpu/docs/preemptible#tpu-vm
+        # On-demand TPU VMs are likely to require manual cleanup as well.
 
         return gcp_utils.is_tpu_vm(resources)
 
@@ -1087,3 +1085,25 @@ def delete_image(cls, image_id: str, region: Optional[str]) -> None:
             error_msg=f'Failed to delete image {image_name!r}',
             stderr=stderr,
             stream_logs=True)
+
+    @classmethod
+    def is_label_valid(cls, label_key: str,
+                       label_value: str) -> Tuple[bool, Optional[str]]:
+        key_regex = re.compile(r'^[a-z]([a-z0-9_-]{0,62})?$')
+        value_regex = re.compile(r'^[a-z0-9_-]{0,63}$')
+        key_valid = bool(key_regex.match(label_key))
+        value_valid = bool(value_regex.match(label_value))
+        error_msg = None
+        condition_msg = ('can include lowercase alphanumeric characters, '
+                         'dashes, and underscores, with a total length of 63 '
+                         'characters or less.')
+        if not key_valid:
+            error_msg = (f'Invalid label key {label_key} for GCP. '
+                         f'Key must start with a lowercase letter '
+                         f'and {condition_msg}')
+        if not value_valid:
+            error_msg = (f'Invalid label value {label_value} for GCP. Value '
+                         f'{condition_msg}')
+        if not key_valid or not value_valid:
+            return False, error_msg
+        return True, None
diff --git a/sky/clouds/ibm.py b/sky/clouds/ibm.py
index 880ad212e25..86e325a437b 100644
--- a/sky/clouds/ibm.py
+++ b/sky/clouds/ibm.py
@@ -165,9 +165,6 @@ def get_egress_cost(self, num_gigabytes: float):
             num_gigabytes -= (num_gigabytes - price_threshold['threshold'])
         return cost
 
-    def is_same_cloud(self, other):
-        return isinstance(other, IBM)
-
     def make_deploy_resources_variables(
         self,
         resources: 'resources_lib.Resources',
diff --git a/sky/clouds/kubernetes.py b/sky/clouds/kubernetes.py
index e6b3b5bbe9e..140190d9fde 100644
--- a/sky/clouds/kubernetes.py
+++ b/sky/clouds/kubernetes.py
@@ -1,6 +1,7 @@
 """Kubernetes."""
 import json
 import os
+import re
 import typing
 from typing import Dict, Iterator, List, Optional, Tuple
 
@@ -13,6 +14,7 @@
 from sky.provision.kubernetes import utils as kubernetes_utils
 from sky.utils import common_utils
 from sky.utils import resources_utils
+from sky.utils import schemas
 
 if typing.TYPE_CHECKING:
     # Renaming to avoid shadowing variables.
@@ -24,14 +26,17 @@
 DEFAULT_KUBECONFIG_PATH = '~/.kube/config'
 CREDENTIAL_PATH = os.environ.get('KUBECONFIG', DEFAULT_KUBECONFIG_PATH)
 
+# Namespace for SkyPilot resources shared across multiple tenants on the
+# same cluster (even if they might be running in different namespaces).
+# E.g., FUSE device manager daemonset is run in this namespace.
+_SKYPILOT_SYSTEM_NAMESPACE = 'skypilot-system'
+
 
 @clouds.CLOUD_REGISTRY.register
 class Kubernetes(clouds.Cloud):
     """Kubernetes."""
 
     SKY_SSH_KEY_SECRET_NAME = 'sky-ssh-keys'
-    SKY_SSH_KEY_SECRET_FIELD_NAME = \
-        f'ssh-publickey-{common_utils.get_user_hash()}'
     SKY_SSH_JUMP_NAME = 'sky-ssh-jump-pod'
     PORT_FORWARD_PROXY_CMD_TEMPLATE = \
         'kubernetes-port-forward-proxy-command.sh.j2'
@@ -46,6 +51,15 @@ class Kubernetes(clouds.Cloud):
     timeout = skypilot_config.get_nested(['kubernetes', 'provision_timeout'],
                                          10)
 
+    # Limit the length of the cluster name to avoid exceeding the limit of 63
+    # characters for Kubernetes resources. We limit to 42 characters (63-21) to
+    # allow additional characters for creating ingress services to expose ports.
+    # These services are named as {cluster_name_on_cloud}--skypilot-svc--{port},
+    # where the suffix is 21 characters long.
+    _MAX_CLUSTER_NAME_LEN_LIMIT = 42
+
+    _SUPPORTS_SERVICE_ACCOUNT_ON_REMOTE = True
+
     _DEFAULT_NUM_VCPUS = 2
     _DEFAULT_MEMORY_CPU_RATIO = 1
     _DEFAULT_MEMORY_CPU_RATIO_WITH_GPU = 4  # Allocate more memory for GPU tasks
@@ -65,13 +79,6 @@ class Kubernetes(clouds.Cloud):
                                                              'tiers are not '
                                                              'supported in '
                                                              'Kubernetes.',
-        # Kubernetes may be using exec-based auth, which may not work by
-        # directly copying the kubeconfig file to the controller.
-        # Support for service accounts for auth will be added in #3377, which
-        # will allow us to support hosting controllers.
-        clouds.CloudImplementationFeatures.HOST_CONTROLLERS: 'Kubernetes can '
-                                                             'not host '
-                                                             'controllers.',
     }
 
     IMAGE_CPU = 'skypilot:cpu-ubuntu-2004'
@@ -80,13 +87,34 @@ class Kubernetes(clouds.Cloud):
     PROVISIONER_VERSION = clouds.ProvisionerVersion.SKYPILOT
     STATUS_VERSION = clouds.StatusVersion.SKYPILOT
 
+    @property
+    def ssh_key_secret_field_name(self):
+        # Use a fresh user hash to avoid conflicts in the secret object naming.
+        # This can happen when the controller is reusing the same user hash
+        # through USER_ID_ENV_VAR but has a different SSH key.
+        fresh_user_hash = common_utils.get_user_hash(force_fresh_hash=True)
+        return f'ssh-publickey-{fresh_user_hash}'
+
     @classmethod
     def _unsupported_features_for_resources(
         cls, resources: 'resources_lib.Resources'
     ) -> Dict[clouds.CloudImplementationFeatures, str]:
         unsupported_features = cls._CLOUD_UNSUPPORTED_FEATURES
+        is_exec_auth, message = kubernetes_utils.is_kubeconfig_exec_auth()
+        if is_exec_auth:
+            assert isinstance(message, str), message
+            # Controllers cannot spin up new pods with exec auth.
+            unsupported_features[
+                clouds.CloudImplementationFeatures.HOST_CONTROLLERS] = message
+            # Pod does not have permissions to terminate itself with exec auth.
+            unsupported_features[
+                clouds.CloudImplementationFeatures.AUTO_TERMINATE] = message
         return unsupported_features
 
+    @classmethod
+    def max_cluster_name_length(cls) -> Optional[int]:
+        return cls._MAX_CLUSTER_NAME_LEN_LIMIT
+
     @classmethod
     def regions(cls) -> List[clouds.Region]:
         return cls._regions
@@ -123,9 +151,6 @@ def get_egress_cost(self, num_gigabytes: float) -> float:
     def __repr__(self):
         return self._REPR
 
-    def is_same_cloud(self, other: clouds.Cloud) -> bool:
-        return isinstance(other, Kubernetes)
-
     @classmethod
     def get_port(cls, svc_name) -> int:
         ns = kubernetes_utils.get_current_kube_config_context_namespace()
@@ -255,6 +280,27 @@ def make_deploy_resources_variables(
 
         port_mode = network_utils.get_port_mode(None)
 
+        remote_identity = skypilot_config.get_nested(
+            ('kubernetes', 'remote_identity'),
+            schemas.get_default_remote_identity('kubernetes'))
+        if (remote_identity ==
+                schemas.RemoteIdentityOptions.LOCAL_CREDENTIALS.value):
+            # SA name doesn't matter since automounting credentials is disabled
+            k8s_service_account_name = 'default'
+            k8s_automount_sa_token = 'false'
+        elif (remote_identity ==
+              schemas.RemoteIdentityOptions.SERVICE_ACCOUNT.value):
+            # Use the default service account
+            k8s_service_account_name = (
+                kubernetes_utils.DEFAULT_SERVICE_ACCOUNT_NAME)
+            k8s_automount_sa_token = 'true'
+        else:
+            # User specified a custom service account
+            k8s_service_account_name = remote_identity
+            k8s_automount_sa_token = 'true'
+
+        fuse_device_required = bool(resources.requires_fuse)
+
         deploy_vars = {
             'instance_type': resources.instance_type,
             'custom_resources': custom_resources,
@@ -271,6 +317,11 @@ def make_deploy_resources_variables(
             'k8s_acc_label_value': k8s_acc_label_value,
             'k8s_ssh_jump_name': self.SKY_SSH_JUMP_NAME,
             'k8s_ssh_jump_image': ssh_jump_image,
+            'k8s_service_account_name': k8s_service_account_name,
+            'k8s_automount_sa_token': k8s_automount_sa_token,
+            'k8s_fuse_device_required': fuse_device_required,
+            # Namespace to run the FUSE device manager in
+            'k8s_skypilot_system_namespace': _SKYPILOT_SYSTEM_NAMESPACE,
             'image_id': image_id,
         }
 
@@ -326,30 +377,32 @@ def _make(instance_list):
                     gpu_task_cpus, gpu_task_memory, acc_count, acc_type).name)
 
         # Check if requested instance type will fit in the cluster.
-        # TODO(romilb): This will fail early for autoscaling clusters.
-        fits, reason = kubernetes_utils.check_instance_fits(
-            chosen_instance_type)
-        if not fits:
-            logger.debug(f'Instance type {chosen_instance_type} does '
-                         'not fit in the Kubernetes cluster. '
-                         f'Reason: {reason}')
-            return [], []
+        autoscaler_type = kubernetes_utils.get_autoscaler_type()
+        if autoscaler_type is None:
+            # If autoscaler is not set, check if the instance type fits in the
+            # cluster. Else, rely on the autoscaler to provision the right
+            # instance type without running checks. Worst case, if autoscaling
+            # fails, the pod will be stuck in pending state until
+            # provision_timeout, after which failover will be triggered.
+            fits, reason = kubernetes_utils.check_instance_fits(
+                chosen_instance_type)
+            if not fits:
+                logger.debug(f'Instance type {chosen_instance_type} does '
+                             'not fit in the Kubernetes cluster. '
+                             f'Reason: {reason}')
+                return [], []
 
         # No fuzzy lists for Kubernetes
         return _make([chosen_instance_type]), []
 
     @classmethod
     def check_credentials(cls) -> Tuple[bool, Optional[str]]:
-        if os.path.exists(os.path.expanduser(CREDENTIAL_PATH)):
-            # Test using python API
-            try:
-                return kubernetes_utils.check_credentials()
-            except Exception as e:  # pylint: disable=broad-except
-                return (False, 'Credential check failed: '
-                        f'{common_utils.format_exception(e)}')
-        else:
-            return (False, 'Credentials not found - '
-                    f'check if {CREDENTIAL_PATH} exists.')
+        # Test using python API
+        try:
+            return kubernetes_utils.check_credentials()
+        except Exception as e:  # pylint: disable=broad-except
+            return (False, 'Credential check failed: '
+                    f'{common_utils.format_exception(e)}')
 
     def get_credential_file_mounts(self) -> Dict[str, str]:
         if os.path.exists(os.path.expanduser(CREDENTIAL_PATH)):
@@ -388,3 +441,33 @@ def get_current_user_identity(cls) -> Optional[List[str]]:
             return [f'{cluster}_{user}_{namespace}']
         except k8s.config.config_exception.ConfigException:
             return None
+
+    @classmethod
+    def is_label_valid(cls, label_key: str,
+                       label_value: str) -> Tuple[bool, Optional[str]]:
+        # Kubernetes labels can be of the format <domain>/<key>: <value>
+        key_regex = re.compile(
+            # Look-ahead to ensure proper domain formatting up to a slash
+            r'^(?:(?=[a-z0-9]([-a-z0-9.]*[a-z0-9])?\/)'
+            # Match domain: starts and ends with alphanum up to 253 chars
+            # including a slash in the domain.
+            r'[a-z0-9]([-a-z0-9.]{0,251}[a-z0-9])?\/)?'
+            # Match key: starts and ends with alphanum, upto to 63 chars.
+            r'[a-z0-9]([-a-z0-9_.]{0,61}[a-z0-9])?$')
+        value_regex = re.compile(
+            r'^([a-zA-Z0-9]([-a-zA-Z0-9_.]{0,61}[a-zA-Z0-9])?)?$')
+        key_valid = bool(key_regex.match(label_key))
+        value_valid = bool(value_regex.match(label_value))
+        error_msg = None
+        condition_msg = ('Value must consist of alphanumeric characters or '
+                         '\'-\', \'_\', \'.\', and must be no more than 63 '
+                         'characters in length.')
+        if not key_valid:
+            error_msg = (f'Invalid label key {label_key} for Kubernetes. '
+                         f'{condition_msg}')
+        if not value_valid:
+            error_msg = (f'Invalid label value {label_value} for Kubernetes. '
+                         f'{condition_msg}')
+        if not key_valid or not value_valid:
+            return False, error_msg
+        return True, None
diff --git a/sky/clouds/lambda_cloud.py b/sky/clouds/lambda_cloud.py
index 6f3e0bcea1d..979b4833354 100644
--- a/sky/clouds/lambda_cloud.py
+++ b/sky/clouds/lambda_cloud.py
@@ -45,6 +45,7 @@ class Lambda(clouds.Cloud):
         clouds.CloudImplementationFeatures.IMAGE_ID: f'Specifying image ID is not supported in {_REPR}.',
         clouds.CloudImplementationFeatures.CUSTOM_DISK_TIER: f'Custom disk tiers are not supported in {_REPR}.',
         clouds.CloudImplementationFeatures.OPEN_PORTS: f'Opening ports is currently not supported on {_REPR}.',
+        clouds.CloudImplementationFeatures.HOST_CONTROLLERS: f'Host controllers are not supported in {_REPR}.',
     }
 
     @classmethod
@@ -120,10 +121,6 @@ def get_egress_cost(self, num_gigabytes: float) -> float:
     def __repr__(self):
         return 'Lambda'
 
-    def is_same_cloud(self, other: clouds.Cloud) -> bool:
-        # Returns true if the two clouds are the same cloud type.
-        return isinstance(other, Lambda)
-
     @classmethod
     def get_default_instance_type(
             cls,
diff --git a/sky/clouds/oci.py b/sky/clouds/oci.py
index 03351fc4cf6..5fb0111bf01 100644
--- a/sky/clouds/oci.py
+++ b/sky/clouds/oci.py
@@ -160,10 +160,6 @@ def get_egress_cost(self, num_gigabytes: float) -> float:
         #     return 0.0
         return (num_gigabytes - 10 * 1024) * 0.0085
 
-    def is_same_cloud(self, other: clouds.Cloud) -> bool:
-        # Returns true if the two clouds are the same cloud type.
-        return isinstance(other, OCI)
-
     @classmethod
     def get_default_instance_type(
             cls,
diff --git a/sky/clouds/paperspace.py b/sky/clouds/paperspace.py
index f76772ab8b7..f67a9a27176 100644
--- a/sky/clouds/paperspace.py
+++ b/sky/clouds/paperspace.py
@@ -147,10 +147,6 @@ def get_egress_cost(self, num_gigabytes: float) -> float:
     def __repr__(self):
         return self._REPR
 
-    def is_same_cloud(self, other: clouds.Cloud) -> bool:
-        # Returns true if the two clouds are the same cloud type.
-        return isinstance(other, Paperspace)
-
     @classmethod
     def get_default_instance_type(
         cls,
diff --git a/sky/clouds/runpod.py b/sky/clouds/runpod.py
index 527c45fafa0..c7a24e274dd 100644
--- a/sky/clouds/runpod.py
+++ b/sky/clouds/runpod.py
@@ -43,6 +43,8 @@ class RunPod(clouds.Cloud):
             ('Mounting object stores is not supported on RunPod. To read data '
              'from object stores on RunPod, use `mode: COPY` to copy the data '
              'to local disk.'),
+        clouds.CloudImplementationFeatures.HOST_CONTROLLERS:
+            ('Host controllers are not supported on RunPod.'),
     }
     _MAX_CLUSTER_NAME_LEN_LIMIT = 120
     _regions: List[clouds.Region] = []
@@ -138,10 +140,6 @@ def accelerators_to_hourly_cost(self,
     def get_egress_cost(self, num_gigabytes: float) -> float:
         return 0.0
 
-    def is_same_cloud(self, other: clouds.Cloud) -> bool:
-        # Returns true if the two clouds are the same cloud type.
-        return isinstance(other, RunPod)
-
     @classmethod
     def get_default_instance_type(
             cls,
diff --git a/sky/clouds/scp.py b/sky/clouds/scp.py
index 1d6cb6cf20f..6a3daf2712a 100644
--- a/sky/clouds/scp.py
+++ b/sky/clouds/scp.py
@@ -144,10 +144,6 @@ def accelerators_to_hourly_cost(self,
     def get_egress_cost(self, num_gigabytes: float) -> float:
         return 0.0
 
-    def is_same_cloud(self, other: clouds.Cloud) -> bool:
-        # Returns true if the two clouds are the same cloud type.
-        return isinstance(other, SCP)
-
     @classmethod
     def get_default_instance_type(
             cls,
diff --git a/sky/clouds/service_catalog/__init__.py b/sky/clouds/service_catalog/__init__.py
index d380cce6757..7479cd77cf7 100644
--- a/sky/clouds/service_catalog/__init__.py
+++ b/sky/clouds/service_catalog/__init__.py
@@ -117,6 +117,46 @@ def list_accelerator_counts(
     return ret
 
 
+def list_accelerator_realtime(
+    gpus_only: bool = True,
+    name_filter: Optional[str] = None,
+    region_filter: Optional[str] = None,
+    quantity_filter: Optional[int] = None,
+    clouds: CloudFilter = None,
+    case_sensitive: bool = True,
+) -> Tuple[Dict[str, List[int]], Dict[str, int], Dict[str, int]]:
+    """List all accelerators offered by Sky with their realtime availability.
+
+    Realtime availability is the total number of accelerators in the cluster
+    and number of accelerators available at the time of the call.
+
+    Used for fixed size cluster settings, such as Kubernetes.
+
+    Returns:
+        A tuple of three dictionaries mapping canonical accelerator names to:
+        - A list of available counts. (e.g., [1, 2, 4])
+        - Total number of accelerators in the cluster (capacity).
+        - Number of accelerators available at the time of call (availability).
+    """
+    qtys_map, total_accelerators_capacity, total_accelerators_available = (
+        _map_clouds_catalog(clouds,
+                            'list_accelerators_realtime',
+                            gpus_only,
+                            name_filter,
+                            region_filter,
+                            quantity_filter,
+                            case_sensitive=case_sensitive,
+                            all_regions=False,
+                            require_price=False))
+    accelerator_counts: Dict[str, List[int]] = collections.defaultdict(list)
+    for gpu, items in qtys_map.items():
+        for item in items:
+            accelerator_counts[gpu].append(item.accelerator_count)
+        accelerator_counts[gpu] = sorted(accelerator_counts[gpu])
+    return (accelerator_counts, total_accelerators_capacity,
+            total_accelerators_available)
+
+
 def instance_type_exists(instance_type: str,
                          clouds: CloudFilter = None) -> bool:
     """Check the existence of a instance type."""
diff --git a/sky/clouds/service_catalog/data_fetchers/fetch_fluidstack.py b/sky/clouds/service_catalog/data_fetchers/fetch_fluidstack.py
index 3dda75e2f07..5d50399ab89 100644
--- a/sky/clouds/service_catalog/data_fetchers/fetch_fluidstack.py
+++ b/sky/clouds/service_catalog/data_fetchers/fetch_fluidstack.py
@@ -4,7 +4,6 @@
     python fetch_fluidstack_cloud.py
 """
 
-import copy
 import csv
 import json
 import os
@@ -19,6 +18,8 @@
 
 GPU_MAP = {
     'H100_PCIE_80GB': 'H100',
+    'H100_NVLINK_80GB': 'H100',
+    'A100_NVLINK_80GB': 'A100-80GB',
     'A100_SXM4_80GB': 'A100-80GB',
     'A100_PCIE_80GB': 'A100-80GB',
     'A100_SXM4_40GB': 'A100',
@@ -40,13 +41,6 @@
     'RTX_3080_10GB': 'RTX3080',
 }
 
-CUSTOM_PLANS_CONFIG = [
-    dict(gpu_count=1, cpu_count=8, nvme_storage=750, ram=64),
-    dict(gpu_count=2, cpu_count=16, nvme_storage=1024, ram=128),
-    #dict(gpu_count=4, cpu_count=32, nvme_storage=1200, ram=160),
-    #dict(gpu_count=8, cpu_count=64, nvme_storage=1500, ram=200)
-]
-
 
 def get_regions(plans: List) -> dict:
     """Return a list of regions where the plan is available."""
@@ -57,35 +51,9 @@ def get_regions(plans: List) -> dict:
     return regions
 
 
-def plans_from_custom_plan(plan: dict) -> List[dict]:
-    prices = dict(cpu=plan['price']['cpu']['hourly'],
-                  ram=plan['price']['ram']['hourly'],
-                  storage=plan['price']['storage']['hourly'],
-                  gpu=plan['price']['gpu']['hourly'])
-    new_plans = []
-    for i, config in enumerate(CUSTOM_PLANS_CONFIG):
-        new_plan = copy.deepcopy(plan)
-        price = (prices['cpu'] *
-                 config['cpu_count']) + (prices['ram'] * config['ram']) + (
-                     prices['storage'] * config['nvme_storage']) + (
-                         prices['gpu'] * config['gpu_count'])
-        new_plan['price']['hourly'] = price / config['gpu_count']
-        new_plan['configuration']['core_count'] = config['cpu_count']
-        new_plan['configuration']['ram'] = config['ram']
-        new_plan['configuration']['gpu_count'] = config['gpu_count']
-        new_plan['configuration']['nvme_storage'] = config['nvme_storage']
-        new_plan['plan_id'] = f'custom:{i}:{plan["plan_id"]}'
-        new_plans.append(new_plan)
-    return new_plans
-
-
 def create_catalog(output_dir: str) -> None:
     response = requests.get(ENDPOINT)
     plans = response.json()
-    custom_plans = [
-        plan for plan in plans if plan['minimum_commitment'] == 'hourly' and
-        plan['type'] in ['custom'] and plan['gpu_type'] != 'NO GPU'
-    ]
     #plans = [plan for plan in plans if len(plan['regions']) > 0]
     plans = [
         plan for plan in plans if plan['minimum_commitment'] == 'hourly' and
@@ -93,9 +61,6 @@ def create_catalog(output_dir: str) -> None:
         plan['gpu_type'] not in ['NO GPU', 'RTX_3080_10GB', 'RTX_3090_24GB']
     ]
 
-    plans = plans + [
-        plan for plan in custom_plans for plan in plans_from_custom_plan(plan)
-    ]
     with open(os.path.join(output_dir, 'vms.csv'), mode='w',
               encoding='utf-8') as f:
         writer = csv.writer(f, delimiter=',', quotechar='"')
@@ -114,7 +79,7 @@ def create_catalog(output_dir: str) -> None:
             try:
                 gpu = GPU_MAP[plan['gpu_type']]
             except KeyError:
-                print(f'Could not map {plan["gpu_type"]}')
+                #print(f'Could not map {plan["gpu_type"]}')
                 continue
             gpu_memory = int(
                 str(plan['configuration']['gpu_memory']).replace('GB',
diff --git a/sky/clouds/service_catalog/kubernetes_catalog.py b/sky/clouds/service_catalog/kubernetes_catalog.py
index bd44847016e..a64aa8f72e9 100644
--- a/sky/clouds/service_catalog/kubernetes_catalog.py
+++ b/sky/clouds/service_catalog/kubernetes_catalog.py
@@ -3,6 +3,7 @@
 Kubernetes does not require a catalog of instances, but we need an image catalog
 mapping SkyPilot image tags to corresponding container image tags.
 """
+import re
 import typing
 from typing import Dict, List, Optional, Set, Tuple
 
@@ -46,38 +47,107 @@ def list_accelerators(
         case_sensitive: bool = True,
         all_regions: bool = False,
         require_price: bool = True) -> Dict[str, List[common.InstanceTypeInfo]]:
+    # TODO(romilb): We should consider putting a lru_cache() with TTL to
+    #   avoid multiple calls to kubernetes API in a short period of time (e.g.,
+    #   from the optimizer).
+    return list_accelerators_realtime(gpus_only, name_filter, region_filter,
+                                      quantity_filter, case_sensitive,
+                                      all_regions, require_price)[0]
+
+
+def list_accelerators_realtime(
+    gpus_only: bool,
+    name_filter: Optional[str],
+    region_filter: Optional[str],
+    quantity_filter: Optional[int],
+    case_sensitive: bool = True,
+    all_regions: bool = False,
+    require_price: bool = True
+) -> Tuple[Dict[str, List[common.InstanceTypeInfo]], Dict[str, int], Dict[str,
+                                                                          int]]:
     del all_regions, require_price  # Unused.
     k8s_cloud = Kubernetes()
     if not any(
             map(k8s_cloud.is_same_cloud,
                 sky_check.get_cached_enabled_clouds_or_refresh())
     ) or not kubernetes_utils.check_credentials()[0]:
-        return {}
+        return {}, {}, {}
 
     has_gpu = kubernetes_utils.detect_gpu_resource()
     if not has_gpu:
-        return {}
+        return {}, {}, {}
 
     label_formatter, _ = kubernetes_utils.detect_gpu_label_formatter()
     if not label_formatter:
-        return {}
+        return {}, {}, {}
 
-    accelerators: Set[Tuple[str, int]] = set()
+    accelerators_qtys: Set[Tuple[str, int]] = set()
     key = label_formatter.get_label_key()
     nodes = kubernetes_utils.get_kubernetes_nodes()
+    # Get the pods to get the real-time GPU usage
+    pods = kubernetes_utils.get_kubernetes_pods()
+    # Total number of GPUs in the cluster
+    total_accelerators_capacity: Dict[str, int] = {}
+    # Total number of GPUs currently available in the cluster
+    total_accelerators_available: Dict[str, int] = {}
+    min_quantity_filter = quantity_filter if quantity_filter else 1
+
     for node in nodes:
         if key in node.metadata.labels:
+            allocated_qty = 0
             accelerator_name = label_formatter.get_accelerator_from_label_value(
                 node.metadata.labels.get(key))
+
+            # Check if name_filter regex matches the accelerator_name
+            regex_flags = 0 if case_sensitive else re.IGNORECASE
+            if name_filter and not re.match(
+                    name_filter, accelerator_name, flags=regex_flags):
+                continue
+
             accelerator_count = int(
                 node.status.allocatable.get('nvidia.com/gpu', 0))
 
+            # Generate the GPU quantities for the accelerators
             if accelerator_name and accelerator_count > 0:
                 for count in range(1, accelerator_count + 1):
-                    accelerators.add((accelerator_name, count))
+                    accelerators_qtys.add((accelerator_name, count))
+
+            for pod in pods:
+                # Get all the pods running on the node
+                if (pod.spec.node_name == node.metadata.name and
+                        pod.status.phase in ['Running', 'Pending']):
+                    # Iterate over all the containers in the pod and sum the
+                    # GPU requests
+                    for container in pod.spec.containers:
+                        if container.resources.requests:
+                            allocated_qty += int(
+                                container.resources.requests.get(
+                                    'nvidia.com/gpu', 0))
+
+            accelerators_available = accelerator_count - allocated_qty
+
+            if accelerator_count >= min_quantity_filter:
+                quantized_count = (min_quantity_filter *
+                                   (accelerator_count // min_quantity_filter))
+                if accelerator_name not in total_accelerators_capacity:
+                    total_accelerators_capacity[
+                        accelerator_name] = quantized_count
+                else:
+                    total_accelerators_capacity[
+                        accelerator_name] += quantized_count
+
+            if accelerator_name not in total_accelerators_available:
+                total_accelerators_available[accelerator_name] = 0
+            if accelerators_available >= min_quantity_filter:
+                quantized_availability = min_quantity_filter * (
+                    accelerators_available // min_quantity_filter)
+                total_accelerators_available[
+                    accelerator_name] += quantized_availability
 
     result = []
-    for accelerator_name, accelerator_count in accelerators:
+
+    # Generate dataframe for common.list_accelerators_impl
+    for accelerator_name, accelerator_count in accelerators_qtys:
         result.append(
             common.InstanceTypeInfo(cloud='Kubernetes',
                                     instance_type=None,
@@ -98,9 +168,13 @@ def list_accelerators(
                       ])
     df['GpuInfo'] = True
 
-    return common.list_accelerators_impl('Kubernetes', df, gpus_only,
-                                         name_filter, region_filter,
-                                         quantity_filter, case_sensitive)
+    # Use common.list_accelerators_impl to get InstanceTypeInfo objects used
+    # by sky show-gpus when cloud is not specified.
+    qtys_map = common.list_accelerators_impl('Kubernetes', df, gpus_only,
+                                             name_filter, region_filter,
+                                             quantity_filter, case_sensitive)
+
+    return qtys_map, total_accelerators_capacity, total_accelerators_available
 
 
 def validate_region_zone(
diff --git a/sky/clouds/service_catalog/oci_catalog.py b/sky/clouds/service_catalog/oci_catalog.py
index 2a34e739509..2561b913dcf 100644
--- a/sky/clouds/service_catalog/oci_catalog.py
+++ b/sky/clouds/service_catalog/oci_catalog.py
@@ -40,7 +40,7 @@ def _get_df() -> 'pd.DataFrame':
 
         df = common.read_catalog('oci/vms.csv')
         try:
-            oci_adaptor.oci
+            oci_adaptor.oci.load_module()
         except ImportError:
             _df = df
             return _df
diff --git a/sky/clouds/vsphere.py b/sky/clouds/vsphere.py
index 02a794d7d58..872b8df9d70 100644
--- a/sky/clouds/vsphere.py
+++ b/sky/clouds/vsphere.py
@@ -136,10 +136,6 @@ def get_egress_cost(self, num_gigabytes: float) -> float:
     def __repr__(self):
         return 'vSphere'
 
-    def is_same_cloud(self, other: clouds.Cloud) -> bool:
-        # Returns true if the two clouds are the same cloud type.
-        return isinstance(other, Vsphere)
-
     @classmethod
     def get_default_instance_type(
         cls,
diff --git a/sky/core.py b/sky/core.py
index c93a50f0b7d..b1006fe19ab 100644
--- a/sky/core.py
+++ b/sky/core.py
@@ -109,6 +109,26 @@ def status(cluster_names: Optional[Union[str, List[str]]] = None,
                                       cluster_names=cluster_names)
 
 
+def endpoints(cluster: str,
+              port: Optional[Union[int, str]] = None) -> Dict[int, str]:
+    """Gets the endpoint for a given cluster and port number (endpoint).
+
+    Args:
+        cluster: The name of the cluster.
+        port: The port number to get the endpoint for. If None, endpoints
+            for all ports are returned..
+
+    Returns: A dictionary of port numbers to endpoints. If endpoint is None,
+        the dictionary will contain all ports:endpoints exposed on the cluster.
+
+    Raises:
+        ValueError: if the cluster is not UP or the endpoint is not exposed.
+        RuntimeError: if the cluster has no ports to be exposed or no endpoints
+            are exposed yet.
+    """
+    return backend_utils.get_endpoints(cluster=cluster, port=port)
+
+
 @usage_lib.entrypoint
 def cost_report() -> List[Dict[str, Any]]:
     # NOTE(dev): Keep the docstring consistent between the Python API and CLI.
@@ -196,8 +216,6 @@ def _start(
         idle_minutes_to_autostop = (
             constants.CONTROLLER_IDLE_MINUTES_TO_AUTOSTOP)
 
-    # NOTE: if spot_queue() calls _start() and hits here, that entrypoint
-    # would have a cluster name (the controller) filled in.
     usage_lib.record_cluster_name_for_current_operation(cluster_name)
 
     with dag.Dag():
@@ -264,7 +282,7 @@ def start(
         ValueError: argument values are invalid: (1) the specified cluster does
           not exist; (2) if ``down`` is set to True but
           ``idle_minutes_to_autostop`` is None; (3) if the specified cluster is
-          the managed spot controller, and either ``idle_minutes_to_autostop``
+          the managed jobs controller, and either ``idle_minutes_to_autostop``
           is not None or ``down`` is True (omit them to use the default
           autostop settings).
         sky.exceptions.NotSupportedError: if the cluster to restart was
@@ -317,7 +335,7 @@ def stop(cluster_name: str, purge: bool = False) -> None:
         ValueError: the specified cluster does not exist.
         RuntimeError: failed to stop the cluster.
         sky.exceptions.NotSupportedError: if the specified cluster is a spot
-          cluster, or a TPU VM Pod cluster, or the managed spot controller.
+          cluster, or a TPU VM Pod cluster, or the managed jobs controller.
     """
     if controller_utils.Controllers.from_name(cluster_name) is not None:
         raise exceptions.NotSupportedError(
@@ -372,7 +390,7 @@ def down(cluster_name: str, purge: bool = False) -> None:
         ValueError: the specified cluster does not exist.
         RuntimeError: failed to tear down the cluster.
         sky.exceptions.NotSupportedError: the specified cluster is the managed
-          spot controller.
+          jobs controller.
     """
     handle = global_user_state.get_handle_from_cluster_name(cluster_name)
     if handle is None:
@@ -470,6 +488,19 @@ def autostop(
                 f'  {_stop_not_supported_message(handle.launched_resources)}.'
             ) from e
 
+    # Check if autodown is required and supported
+    if not is_cancel:
+        try:
+            cloud.check_features_are_supported(
+                handle.launched_resources,
+                {clouds.CloudImplementationFeatures.AUTO_TERMINATE})
+        except exceptions.NotSupportedError as e:
+            raise exceptions.NotSupportedError(
+                f'{colorama.Fore.YELLOW}{operation} on cluster '
+                f'{cluster_name!r}...skipped.{colorama.Style.RESET_ALL}\n'
+                f'  Auto{option_str} is not supported on {cloud!r} - '
+                f'see reason above.') from e
+
     usage_lib.record_cluster_name_for_current_operation(cluster_name)
     backend.set_autostop(handle, idle_minutes, down)
 
@@ -559,7 +590,7 @@ def cancel(
     Additional arguments:
         _try_cancel_if_cluster_is_init: (bool) whether to try cancelling the job
             even if the cluster is not UP, but the head node is still alive.
-            This is used by the spot controller to cancel the job when the
+            This is used by the jobs controller to cancel the job when the
             worker node is preempted in the spot cluster.
 
     Raises:
diff --git a/sky/data/mounting_utils.py b/sky/data/mounting_utils.py
index 2f4e37a1b66..d445d3d67c5 100644
--- a/sky/data/mounting_utils.py
+++ b/sky/data/mounting_utils.py
@@ -4,6 +4,7 @@
 from typing import Optional
 
 from sky import exceptions
+from sky.utils import command_runner
 
 # Values used to construct mounting commands
 _STAT_CACHE_TTL = '5s'
@@ -11,7 +12,7 @@
 _TYPE_CACHE_TTL = '5s'
 _RENAME_DIR_LIMIT = 10000
 # https://github.com/GoogleCloudPlatform/gcsfuse/releases
-GCSFUSE_VERSION = '1.3.0'
+GCSFUSE_VERSION = '2.2.0'
 
 
 def get_s3_mount_install_cmd() -> str:
@@ -129,6 +130,8 @@ def get_mounting_script(
     script = textwrap.dedent(f"""
         #!/usr/bin/env bash
         set -e
+                             
+        {command_runner.ALIAS_SUDO_TO_EMPTY_FOR_ROOT_CMD}
 
         MOUNT_PATH={mount_path}
         MOUNT_BINARY={mount_binary}
diff --git a/sky/data/storage.py b/sky/data/storage.py
index 06cdcbca62c..e43406c3951 100644
--- a/sky/data/storage.py
+++ b/sky/data/storage.py
@@ -110,15 +110,24 @@ class StoreType(enum.Enum):
     IBM = 'IBM'
 
     @classmethod
-    def from_cloud(cls, cloud: clouds.Cloud) -> 'StoreType':
-        if isinstance(cloud, clouds.AWS):
+    def from_cloud(cls, cloud: str) -> 'StoreType':
+        if cloud.lower() == str(clouds.AWS()).lower():
             return StoreType.S3
-        elif isinstance(cloud, clouds.GCP):
+        elif cloud.lower() == str(clouds.GCP()).lower():
             return StoreType.GCS
-        elif isinstance(cloud, clouds.Azure):
-            return StoreType.AZURE
-        elif isinstance(cloud, clouds.IBM):
+        elif cloud.lower() == str(clouds.IBM()).lower():
             return StoreType.IBM
+        elif cloud.lower() == cloudflare.NAME.lower():
+            return StoreType.R2
+        elif cloud.lower() == str(clouds.Azure()).lower():
+            with ux_utils.print_exception_no_traceback():
+                raise ValueError('Azure Blob Storage is not supported yet.')
+        elif cloud.lower() == str(clouds.Lambda()).lower():
+            with ux_utils.print_exception_no_traceback():
+                raise ValueError('Lambda Cloud does not provide cloud storage.')
+        elif cloud.lower() == str(clouds.SCP()).lower():
+            with ux_utils.print_exception_no_traceback():
+                raise ValueError('SCP does not provide cloud storage.')
 
         raise ValueError(f'Unsupported cloud for StoreType: {cloud}')
 
@@ -136,51 +145,28 @@ def from_store(cls, store: 'AbstractStore') -> 'StoreType':
             with ux_utils.print_exception_no_traceback():
                 raise ValueError(f'Unknown store type: {store}')
 
+    def store_prefix(self) -> str:
+        if self == StoreType.S3:
+            return 's3://'
+        elif self == StoreType.GCS:
+            return 'gs://'
+        elif self == StoreType.R2:
+            return 'r2://'
+        elif self == StoreType.IBM:
+            return 'cos://'
+        elif self == StoreType.AZURE:
+            with ux_utils.print_exception_no_traceback():
+                raise ValueError('Azure Blob Storage is not supported yet.')
+        else:
+            with ux_utils.print_exception_no_traceback():
+                raise ValueError(f'Unknown store type: {self}')
+
 
 class StorageMode(enum.Enum):
     MOUNT = 'MOUNT'
     COPY = 'COPY'
 
 
-def get_storetype_from_cloud(cloud: clouds.Cloud) -> StoreType:
-    if isinstance(cloud, clouds.AWS):
-        return StoreType.S3
-    elif isinstance(cloud, clouds.GCP):
-        return StoreType.GCS
-    elif isinstance(cloud, clouds.IBM):
-        return StoreType.IBM
-    elif isinstance(cloud, clouds.Azure):
-        with ux_utils.print_exception_no_traceback():
-            raise ValueError('Azure Blob Storage is not supported yet.')
-    elif isinstance(cloud, clouds.Lambda):
-        with ux_utils.print_exception_no_traceback():
-            raise ValueError('Lambda Cloud does not provide cloud storage.')
-    elif isinstance(cloud, clouds.SCP):
-        with ux_utils.print_exception_no_traceback():
-            raise ValueError('SCP does not provide cloud storage.')
-    else:
-        with ux_utils.print_exception_no_traceback():
-            raise ValueError(f'Unknown cloud type: {cloud}')
-
-
-def get_store_prefix(storetype: StoreType) -> str:
-    if storetype == StoreType.S3:
-        return 's3://'
-    elif storetype == StoreType.GCS:
-        return 'gs://'
-    # R2 storages use 's3://' as a prefix for various aws cli commands
-    elif storetype == StoreType.R2:
-        return 's3://'
-    elif storetype == StoreType.IBM:
-        return 'cos://'
-    elif storetype == StoreType.AZURE:
-        with ux_utils.print_exception_no_traceback():
-            raise ValueError('Azure Blob Storage is not supported yet.')
-    else:
-        with ux_utils.print_exception_no_traceback():
-            raise ValueError(f'Unknown store type: {storetype}')
-
-
 class AbstractStore:
     """AbstractStore abstracts away the different storage types exposed by
     different clouds.
@@ -353,7 +339,7 @@ def _validate_existing_bucket(self):
             # bucket's URL as 'source'.
             if handle is None:
                 with ux_utils.print_exception_no_traceback():
-                    store_prefix = get_store_prefix(StoreType.from_store(self))
+                    store_prefix = StoreType.from_store(self).store_prefix()
                     raise exceptions.StorageSpecError(
                         'Attempted to mount a non-sky managed bucket '
                         f'{self.name!r} without specifying the storage source.'
diff --git a/sky/exceptions.py b/sky/exceptions.py
index 131b4675399..4fced20ce4e 100644
--- a/sky/exceptions.py
+++ b/sky/exceptions.py
@@ -52,11 +52,12 @@ class InvalidCloudConfigs(Exception):
 
 
 class ProvisionPrechecksError(Exception):
-    """Raised when a spot job fails prechecks before provision.
+    """Raised when a managed job fails prechecks before provision.
+
     Developer note: For now this should only be used by managed
-    spot code path (technically, this can/should be raised by the
+    jobs code path (technically, this can/should be raised by the
     lower-level sky.launch()). Please refer to the docstring of
-    `spot.recovery_strategy._launch` for more details about when
+    `jobs.recovery_strategy._launch` for more details about when
     the error will be raised.
 
     Args:
@@ -68,11 +69,11 @@ def __init__(self, reasons: List[Exception]) -> None:
         self.reasons = list(reasons)
 
 
-class SpotJobReachedMaxRetriesError(Exception):
-    """Raised when a spot job fails to be launched after maximum retries.
+class ManagedJobReachedMaxRetriesError(Exception):
+    """Raised when a managed job fails to be launched after maximum retries.
 
-    Developer note: For now this should only be used by managed spot code
-    path. Please refer to the docstring of `spot.recovery_strategy._launch`
+    Developer note: For now this should only be used by managed jobs code
+    path. Please refer to the docstring of `jobs.recovery_strategy._launch`
     for more details about when the error will be raised.
     """
     pass
@@ -189,8 +190,8 @@ class StorageExternalDeletionError(StorageBucketGetError):
     pass
 
 
-class FetchIPError(Exception):
-    """Raised when fetching the IP fails."""
+class FetchClusterInfoError(Exception):
+    """Raised when fetching the cluster info fails."""
 
     class Reason(enum.Enum):
         HEAD = 'HEAD'
@@ -211,8 +212,8 @@ class ClusterStatusFetchingError(Exception):
     pass
 
 
-class SpotUserCancelledError(Exception):
-    """Raised when a spot user cancels the job."""
+class ManagedJobUserCancelledError(Exception):
+    """Raised when a user cancels a managed job."""
     pass
 
 
diff --git a/sky/execution.py b/sky/execution.py
index 25f0d8cc7a8..b1fb4ec4164 100644
--- a/sky/execution.py
+++ b/sky/execution.py
@@ -108,7 +108,7 @@ def _execute(
     clone_disk_from: Optional[str] = None,
     # Internal only:
     # pylint: disable=invalid-name
-    _is_launched_by_spot_controller: bool = False,
+    _is_launched_by_jobs_controller: bool = False,
     _is_launched_by_sky_serve_controller: bool = False,
 ) -> Tuple[Optional[int], Optional[backends.ResourceHandle]]:
     """Execute an entrypoint.
@@ -160,11 +160,11 @@ def _execute(
     assert len(dag) == 1, f'We support 1 task for now. {dag}'
     task = dag.tasks[0]
 
-    if task.need_spot_recovery:
+    if any(r.job_recovery is not None for r in task.resources):
         with ux_utils.print_exception_no_traceback():
             raise ValueError(
-                'Spot recovery is specified in the task. To launch the '
-                'managed spot job, please use: sky spot launch')
+                'Job recovery is specified in the task. To launch a '
+                'managed job, please use: sky jobs launch')
 
     cluster_exists = False
     if cluster_name is not None:
@@ -207,8 +207,12 @@ def _execute(
                             f'{colorama.Style.RESET_ALL}')
                 idle_minutes_to_autostop = 1
             stages.remove(Stage.DOWN)
-            if not down:
-                requested_features.add(clouds.CloudImplementationFeatures.STOP)
+            if idle_minutes_to_autostop >= 0:
+                requested_features.add(
+                    clouds.CloudImplementationFeatures.AUTO_TERMINATE)
+                if not down:
+                    requested_features.add(
+                        clouds.CloudImplementationFeatures.STOP)
         # NOTE: in general we may not have sufficiently specified info
         # (cloud/resource) to check STOP_SPOT_INSTANCE here. This is checked in
         # the backend.
@@ -225,10 +229,10 @@ def _execute(
                                               task)
 
     if not cluster_exists:
-        # If spot is launched by skyserve controller or managed spot controller,
-        # We don't need to print out the logger info.
+        # If spot is launched on serve or jobs controller, we don't need to
+        # print out the hint.
         if (Stage.PROVISION in stages and task.use_spot and
-                not _is_launched_by_spot_controller and
+                not _is_launched_by_jobs_controller and
                 not _is_launched_by_sky_serve_controller):
             yellow = colorama.Fore.YELLOW
             bold = colorama.Style.BRIGHT
@@ -236,9 +240,9 @@ def _execute(
             logger.info(
                 f'{yellow}Launching an unmanaged spot task, which does not '
                 f'automatically recover from preemptions.{reset}\n{yellow}To '
-                'get automatic recovery, use managed spot instead: '
-                f'{reset}{bold}sky spot launch{reset} {yellow}or{reset} '
-                f'{bold}sky.spot.launch(){reset}.')
+                'get automatic recovery, use managed job instead: '
+                f'{reset}{bold}sky jobs launch{reset} {yellow}or{reset} '
+                f'{bold}sky.jobs.launch(){reset}.')
 
         if Stage.OPTIMIZE in stages:
             if task.best_resources is None:
@@ -318,10 +322,10 @@ def _execute(
         if controller is None and not _is_launched_by_sky_serve_controller:
             # UX: print live clusters to make users aware (to save costs).
             #
-            # Don't print if this job is launched by the spot controller,
-            # because spot jobs are serverless, there can be many of them, and
-            # users tend to continuously monitor spot jobs using `sky spot
-            # status`. Also don't print if this job is a skyserve controller
+            # Don't print if this job is launched by the jobs controller,
+            # because managed jobs are serverless, there can be many of them,
+            # and users tend to continuously monitor managed jobs using `sky
+            # job queue`. Also don't print if this job is a skyserve controller
             # job or launched by a skyserve controller job, because the
             # redirect for this subprocess.run won't success and it will
             # pollute the controller logs.
@@ -330,7 +334,7 @@ def _execute(
             env = dict(os.environ,
                        **{env_options.Options.DISABLE_LOGGING.value: '1'})
             subprocess_utils.run(
-                'sky status --no-show-spot-jobs --no-show-services', env=env)
+                'sky status --no-show-managed-jobs --no-show-services', env=env)
         print()
         print('\x1b[?25h', end='')  # Show cursor.
     return job_id, handle
@@ -354,7 +358,7 @@ def launch(
     clone_disk_from: Optional[str] = None,
     # Internal only:
     # pylint: disable=invalid-name
-    _is_launched_by_spot_controller: bool = False,
+    _is_launched_by_jobs_controller: bool = False,
     _is_launched_by_sky_serve_controller: bool = False,
     _disable_controller_check: bool = False,
 ) -> Tuple[Optional[int], Optional[backends.ResourceHandle]]:
@@ -464,7 +468,7 @@ def launch(
         idle_minutes_to_autostop=idle_minutes_to_autostop,
         no_setup=no_setup,
         clone_disk_from=clone_disk_from,
-        _is_launched_by_spot_controller=_is_launched_by_spot_controller,
+        _is_launched_by_jobs_controller=_is_launched_by_jobs_controller,
         _is_launched_by_sky_serve_controller=
         _is_launched_by_sky_serve_controller,
     )
diff --git a/sky/spot/README.md b/sky/jobs/README.md
similarity index 52%
rename from sky/spot/README.md
rename to sky/jobs/README.md
index ed20a77b46e..579f675a5f9 100644
--- a/sky/spot/README.md
+++ b/sky/jobs/README.md
@@ -1,11 +1,11 @@
-# SkyPilot Managed Spot
+# SkyPilot Managed Jobs
 
-This module is used for running user jobs on spot clusters, which automatically recovers the job from preemptions.
+This module is used for running and managing user jobs, which automatically recovers failed jobs from spot preemptions and/or machine failures.
 
 ## Concepts
 
-- Task: A task (sky.Task) is a unit of work. SkyPilot will launch a spot cluster to run the task, automatically recover the task from preemptions, and terminate the cluster when the task is done.
-- Job: A job in the context of SkyPilot managed spot, is equivalent to a SkyPilot DAG (sky.Dag). A job is a collection of tasks that are executed in a specific order based on the dependencies between the tasks. Each controller process will be in charge of the whole lifecycle of a job.
+- Task: A task (sky.Task) is a unit of work. SkyPilot will launch a cluster to run the task, automatically recover the task from preemptions, and terminate the cluster when the task is done.
+- Job: A job in the context of SkyPilot managed jobs, is equivalent to a SkyPilot DAG (sky.Dag). A job is a collection of tasks that are executed in a specific order based on the dependencies between the tasks. Each controller process will be in charge of the whole lifecycle of a job.
 
 Note that for singleton (1-task) jobs, we will use the term "task" and "job" interchangeably.
 
@@ -14,6 +14,6 @@ A job of n tasks (experimental; we support a pipeline of such tasks only): the j
 
 ## Architecture
 
-![Architecture](../../docs/source/images/spot-controller.png)
-
+![Architecture](../../docs/source/images/managed-jobs-arch.png)
+<!-- Raw file: https://docs.google.com/presentation/d/1AoFewsxm7jEsnFYyovyuTqKZs8W59qD9sNcM7Wcic4I/edit#slide=id.p -->
 
diff --git a/sky/jobs/__init__.py b/sky/jobs/__init__.py
new file mode 100644
index 00000000000..922bb613ff7
--- /dev/null
+++ b/sky/jobs/__init__.py
@@ -0,0 +1,43 @@
+"""Managed jobs."""
+import pathlib
+
+from sky.jobs.constants import JOBS_CLUSTER_NAME_PREFIX_LENGTH
+from sky.jobs.constants import JOBS_CONTROLLER_TEMPLATE
+from sky.jobs.constants import JOBS_CONTROLLER_YAML_PREFIX
+from sky.jobs.constants import JOBS_TASK_YAML_PREFIX
+from sky.jobs.core import cancel
+from sky.jobs.core import launch
+from sky.jobs.core import queue
+from sky.jobs.core import tail_logs
+from sky.jobs.recovery_strategy import DEFAULT_RECOVERY_STRATEGY
+from sky.jobs.recovery_strategy import RECOVERY_STRATEGIES
+from sky.jobs.state import ManagedJobStatus
+from sky.jobs.utils import dump_managed_job_queue
+from sky.jobs.utils import format_job_table
+from sky.jobs.utils import JOB_CONTROLLER_NAME
+from sky.jobs.utils import load_managed_job_queue
+from sky.jobs.utils import ManagedJobCodeGen
+
+pathlib.Path(JOBS_TASK_YAML_PREFIX).expanduser().parent.mkdir(parents=True,
+                                                              exist_ok=True)
+__all__ = [
+    'RECOVERY_STRATEGIES',
+    'DEFAULT_RECOVERY_STRATEGY',
+    'JOB_CONTROLLER_NAME',
+    # Constants
+    'JOBS_CONTROLLER_TEMPLATE',
+    'JOBS_CONTROLLER_YAML_PREFIX',
+    'JOBS_TASK_YAML_PREFIX',
+    # Enums
+    'ManagedJobStatus',
+    # Core
+    'cancel',
+    'launch',
+    'queue',
+    'tail_logs',
+    # utils
+    'ManagedJobCodeGen',
+    'format_job_table',
+    'dump_managed_job_queue',
+    'load_managed_job_queue',
+]
diff --git a/sky/jobs/constants.py b/sky/jobs/constants.py
new file mode 100644
index 00000000000..d5f32908317
--- /dev/null
+++ b/sky/jobs/constants.py
@@ -0,0 +1,27 @@
+"""Constants used for Managed Jobs."""
+
+JOBS_CONTROLLER_TEMPLATE = 'jobs-controller.yaml.j2'
+JOBS_CONTROLLER_YAML_PREFIX = '~/.sky/jobs_controller'
+
+JOBS_TASK_YAML_PREFIX = '~/.sky/managed_jobs'
+
+# Resources as a dict for the jobs controller.
+# Use default CPU instance type for jobs controller with >= 24GB, i.e.
+# m6i.2xlarge (8vCPUs, 32 GB) for AWS, Standard_D8s_v4 (8vCPUs, 32 GB)
+# for Azure, and n1-standard-8 (8 vCPUs, 32 GB) for GCP, etc.
+# Based on profiling, memory should be at least 3x (in GB) as num vCPUs to avoid
+# OOM (each vCPU can have 4 jobs controller processes as we set the CPU
+# requirement to 0.25, and 3 GB is barely enough for 4 job processes).
+# We use 50 GB disk size to reduce the cost.
+CONTROLLER_RESOURCES = {'cpus': '8+', 'memory': '3x', 'disk_size': 50}
+
+# Max length of the cluster name for GCP is 35, the user hash to be attached is
+# 4+1 chars, and we assume the maximum length of the job id is 4+1, so the max
+# length of the cluster name prefix is 25 to avoid the cluster name being too
+# long and truncated twice during the cluster creation.
+JOBS_CLUSTER_NAME_PREFIX_LENGTH = 25
+
+# The version of the lib files that jobs/utils use. Whenever there is an API
+# change for the jobs/utils, we need to bump this version and update
+# job.utils.ManagedJobCodeGen to handle the version update.
+MANAGED_JOBS_VERSION = 1
diff --git a/sky/spot/controller.py b/sky/jobs/controller.py
similarity index 65%
rename from sky/spot/controller.py
rename to sky/jobs/controller.py
index 8a5a4cceb07..39c89d2784b 100644
--- a/sky/spot/controller.py
+++ b/sky/jobs/controller.py
@@ -1,4 +1,4 @@
-"""Controller: handles the life cycle of a managed spot cluster (job)."""
+"""Controller: handles the life cycle of a managed job."""
 import argparse
 import multiprocessing
 import os
@@ -15,11 +15,11 @@
 from sky import status_lib
 from sky.backends import backend_utils
 from sky.backends import cloud_vm_ray_backend
+from sky.jobs import recovery_strategy
+from sky.jobs import state as managed_job_state
+from sky.jobs import utils as managed_job_utils
 from sky.skylet import constants
 from sky.skylet import job_lib
-from sky.spot import recovery_strategy
-from sky.spot import spot_state
-from sky.spot import spot_utils
 from sky.usage import usage_lib
 from sky.utils import common_utils
 from sky.utils import controller_utils
@@ -31,9 +31,9 @@
     import sky
 
 # Use the explicit logger name so that the logger is under the
-# `sky.spot.controller` namespace when executed directly, so as
+# `sky.jobs.controller` namespace when executed directly, so as
 # to inherit the setup from the `sky` logger.
-logger = sky_logging.init_logger('sky.spot.controller')
+logger = sky_logging.init_logger('sky.jobs.controller')
 
 
 def _get_dag_and_name(dag_yaml: str) -> Tuple['sky.Dag', str]:
@@ -43,8 +43,8 @@ def _get_dag_and_name(dag_yaml: str) -> Tuple['sky.Dag', str]:
     return dag, dag_name
 
 
-class SpotController:
-    """Each spot controller manages the life cycle of one spot job."""
+class JobsController:
+    """Each jobs controller manages the life cycle of one managed job."""
 
     def __init__(self, job_id: int, dag_yaml: str,
                  retry_until_up: bool) -> None:
@@ -60,9 +60,15 @@ def __init__(self, job_id: int, dag_yaml: str,
         # the user can have the same id for multiple recoveries.
         #   Example value: sky-2022-10-04-22-46-52-467694_my-spot-name_spot_id-17-0
         job_id_env_vars = []
-        for i in range(len(self._dag.tasks)):
-            task_name = self._dag_name
-            if len(self._dag.tasks) > 1:
+        for i, task in enumerate(self._dag.tasks):
+            if len(self._dag.tasks) <= 1:
+                task_name = self._dag_name
+            else:
+                task_name = task.name
+                # This is guaranteed by the spot_launch API, where we fill in
+                # the task.name with
+                # dag_utils.maybe_infer_and_fill_dag_and_task_names.
+                assert task_name is not None, self._dag
                 task_name = f'{self._dag_name}_{task_name}'
             job_id_env_var = common_utils.get_global_job_id(
                 self._backend.run_timestamp,
@@ -82,23 +88,23 @@ def __init__(self, job_id: int, dag_yaml: str,
     def _download_log_and_stream(
             self,
             handle: cloud_vm_ray_backend.CloudVmRayResourceHandle) -> None:
-        """Downloads and streams the logs of the latest job of a spot cluster.
+        """Downloads and streams the logs of the latest job.
 
-        We do not stream the logs from the spot cluster directly, as the
+        We do not stream the logs from the cluster directly, as the
         donwload and stream should be faster, and more robust against
         preemptions or ssh disconnection during the streaming.
         """
-        spot_job_logs_dir = os.path.join(constants.SKY_LOGS_DIRECTORY,
-                                         'spot_jobs')
+        managed_job_logs_dir = os.path.join(constants.SKY_LOGS_DIRECTORY,
+                                            'managed_jobs')
         controller_utils.download_and_stream_latest_job_log(
-            self._backend, handle, spot_job_logs_dir)
+            self._backend, handle, managed_job_logs_dir)
         logger.info(f'\n== End of logs (ID: {self._job_id}) ==')
 
     def _run_one_task(self, task_id: int, task: 'sky.Task') -> bool:
-        """Busy loop monitoring spot cluster status and handling recovery.
+        """Busy loop monitoring cluster status and handling recovery.
 
         When the task is successfully completed, this function returns True,
-        and will terminate the spot cluster before returning.
+        and will terminate the cluster before returning.
 
         If the user program fails, i.e. the task is set to FAILED or
         FAILED_SETUP, this function will return False.
@@ -124,28 +130,28 @@ def _run_one_task(self, task_id: int, task: 'sky.Task') -> bool:
                 due to:
                 1. Any of the underlying failover exceptions is due to resources
                 unavailability.
-                2. The cluster is preempted before the job is submitted.
+                2. The cluster is preempted or failed before the job is
+                submitted.
                 3. Any unexpected error happens during the `sky.launch`.
         Other exceptions may be raised depending on the backend.
         """
 
-        callback_func = spot_utils.event_callback_func(job_id=self._job_id,
-                                                       task_id=task_id,
-                                                       task=task)
+        callback_func = managed_job_utils.event_callback_func(
+            job_id=self._job_id, task_id=task_id, task=task)
         if task.run is None:
             logger.info(f'Skip running task {task_id} ({task.name}) due to its '
                         'run commands being empty.')
             # Call set_started first to initialize columns in the state table,
             # including start_at and last_recovery_at to avoid issues for
             # uninitialized columns.
-            spot_state.set_started(job_id=self._job_id,
-                                   task_id=task_id,
-                                   start_time=time.time(),
-                                   callback_func=callback_func)
-            spot_state.set_succeeded(job_id=self._job_id,
-                                     task_id=task_id,
-                                     end_time=time.time(),
-                                     callback_func=callback_func)
+            managed_job_state.set_started(job_id=self._job_id,
+                                          task_id=task_id,
+                                          start_time=time.time(),
+                                          callback_func=callback_func)
+            managed_job_state.set_succeeded(job_id=self._job_id,
+                                            task_id=task_id,
+                                            end_time=time.time(),
+                                            callback_func=callback_func)
             return True
         usage_lib.messages.usage.update_task_id(task_id)
         task_id_env_var = task.envs[constants.TASK_ID_ENV_VAR]
@@ -153,64 +159,65 @@ def _run_one_task(self, task_id: int, task: 'sky.Task') -> bool:
         if task_id == 0:
             submitted_at = backend_utils.get_timestamp_from_run_timestamp(
                 self._backend.run_timestamp)
-        spot_state.set_submitted(
+        managed_job_state.set_submitted(
             self._job_id,
             task_id,
             self._backend.run_timestamp,
             submitted_at,
-            resources_str=backend_utils.get_task_resources_str(task),
+            resources_str=backend_utils.get_task_resources_str(
+                task, is_managed_job=True),
             callback_func=callback_func)
         logger.info(
-            f'Submitted spot job {self._job_id} (task: {task_id}, name: '
+            f'Submitted managed job {self._job_id} (task: {task_id}, name: '
             f'{task.name!r}); {constants.TASK_ID_ENV_VAR}: {task_id_env_var}')
         assert task.name is not None, task
-        cluster_name = spot_utils.generate_spot_cluster_name(
+        cluster_name = managed_job_utils.generate_managed_job_cluster_name(
             task.name, self._job_id)
         self._strategy_executor = recovery_strategy.StrategyExecutor.make(
             cluster_name, self._backend, task, self._retry_until_up)
 
         logger.info('Started monitoring.')
-        spot_state.set_starting(job_id=self._job_id,
-                                task_id=task_id,
-                                callback_func=callback_func)
+        managed_job_state.set_starting(job_id=self._job_id,
+                                       task_id=task_id,
+                                       callback_func=callback_func)
         remote_job_submitted_at = self._strategy_executor.launch()
         assert remote_job_submitted_at is not None, remote_job_submitted_at
 
-        spot_state.set_started(job_id=self._job_id,
-                               task_id=task_id,
-                               start_time=remote_job_submitted_at,
-                               callback_func=callback_func)
+        managed_job_state.set_started(job_id=self._job_id,
+                                      task_id=task_id,
+                                      start_time=remote_job_submitted_at,
+                                      callback_func=callback_func)
         while True:
-            time.sleep(spot_utils.JOB_STATUS_CHECK_GAP_SECONDS)
+            time.sleep(managed_job_utils.JOB_STATUS_CHECK_GAP_SECONDS)
 
             # Check the network connection to avoid false alarm for job failure.
             # Network glitch was observed even in the VM.
             try:
                 backend_utils.check_network_connection()
             except exceptions.NetworkError:
-                logger.info(
-                    'Network is not available. Retrying again in '
-                    f'{spot_utils.JOB_STATUS_CHECK_GAP_SECONDS} seconds.')
+                logger.info('Network is not available. Retrying again in '
+                            f'{managed_job_utils.JOB_STATUS_CHECK_GAP_SECONDS} '
+                            'seconds.')
                 continue
 
             # NOTE: we do not check cluster status first because race condition
             # can occur, i.e. cluster can be down during the job status check.
-            job_status = spot_utils.get_job_status(self._backend, cluster_name)
+            job_status = managed_job_utils.get_job_status(
+                self._backend, cluster_name)
 
             if job_status == job_lib.JobStatus.SUCCEEDED:
-                end_time = spot_utils.get_job_timestamp(self._backend,
-                                                        cluster_name,
-                                                        get_end_time=True)
+                end_time = managed_job_utils.get_job_timestamp(
+                    self._backend, cluster_name, get_end_time=True)
                 # The job is done.
-                spot_state.set_succeeded(self._job_id,
-                                         task_id,
-                                         end_time=end_time,
-                                         callback_func=callback_func)
+                managed_job_state.set_succeeded(self._job_id,
+                                                task_id,
+                                                end_time=end_time,
+                                                callback_func=callback_func)
                 logger.info(
                     f'Spot job {self._job_id} (task: {task_id}) SUCCEEDED. '
-                    f'Cleaning up the spot cluster {cluster_name}.')
-                # Only clean up the spot cluster, not the storages, because
-                # tasks may share storages.
+                    f'Cleaning up the cluster {cluster_name}.')
+                # Only clean up the cluster, not the storages, because tasks may
+                # share storages.
                 recovery_strategy.terminate_cluster(cluster_name=cluster_name)
                 return True
 
@@ -218,7 +225,8 @@ def _run_one_task(self, task_id: int, task: 'sky.Task') -> bool:
             # healthy cluster. We can safely continue monitoring.
             # For multi-node jobs, since the job may not be set to FAILED
             # immediately (depending on user program) when only some of the
-            # nodes are preempted, need to check the actual cluster status.
+            # nodes are preempted or failed, need to check the actual cluster
+            # status.
             if (job_status is not None and not job_status.is_terminal() and
                     task.num_nodes == 1):
                 continue
@@ -229,21 +237,28 @@ def _run_one_task(self, task_id: int, task: 'sky.Task') -> bool:
                 # Add a grace period before the check of preemption to avoid
                 # false alarm for job failure.
                 time.sleep(5)
+
             # Pull the actual cluster status from the cloud provider to
-            # determine whether the cluster is preempted.
+            # determine whether the cluster is preempted or failed.
+            # TODO(zhwu): For hardware failure, such as GPU failure, it may not
+            # be reflected in the cluster status, depending on the cloud, which
+            # can also cause failure of the job, and we need to recover it
+            # rather than fail immediately.
             (cluster_status,
              handle) = backend_utils.refresh_cluster_status_handle(
                  cluster_name,
                  force_refresh_statuses=set(status_lib.ClusterStatus))
 
             if cluster_status != status_lib.ClusterStatus.UP:
-                # The cluster is (partially) preempted. It can be down, INIT
-                # or STOPPED, based on the interruption behavior of the cloud.
-                # Spot recovery is needed (will be done later in the code).
+                # The cluster is (partially) preempted or failed. It can be
+                # down, INIT or STOPPED, based on the interruption behavior of
+                # the cloud. Spot recovery is needed (will be done later in the
+                # code).
                 cluster_status_str = ('' if cluster_status is None else
                                       f' (status: {cluster_status.value})')
                 logger.info(
-                    f'Cluster is preempted{cluster_status_str}. Recovering...')
+                    f'Cluster is preempted or failed{cluster_status_str}. '
+                    'Recovering...')
             else:
                 if job_status is not None and not job_status.is_terminal():
                     # The multi-node job is still running, continue monitoring.
@@ -252,27 +267,29 @@ def _run_one_task(self, task_id: int, task: 'sky.Task') -> bool:
                         job_lib.JobStatus.FAILED, job_lib.JobStatus.FAILED_SETUP
                 ]:
                     # The user code has probably crashed, fail immediately.
-                    end_time = spot_utils.get_job_timestamp(self._backend,
-                                                            cluster_name,
-                                                            get_end_time=True)
+                    end_time = managed_job_utils.get_job_timestamp(
+                        self._backend, cluster_name, get_end_time=True)
                     logger.info(
                         'The user job failed. Please check the logs below.\n'
                         f'== Logs of the user job (ID: {self._job_id}) ==\n')
 
                     self._download_log_and_stream(handle)
-                    spot_status_to_set = spot_state.SpotStatus.FAILED
+                    managed_job_status = (
+                        managed_job_state.ManagedJobStatus.FAILED)
                     if job_status == job_lib.JobStatus.FAILED_SETUP:
-                        spot_status_to_set = spot_state.SpotStatus.FAILED_SETUP
+                        managed_job_status = (
+                            managed_job_state.ManagedJobStatus.FAILED_SETUP)
                     failure_reason = (
                         'To see the details, run: '
-                        f'sky spot logs --controller {self._job_id}')
-
-                    spot_state.set_failed(self._job_id,
-                                          task_id,
-                                          failure_type=spot_status_to_set,
-                                          failure_reason=failure_reason,
-                                          end_time=end_time,
-                                          callback_func=callback_func)
+                        f'sky jobs logs --controller {self._job_id}')
+
+                    managed_job_state.set_failed(
+                        self._job_id,
+                        task_id,
+                        failure_type=managed_job_status,
+                        failure_reason=failure_reason,
+                        end_time=end_time,
+                        callback_func=callback_func)
                     return False
                 # Although the cluster is healthy, we fail to access the
                 # job status. Try to recover the job (will not restart the
@@ -286,22 +303,24 @@ def _run_one_task(self, task_id: int, task: 'sky.Task') -> bool:
             if handle is not None:
                 resources = handle.launched_resources
                 assert resources is not None, handle
-                if resources.need_cleanup_after_preemption():
+                if resources.need_cleanup_after_preemption_or_failure():
                     # Some spot resource (e.g., Spot TPU VM) may need to be
-                    # cleaned up after preemption.
-                    logger.info('Cleaning up the preempted spot cluster...')
+                    # cleaned up after preemption, as running launch again on
+                    # those clusters again may fail.
+                    logger.info('Cleaning up the preempted or failed cluster'
+                                '...')
                     recovery_strategy.terminate_cluster(cluster_name)
 
-            # Try to recover the spot jobs, when the cluster is preempted
-            # or the job status is failed to be fetched.
-            spot_state.set_recovering(job_id=self._job_id,
-                                      task_id=task_id,
-                                      callback_func=callback_func)
+            # Try to recover the managed jobs, when the cluster is preempted or
+            # failed or the job status is failed to be fetched.
+            managed_job_state.set_recovering(job_id=self._job_id,
+                                             task_id=task_id,
+                                             callback_func=callback_func)
             recovered_time = self._strategy_executor.recover()
-            spot_state.set_recovered(self._job_id,
-                                     task_id,
-                                     recovered_time=recovered_time,
-                                     callback_func=callback_func)
+            managed_job_state.set_recovered(self._job_id,
+                                            task_id,
+                                            recovered_time=recovered_time,
+                                            callback_func=callback_func)
 
     def run(self):
         """Run controller logic and handle exceptions."""
@@ -320,27 +339,29 @@ def run(self):
                 common_utils.format_exception(reason, use_bracket=True)
                 for reason in e.reasons))
             logger.error(failure_reason)
-            spot_state.set_failed(
+            managed_job_state.set_failed(
                 self._job_id,
                 task_id=task_id,
-                failure_type=spot_state.SpotStatus.FAILED_PRECHECKS,
+                failure_type=managed_job_state.ManagedJobStatus.
+                FAILED_PRECHECKS,
                 failure_reason=failure_reason,
-                callback_func=spot_utils.event_callback_func(
+                callback_func=managed_job_utils.event_callback_func(
                     job_id=self._job_id,
                     task_id=task_id,
                     task=self._dag.tasks[task_id]))
-        except exceptions.SpotJobReachedMaxRetriesError as e:
+        except exceptions.ManagedJobReachedMaxRetriesError as e:
             # Please refer to the docstring of self._run for the cases when
             # this exception can occur.
             logger.error(common_utils.format_exception(e))
-            # The spot job should be marked as FAILED_NO_RESOURCE, as the
-            # spot job may be able to launch next time.
-            spot_state.set_failed(
+            # The managed job should be marked as FAILED_NO_RESOURCE, as the
+            # managed job may be able to launch next time.
+            managed_job_state.set_failed(
                 self._job_id,
                 task_id=task_id,
-                failure_type=spot_state.SpotStatus.FAILED_NO_RESOURCE,
+                failure_type=managed_job_state.ManagedJobStatus.
+                FAILED_NO_RESOURCE,
                 failure_reason=common_utils.format_exception(e),
-                callback_func=spot_utils.event_callback_func(
+                callback_func=managed_job_utils.event_callback_func(
                     job_id=self._job_id,
                     task_id=task_id,
                     task=self._dag.tasks[task_id]))
@@ -350,12 +371,13 @@ def run(self):
             msg = ('Unexpected error occurred: '
                    f'{common_utils.format_exception(e, use_bracket=True)}')
             logger.error(msg)
-            spot_state.set_failed(
+            managed_job_state.set_failed(
                 self._job_id,
                 task_id=task_id,
-                failure_type=spot_state.SpotStatus.FAILED_CONTROLLER,
+                failure_type=managed_job_state.ManagedJobStatus.
+                FAILED_CONTROLLER,
                 failure_reason=msg,
-                callback_func=spot_utils.event_callback_func(
+                callback_func=managed_job_utils.event_callback_func(
                     job_id=self._job_id,
                     task_id=task_id,
                     task=self._dag.tasks[task_id]))
@@ -364,27 +386,28 @@ def run(self):
             # affect the jobs in terminal states.
             # We need to call set_cancelling before set_cancelled to make sure
             # the table entries are correctly set.
-            callback_func = spot_utils.event_callback_func(
+            callback_func = managed_job_utils.event_callback_func(
                 job_id=self._job_id,
                 task_id=task_id,
                 task=self._dag.tasks[task_id])
-            spot_state.set_cancelling(job_id=self._job_id,
-                                      callback_func=callback_func)
-            spot_state.set_cancelled(job_id=self._job_id,
-                                     callback_func=callback_func)
+            managed_job_state.set_cancelling(job_id=self._job_id,
+                                             callback_func=callback_func)
+            managed_job_state.set_cancelled(job_id=self._job_id,
+                                            callback_func=callback_func)
 
 
 def _run_controller(job_id: int, dag_yaml: str, retry_until_up: bool):
     """Runs the controller in a remote process for interruption."""
     # The controller needs to be instantiated in the remote process, since
     # the controller is not serializable.
-    spot_controller = SpotController(job_id, dag_yaml, retry_until_up)
-    spot_controller.run()
+    jobs_controller = JobsController(job_id, dag_yaml, retry_until_up)
+    jobs_controller.run()
 
 
 def _handle_signal(job_id):
     """Handle the signal if the user sent it."""
-    signal_file = pathlib.Path(spot_utils.SIGNAL_FILE_PREFIX.format(job_id))
+    signal_file = pathlib.Path(
+        managed_job_utils.SIGNAL_FILE_PREFIX.format(job_id))
     user_signal = None
     if signal_file.exists():
         # Filelock is needed to prevent race condition with concurrent
@@ -393,7 +416,7 @@ def _handle_signal(job_id):
             with signal_file.open(mode='r', encoding='utf-8') as f:
                 user_signal = f.read().strip()
                 try:
-                    user_signal = spot_utils.UserSignal(user_signal)
+                    user_signal = managed_job_utils.UserSignal(user_signal)
                 except ValueError:
                     logger.warning(
                         f'Unknown signal received: {user_signal}. Ignoring.')
@@ -403,28 +426,29 @@ def _handle_signal(job_id):
     if user_signal is None:
         # None or empty string.
         return
-    assert user_signal == spot_utils.UserSignal.CANCEL, (
+    assert user_signal == managed_job_utils.UserSignal.CANCEL, (
         f'Only cancel signal is supported, but {user_signal} got.')
-    raise exceptions.SpotUserCancelledError(
+    raise exceptions.ManagedJobUserCancelledError(
         f'User sent {user_signal.value} signal.')
 
 
 def _cleanup(job_id: int, dag_yaml: str):
-    """Clean up the spot cluster(s) and storages.
+    """Clean up the cluster(s) and storages.
 
     (1) Clean up the succeeded task(s)' ephemeral storage. The storage has
         to be cleaned up after the whole job is finished, as the tasks
         may share the same storage.
-    (2) Clean up the spot cluster(s) that are not cleaned up yet, which
-        can happen when the spot task failed or cancelled. At most one
-        spot cluster should be left when reaching here, as we currently
-        only support chain DAGs, and only spot task is executed at a time.
+    (2) Clean up the cluster(s) that are not cleaned up yet, which can happen
+        when the task failed or cancelled. At most one cluster should be left
+        when reaching here, as we currently only support chain DAGs, and only
+        task is executed at a time.
     """
     # NOTE: The code to get cluster name is same as what we did in the spot
-    # controller, we should keep it in sync with SpotController.__init__()
+    # controller, we should keep it in sync with JobsController.__init__()
     dag, _ = _get_dag_and_name(dag_yaml)
     for task in dag.tasks:
-        cluster_name = spot_utils.generate_spot_cluster_name(task.name, job_id)
+        cluster_name = managed_job_utils.generate_managed_job_cluster_name(
+            task.name, job_id)
         recovery_strategy.terminate_cluster(cluster_name)
         # Clean up Storages with persistent=False.
         # TODO(zhwu): this assumes the specific backend.
@@ -451,16 +475,15 @@ def start(job_id, dag_yaml, retry_until_up):
         while controller_process.is_alive():
             _handle_signal(job_id)
             time.sleep(1)
-    except exceptions.SpotUserCancelledError:
+    except exceptions.ManagedJobUserCancelledError:
         dag, _ = _get_dag_and_name(dag_yaml)
-        task_id, _ = (spot_state.get_latest_task_id_status(job_id))
+        task_id, _ = managed_job_state.get_latest_task_id_status(job_id)
         logger.info(
-            f'Cancelling spot job, job_id: {job_id}, task_id: {task_id}')
-        spot_state.set_cancelling(job_id=job_id,
-                                  callback_func=spot_utils.event_callback_func(
-                                      job_id=job_id,
-                                      task_id=task_id,
-                                      task=dag.tasks[task_id]))
+            f'Cancelling managed job, job_id: {job_id}, task_id: {task_id}')
+        managed_job_state.set_cancelling(
+            job_id=job_id,
+            callback_func=managed_job_utils.event_callback_func(
+                job_id=job_id, task_id=task_id, task=dag.tasks[task_id]))
         cancelling = True
     finally:
         if controller_process is not None:
@@ -474,37 +497,38 @@ def start(job_id, dag_yaml, retry_until_up):
             controller_process.join()
             logger.info(f'Controller process {controller_process.pid} killed.')
 
-        logger.info(f'Cleaning up any spot cluster for job {job_id}.')
+        logger.info(f'Cleaning up any cluster for job {job_id}.')
         # NOTE: Originally, we send an interruption signal to the controller
         # process and the controller process handles cleanup. However, we
         # figure out the behavior differs from cloud to cloud
         # (e.g., GCP ignores 'SIGINT'). A possible explanation is
         # https://unix.stackexchange.com/questions/356408/strange-problem-with-trap-and-sigint
         # But anyway, a clean solution is killing the controller process
-        # directly, and then cleanup the cluster state.
+        # directly, and then cleanup the cluster job_state.
         _cleanup(job_id, dag_yaml=dag_yaml)
-        logger.info(f'Spot cluster of job {job_id} has been cleaned up.')
+        logger.info(f'Cluster of managed job {job_id} has been cleaned up.')
 
         if cancelling:
-            spot_state.set_cancelled(
+            managed_job_state.set_cancelled(
                 job_id=job_id,
-                callback_func=spot_utils.event_callback_func(
+                callback_func=managed_job_utils.event_callback_func(
                     job_id=job_id, task_id=task_id, task=dag.tasks[task_id]))
 
         # We should check job status after 'set_cancelled', otherwise
         # the job status is not terminal.
-        job_status = spot_state.get_status(job_id)
+        job_status = managed_job_state.get_status(job_id)
         assert job_status is not None
         # The job can be non-terminal if the controller exited abnormally,
         # e.g. failed to launch cluster after reaching the MAX_RETRY.
         if not job_status.is_terminal():
-            logger.info(f'Previous spot job status: {job_status.value}')
-            spot_state.set_failed(
+            logger.info(f'Previous job status: {job_status.value}')
+            managed_job_state.set_failed(
                 job_id,
                 task_id=None,
-                failure_type=spot_state.SpotStatus.FAILED_CONTROLLER,
+                failure_type=managed_job_state.ManagedJobStatus.
+                FAILED_CONTROLLER,
                 failure_reason=('Unexpected error occurred. For details, '
-                                f'run: sky spot logs --controller {job_id}'))
+                                f'run: sky jobs logs --controller {job_id}'))
 
 
 if __name__ == '__main__':
@@ -515,10 +539,10 @@ def start(job_id, dag_yaml, retry_until_up):
                         help='Job id for the controller job.')
     parser.add_argument('--retry-until-up',
                         action='store_true',
-                        help='Retry until the spot cluster is up.')
+                        help='Retry until the cluster is up.')
     parser.add_argument('dag_yaml',
                         type=str,
-                        help='The path to the user spot task yaml file.')
+                        help='The path to the user job yaml file.')
     args = parser.parse_args()
     # We start process with 'spawn', because 'fork' could result in weird
     # behaviors; 'spawn' is also cross-platform.
diff --git a/sky/spot/core.py b/sky/jobs/core.py
similarity index 67%
rename from sky/spot/core.py
rename to sky/jobs/core.py
index 459f351b7b4..561d47f4b25 100644
--- a/sky/spot/core.py
+++ b/sky/jobs/core.py
@@ -1,4 +1,4 @@
-"""SDK functions for managed spot job."""
+"""SDK functions for managed jobs."""
 import os
 import tempfile
 from typing import Any, Dict, List, Optional, Union
@@ -14,9 +14,9 @@
 from sky import task as task_lib
 from sky.backends import backend_utils
 from sky.clouds.service_catalog import common as service_catalog_common
+from sky.jobs import constants as managed_job_constants
+from sky.jobs import utils as managed_job_utils
 from sky.skylet import constants as skylet_constants
-from sky.spot import constants
-from sky.spot import spot_utils
 from sky.usage import usage_lib
 from sky.utils import common_utils
 from sky.utils import controller_utils
@@ -35,18 +35,19 @@ def launch(
     retry_until_up: bool = False,
 ) -> None:
     # NOTE(dev): Keep the docstring consistent between the Python API and CLI.
-    """Launch a managed spot job.
+    """Launch a managed job.
 
-    Please refer to the sky.cli.spot_launch for the document.
+    Please refer to sky.cli.job_launch for documentation.
 
     Args:
         task: sky.Task, or sky.Dag (experimental; 1-task only) to launch as a
-          managed spot job.
-        name: Name of the spot job.
+          managed job.
+        name: Name of the managed job.
         detach_run: Whether to detach the run.
 
     Raises:
-        ValueError: cluster does not exist.
+        ValueError: cluster does not exist. Or, the entrypoint is not a valid
+            chain dag.
         sky.exceptions.NotSupportedError: the feature is not supported.
     """
     entrypoint = task
@@ -55,8 +56,8 @@ def launch(
     dag = dag_utils.convert_entrypoint_to_dag(entrypoint)
     if not dag.is_chain():
         with ux_utils.print_exception_no_traceback():
-            raise ValueError('Only single-task or chain DAG is allowed for '
-                             f'sky.spot.launch. Dag:\n{dag}')
+            raise ValueError('Only single-task or chain DAG is '
+                             f'allowed for job_launch. Dag: {dag}')
 
     dag_utils.maybe_infer_and_fill_dag_and_task_names(dag)
 
@@ -71,56 +72,58 @@ def launch(
                     'will be auto-generated) .')
         task_names.add(task_.name)
 
-    dag_utils.fill_default_spot_config_in_dag_for_spot_launch(dag)
+    dag_utils.fill_default_config_in_dag_for_job_launch(dag)
 
     for task_ in dag.tasks:
         controller_utils.maybe_translate_local_file_mounts_and_sync_up(
-            task_, path='spot')
+            task_, path='jobs')
 
-    with tempfile.NamedTemporaryFile(prefix=f'spot-dag-{dag.name}-',
+    with tempfile.NamedTemporaryFile(prefix=f'managed-dag-{dag.name}-',
                                      mode='w') as f:
         dag_utils.dump_chain_dag_to_yaml(dag, f.name)
-        controller_name = spot_utils.SPOT_CONTROLLER_NAME
-        prefix = constants.SPOT_TASK_YAML_PREFIX
+        controller = controller_utils.Controllers.JOBS_CONTROLLER
+        controller_name = controller.value.cluster_name
+        prefix = managed_job_constants.JOBS_TASK_YAML_PREFIX
         remote_user_yaml_path = f'{prefix}/{dag.name}-{dag_uuid}.yaml'
         remote_user_config_path = f'{prefix}/{dag.name}-{dag_uuid}.config_yaml'
         controller_resources = controller_utils.get_controller_resources(
-            controller_type='spot',
-            controller_resources_config=constants.CONTROLLER_RESOURCES)
+            controller=controller_utils.Controllers.JOBS_CONTROLLER,
+            task_resources=sum([list(t.resources) for t in dag.tasks], []))
 
         vars_to_fill = {
             'remote_user_yaml_path': remote_user_yaml_path,
             'user_yaml_path': f.name,
-            'spot_controller': controller_name,
-            # Note: actual spot cluster name will be <task.name>-<spot job ID>
+            'jobs_controller': controller_name,
+            # Note: actual cluster name will be <task.name>-<managed job ID>
             'dag_name': dag.name,
             'retry_until_up': retry_until_up,
             'remote_user_config_path': remote_user_config_path,
-            'sky_python_cmd': skylet_constants.SKY_PYTHON_CMD,
             'modified_catalogs':
                 service_catalog_common.get_modified_catalog_file_mounts(),
             **controller_utils.shared_controller_vars_to_fill(
-                'spot',
+                controller_utils.Controllers.JOBS_CONTROLLER,
                 remote_user_config_path=remote_user_config_path,
             ),
         }
 
-        yaml_path = os.path.join(constants.SPOT_CONTROLLER_YAML_PREFIX,
-                                 f'{name}-{dag_uuid}.yaml')
-        common_utils.fill_template(constants.SPOT_CONTROLLER_TEMPLATE,
-                                   vars_to_fill,
-                                   output_path=yaml_path)
+        yaml_path = os.path.join(
+            managed_job_constants.JOBS_CONTROLLER_YAML_PREFIX,
+            f'{name}-{dag_uuid}.yaml')
+        common_utils.fill_template(
+            managed_job_constants.JOBS_CONTROLLER_TEMPLATE,
+            vars_to_fill,
+            output_path=yaml_path)
         controller_task = task_lib.Task.from_yaml(yaml_path)
         controller_task.set_resources(controller_resources)
 
-        controller_task.spot_dag = dag
-        assert len(controller_task.resources) == 1
+        controller_task.managed_job_dag = dag
+        assert len(controller_task.resources) == 1, controller_task
 
         sky_logging.print(
             f'{colorama.Fore.YELLOW}'
-            f'Launching managed spot job {dag.name!r} from spot controller...'
+            f'Launching managed job {dag.name!r} from jobs controller...'
             f'{colorama.Style.RESET_ALL}')
-        sky_logging.print('Launching spot controller...')
+        sky_logging.print('Launching jobs controller...')
         sky.launch(task=controller_task,
                    stream_logs=stream_logs,
                    cluster_name=controller_name,
@@ -134,9 +137,9 @@ def launch(
 @usage_lib.entrypoint
 def queue(refresh: bool, skip_finished: bool = False) -> List[Dict[str, Any]]:
     # NOTE(dev): Keep the docstring consistent between the Python API and CLI.
-    """Get statuses of managed spot jobs.
+    """Get statuses of managed jobs.
 
-    Please refer to the sky.cli.spot_queue for the documentation.
+    Please refer to sky.cli.job_queue for documentation.
 
     Returns:
         [
@@ -148,23 +151,23 @@ def queue(refresh: bool, skip_finished: bool = False) -> List[Dict[str, Any]]:
                 'end_at': (float) timestamp of end,
                 'duration': (float) duration in seconds,
                 'recovery_count': (int) Number of retries,
-                'status': (sky.spot.SpotStatus) of the job,
+                'status': (sky.jobs.ManagedJobStatus) of the job,
                 'cluster_resources': (str) resources of the cluster,
                 'region': (str) region of the cluster,
             }
         ]
     Raises:
-        sky.exceptions.ClusterNotUpError: the spot controller is not up or
+        sky.exceptions.ClusterNotUpError: the jobs controller is not up or
             does not exist.
-        RuntimeError: if failed to get the spot jobs with ssh.
+        RuntimeError: if failed to get the managed jobs with ssh.
     """
+    jobs_controller_type = controller_utils.Controllers.JOBS_CONTROLLER
     stopped_message = ''
     if not refresh:
-        stopped_message = 'No in-progress spot jobs.'
+        stopped_message = 'No in-progress managed jobs.'
     try:
         handle = backend_utils.is_controller_accessible(
-            controller_type=controller_utils.Controllers.SPOT_CONTROLLER,
-            stopped_message=stopped_message)
+            controller=jobs_controller_type, stopped_message=stopped_message)
     except exceptions.ClusterNotUpError as e:
         if not refresh:
             raise
@@ -176,18 +179,19 @@ def queue(refresh: bool, skip_finished: bool = False) -> List[Dict[str, Any]]:
                           'Restarting controller for latest status...'
                           f'{colorama.Style.RESET_ALL}')
 
-        rich_utils.force_update_status('[cyan] Checking spot jobs - restarting '
-                                       'controller[/]')
-        handle = sky.start(spot_utils.SPOT_CONTROLLER_NAME)
+        rich_utils.force_update_status(
+            '[cyan] Checking managed jobs - restarting '
+            'controller[/]')
+        handle = sky.start(jobs_controller_type.value.cluster_name)
         controller_status = status_lib.ClusterStatus.UP
-        rich_utils.force_update_status('[cyan] Checking spot jobs[/]')
+        rich_utils.force_update_status('[cyan] Checking managed jobs[/]')
 
     assert handle is not None, (controller_status, refresh)
 
     backend = backend_utils.get_backend_from_handle(handle)
     assert isinstance(backend, backends.CloudVmRayBackend)
 
-    code = spot_utils.SpotCodeGen.get_job_table()
+    code = managed_job_utils.ManagedJobCodeGen.get_job_table()
     returncode, job_table_payload, stderr = backend.run_on_head(
         handle,
         code,
@@ -198,13 +202,13 @@ def queue(refresh: bool, skip_finished: bool = False) -> List[Dict[str, Any]]:
     try:
         subprocess_utils.handle_returncode(returncode,
                                            code,
-                                           'Failed to fetch managed spot jobs',
+                                           'Failed to fetch managed jobs',
                                            job_table_payload + stderr,
                                            stream_logs=False)
     except exceptions.CommandError as e:
         raise RuntimeError(str(e)) from e
 
-    jobs = spot_utils.load_spot_job_queue(job_table_payload)
+    jobs = managed_job_utils.load_managed_job_queue(job_table_payload)
     if skip_finished:
         # Filter out the finished jobs. If a multi-task job is partially
         # finished, we will include all its tasks.
@@ -222,18 +226,18 @@ def cancel(name: Optional[str] = None,
            job_ids: Optional[List[int]] = None,
            all: bool = False) -> None:
     # NOTE(dev): Keep the docstring consistent between the Python API and CLI.
-    """Cancel managed spot jobs.
+    """Cancel managed jobs.
 
-    Please refer to the sky.cli.spot_cancel for the document.
+    Please refer to sky.cli.job_cancel for documentation.
 
     Raises:
-        sky.exceptions.ClusterNotUpError: the spot controller is not up.
+        sky.exceptions.ClusterNotUpError: the jobs controller is not up.
         RuntimeError: failed to cancel the job.
     """
     job_ids = [] if job_ids is None else job_ids
     handle = backend_utils.is_controller_accessible(
-        controller_type=controller_utils.Controllers.SPOT_CONTROLLER,
-        stopped_message='All managed spot jobs should have finished.')
+        controller=controller_utils.Controllers.JOBS_CONTROLLER,
+        stopped_message='All managed jobs should have finished.')
 
     job_id_str = ','.join(map(str, job_ids))
     if sum([len(job_ids) > 0, name is not None, all]) != 1:
@@ -247,12 +251,12 @@ def cancel(name: Optional[str] = None,
     backend = backend_utils.get_backend_from_handle(handle)
     assert isinstance(backend, backends.CloudVmRayBackend)
     if all:
-        code = spot_utils.SpotCodeGen.cancel_jobs_by_id(None)
+        code = managed_job_utils.ManagedJobCodeGen.cancel_jobs_by_id(None)
     elif job_ids:
-        code = spot_utils.SpotCodeGen.cancel_jobs_by_id(job_ids)
+        code = managed_job_utils.ManagedJobCodeGen.cancel_jobs_by_id(job_ids)
     else:
         assert name is not None, (job_ids, name, all)
-        code = spot_utils.SpotCodeGen.cancel_job_by_name(name)
+        code = managed_job_utils.ManagedJobCodeGen.cancel_job_by_name(name)
     # The stderr is redirected to stdout
     returncode, stdout, _ = backend.run_on_head(handle,
                                                 code,
@@ -260,7 +264,7 @@ def cancel(name: Optional[str] = None,
                                                 stream_logs=False)
     try:
         subprocess_utils.handle_returncode(returncode, code,
-                                           'Failed to cancel managed spot job',
+                                           'Failed to cancel managed job',
                                            stdout)
     except exceptions.CommandError as e:
         with ux_utils.print_exception_no_traceback():
@@ -274,44 +278,53 @@ def cancel(name: Optional[str] = None,
 
 
 @usage_lib.entrypoint
-def tail_logs(name: Optional[str], job_id: Optional[int], follow: bool) -> None:
+def tail_logs(name: Optional[str], job_id: Optional[int], follow: bool,
+              controller: bool) -> None:
     # NOTE(dev): Keep the docstring consistent between the Python API and CLI.
-    """Tail logs of managed spot jobs.
+    """Tail logs of managed jobs.
 
-    Please refer to the sky.cli.spot_logs for the document.
+    Please refer to sky.cli.job_logs for documentation.
 
     Raises:
         ValueError: invalid arguments.
-        sky.exceptions.ClusterNotUpError: the spot controller is not up.
+        sky.exceptions.ClusterNotUpError: the jobs controller is not up.
     """
-    # TODO(zhwu): Automatically restart the spot controller
+    # TODO(zhwu): Automatically restart the jobs controller
+    jobs_controller_type = controller_utils.Controllers.JOBS_CONTROLLER
     handle = backend_utils.is_controller_accessible(
-        controller_type=controller_utils.Controllers.SPOT_CONTROLLER,
-        stopped_message=('Please restart the spot controller with '
-                         f'`sky start {spot_utils.SPOT_CONTROLLER_NAME}`.'))
+        controller=jobs_controller_type,
+        stopped_message=(
+            'Please restart the jobs controller with '
+            f'`sky start {jobs_controller_type.value.cluster_name}`.'))
 
     if name is not None and job_id is not None:
         raise ValueError('Cannot specify both name and job_id.')
     backend = backend_utils.get_backend_from_handle(handle)
     assert isinstance(backend, backends.CloudVmRayBackend), backend
-    # Stream the realtime logs
-    backend.tail_spot_logs(handle, job_id=job_id, job_name=name, follow=follow)
 
+    backend.tail_managed_job_logs(handle,
+                                  job_id=job_id,
+                                  job_name=name,
+                                  follow=follow,
+                                  controller=controller)
 
-spot_launch = common_utils.deprecated_function(launch,
-                                               name='sky.spot.launch',
-                                               deprecated_name='spot_launch',
-                                               removing_version='0.7.0')
+
+spot_launch = common_utils.deprecated_function(
+    launch,
+    name='sky.jobs.launch',
+    deprecated_name='spot_launch',
+    removing_version='0.8.0',
+    override_argument={'use_spot': True})
 spot_queue = common_utils.deprecated_function(queue,
-                                              name='sky.spot.queue',
+                                              name='sky.jobs.queue',
                                               deprecated_name='spot_queue',
-                                              removing_version='0.7.0')
+                                              removing_version='0.8.0')
 spot_cancel = common_utils.deprecated_function(cancel,
-                                               name='sky.spot.cancel',
+                                               name='sky.jobs.cancel',
                                                deprecated_name='spot_cancel',
-                                               removing_version='0.7.0')
+                                               removing_version='0.8.0')
 spot_tail_logs = common_utils.deprecated_function(
     tail_logs,
-    name='sky.spot.tail_logs',
+    name='sky.jobs.tail_logs',
     deprecated_name='spot_tail_logs',
-    removing_version='0.7.0')
+    removing_version='0.8.0')
diff --git a/sky/jobs/dashboard/dashboard.py b/sky/jobs/dashboard/dashboard.py
new file mode 100644
index 00000000000..89c97274646
--- /dev/null
+++ b/sky/jobs/dashboard/dashboard.py
@@ -0,0 +1,87 @@
+"""Dashboard for managed jobs based on Flask.
+
+TODO(zongheng): This is a basic version. In the future we can beef up the web
+frameworks used (e.g.,
+https://github.com/ray-project/ray/tree/master/dashboard/client/src) and/or get
+rid of the SSH port-forwarding business (see cli.py's job_dashboard()
+comment).
+"""
+import datetime
+import pathlib
+
+import flask
+import yaml
+
+from sky import jobs as managed_jobs
+from sky.utils import common_utils
+from sky.utils import controller_utils
+
+app = flask.Flask(__name__)
+
+
+def _is_running_on_jobs_controller() -> bool:
+    """Am I running on jobs controller?
+
+    Loads ~/.sky/sky_ray.yml and check cluster_name.
+    """
+    if pathlib.Path('~/.sky/sky_ray.yml').expanduser().exists():
+        config = yaml.safe_load(
+            pathlib.Path('~/.sky/sky_ray.yml').expanduser().read_text())
+        cluster_name = config.get('cluster_name', '')
+        candidate_controller_names = (
+            controller_utils.Controllers.JOBS_CONTROLLER.value.
+            candidate_cluster_names)
+        # We use startswith instead of exact match because the cluster name in
+        # the yaml file is cluster_name_on_cloud which may have additional
+        # suffices.
+        return any(
+            cluster_name.startswith(name)
+            for name in candidate_controller_names)
+    return False
+
+
+@app.route('/')
+def home():
+    if not _is_running_on_jobs_controller():
+        # Experimental: run on laptop (refresh is very slow).
+        all_managed_jobs = managed_jobs.queue(refresh=True, skip_finished=False)
+    else:
+        job_table = managed_jobs.dump_managed_job_queue()
+        all_managed_jobs = managed_jobs.load_managed_job_queue(job_table)
+
+    timestamp = datetime.datetime.now(datetime.timezone.utc)
+    rows = managed_jobs.format_job_table(all_managed_jobs,
+                                         show_all=True,
+                                         return_rows=True)
+    # Add an empty column for the dropdown button. This will be added in the
+    # jobs/templates/index.html file.
+    rows = [[''] + row for row in rows]
+
+    # FIXME(zongheng): make the job table/queue funcs return structured info so
+    # that we don't have to do things like row[-5] below.
+    columns = [
+        '', 'ID', 'Task', 'Name', 'Resources', 'Submitted', 'Total Duration',
+        'Job Duration', 'Recoveries', 'Status', 'Started', 'Cluster', 'Region',
+        'Failure'
+    ]
+    if rows and len(rows[0]) != len(columns):
+        raise RuntimeError(
+            'Dashboard code and managed job queue code are out of sync.')
+
+    # Fix STATUS color codes: '\x1b[33mCANCELLED\x1b[0m' -> 'CANCELLED'.
+    for row in rows:
+        row[-5] = common_utils.remove_color(row[-5])
+    # Remove filler rows ([''], ..., ['-']).
+    rows = [row for row in rows if ''.join(map(str, row)) != '']
+
+    rendered_html = flask.render_template(
+        'index.html',
+        columns=columns,
+        rows=rows,
+        last_updated_timestamp=timestamp,
+    )
+    return rendered_html
+
+
+if __name__ == '__main__':
+    app.run()
diff --git a/sky/spot/dashboard/static/favicon.ico b/sky/jobs/dashboard/static/favicon.ico
similarity index 100%
rename from sky/spot/dashboard/static/favicon.ico
rename to sky/jobs/dashboard/static/favicon.ico
diff --git a/sky/spot/dashboard/templates/index.html b/sky/jobs/dashboard/templates/index.html
similarity index 50%
rename from sky/spot/dashboard/templates/index.html
rename to sky/jobs/dashboard/templates/index.html
index f8267cd3e5f..af4f5708bce 100644
--- a/sky/spot/dashboard/templates/index.html
+++ b/sky/jobs/dashboard/templates/index.html
@@ -6,7 +6,8 @@
     <meta name="viewport" content="width=device-width, initial-scale=1">
     <link rel="icon" href="{{ url_for('static', filename='favicon.ico') }}" type="image/x-icon">
     <title>SkyPilot Dashboard</title>
-    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap@5.2.3/dist/css/bootstrap.min.css" integrity="sha384-rbsA2VBKQhggwzxH7pPCaAqO46MgnOM80zW1RWuH61DGLwZJEdK2Kadq2F9CUG65" crossorigin="anonymous">
+    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap@5.2.3/dist/css/bootstrap.min.css"
+        integrity="sha384-rbsA2VBKQhggwzxH7pPCaAqO46MgnOM80zW1RWuH61DGLwZJEdK2Kadq2F9CUG65" crossorigin="anonymous">
     <style>
         .table-no-stripes tbody tr:nth-of-type(even) {
             background-color: transparent;
@@ -40,6 +41,9 @@
             /* Replace with your desired text color */
             z-index: 1;
         }
+        .clickable {
+           cursor: pointer; /* This makes the cursor a pointer when hovering over the element */
+        }
     </style>
     <script src="https://cdnjs.cloudflare.com/ajax/libs/moment.js/2.29.1/moment.min.js"></script>
     <script
@@ -49,18 +53,16 @@
 
 <body>
     <div class="container">
-
         <header>
-            <h1>Managed spot jobs</h1>
+            <h1>Managed jobs</h1>
             <p class="text-muted mt-4" id="last-updated"></p>
-
             <div class="form-check form-switch">
                 <input class="form-check-input" type="checkbox" id="refresh-toggle" checked>
                 <label class="form-check-label" for="refresh-toggle">Auto-refresh (every 30s)</label>
             </div>
         </header>
 
-        <table class="table table-hover table-hover-selected fixed-header-table">
+        <table class="table table-hover table-hover-selected fixed-header-table" id="jobs-table">
             <thead>
                 <tr>
                     {% for column in columns %}
@@ -70,43 +72,102 @@ <h1>Managed spot jobs</h1>
             </thead>
             <tbody>
                 {% for row in rows %}
-                <tr>
-                    <td>{{ row[0] }}</td>
-                    <td>{{ row[1] }}</td>
+                <tr class="{% if row[2] == '' %}folder{% endif %}{% if row[1] == ' \u21B3' %}folded{% endif %}">
+                    <td>{% if row[2] == '' %}<span class="clickable">▶</span>{% endif %}</td>
+                    <td>{{ row[1]|string|replace(' \u21B3', '') }}</td>
                     <td>{{ row[2] }}</td>
                     <td>{{ row[3] }}</td>
                     <td>{{ row[4] }}</td>
                     <td>{{ row[5] }}</td>
                     <td>{{ row[6] }}</td>
                     <td>{{ row[7] }}</td>
+                    <td>{{ row[8] }}</td>
                     <td>
                         <!-- https://getbootstrap.com/docs/4.0/components/badge/ -->
-                        {% if row[8] == 'RUNNING' %}
-                        <span class="badge bg-primary">{{ row[8] }}</span>
-                        {% elif row[8] == 'PENDING' or row[8] == 'SUBMITTED' %}
-                        <span class="badge bg-light badge-light">{{ row[8] }}</span>
-                        {% elif row[8] == 'RECOVERING' or row[8] == 'CANCELLING' or row[8] == 'STARTING' %}
-                        <!-- transient states, so same color makes sense? -->
-                        <span class="badge bg-info">{{ row[8] }}</span>
-                        {% elif row[8] == 'SUCCEEDED' %}
-                        <span class="badge bg-success">{{ row[8] }}</span>
-                        {% elif row[8] == 'CANCELLED' %}
-                        <span class="badge bg-secondary">{{ row[8] }}</span>
-                        {% elif row[8].startswith('FAILED') %}
-                        <span class="badge bg-warning">{{ row[8] }}</span>
+                        {% if row[9].startswith('RUNNING') %}
+                        <span class="badge bg-primary">{{ row[9].split()[0] }}</span>{{ row[9][row[9].split()[0]|length:] }}
+                        {% elif row[9].startswith('PENDING') or row[9].startswith('SUBMITTED') %}
+                        <span class="badge bg-light">{{ row[9].split()[0] }}</span>{{ row[9][row[9].split()[0]|length:] }}
+                        {% elif row[9].startswith('RECOVERING') or row[9].startswith('CANCELLING') or row[9].startswith('STARTING') %}
+                        <span class="badge bg-info">{{ row[9].split()[0] }}</span>{{ row[9][row[9].split()[0]|length:] }}
+                        {% elif row[9].startswith('SUCCEEDED') %}
+                        <span class="badge bg-success">{{ row[9].split()[0] }}</span>{{ row[9][row[9].split()[0]|length:] }}
+                        {% elif row[9].startswith('CANCELLED') %}
+                        <span class="badge bg-secondary">{{ row[9].split()[0] }}</span>{{ row[9][row[9].split()[0]|length:] }}
+                        {% elif row[9].startswith('FAILED') %}
+                        <span class="badge bg-warning">{{ row[9].split()[0] }}</span>{{ row[9][row[9].split()[0]|length:] }}
                         {% else %}
-                        {{ row[8] }}
+                        {{ row[9] }}
                         {% endif %}
                     </td>
-                    <td>{{ row[9] }}</td>
                     <td>{{ row[10] }}</td>
                     <td>{{ row[11] }}</td>
                     <td>{{ row[12] }}</td>
+                    <td>{{ row[13] }}</td>
                 </tr>
                 {% endfor %}
             </tbody>
         </table>
     </div>
+    <script>
+        // Folder toggle for pipelines, this will fold/unfold the rows for
+        // a pipeline and its tasks.
+        document.querySelectorAll('.folder').forEach(function (folder) {
+            folder.addEventListener('click', function () {
+                let plusSign = folder.querySelector('.clickable');
+                let folderIndex = Array.from(document.querySelectorAll('.folder')).indexOf(folder);
+
+                if (!plusSign) {
+                    plusSign = document.createElement('span');
+                    plusSign.classList.add('clickable');
+                    folder.children[0].prepend(plusSign); // Ensure the clickable span is the first child of the folder cell
+                }
+
+                let displayState = 'none';
+                if (plusSign.textContent === '▶') {
+                    plusSign.textContent = '▼'; // Change to downward-pointing triangle
+                    displayState = '';
+                    localStorage.setItem('folderState' + folderIndex, 'expanded');
+                } else {
+                    plusSign.textContent = '▶'; // Change back to right-pointing triangle
+                    localStorage.setItem('folderState' + folderIndex, 'collapsed');
+                }
+
+                let next = folder.nextElementSibling;
+                while (next && next.classList.contains('folded')) {
+                    next.style.display = displayState;
+                    next = next.nextElementSibling;
+                }
+            });
+        });
+
+        document.addEventListener("DOMContentLoaded", function () {
+            // Setting initial fold state for each folder
+            document.querySelectorAll('.folder').forEach(function (folder, index) {
+                let storedState = localStorage.getItem('folderState' + index);
+                let plusSign = folder.querySelector('.clickable');
+                let next = folder.nextElementSibling;
+                if (storedState === 'expanded') {
+                    plusSign.textContent = '▼';
+                    while (next && next.classList.contains('folded')) {
+                        next.style.display = '';  // Show
+                        next = next.nextElementSibling;
+                    }
+                } else {
+                    plusSign.textContent = '▶';
+                    while (next && next.classList.contains('folded')) {
+                        next.style.display = 'none';  // Hide initially
+                        next = next.nextElementSibling;
+                    }
+                }
+            });
+            document.querySelectorAll('.folded').forEach(function (folded) {
+                if (!folded.previousElementSibling.querySelector('.clickable').textContent.includes('▼')) {
+                    folded.style.display = 'none'; // Make sure these rows are initially hidden unless state is expanded
+                }
+            });
+        });
+    </script>
     <script>
         document.addEventListener("DOMContentLoaded", function () {
             var timestamp = "{{ last_updated_timestamp }}";  // Get the UTC timestamp from the template
diff --git a/sky/spot/recovery_strategy.py b/sky/jobs/recovery_strategy.py
similarity index 88%
rename from sky/spot/recovery_strategy.py
rename to sky/jobs/recovery_strategy.py
index 897899e55f8..2a32aa3b24e 100644
--- a/sky/spot/recovery_strategy.py
+++ b/sky/jobs/recovery_strategy.py
@@ -1,9 +1,9 @@
-"""The strategy to handle launching/recovery/termination of spot clusters.
+"""Strategies to handle launching/recovery/termination of managed job clusters.
 
-In the YAML file, the user can specify the strategy to use for spot jobs.
+In the YAML file, the user can specify the strategy to use for managed jobs.
 
 resources:
-    spot_recovery: EAGER_NEXT_REGION
+    job_recovery: EAGER_NEXT_REGION
 """
 import time
 import traceback
@@ -17,8 +17,8 @@
 from sky import sky_logging
 from sky import status_lib
 from sky.backends import backend_utils
+from sky.jobs import utils as managed_job_utils
 from sky.skylet import job_lib
-from sky.spot import spot_utils
 from sky.usage import usage_lib
 from sky.utils import common_utils
 from sky.utils import ux_utils
@@ -28,8 +28,8 @@
 
 logger = sky_logging.init_logger(__name__)
 
-SPOT_STRATEGIES = {}
-SPOT_DEFAULT_STRATEGY = None
+RECOVERY_STRATEGIES = {}
+DEFAULT_RECOVERY_STRATEGY = None
 
 # Waiting time for job from INIT/PENDING to RUNNING
 # 10 * JOB_STARTED_STATUS_CHECK_GAP_SECONDS = 10 * 5 = 50 seconds
@@ -37,7 +37,7 @@
 
 
 def terminate_cluster(cluster_name: str, max_retry: int = 3) -> None:
-    """Terminate the spot cluster."""
+    """Terminate the cluster."""
     retry_cnt = 0
     while True:
         try:
@@ -50,17 +50,17 @@ def terminate_cluster(cluster_name: str, max_retry: int = 3) -> None:
         except Exception as e:  # pylint: disable=broad-except
             retry_cnt += 1
             if retry_cnt >= max_retry:
-                raise RuntimeError('Failed to terminate the spot cluster '
-                                   f'{cluster_name}.') from e
-            logger.error('Failed to terminate the spot cluster '
-                         f'{cluster_name}. Retrying.'
-                         f'Details: {common_utils.format_exception(e)}')
+                raise RuntimeError(
+                    f'Failed to terminate the cluster {cluster_name}.') from e
+            logger.error(
+                f'Failed to terminate the cluster {cluster_name}. Retrying.'
+                f'Details: {common_utils.format_exception(e)}')
             with ux_utils.enable_traceback():
                 logger.error(f'  Traceback: {traceback.format_exc()}')
 
 
 class StrategyExecutor:
-    """Handle each launching, recovery and termination of the spot clusters."""
+    """Handle the launching, recovery and termination of managed job clusters"""
 
     RETRY_INIT_GAP_SECONDS = 60
 
@@ -83,12 +83,12 @@ def __init__(self, cluster_name: str, backend: 'backends.Backend',
         self.retry_until_up = retry_until_up
 
     def __init_subclass__(cls, name: str, default: bool = False):
-        SPOT_STRATEGIES[name] = cls
+        RECOVERY_STRATEGIES[name] = cls
         if default:
-            global SPOT_DEFAULT_STRATEGY
-            assert SPOT_DEFAULT_STRATEGY is None, (
+            global DEFAULT_RECOVERY_STRATEGY
+            assert DEFAULT_RECOVERY_STRATEGY is None, (
                 'Only one strategy can be default.')
-            SPOT_DEFAULT_STRATEGY = name
+            DEFAULT_RECOVERY_STRATEGY = name
 
     @classmethod
     def make(cls, cluster_name: str, backend: 'backends.Backend',
@@ -96,23 +96,23 @@ def make(cls, cluster_name: str, backend: 'backends.Backend',
         """Create a strategy from a task."""
 
         resource_list = list(task.resources)
-        spot_recovery = resource_list[0].spot_recovery
+        job_recovery = resource_list[0].job_recovery
         for resource in resource_list:
-            if resource.spot_recovery != spot_recovery:
+            if resource.job_recovery != job_recovery:
                 raise ValueError(
-                    'The spot recovery strategy should be the same for all '
+                    'The job recovery strategy should be the same for all '
                     'resources.')
-        # Remove the spot_recovery field from the resources, as the strategy
+        # Remove the job_recovery field from the resources, as the strategy
         # will be handled by the strategy class.
-        new_resources_list = [r.copy(spot_recovery=None) for r in resource_list]
+        new_resources_list = [r.copy(job_recovery=None) for r in resource_list]
         # set the new_task_resources to be the same type (list or set) as the
         # original task.resources
         task.set_resources(type(task.resources)(new_resources_list))
-        return SPOT_STRATEGIES[spot_recovery](cluster_name, backend, task,
-                                              retry_until_up)
+        return RECOVERY_STRATEGIES[job_recovery](cluster_name, backend, task,
+                                                 retry_until_up)
 
     def launch(self) -> float:
-        """Launch the spot cluster for the first time.
+        """Launch the cluster for the first time.
 
         It can fail if resource is not available. Need to check the cluster
         status, after calling.
@@ -131,7 +131,7 @@ def launch(self) -> float:
         return job_submit_at
 
     def recover(self) -> float:
-        """Relaunch the spot cluster after failure and wait until job starts.
+        """Relaunch the cluster after failure and wait until job starts.
 
         When recover() is called the cluster should be in STOPPED status (i.e.
         partially down).
@@ -214,8 +214,8 @@ def _wait_until_job_starts_on_cluster(self) -> Optional[float]:
                 break
 
             try:
-                status = spot_utils.get_job_status(self.backend,
-                                                   self.cluster_name)
+                status = managed_job_utils.get_job_status(
+                    self.backend, self.cluster_name)
             except Exception as e:  # pylint: disable=broad-except
                 # If any unexpected error happens, retry the job checking
                 # loop.
@@ -230,7 +230,7 @@ def _wait_until_job_starts_on_cluster(self) -> Optional[float]:
             # Check the job status until it is not in initialized status
             if status is not None and status > job_lib.JobStatus.INIT:
                 try:
-                    job_submitted_at = spot_utils.get_job_timestamp(
+                    job_submitted_at = managed_job_utils.get_job_timestamp(
                         self.backend, self.cluster_name, get_end_time=False)
                     return job_submitted_at
                 except Exception as e:  # pylint: disable=broad-except
@@ -240,7 +240,7 @@ def _wait_until_job_starts_on_cluster(self) -> Optional[float]:
                                 'the job start timestamp. Retrying.')
                     continue
             # Wait for the job to be started
-            time.sleep(spot_utils.JOB_STARTED_STATUS_CHECK_GAP_SECONDS)
+            time.sleep(managed_job_utils.JOB_STARTED_STATUS_CHECK_GAP_SECONDS)
         return None
 
     def _launch(self,
@@ -293,8 +293,8 @@ def _launch(self,
                            cluster_name=self.cluster_name,
                            detach_setup=True,
                            detach_run=True,
-                           _is_launched_by_spot_controller=True)
-                logger.info('Spot cluster launched.')
+                           _is_launched_by_jobs_controller=True)
+                logger.info('Managed job cluster launched.')
             except (exceptions.InvalidClusterNameError,
                     exceptions.NoCloudAccessError,
                     exceptions.ResourcesMismatchError) as e:
@@ -330,12 +330,12 @@ def _launch(self,
                         raise exceptions.ProvisionPrechecksError(
                             reasons=reasons)
                     return None
-                logger.info('Failed to launch the spot cluster with error: '
+                logger.info('Failed to launch a cluster with error: '
                             f'{common_utils.format_exception(e)})')
             except Exception as e:  # pylint: disable=broad-except
                 # If the launch fails, it will be recovered by the following
                 # code.
-                logger.info('Failed to launch the spot cluster with error: '
+                logger.info('Failed to launch a cluster with error: '
                             f'{common_utils.format_exception(e)})')
                 with ux_utils.enable_traceback():
                     logger.info(f'  Traceback: {traceback.format_exc()}')
@@ -345,7 +345,7 @@ def _launch(self,
                 job_submitted_at = self._wait_until_job_starts_on_cluster()
                 if job_submitted_at is not None:
                     return job_submitted_at
-                # The job fails to start on the spot cluster, retry the launch.
+                # The job fails to start on the cluster, retry the launch.
                 # TODO(zhwu): log the unexpected error to usage collection
                 # for future debugging.
                 logger.info(
@@ -358,13 +358,13 @@ def _launch(self,
                 # Retry forever if max_retry is None.
                 if raise_on_failure:
                     with ux_utils.print_exception_no_traceback():
-                        raise exceptions.SpotJobReachedMaxRetriesError(
-                            'Resources unavailable: failed to launch the spot '
-                            f'cluster after {max_retry} retries.')
+                        raise exceptions.ManagedJobReachedMaxRetriesError(
+                            'Resources unavailable: failed to launch clusters '
+                            f'after {max_retry} retries.')
                 else:
                     return None
             gap_seconds = backoff.current_backoff()
-            logger.info('Retrying to launch the spot cluster in '
+            logger.info('Retrying to launch the cluster in '
                         f'{gap_seconds:.1f} seconds.')
             time.sleep(gap_seconds)
 
@@ -429,8 +429,8 @@ def recover(self) -> float:
                     return job_submitted_at
 
             # Step 2
-            logger.debug('Terminating unhealthy spot cluster and '
-                         'reset cloud region.')
+            logger.debug('Terminating unhealthy cluster and reset cloud '
+                         'region.')
             terminate_cluster(self.cluster_name)
 
             # Step 3
@@ -443,13 +443,13 @@ def recover(self) -> float:
                 # Failed to launch the cluster.
                 if self.retry_until_up:
                     gap_seconds = self.RETRY_INIT_GAP_SECONDS
-                    logger.info('Retrying to recover the spot cluster in '
+                    logger.info('Retrying to recover the cluster in '
                                 f'{gap_seconds:.1f} seconds.')
                     time.sleep(gap_seconds)
                     continue
                 with ux_utils.print_exception_no_traceback():
                     raise exceptions.ResourcesUnavailableError(
-                        f'Failed to recover the spot cluster after retrying '
+                        f'Failed to recover the cluster after retrying '
                         f'{self._MAX_RETRY_CNT} times.')
 
             return job_submitted_at
@@ -493,8 +493,7 @@ def recover(self) -> float:
         # task.resources.
 
         # Step 1
-        logger.debug('Terminating unhealthy spot cluster and '
-                     'reset cloud region.')
+        logger.debug('Terminating unhealthy cluster and reset cloud region.')
         terminate_cluster(self.cluster_name)
 
         # Step 2
@@ -532,13 +531,13 @@ def recover(self) -> float:
                 # Failed to launch the cluster.
                 if self.retry_until_up:
                     gap_seconds = self.RETRY_INIT_GAP_SECONDS
-                    logger.info('Retrying to recover the spot cluster in '
+                    logger.info('Retrying to recover the cluster in '
                                 f'{gap_seconds:.1f} seconds.')
                     time.sleep(gap_seconds)
                     continue
                 with ux_utils.print_exception_no_traceback():
                     raise exceptions.ResourcesUnavailableError(
-                        f'Failed to recover the spot cluster after retrying '
+                        f'Failed to recover the cluster after retrying '
                         f'{self._MAX_RETRY_CNT} times.')
 
             return job_submitted_at
diff --git a/sky/spot/spot_state.py b/sky/jobs/state.py
similarity index 76%
rename from sky/spot/spot_state.py
rename to sky/jobs/state.py
index 58b0099c3d2..6ea68da59f8 100644
--- a/sky/spot/spot_state.py
+++ b/sky/jobs/state.py
@@ -1,4 +1,4 @@
-"""The database for spot jobs status."""
+"""The database for managed jobs status."""
 # TODO(zhwu): maybe use file based status instead of database, so
 # that we can easily switch to a s3-based storage.
 import enum
@@ -30,9 +30,11 @@
 _CONN = sqlite3.connect(_DB_PATH)
 _CURSOR = _CONN.cursor()
 
+# === Database schema ===
 # `spot` table contains all the finest-grained tasks, including all the
-# tasks of a spot job. All tasks of the same job will have the same
-# `spot_job_id`.
+# tasks of a managed job (called spot for legacy reason, as it is generalized
+# from the previous managed spot jobs). All tasks of the same job will have the
+# same `spot_job_id`.
 # The `job_name` column is now deprecated. It now holds the task's name, i.e.,
 # the same content as the `task_name` column.
 # The `job_id` is now not really a job id, but a only a unique
@@ -60,7 +62,7 @@
 
 db_utils.add_column_to_table(_CURSOR, _CONN, 'spot', 'failure_reason', 'TEXT')
 # Create a new column `spot_job_id`, which is the same for tasks of the
-# same spot job.
+# same managed job.
 # The original `job_id` no longer has an actual meaning, but only a legacy
 # identifier for all tasks in database.
 db_utils.add_column_to_table(_CURSOR,
@@ -99,7 +101,7 @@
 # total_job_duration = end_at - last_recovered_at + job_duration
 #
 # Column names to be used in the jobs dict returned to the caller,
-# e.g., via sky spot queue. These may not correspond to actual
+# e.g., via sky jobs queue. These may not correspond to actual
 # column names in the DB and it corresponds to the combined view
 # by joining the spot and job_info tables.
 columns = [
@@ -124,21 +126,21 @@
 ]
 
 
-class SpotStatus(enum.Enum):
-    """Spot job status, designed to be in serverless style.
+class ManagedJobStatus(enum.Enum):
+    """Managed job status, designed to be in serverless style.
 
-    The SpotStatus is a higher level status than the JobStatus.
-    Each spot job submitted to the spot cluster, will have a JobStatus
-    on that spot cluster:
+    The ManagedJobStatus is a higher level status than the JobStatus.
+    Each managed job submitted to a cluster will have a JobStatus
+    on that cluster:
         JobStatus = [INIT, SETTING_UP, PENDING, RUNNING, ...]
-    Whenever the spot cluster is preempted and recovered, the JobStatus
+    Whenever the cluster is preempted and recovered, the JobStatus
     will go through the statuses above again.
-    That means during the lifetime of a spot job, its JobsStatus could be
+    That means during the lifetime of a managed job, its JobsStatus could be
     reset to INIT or SETTING_UP multiple times (depending on the preemptions).
 
-    However, a spot job only has one SpotStatus on the spot controller.
-        SpotStatus = [PENDING, SUBMITTED, STARTING, RUNNING, ...]
-    Mapping from JobStatus to SpotStatus:
+    However, a managed job only has one ManagedJobStatus on the jobs controller.
+        ManagedJobStatus = [PENDING, SUBMITTED, STARTING, RUNNING, ...]
+    Mapping from JobStatus to ManagedJobStatus:
         INIT            ->  STARTING/RECOVERING
         SETTING_UP      ->  RUNNING
         PENDING         ->  RUNNING
@@ -146,38 +148,37 @@ class SpotStatus(enum.Enum):
         SUCCEEDED       ->  SUCCEEDED
         FAILED          ->  FAILED
         FAILED_SETUP    ->  FAILED_SETUP
-    Note that the JobStatus will not be stuck in PENDING, because each spot
-    cluster is dedicated to a spot job, i.e. there should always be enough
-    resource to run the job and the job will be immediately transitioned to
-    RUNNING.
+    Note that the JobStatus will not be stuck in PENDING, because each cluster
+    is dedicated to a managed job, i.e. there should always be enough resource
+    to run the job and the job will be immediately transitioned to RUNNING.
     """
-    # PENDING: Waiting for the spot controller to have a slot to run the
+    # PENDING: Waiting for the jobs controller to have a slot to run the
     # controller process.
-    # The submitted_at timestamp of the spot job in the 'spot' table will be
+    # The submitted_at timestamp of the managed job in the 'spot' table will be
     # set to the time when the job is firstly submitted by the user (set to
     # PENDING).
     PENDING = 'PENDING'
-    # SUBMITTED: The spot controller starts the controller process.
+    # SUBMITTED: The jobs controller starts the controller process.
     SUBMITTED = 'SUBMITTED'
-    # STARTING: The controller process is launching the spot cluster for
-    # the spot job.
+    # STARTING: The controller process is launching the cluster for the managed
+    # job.
     STARTING = 'STARTING'
-    # RUNNING: The job is submitted to the spot cluster, and is setting up
-    # or running.
-    # The start_at timestamp of the spot job in the 'spot' table will be set
+    # RUNNING: The job is submitted to the cluster, and is setting up or
+    # running.
+    # The start_at timestamp of the managed job in the 'spot' table will be set
     # to the time when the job is firstly transitioned to RUNNING.
     RUNNING = 'RUNNING'
-    # RECOVERING: The spot cluster is preempted, and the controller process
-    # is recovering the spot cluster (relaunching/failover).
+    # RECOVERING: The cluster is preempted, and the controller process is
+    # recovering the cluster (relaunching/failover).
     RECOVERING = 'RECOVERING'
     # Terminal statuses
     # SUCCEEDED: The job is finished successfully.
     SUCCEEDED = 'SUCCEEDED'
     # CANCELLING: The job is requested to be cancelled by the user, and the
-    # controller is cleaning up the spot cluster.
+    # controller is cleaning up the cluster.
     CANCELLING = 'CANCELLING'
-    # CANCELLED: The job is cancelled by the user. When the spot job is in
-    # CANCELLED status, the spot cluster has been cleaned up.
+    # CANCELLED: The job is cancelled by the user. When the managed job is in
+    # CANCELLED status, the cluster has been cleaned up.
     CANCELLED = 'CANCELLED'
     # FAILED: The job is finished with failure from the user's program.
     FAILED = 'FAILED'
@@ -192,7 +193,7 @@ class SpotStatus(enum.Enum):
     #    identity, or unsupported feature.
     FAILED_PRECHECKS = 'FAILED_PRECHECKS'
     # FAILED_NO_RESOURCE: The job is finished with failure because there is no
-    # resource available in the cloud provider(s) to launch the spot cluster.
+    # resource available in the cloud provider(s) to launch the cluster.
     FAILED_NO_RESOURCE = 'FAILED_NO_RESOURCE'
     # FAILED_CONTROLLER: The job is finished with failure because of unexpected
     # error in the controller process.
@@ -204,15 +205,16 @@ def is_terminal(self) -> bool:
     def is_failed(self) -> bool:
         return self in self.failure_statuses()
 
-    def colored_str(self):
+    def colored_str(self) -> str:
         color = _SPOT_STATUS_TO_COLOR[self]
         return f'{color}{self.value}{colorama.Style.RESET_ALL}'
 
-    def __lt__(self, other):
-        return list(SpotStatus).index(self) < list(SpotStatus).index(other)
+    def __lt__(self, other) -> bool:
+        status_list = list(ManagedJobStatus)
+        return status_list.index(self) < status_list.index(other)
 
     @classmethod
-    def terminal_statuses(cls) -> List['SpotStatus']:
+    def terminal_statuses(cls) -> List['ManagedJobStatus']:
         return [
             cls.SUCCEEDED,
             cls.FAILED,
@@ -225,7 +227,7 @@ def terminal_statuses(cls) -> List['SpotStatus']:
         ]
 
     @classmethod
-    def failure_statuses(cls) -> List['SpotStatus']:
+    def failure_statuses(cls) -> List['ManagedJobStatus']:
         return [
             cls.FAILED, cls.FAILED_SETUP, cls.FAILED_PRECHECKS,
             cls.FAILED_NO_RESOURCE, cls.FAILED_CONTROLLER
@@ -233,19 +235,19 @@ def failure_statuses(cls) -> List['SpotStatus']:
 
 
 _SPOT_STATUS_TO_COLOR = {
-    SpotStatus.PENDING: colorama.Fore.BLUE,
-    SpotStatus.SUBMITTED: colorama.Fore.BLUE,
-    SpotStatus.STARTING: colorama.Fore.BLUE,
-    SpotStatus.RUNNING: colorama.Fore.GREEN,
-    SpotStatus.RECOVERING: colorama.Fore.CYAN,
-    SpotStatus.SUCCEEDED: colorama.Fore.GREEN,
-    SpotStatus.FAILED: colorama.Fore.RED,
-    SpotStatus.FAILED_PRECHECKS: colorama.Fore.RED,
-    SpotStatus.FAILED_SETUP: colorama.Fore.RED,
-    SpotStatus.FAILED_NO_RESOURCE: colorama.Fore.RED,
-    SpotStatus.FAILED_CONTROLLER: colorama.Fore.RED,
-    SpotStatus.CANCELLING: colorama.Fore.YELLOW,
-    SpotStatus.CANCELLED: colorama.Fore.YELLOW,
+    ManagedJobStatus.PENDING: colorama.Fore.BLUE,
+    ManagedJobStatus.SUBMITTED: colorama.Fore.BLUE,
+    ManagedJobStatus.STARTING: colorama.Fore.BLUE,
+    ManagedJobStatus.RUNNING: colorama.Fore.GREEN,
+    ManagedJobStatus.RECOVERING: colorama.Fore.CYAN,
+    ManagedJobStatus.SUCCEEDED: colorama.Fore.GREEN,
+    ManagedJobStatus.FAILED: colorama.Fore.RED,
+    ManagedJobStatus.FAILED_PRECHECKS: colorama.Fore.RED,
+    ManagedJobStatus.FAILED_SETUP: colorama.Fore.RED,
+    ManagedJobStatus.FAILED_NO_RESOURCE: colorama.Fore.RED,
+    ManagedJobStatus.FAILED_CONTROLLER: colorama.Fore.RED,
+    ManagedJobStatus.CANCELLING: colorama.Fore.YELLOW,
+    ManagedJobStatus.CANCELLED: colorama.Fore.YELLOW,
 }
 
 
@@ -268,7 +270,7 @@ def set_pending(job_id: int, task_id: int, task_name: str, resources_str: str):
             (spot_job_id, task_id, task_name, resources, status)
             VALUES (?, ?, ?, ?, ?)""",
             (job_id, task_id, task_name, resources_str,
-             SpotStatus.PENDING.value))
+             ManagedJobStatus.PENDING.value))
 
 
 def set_submitted(job_id: int, task_id: int, run_timestamp: str,
@@ -277,18 +279,18 @@ def set_submitted(job_id: int, task_id: int, run_timestamp: str,
     """Set the task to submitted.
 
     Args:
-        job_id: The spot job ID.
+        job_id: The managed job ID.
         task_id: The task ID.
         run_timestamp: The run_timestamp of the run. This will be used to
-            determine the log directory of the spot task.
-        submit_time: The time when the spot task is submitted.
-        resources_str: The resources string of the spot task.
+            determine the log directory of the managed task.
+        submit_time: The time when the managed task is submitted.
+        resources_str: The resources string of the managed task.
     """
     # Use the timestamp in the `run_timestamp` ('sky-2022-10...'), to make
     # the log directory and submission time align with each other, so as to
     # make it easier to find them based on one of the values.
     # Also, using the earlier timestamp should be closer to the term
-    # `submit_at`, which represents the time the spot task is submitted.
+    # `submit_at`, which represents the time the managed task is submitted.
     with db_utils.safe_cursor(_DB_PATH) as cursor:
         cursor.execute(
             """\
@@ -299,7 +301,7 @@ def set_submitted(job_id: int, task_id: int, run_timestamp: str,
             run_timestamp=(?)
             WHERE spot_job_id=(?) AND
             task_id=(?)""",
-            (resources_str, submit_time, SpotStatus.SUBMITTED.value,
+            (resources_str, submit_time, ManagedJobStatus.SUBMITTED.value,
              run_timestamp, job_id, task_id))
     callback_func('SUBMITTED')
 
@@ -312,7 +314,7 @@ def set_starting(job_id: int, task_id: int, callback_func: CallbackType):
             """\
             UPDATE spot SET status=(?)
             WHERE spot_job_id=(?) AND
-            task_id=(?)""", (SpotStatus.STARTING.value, job_id, task_id))
+            task_id=(?)""", (ManagedJobStatus.STARTING.value, job_id, task_id))
     callback_func('STARTING')
 
 
@@ -326,7 +328,14 @@ def set_started(job_id: int, task_id: int, start_time: float,
             UPDATE spot SET status=(?), start_at=(?), last_recovered_at=(?)
             WHERE spot_job_id=(?) AND
             task_id=(?)""",
-            (SpotStatus.RUNNING.value, start_time, start_time, job_id, task_id))
+            (
+                ManagedJobStatus.RUNNING.value,
+                start_time,
+                start_time,
+                job_id,
+                task_id,
+            ),
+        )
     callback_func('STARTED')
 
 
@@ -340,7 +349,7 @@ def set_recovering(job_id: int, task_id: int, callback_func: CallbackType):
                 status=(?), job_duration=job_duration+(?)-last_recovered_at
                 WHERE spot_job_id=(?) AND
                 task_id=(?)""",
-            (SpotStatus.RECOVERING.value, time.time(), job_id, task_id))
+            (ManagedJobStatus.RECOVERING.value, time.time(), job_id, task_id))
     callback_func('RECOVERING')
 
 
@@ -354,7 +363,7 @@ def set_recovered(job_id: int, task_id: int, recovered_time: float,
             status=(?), last_recovered_at=(?), recovery_count=recovery_count+1
             WHERE spot_job_id=(?) AND
             task_id=(?)""",
-            (SpotStatus.RUNNING.value, recovered_time, job_id, task_id))
+            (ManagedJobStatus.RUNNING.value, recovered_time, job_id, task_id))
     logger.info('==== Recovered. ====')
     callback_func('RECOVERED')
 
@@ -369,7 +378,7 @@ def set_succeeded(job_id: int, task_id: int, end_time: float,
             status=(?), end_at=(?)
             WHERE spot_job_id=(?) AND task_id=(?)
             AND end_at IS null""",
-            (SpotStatus.SUCCEEDED.value, end_time, job_id, task_id))
+            (ManagedJobStatus.SUCCEEDED.value, end_time, job_id, task_id))
 
     callback_func('SUCCEEDED')
     logger.info('Job succeeded.')
@@ -378,7 +387,7 @@ def set_succeeded(job_id: int, task_id: int, end_time: float,
 def set_failed(
     job_id: int,
     task_id: Optional[int],
-    failure_type: SpotStatus,
+    failure_type: ManagedJobStatus,
     failure_reason: str,
     callback_func: Optional[CallbackType] = None,
     end_time: Optional[float] = None,
@@ -389,7 +398,7 @@ def set_failed(
         job_id: The job id.
         task_id: The task id. If None, all non-finished tasks of the job will
             be set to failed.
-        failure_type: The failure type. One of SpotStatus.FAILED_*.
+        failure_type: The failure type. One of ManagedJobStatus.FAILED_*.
         failure_reason: The failure reason.
         end_time: The end time. If None, the current time will be used.
     """
@@ -405,8 +414,8 @@ def set_failed(
         previous_status = cursor.execute(
             'SELECT status FROM spot WHERE spot_job_id=(?)',
             (job_id,)).fetchone()
-        previous_status = SpotStatus(previous_status[0])
-        if previous_status in [SpotStatus.RECOVERING]:
+        previous_status = ManagedJobStatus(previous_status[0])
+        if previous_status in [ManagedJobStatus.RECOVERING]:
             # If the job is recovering, we should set the
             # last_recovered_at to the end_time, so that the
             # end_at - last_recovered_at will not be affect the job duration
@@ -438,7 +447,7 @@ def set_cancelling(job_id: int, callback_func: CallbackType):
             UPDATE spot SET
             status=(?), end_at=(?)
             WHERE spot_job_id=(?) AND end_at IS null""",
-            (SpotStatus.CANCELLING.value, time.time(), job_id))
+            (ManagedJobStatus.CANCELLING.value, time.time(), job_id))
         if rows.rowcount > 0:
             logger.info('Cancelling the job...')
             callback_func('CANCELLING')
@@ -455,8 +464,8 @@ def set_cancelled(job_id: int, callback_func: CallbackType):
             UPDATE spot SET
             status=(?), end_at=(?)
             WHERE spot_job_id=(?) AND status=(?)""",
-            (SpotStatus.CANCELLED.value, time.time(), job_id,
-             SpotStatus.CANCELLING.value))
+            (ManagedJobStatus.CANCELLED.value, time.time(), job_id,
+             ManagedJobStatus.CANCELLING.value))
         if rows.rowcount > 0:
             logger.info('Job cancelled.')
             callback_func('CANCELLED')
@@ -465,8 +474,10 @@ def set_cancelled(job_id: int, callback_func: CallbackType):
 # ======== utility functions ========
 def get_nonterminal_job_ids_by_name(name: Optional[str]) -> List[int]:
     """Get non-terminal job ids by name."""
-    statuses = ', '.join(['?'] * len(SpotStatus.terminal_statuses()))
-    field_values = [status.value for status in SpotStatus.terminal_statuses()]
+    statuses = ', '.join(['?'] * len(ManagedJobStatus.terminal_statuses()))
+    field_values = [
+        status.value for status in ManagedJobStatus.terminal_statuses()
+    ]
 
     name_filter = ''
     if name is not None:
@@ -478,7 +489,7 @@ def get_nonterminal_job_ids_by_name(name: Optional[str]) -> List[int]:
         field_values.extend([name, name])
 
     # Left outer join is used here instead of join, because the job_info does
-    # not contain the spot jobs submitted before #1982.
+    # not contain the managed jobs submitted before #1982.
     with db_utils.safe_cursor(_DB_PATH) as cursor:
         rows = cursor.execute(
             f"""\
@@ -494,14 +505,15 @@ def get_nonterminal_job_ids_by_name(name: Optional[str]) -> List[int]:
         return job_ids
 
 
-def _get_all_task_ids_statuses(job_id: int) -> List[Tuple[int, SpotStatus]]:
+def _get_all_task_ids_statuses(
+        job_id: int) -> List[Tuple[int, ManagedJobStatus]]:
     with db_utils.safe_cursor(_DB_PATH) as cursor:
         id_statuses = cursor.execute(
             """\
             SELECT task_id, status FROM spot
             WHERE spot_job_id=(?)
             ORDER BY task_id ASC""", (job_id,)).fetchall()
-        return [(row[0], SpotStatus(row[1])) for row in id_statuses]
+        return [(row[0], ManagedJobStatus(row[1])) for row in id_statuses]
 
 
 def get_num_tasks(job_id: int) -> int:
@@ -509,13 +521,13 @@ def get_num_tasks(job_id: int) -> int:
 
 
 def get_latest_task_id_status(
-        job_id: int) -> Union[Tuple[int, SpotStatus], Tuple[None, None]]:
+        job_id: int) -> Union[Tuple[int, ManagedJobStatus], Tuple[None, None]]:
     """Returns the (task id, status) of the latest task of a job.
 
-    The latest means the task that is currently being executed or to be
-    started by the controller process. For example, in a spot job with
-    3 tasks, the first task is succeeded, and the second task is being
-    executed. This will return (1, SpotStatus.RUNNING).
+    The latest means the task that is currently being executed or to be started
+    by the controller process. For example, in a managed job with 3 tasks, the
+    first task is succeeded, and the second task is being executed. This will
+    return (1, ManagedJobStatus.RUNNING).
 
     If the job_id does not exist, (None, None) will be returned.
     """
@@ -529,7 +541,7 @@ def get_latest_task_id_status(
     return task_id, status
 
 
-def get_status(job_id: int) -> Optional[SpotStatus]:
+def get_status(job_id: int) -> Optional[ManagedJobStatus]:
     _, status = get_latest_task_id_status(job_id)
     return status
 
@@ -551,14 +563,14 @@ def get_failure_reason(job_id: int) -> Optional[str]:
         return reason[0]
 
 
-def get_spot_jobs(job_id: Optional[int] = None) -> List[Dict[str, Any]]:
-    """Get spot clusters' status."""
+def get_managed_jobs(job_id: Optional[int] = None) -> List[Dict[str, Any]]:
+    """Get managed jobs from the database."""
     job_filter = '' if job_id is None else f'WHERE spot.spot_job_id={job_id}'
 
-    # Join the spot and job_info tables to get the job name for each task.
+    # Join spot and job_info tables to get the job name for each task.
     # We use LEFT OUTER JOIN mainly for backward compatibility, as for an
     # existing controller before #1982, the job_info table may not exist,
-    # and all the spot jobs created before will not present in the
+    # and all the managed jobs created before will not present in the
     # job_info.
     with db_utils.safe_cursor(_DB_PATH) as cursor:
         rows = cursor.execute(f"""\
@@ -571,7 +583,7 @@ def get_spot_jobs(job_id: Optional[int] = None) -> List[Dict[str, Any]]:
         jobs = []
         for row in rows:
             job_dict = dict(zip(columns, row))
-            job_dict['status'] = SpotStatus(job_dict['status'])
+            job_dict['status'] = ManagedJobStatus(job_dict['status'])
             if job_dict['job_name'] is None:
                 job_dict['job_name'] = job_dict['task_name']
             jobs.append(job_dict)
diff --git a/sky/spot/spot_utils.py b/sky/jobs/utils.py
similarity index 59%
rename from sky/spot/spot_utils.py
rename to sky/jobs/utils.py
index 67ac0cf4ae8..aadf5a64684 100644
--- a/sky/spot/spot_utils.py
+++ b/sky/jobs/utils.py
@@ -1,10 +1,17 @@
-"""User interfaces with managed spot jobs."""
+"""User interfaces with managed jobs.
+
+NOTE: whenever an API change is made in this file, we need to bump the
+jobs.constants.MANAGED_JOBS_VERSION and handle the API change in the
+ManagedJobCodeGen.
+"""
 import collections
 import enum
-import json
+import inspect
 import os
 import pathlib
 import shlex
+import shutil
+import textwrap
 import time
 import typing
 from typing import Any, Dict, List, Optional, Tuple, Union
@@ -18,11 +25,11 @@
 from sky import global_user_state
 from sky import sky_logging
 from sky.backends import backend_utils
+from sky.jobs import constants as managed_job_constants
+from sky.jobs import state as managed_job_state
 from sky.skylet import constants
 from sky.skylet import job_lib
-from sky.skylet.log_lib import run_bash_command_with_log
-from sky.spot import constants as spot_constants
-from sky.spot import spot_state
+from sky.skylet import log_lib
 from sky.utils import common_utils
 from sky.utils import log_utils
 from sky.utils import rich_utils
@@ -36,17 +43,18 @@
 
 # Add user hash so that two users don't have the same controller VM on
 # shared-account clouds such as GCP.
-SPOT_CONTROLLER_NAME: str = (
+JOB_CONTROLLER_NAME: str = (
+    f'sky-jobs-controller-{common_utils.get_user_hash()}')
+LEGACY_JOB_CONTROLLER_NAME: str = (
     f'sky-spot-controller-{common_utils.get_user_hash()}')
-SIGNAL_FILE_PREFIX = '/tmp/sky_spot_controller_signal_{}'
+SIGNAL_FILE_PREFIX = '/tmp/sky_jobs_controller_signal_{}'
+LEGACY_SIGNAL_FILE_PREFIX = '/tmp/sky_spot_controller_signal_{}'
 # Controller checks its job's status every this many seconds.
 JOB_STATUS_CHECK_GAP_SECONDS = 20
 
 # Controller checks if its job has started every this many seconds.
 JOB_STARTED_STATUS_CHECK_GAP_SECONDS = 5
 
-_SPOT_STATUS_CACHE = '~/.sky/spot_status_cache.txt'
-
 _LOG_STREAM_CHECK_CONTROLLER_GAP_SECONDS = 5
 
 _JOB_WAITING_STATUS_MESSAGE = ('[bold cyan]Waiting for the task to start'
@@ -55,11 +63,11 @@
     '[bold cyan]Waiting for the task status to be updated.'
     '[/] It may take a minute.')
 
-# The maximum time to wait for the spot job status to transition to terminal
+# The maximum time to wait for the managed job status to transition to terminal
 # state, after the job finished. This is a safeguard to avoid the case where
-# the spot job status fails to be updated and keep the `sky spot logs` blocking
-# for a long time.
-_FINAL_SPOT_STATUS_WAIT_TIMEOUT_SECONDS = 20
+# the managed job status fails to be updated and keep the `sky jobs logs`
+# blocking for a long time.
+_FINAL_JOB_STATUS_WAIT_TIMEOUT_SECONDS = 20
 
 
 class UserSignal(enum.Enum):
@@ -72,7 +80,7 @@ class UserSignal(enum.Enum):
 # ====== internal functions ======
 def get_job_status(backend: 'backends.CloudVmRayBackend',
                    cluster_name: str) -> Optional['job_lib.JobStatus']:
-    """Check the status of the job running on the spot cluster.
+    """Check the status of the job running on a managed job cluster.
 
     It can be None, INIT, RUNNING, SUCCEEDED, FAILED, FAILED_SETUP or CANCELLED.
     """
@@ -93,8 +101,8 @@ def get_job_status(backend: 'backends.CloudVmRayBackend',
     return status
 
 
-def update_spot_job_status(job_id: Optional[int] = None):
-    """Update spot job status if the controller job failed abnormally.
+def update_managed_job_status(job_id: Optional[int] = None):
+    """Update managed job status if the controller job failed abnormally.
 
     Check the status of the controller job. If it is not running, it must have
     exited abnormally, and we should set the job status to FAILED_CONTROLLER.
@@ -103,7 +111,7 @@ def update_spot_job_status(job_id: Optional[int] = None):
     is called.
     """
     if job_id is None:
-        job_ids = spot_state.get_nonterminal_job_ids_by_name(None)
+        job_ids = managed_job_state.get_nonterminal_job_ids_by_name(None)
     else:
         job_ids = [job_id]
     for job_id_ in job_ids:
@@ -111,11 +119,12 @@ def update_spot_job_status(job_id: Optional[int] = None):
         if controller_status is None or controller_status.is_terminal():
             logger.error(f'Controller for job {job_id_} has exited abnormally. '
                          'Setting the job status to FAILED_CONTROLLER.')
-            tasks = spot_state.get_spot_jobs(job_id_)
+            tasks = managed_job_state.get_managed_jobs(job_id_)
             for task in tasks:
                 task_name = task['job_name']
-                # Tear down the abnormal spot cluster to avoid resource leakage.
-                cluster_name = generate_spot_cluster_name(task_name, job_id_)
+                # Tear down the abnormal cluster to avoid resource leakage.
+                cluster_name = generate_managed_job_cluster_name(
+                    task_name, job_id_)
                 handle = global_user_state.get_handle_from_cluster_name(
                     cluster_name)
                 if handle is not None:
@@ -126,22 +135,23 @@ def update_spot_job_status(job_id: Optional[int] = None):
                             backend.teardown(handle, terminate=True)
                             break
                         except RuntimeError:
-                            logger.error('Failed to tear down the spot cluster '
+                            logger.error('Failed to tear down the cluster '
                                          f'{cluster_name!r}. Retrying '
                                          f'[{retry_cnt}/{max_retry}].')
 
-            # The controller job for this spot job is not running: it must
+            # The controller job for this managed job is not running: it must
             # have exited abnormally, and we should set the job status to
             # FAILED_CONTROLLER.
             # The `set_failed` will only update the task's status if the
             # status is non-terminal.
-            spot_state.set_failed(
+            managed_job_state.set_failed(
                 job_id_,
                 task_id=None,
-                failure_type=spot_state.SpotStatus.FAILED_CONTROLLER,
+                failure_type=managed_job_state.ManagedJobStatus.
+                FAILED_CONTROLLER,
                 failure_reason=
                 'Controller process has exited abnormally. For more details,'
-                f' run: sky spot logs --controller {job_id_}')
+                f' run: sky jobs logs --controller {job_id_}')
 
 
 def get_job_timestamp(backend: 'backends.CloudVmRayBackend', cluster_name: str,
@@ -164,17 +174,18 @@ def get_job_timestamp(backend: 'backends.CloudVmRayBackend', cluster_name: str,
 def event_callback_func(job_id: int, task_id: int, task: 'sky.Task'):
     """Run event callback for the task."""
 
-    def callback_func(state: str):
+    def callback_func(status: str):
         event_callback = task.event_callback if task else None
         if event_callback is None or task is None:
             return
         event_callback = event_callback.strip()
-        cluster_name = generate_spot_cluster_name(task.name,
-                                                  job_id) if task.name else None
-        logger.info(f'=== START: event callback for {state!r} ===')
-        log_path = os.path.join(constants.SKY_LOGS_DIRECTORY, 'spot_event',
-                                f'spot-callback-{job_id}-{task_id}.log')
-        result = run_bash_command_with_log(
+        cluster_name = generate_managed_job_cluster_name(
+            task.name, job_id) if task.name else None
+        logger.info(f'=== START: event callback for {status!r} ===')
+        log_path = os.path.join(constants.SKY_LOGS_DIRECTORY,
+                                'managed_job_event',
+                                f'jobs-callback-{job_id}-{task_id}.log')
+        result = log_lib.run_bash_command_with_log(
             bash_command=event_callback,
             log_path=log_path,
             env_vars=dict(
@@ -184,14 +195,14 @@ def callback_func(state: str):
                     task.envs.get(constants.TASK_ID_LIST_ENV_VAR, 'N.A.')),
                 TASK_ID=str(task_id),
                 JOB_ID=str(job_id),
-                JOB_STATUS=state,
+                JOB_STATUS=status,
                 CLUSTER_NAME=cluster_name or '',
                 TASK_NAME=task.name or '',
                 # TODO(MaoZiming): Future event type Job or Spot.
                 EVENT_TYPE='Spot'))
         logger.info(
             f'Bash:{event_callback},log_path:{log_path},result:{result}')
-        logger.info(f'=== END: event callback for {state!r} ===')
+        logger.info(f'=== END: event callback for {status!r} ===')
 
     return callback_func
 
@@ -199,14 +210,14 @@ def callback_func(state: str):
 # ======== user functions ========
 
 
-def generate_spot_cluster_name(task_name: str, job_id: int) -> str:
-    """Generate spot cluster name."""
+def generate_managed_job_cluster_name(task_name: str, job_id: int) -> str:
+    """Generate managed job cluster name."""
     # Truncate the task name to 30 chars to avoid the cluster name being too
     # long after appending the job id, which will cause another truncation in
     # the underlying sky.launch, hiding the `job_id` in the cluster name.
     cluster_name = common_utils.make_cluster_name_on_cloud(
         task_name,
-        spot_constants.SPOT_CLUSTER_NAME_PREFIX_LENGTH,
+        managed_job_constants.JOBS_CLUSTER_NAME_PREFIX_LENGTH,
         add_user_hash=False)
     return f'{cluster_name}-{job_id}'
 
@@ -217,7 +228,7 @@ def cancel_jobs_by_id(job_ids: Optional[List[int]]) -> str:
     If job_ids is None, cancel all jobs.
     """
     if job_ids is None:
-        job_ids = spot_state.get_nonterminal_job_ids_by_name(None)
+        job_ids = managed_job_state.get_nonterminal_job_ids_by_name(None)
     job_ids = list(set(job_ids))
     if len(job_ids) == 0:
         return 'No job to cancel.'
@@ -225,9 +236,9 @@ def cancel_jobs_by_id(job_ids: Optional[List[int]]) -> str:
     logger.info(f'Cancelling jobs {job_id_str}.')
     cancelled_job_ids = []
     for job_id in job_ids:
-        # Check the status of the managed spot job status. If it is in
+        # Check the status of the managed job status. If it is in
         # terminal state, we can safely skip it.
-        job_status = spot_state.get_status(job_id)
+        job_status = managed_job_state.get_status(job_id)
         if job_status is None:
             logger.info(f'Job {job_id} not found. Skipped.')
             continue
@@ -236,18 +247,21 @@ def cancel_jobs_by_id(job_ids: Optional[List[int]]) -> str:
                         f'{job_status.value}. Skipped.')
             continue
 
-        update_spot_job_status(job_id)
+        update_managed_job_status(job_id)
 
-        # Send the signal to the spot job controller.
+        # Send the signal to the jobs controller.
         signal_file = pathlib.Path(SIGNAL_FILE_PREFIX.format(job_id))
+        legacy_signal_file = pathlib.Path(
+            LEGACY_SIGNAL_FILE_PREFIX.format(job_id))
         # Filelock is needed to prevent race condition between signal
         # check/removal and signal writing.
-        # TODO(mraheja): remove pylint disabling when filelock version updated
-        # pylint: disable=abstract-class-instantiated
         with filelock.FileLock(str(signal_file) + '.lock'):
             with signal_file.open('w', encoding='utf-8') as f:
                 f.write(UserSignal.CANCEL.value)
                 f.flush()
+            # Backward compatibility for managed jobs launched before #3419. It
+            # can be removed in the future 0.8.0 release.
+            shutil.copy(str(signal_file), str(legacy_signal_file))
         cancelled_job_ids.append(job_id)
 
     if len(cancelled_job_ids) == 0:
@@ -262,7 +276,7 @@ def cancel_jobs_by_id(job_ids: Optional[List[int]]) -> str:
 
 def cancel_job_by_name(job_name: str) -> str:
     """Cancel a job by name."""
-    job_ids = spot_state.get_nonterminal_job_ids_by_name(job_name)
+    job_ids = managed_job_state.get_nonterminal_job_ids_by_name(job_name)
     if len(job_ids) == 0:
         return f'No running job found with name {job_name!r}.'
     if len(job_ids) > 1:
@@ -279,7 +293,7 @@ def stream_logs_by_id(job_id: int, follow: bool = True) -> str:
     status_msg = ('[bold cyan]Waiting for controller process to be RUNNING'
                   '{status_str}[/].')
     status_display = rich_utils.safe_status(status_msg.format(status_str=''))
-    num_tasks = spot_state.get_num_tasks(job_id)
+    num_tasks = managed_job_state.get_num_tasks(job_id)
 
     with status_display:
         prev_msg = None
@@ -299,43 +313,46 @@ def stream_logs_by_id(job_id: int, follow: bool = True) -> str:
         msg = _JOB_WAITING_STATUS_MESSAGE.format(status_str='')
         status_display.update(msg)
         prev_msg = msg
-        spot_job_status = spot_state.get_status(job_id)
-        while spot_job_status is None:
+        managed_job_status = managed_job_state.get_status(job_id)
+        while managed_job_status is None:
             time.sleep(1)
-            spot_job_status = spot_state.get_status(job_id)
+            managed_job_status = managed_job_state.get_status(job_id)
 
-        if spot_job_status.is_terminal():
+        if managed_job_status.is_terminal():
             job_msg = ''
-            if spot_job_status.is_failed():
-                job_msg = (
-                    f'\nFailure reason: {spot_state.get_failure_reason(job_id)}'
-                )
+            if managed_job_status.is_failed():
+                job_msg = ('\nFailure reason: '
+                           f'{managed_job_state.get_failure_reason(job_id)}')
             return (f'{colorama.Fore.YELLOW}'
                     f'Job {job_id} is already in terminal state '
-                    f'{spot_job_status.value}. Logs will not be shown.'
+                    f'{managed_job_status.value}. Logs will not be shown.'
                     f'{colorama.Style.RESET_ALL}{job_msg}')
         backend = backends.CloudVmRayBackend()
-        task_id, spot_status = spot_state.get_latest_task_id_status(job_id)
+        task_id, managed_job_status = (
+            managed_job_state.get_latest_task_id_status(job_id))
 
-        # task_id and spot_status can be None if the controller process just
-        # started and the spot status has not set to PENDING yet.
-        while spot_status is None or not spot_status.is_terminal():
+        # task_id and managed_job_status can be None if the controller process
+        # just started and the managed job status has not set to PENDING yet.
+        while (managed_job_status is None or
+               not managed_job_status.is_terminal()):
             handle = None
             if task_id is not None:
-                task_name = spot_state.get_task_name(job_id, task_id)
-                cluster_name = generate_spot_cluster_name(task_name, job_id)
+                task_name = managed_job_state.get_task_name(job_id, task_id)
+                cluster_name = generate_managed_job_cluster_name(
+                    task_name, job_id)
                 handle = global_user_state.get_handle_from_cluster_name(
                     cluster_name)
 
             # Check the handle: The cluster can be preempted and removed from
-            # the table before the spot state is updated by the controller. In
-            # this case, we should skip the logging, and wait for the next
-            # round of status check.
-            if handle is None or spot_status != spot_state.SpotStatus.RUNNING:
+            # the table before the managed job state is updated by the
+            # controller. In this case, we should skip the logging, and wait for
+            # the next round of status check.
+            if (handle is None or managed_job_status !=
+                    managed_job_state.ManagedJobStatus.RUNNING):
                 status_str = ''
-                if (spot_status is not None and
-                        spot_status != spot_state.SpotStatus.RUNNING):
-                    status_str = f' (status: {spot_status.value})'
+                if (managed_job_status is not None and managed_job_status !=
+                        managed_job_state.ManagedJobStatus.RUNNING):
+                    status_str = f' (status: {managed_job_status.value})'
                 logger.debug(
                     f'INFO: The log is not ready yet{status_str}. '
                     f'Waiting for {JOB_STATUS_CHECK_GAP_SECONDS} seconds.')
@@ -344,21 +361,21 @@ def stream_logs_by_id(job_id: int, follow: bool = True) -> str:
                     status_display.update(msg)
                     prev_msg = msg
                 time.sleep(JOB_STATUS_CHECK_GAP_SECONDS)
-                task_id, spot_status = (
-                    spot_state.get_latest_task_id_status(job_id))
+                task_id, managed_job_status = (
+                    managed_job_state.get_latest_task_id_status(job_id))
                 continue
-            assert spot_status is not None
+            assert managed_job_status is not None
             assert isinstance(handle, backends.CloudVmRayResourceHandle), handle
             status_display.stop()
             returncode = backend.tail_logs(handle,
                                            job_id=None,
-                                           spot_job_id=job_id,
+                                           managed_job_id=job_id,
                                            follow=follow)
             if returncode == 0:
                 # If the log tailing exit successfully (the real job can be
                 # SUCCEEDED or FAILED), we can safely break the loop. We use the
-                # status in job queue to show the information, as the spot_state
-                # is not updated yet.
+                # status in job queue to show the information, as the
+                # ManagedJobStatus is not updated yet.
                 job_statuses = backend.get_job_status(handle, stream_logs=False)
                 job_status = list(job_statuses.values())[0]
                 assert job_status is not None, 'No job found.'
@@ -376,8 +393,9 @@ def stream_logs_by_id(job_id: int, follow: bool = True) -> str:
                         status_display.start()
                         original_task_id = task_id
                         while True:
-                            task_id, spot_status = (
-                                spot_state.get_latest_task_id_status(job_id))
+                            task_id, managed_job_status = (
+                                managed_job_state.get_latest_task_id_status(
+                                    job_id))
                             if original_task_id != task_id:
                                 break
                             time.sleep(JOB_STATUS_CHECK_GAP_SECONDS)
@@ -393,59 +411,97 @@ def stream_logs_by_id(job_id: int, follow: bool = True) -> str:
                 logger.debug(
                     f'INFO: (Log streaming) Got return code {returncode}. '
                     f'Retrying in {JOB_STATUS_CHECK_GAP_SECONDS} seconds.')
-            # Finish early if the spot status is already in terminal state.
-            spot_status = spot_state.get_status(job_id)
-            assert spot_status is not None, job_id
-            if spot_status.is_terminal():
+            # Finish early if the managed job status is already in terminal
+            # state.
+            managed_job_status = managed_job_state.get_status(job_id)
+            assert managed_job_status is not None, job_id
+            if managed_job_status.is_terminal():
                 break
-            logger.info(f'{colorama.Fore.YELLOW}The job is preempted.'
-                        f'{colorama.Style.RESET_ALL}')
+            logger.info(f'{colorama.Fore.YELLOW}The job cluster is preempted '
+                        f'or failed.{colorama.Style.RESET_ALL}')
             msg = _JOB_CANCELLED_MESSAGE
             status_display.update(msg)
             prev_msg = msg
             status_display.start()
             # If the tailing fails, it is likely that the cluster fails, so we
-            # wait a while to make sure the spot state is updated by the
-            # controller, and check the spot queue again.
+            # wait a while to make sure the managed job state is updated by the
+            # controller, and check the managed job queue again.
             # Wait a bit longer than the controller, so as to make sure the
-            # spot state is updated.
+            # managed job state is updated.
             time.sleep(3 * JOB_STATUS_CHECK_GAP_SECONDS)
-            spot_status = spot_state.get_status(job_id)
+            managed_job_status = managed_job_state.get_status(job_id)
 
-    # The spot_status may not be in terminal status yet, since the controllerhas
-    # not updated the spot state yet. We wait for a while, until the spot state
-    # is updated.
+    # The managed_job_status may not be in terminal status yet, since the
+    # controller has not updated the managed job state yet. We wait for a while,
+    # until the managed job state is updated.
     wait_seconds = 0
-    spot_status = spot_state.get_status(job_id)
-    assert spot_status is not None, job_id
-    while (not spot_status.is_terminal() and follow and
-           wait_seconds < _FINAL_SPOT_STATUS_WAIT_TIMEOUT_SECONDS):
+    managed_job_status = managed_job_state.get_status(job_id)
+    assert managed_job_status is not None, job_id
+    while (not managed_job_status.is_terminal() and follow and
+           wait_seconds < _FINAL_JOB_STATUS_WAIT_TIMEOUT_SECONDS):
         time.sleep(1)
         wait_seconds += 1
-        spot_status = spot_state.get_status(job_id)
-        assert spot_status is not None, job_id
+        managed_job_status = managed_job_state.get_status(job_id)
+        assert managed_job_status is not None, job_id
 
     logger.info(f'Logs finished for job {job_id} '
-                f'(status: {spot_status.value}).')
+                f'(status: {managed_job_status.value}).')
     return ''
 
 
-def stream_logs_by_name(job_name: str, follow: bool = True) -> str:
-    """Stream logs by name."""
-    job_ids = spot_state.get_nonterminal_job_ids_by_name(job_name)
-    if len(job_ids) == 0:
-        return (f'{colorama.Fore.RED}No job found with name {job_name!r}.'
-                f'{colorama.Style.RESET_ALL}')
-    if len(job_ids) > 1:
-        return (f'{colorama.Fore.RED}Multiple running jobs found '
-                f'with name {job_name!r}.\n'
-                f'Job IDs: {job_ids}{colorama.Style.RESET_ALL}')
-    stream_logs_by_id(job_ids[0], follow)
-    return ''
+def stream_logs(job_id: Optional[int],
+                job_name: Optional[str],
+                controller: bool = False,
+                follow: bool = True) -> str:
+    """Stream logs by job id or job name."""
+    if job_id is None and job_name is None:
+        job_id = managed_job_state.get_latest_job_id()
+        if job_id is None:
+            return 'No managed job found.'
+    if controller:
+        if job_id is None:
+            assert job_name is not None
+            managed_jobs = managed_job_state.get_managed_jobs()
+            # We manually filter the jobs by name, instead of using
+            # get_nonterminal_job_ids_by_name, as with `controller=True`, we
+            # should be able to show the logs for jobs in terminal states.
+            managed_jobs = list(
+                filter(lambda job: job['job_name'] == job_name, managed_jobs))
+            if len(managed_jobs) == 0:
+                return f'No managed job found with name {job_name!r}.'
+            if len(managed_jobs) > 1:
+                job_ids_str = ', '.join(job['job_id'] for job in managed_jobs)
+                raise ValueError(
+                    f'Multiple managed jobs found with name {job_name!r} (Job '
+                    f'IDs: {job_ids_str}). Please specify the job_id instead.')
+            job_id = managed_jobs[0]['job_id']
+        assert job_id is not None, (job_id, job_name)
+        # TODO: keep the following code sync with
+        # job_lib.JobLibCodeGen.tail_logs, we do not directly call that function
+        # as the following code need to be run in the current machine, instead
+        # of running remotely.
+        run_timestamp = job_lib.get_run_timestamp(job_id)
+        if run_timestamp is None:
+            return f'No managed job contrller log found with job_id {job_id}.'
+        log_dir = os.path.join(constants.SKY_LOGS_DIRECTORY, run_timestamp)
+        log_lib.tail_logs(job_id=job_id, log_dir=log_dir, follow=follow)
+        return ''
+
+    if job_id is None:
+        assert job_name is not None
+        job_ids = managed_job_state.get_nonterminal_job_ids_by_name(job_name)
+        if len(job_ids) == 0:
+            return f'No running managed job found with name {job_name!r}.'
+        if len(job_ids) > 1:
+            raise ValueError(
+                f'Multiple running jobs found with name {job_name!r}.')
+        job_id = job_ids[0]
+
+    return stream_logs_by_id(job_id, follow)
 
 
-def dump_spot_job_queue() -> str:
-    jobs = spot_state.get_spot_jobs()
+def dump_managed_job_queue() -> str:
+    jobs = managed_job_state.get_managed_jobs()
 
     for job in jobs:
         end_at = job['end_at']
@@ -453,7 +509,7 @@ def dump_spot_job_queue() -> str:
             end_at = time.time()
 
         job_submitted_at = job['last_recovered_at'] - job['job_duration']
-        if job['status'] == spot_state.SpotStatus.RECOVERING:
+        if job['status'] == managed_job_state.ManagedJobStatus.RECOVERING:
             # When job is recovering, the duration is exact job['job_duration']
             job_duration = job['job_duration']
         elif job_submitted_at > 0:
@@ -465,8 +521,8 @@ def dump_spot_job_queue() -> str:
         job['job_duration'] = job_duration
         job['status'] = job['status'].value
 
-        cluster_name = generate_spot_cluster_name(job['task_name'],
-                                                  job['job_id'])
+        cluster_name = generate_managed_job_cluster_name(
+            job['task_name'], job['job_id'])
         handle = global_user_state.get_handle_from_cluster_name(cluster_name)
         if handle is not None:
             assert isinstance(handle, backends.CloudVmRayResourceHandle)
@@ -481,14 +537,34 @@ def dump_spot_job_queue() -> str:
     return common_utils.encode_payload(jobs)
 
 
-def load_spot_job_queue(payload: str) -> List[Dict[str, Any]]:
+def load_managed_job_queue(payload: str) -> List[Dict[str, Any]]:
     """Load job queue from json string."""
     jobs = common_utils.decode_payload(payload)
     for job in jobs:
-        job['status'] = spot_state.SpotStatus(job['status'])
+        job['status'] = managed_job_state.ManagedJobStatus(job['status'])
     return jobs
 
 
+def _get_job_status_from_tasks(
+    job_tasks: List[Dict[str, Any]]
+) -> Tuple[managed_job_state.ManagedJobStatus, int]:
+    """Get the current task status and the current task id for a job."""
+    managed_task_status = managed_job_state.ManagedJobStatus.SUCCEEDED
+    current_task_id = 0
+    for task in job_tasks:
+        managed_task_status = task['status']
+        current_task_id = task['task_id']
+
+        # Use the first non-succeeded status.
+        if managed_task_status != managed_job_state.ManagedJobStatus.SUCCEEDED:
+            # TODO(zhwu): we should not blindly use the first non-
+            # succeeded as the status could be changed to SUBMITTED
+            # when going from one task to the next one, which can be
+            # confusing.
+            break
+    return managed_task_status, current_task_id
+
+
 @typing.overload
 def format_job_table(tasks: List[Dict[str, Any]],
                      show_all: bool,
@@ -510,18 +586,36 @@ def format_job_table(
         show_all: bool,
         return_rows: bool = False,
         max_jobs: Optional[int] = None) -> Union[str, List[List[str]]]:
-    """Returns spot jobs as a formatted string.
+    """Returns managed jobs as a formatted string.
 
     Args:
-        jobs: A list of spot jobs.
+        jobs: A list of managed jobs.
         show_all: Whether to show all columns.
         max_jobs: The maximum number of jobs to show in the table.
         return_rows: If True, return the rows as a list of strings instead of
           all rows concatenated into a single string.
 
-    Returns: A formatted string of spot jobs, if not `return_rows`; otherwise a
-      list of "rows" (each of which is a list of str).
+    Returns: A formatted string of managed jobs, if not `return_rows`; otherwise
+      a list of "rows" (each of which is a list of str).
     """
+    jobs = collections.defaultdict(list)
+    for task in tasks:
+        # The tasks within the same job_id are already sorted
+        # by the task_id.
+        jobs[task['job_id']].append(task)
+    jobs = dict(jobs)
+
+    status_counts: Dict[str, int] = collections.defaultdict(int)
+    for job_tasks in jobs.values():
+        managed_job_status = _get_job_status_from_tasks(job_tasks)[0]
+        if not managed_job_status.is_terminal():
+            status_counts[managed_job_status.value] += 1
+
+    if max_jobs is not None:
+        job_ids = sorted(jobs.keys(), reverse=True)
+        job_ids = job_ids[:max_jobs]
+        jobs = {job_id: jobs[job_id] for job_id in job_ids}
+
     columns = [
         'ID', 'TASK', 'NAME', 'RESOURCES', 'SUBMITTED', 'TOT. DURATION',
         'JOB DURATION', '#RECOVERIES', 'STATUS'
@@ -552,9 +646,8 @@ def format_job_table(
             submitted_at = None
             end_at: Optional[int] = 0
             recovery_cnt = 0
-            spot_status = spot_state.SpotStatus.SUCCEEDED
-            failure_reason = None
-            current_task_id = len(job_tasks) - 1
+            managed_job_status, current_task_id = _get_job_status_from_tasks(
+                job_tasks)
             for task in job_tasks:
                 job_duration += task['job_duration']
                 if task['submitted_at'] is not None:
@@ -567,19 +660,8 @@ def format_job_table(
                 else:
                     end_at = None
                 recovery_cnt += task['recovery_count']
-                if spot_status == spot_state.SpotStatus.SUCCEEDED:
-                    # Use the first non-succeeded status.
-                    # TODO(zhwu): we should not blindly use the first non-
-                    # succeeded as the status could be changed to SUBMITTED
-                    # when going from one task to the next one, which can be
-                    # confusing.
-                    spot_status = task['status']
-                    current_task_id = task['task_id']
-
-                if (failure_reason is None and
-                        task['status'] > spot_state.SpotStatus.SUCCEEDED):
-                    failure_reason = task['failure_reason']
 
+            failure_reason = job_tasks[current_task_id]['failure_reason']
             job_duration = log_utils.readable_time_duration(0,
                                                             job_duration,
                                                             absolute=True)
@@ -588,9 +670,8 @@ def format_job_table(
                                                               end_at,
                                                               absolute=True)
 
-            status_str = spot_status.colored_str()
-            if (spot_status < spot_state.SpotStatus.RUNNING and
-                    current_task_id > 0):
+            status_str = managed_job_status.colored_str()
+            if not managed_job_status.is_terminal():
                 status_str += f' (task: {current_task_id})'
 
             job_values = [
@@ -615,7 +696,7 @@ def format_job_table(
 
         for task in job_tasks:
             # The job['job_duration'] is already calculated in
-            # dump_spot_job_queue().
+            # dump_managed_job_queue().
             job_duration = log_utils.readable_time_duration(
                 0, task['job_duration'], absolute=True)
             submitted = log_utils.readable_time_duration(task['submitted_at'])
@@ -654,7 +735,7 @@ def format_job_table(
     if status_str:
         status_str = f'In progress tasks: {status_str}'
     else:
-        status_str = 'No in-progress spot jobs.'
+        status_str = 'No in-progress managed jobs.'
     output = status_str
     if str(job_table):
         output += f'\n{job_table}'
@@ -663,105 +744,103 @@ def format_job_table(
     return output
 
 
-class SpotCodeGen:
-    """Code generator for managed spot job utility functions.
+class ManagedJobCodeGen:
+    """Code generator for managed job utility functions.
 
     Usage:
 
-      >> codegen = SpotCodegen.show_jobs(...)
+      >> codegen = ManagedJobCodeGen.show_jobs(...)
     """
-    _PREFIX = [
-        'from sky.spot import spot_state',
-        'from sky.spot import spot_utils',
-    ]
+    # TODO: the try..except.. block is for backward compatibility. Remove it in
+    # v0.8.0.
+    _PREFIX = textwrap.dedent("""\
+        managed_job_version = 0
+        try:
+            from sky.jobs import utils
+            from sky.jobs import constants as managed_job_constants
+            from sky.jobs import state as managed_job_state
+
+            managed_job_version = managed_job_constants.MANAGED_JOBS_VERSION
+        except ImportError:
+            from sky.spot import spot_state as managed_job_state
+            from sky.spot import spot_utils as utils
+        """)
 
     @classmethod
     def get_job_table(cls) -> str:
-        code = [
-            'job_table = spot_utils.dump_spot_job_queue()',
-            'print(job_table, flush=True)',
-        ]
+        code = textwrap.dedent("""\
+        if managed_job_version < 1:
+            job_table = utils.dump_spot_job_queue()
+        else:
+            job_table = utils.dump_managed_job_queue()
+        print(job_table, flush=True)
+        """)
         return cls._build(code)
 
     @classmethod
     def cancel_jobs_by_id(cls, job_ids: Optional[List[int]]) -> str:
-        code = [
-            f'msg = spot_utils.cancel_jobs_by_id({job_ids})',
-            'print(msg, end="", flush=True)',
-        ]
+        code = textwrap.dedent(f"""\
+        msg = utils.cancel_jobs_by_id({job_ids})
+        print(msg, end="", flush=True)
+        """)
         return cls._build(code)
 
     @classmethod
     def cancel_job_by_name(cls, job_name: str) -> str:
-        code = [
-            f'msg = spot_utils.cancel_job_by_name({job_name!r})',
-            'print(msg, end="", flush=True)',
-        ]
-        return cls._build(code)
-
-    @classmethod
-    def stream_logs_by_name(cls, job_name: str, follow: bool = True) -> str:
-        code = [
-            f'msg = spot_utils.stream_logs_by_name({job_name!r}, '
-            f'follow={follow})',
-            'print(msg, flush=True)',
-        ]
+        code = textwrap.dedent(f"""\
+        msg = utils.cancel_job_by_name({job_name!r})
+        print(msg, end="", flush=True)
+        """)
         return cls._build(code)
 
     @classmethod
-    def stream_logs_by_id(cls,
-                          job_id: Optional[int],
-                          follow: bool = True) -> str:
-        code = [
-            f'job_id = {job_id} if {job_id} is not None '
-            'else spot_state.get_latest_job_id()',
-            f'msg = spot_utils.stream_logs_by_id(job_id, follow={follow})',
-            'print(msg, flush=True)',
-        ]
+    def stream_logs(cls,
+                    job_name: Optional[str],
+                    job_id: Optional[int],
+                    follow: bool = True,
+                    controller: bool = False) -> str:
+        # We inspect the source code of the function here for backward
+        # compatibility.
+        # TODO: change to utils.stream_logs(job_id, job_name, follow) in v0.8.0.
+        # Import libraries required by `stream_logs`. The try...except... block
+        # should be removed in v0.8.0.
+        code = textwrap.dedent("""\
+        import os
+
+        from sky.skylet import job_lib, log_lib
+        from sky.skylet import constants
+        try:
+            from sky.jobs.utils import stream_logs_by_id
+        except ImportError:
+            from sky.spot.spot_utils import stream_logs_by_id
+        from typing import Optional
+        """)
+        code += inspect.getsource(stream_logs)
+        code += textwrap.dedent(f"""\
+
+        msg = stream_logs({job_id!r}, {job_name!r}, 
+                           follow={follow}, controller={controller})
+        print(msg, flush=True)
+        """)
         return cls._build(code)
 
     @classmethod
-    def set_pending(cls, job_id: int, spot_dag: 'dag_lib.Dag') -> str:
-        dag_name = spot_dag.name
-        # Add the spot job to spot queue table.
-        code = [
-            f'spot_state.set_job_name('
-            f'{job_id}, {dag_name!r})',
-        ]
-        for task_id, task in enumerate(spot_dag.tasks):
-            resources_str = backend_utils.get_task_resources_str(task)
-            code += [
-                f'spot_state.set_pending('
-                f'{job_id}, {task_id}, {task.name!r}, '
-                f'{resources_str!r})',
-            ]
+    def set_pending(cls, job_id: int, managed_job_dag: 'dag_lib.Dag') -> str:
+        dag_name = managed_job_dag.name
+        # Add the managed job to queue table.
+        code = textwrap.dedent(f"""\
+            managed_job_state.set_job_name({job_id}, {dag_name!r})
+            """)
+        for task_id, task in enumerate(managed_job_dag.tasks):
+            resources_str = backend_utils.get_task_resources_str(
+                task, is_managed_job=True)
+            code += textwrap.dedent(f"""\
+                managed_job_state.set_pending({job_id}, {task_id}, 
+                                  {task.name!r}, {resources_str!r})
+                """)
         return cls._build(code)
 
     @classmethod
-    def _build(cls, code: List[str]) -> str:
-        code = cls._PREFIX + code
-        generated_code = '; '.join(code)
+    def _build(cls, code: str) -> str:
+        generated_code = cls._PREFIX + '\n' + code
         return f'{constants.SKY_PYTHON_CMD} -u -c {shlex.quote(generated_code)}'
-
-
-def dump_job_table_cache(job_table: str):
-    """Dump job table cache to file."""
-    cache_file = pathlib.Path(_SPOT_STATUS_CACHE).expanduser()
-    with cache_file.open('w', encoding='utf-8') as f:
-        json.dump((time.time(), job_table), f)
-
-
-def load_job_table_cache() -> Optional[Tuple[float, str]]:
-    """Load job table cache from file.
-
-    Returns:
-        A tuple of (timestamp, job_table), where the timestamp is
-        the time when the job table is dumped and the job_table is
-        the dumped job table in string.
-        None if the cache file does not exist.
-    """
-    cache_file = pathlib.Path(_SPOT_STATUS_CACHE).expanduser()
-    if not cache_file.exists():
-        return None
-    with cache_file.open('r', encoding='utf-8') as f:
-        return json.load(f)
diff --git a/sky/optimizer.py b/sky/optimizer.py
index b89252887bb..9c11511a38b 100644
--- a/sky/optimizer.py
+++ b/sky/optimizer.py
@@ -4,7 +4,7 @@
 import enum
 import json
 import typing
-from typing import Any, Dict, Iterable, List, Optional, Tuple
+from typing import Any, Dict, Iterable, List, Optional, Set, Tuple
 
 import colorama
 import numpy as np
@@ -120,6 +120,8 @@ def optimize(dag: 'dag_lib.Dag',
                 for a task.
             exceptions.NoCloudAccessError: if no public clouds are enabled.
         """
+        _check_specified_clouds(dag)
+
         # This function is effectful: mutates every node in 'dag' by setting
         # node.best_resources if it is None.
         Optimizer._add_dummy_source_sink_nodes(dag)
@@ -269,7 +271,6 @@ def _estimate_nodes_cost_or_time(
                     _fill_in_launchable_resources(
                         task=node,
                         blocked_resources=blocked_resources,
-                        try_fix_with_sky_check=True,
                         quiet=quiet))
                 node_to_candidate_map[node] = cloud_candidates
             else:
@@ -337,7 +338,8 @@ def _estimate_nodes_cost_or_time(
                 # in the error message.
                 enabled_clouds = (
                     sky_check.get_cached_enabled_clouds_or_refresh())
-                if clouds.cloud_in_list(clouds.Kubernetes(), enabled_clouds):
+                if clouds.cloud_in_iterable(clouds.Kubernetes(),
+                                            enabled_clouds):
                     if any(orig_resources.cloud is None
                            for orig_resources in node.resources):
                         source_hint = 'catalog and kubernetes cluster'
@@ -808,10 +810,12 @@ def _get_resource_group_hash(resources: 'resources_lib.Resources'):
             'CLOUD', 'INSTANCE', 'vCPUs', 'Mem(GB)', 'ACCELERATORS',
             'REGION/ZONE'
         ]
-        # Do not print Source or Sink.
-        best_plan_rows = [[t, t.num_nodes] + _get_resources_element_list(r)
-                          for t, r in ordered_best_plan.items()]
-        if len(best_plan_rows) > 1:
+        if len(ordered_best_plan) > 1:
+            best_plan_rows = []
+            for t, r in ordered_best_plan.items():
+                assert t.name is not None, t
+                best_plan_rows.append([t.name, str(t.num_nodes)] +
+                                      _get_resources_element_list(r))
             logger.info(
                 f'{colorama.Style.BRIGHT}Best plan: {colorama.Style.RESET_ALL}')
             best_plan_table = _create_table(['TASK', '#NODES'] +
@@ -834,7 +838,7 @@ def _get_resource_group_hash(resources: 'resources_lib.Resources'):
                 json.dumps(resource.to_yaml_config()): cost
                 for resource, cost in v.items()
             }
-            task_str = (f'for task {repr(task)!r} ' if num_tasks > 1 else '')
+            task_str = (f'for task {task.name!r} ' if num_tasks > 1 else '')
             plural = 's' if task.num_nodes > 1 else ''
             logger.info(
                 f'{colorama.Style.BRIGHT}Considered resources {task_str}'
@@ -989,18 +993,18 @@ def ordinal_number(n):
         for task_id in range(len(dag.tasks)):
             task = dag.tasks[task_id]
             if isinstance(task.resources, list):
+                # For ordered resources, we try the resources in the order
+                # specified by the user.
                 local_task = local_dag.tasks[task_id]
                 for resources in task.resources:
                     # Check if there exists launchable resources
                     local_task.set_resources(resources)
-                    launchable_resources_map, _ , _ = \
+                    launchable_resources_map, _, _ = (
                         _fill_in_launchable_resources(
-                            task = local_task,
-                            blocked_resources = blocked_resources,
-                            try_fix_with_sky_check = True,
-                            quiet = False
-                    )
-                    if len(launchable_resources_map[resources]) != 0:
+                            task=local_task,
+                            blocked_resources=blocked_resources,
+                            quiet=False))
+                    if launchable_resources_map.get(resources, []):
                         break
 
         local_graph = local_dag.get_graph()
@@ -1040,24 +1044,25 @@ def ordinal_number(n):
         if has_resources_ordered:
             best_plan = {}
             # We have to manually set the best_resources for the tasks in the
-            # original dag, to pass the optimization results
-            # to the caller, as we deep copied the dag
-            # when the dag has nodes with ordered resources.
+            # original dag, to pass the optimization results to the caller, as
+            # we deep copied the dag when the dag has nodes with ordered
+            # resources.
             for task, resources in local_best_plan.items():
                 task_idx = local_dag.tasks.index(task)
                 dag.tasks[task_idx].best_resources = resources
                 best_plan[dag.tasks[task_idx]] = resources
+
+            topo_order = list(nx.topological_sort(graph))
+            # Get the cost of each specified resources for display purpose.
+            node_to_cost_map, _ = Optimizer._estimate_nodes_cost_or_time(
+                topo_order=topo_order,
+                minimize_cost=minimize_cost,
+                blocked_resources=blocked_resources,
+                quiet=True)
         else:
             best_plan = local_best_plan
-
-        topo_order = list(nx.topological_sort(graph)) if has_resources_ordered \
-            else local_topo_order
-        node_to_cost_map, _ = (Optimizer._estimate_nodes_cost_or_time(
-            topo_order=topo_order,
-            minimize_cost=minimize_cost,
-            blocked_resources=blocked_resources,
-            quiet=True)) if has_resources_ordered else (
-                local_node_to_cost_map, local_node_to_candidate_map)
+            topo_order = local_topo_order
+            node_to_cost_map = local_node_to_cost_map
 
         if not quiet:
             Optimizer.print_optimized_plan(graph, topo_order, best_plan,
@@ -1143,10 +1148,63 @@ def _filter_out_blocked_launchable_resources(
     return available_resources
 
 
+def _check_specified_clouds(dag: 'dag_lib.Dag') -> None:
+    """Check if specified clouds are enabled in cache and refresh if needed.
+
+    Our enabled cloud list is cached in a local database, and if a user
+    specified a cloud that is not enabled, we should refresh the cache for that
+    cloud in case the cloud access has been enabled since the last cache update.
+
+    Args:
+        dag: The DAG specified by a user.
+    """
+    enabled_clouds = sky_check.get_cached_enabled_clouds_or_refresh(
+        raise_if_no_cloud_access=True)
+
+    global_disabled_clouds: Set[str] = set()
+    for task in dag.tasks:
+        # Recheck the enabled clouds if the task's requested resources are on a
+        # cloud that is not enabled in the cached enabled_clouds.
+        all_clouds_specified: Set[str] = set()
+        clouds_need_recheck: Set[str] = set()
+        for resources in task.resources:
+            cloud_str = str(resources.cloud)
+            if (resources.cloud is not None and not clouds.cloud_in_iterable(
+                    resources.cloud, enabled_clouds)):
+                # Explicitly check again to update the enabled cloud list.
+                clouds_need_recheck.add(cloud_str)
+            all_clouds_specified.add(cloud_str)
+
+        # Explicitly check again to update the enabled cloud list.
+        sky_check.check(quiet=True,
+                        clouds=list(clouds_need_recheck -
+                                    global_disabled_clouds))
+        enabled_clouds = sky_check.get_cached_enabled_clouds_or_refresh(
+            raise_if_no_cloud_access=True)
+        disabled_clouds = (clouds_need_recheck -
+                           {str(c) for c in enabled_clouds})
+        global_disabled_clouds.update(disabled_clouds)
+        if disabled_clouds:
+            is_or_are = 'is' if len(disabled_clouds) == 1 else 'are'
+            task_name = f' {task.name!r}' if task.name is not None else ''
+            msg = (f'Task{task_name} requires {", ".join(disabled_clouds)} '
+                   f'which {is_or_are} not enabled. To enable access, change '
+                   f'the task cloud requirement or run: {colorama.Style.BRIGHT}'
+                   f'sky check {" ".join(c.lower() for c in disabled_clouds)}'
+                   f'{colorama.Style.RESET_ALL}')
+            if all_clouds_specified == disabled_clouds:
+                # If all resources are specified with a disabled cloud, we
+                # should raise an error as no resource can satisfy the
+                # requirement. Otherwise, we should just skip the resource.
+                with ux_utils.print_exception_no_traceback():
+                    raise exceptions.ResourcesUnavailableError(msg)
+            logger.warning(
+                f'{colorama.Fore.YELLOW}{msg}{colorama.Style.RESET_ALL}')
+
+
 def _fill_in_launchable_resources(
     task: task_lib.Task,
     blocked_resources: Optional[Iterable[resources_lib.Resources]],
-    try_fix_with_sky_check: bool = True,
     quiet: bool = False
 ) -> Tuple[Dict[resources_lib.Resources, List[resources_lib.Resources]],
            _PerCloudCandidates, List[str]]:
@@ -1158,10 +1216,15 @@ def _fill_in_launchable_resources(
           Resources,
         Dict mapping Cloud to a list of feasible Resources (for printing),
         Sorted list of fuzzy candidates (alternative GPU names).
+    Raises:
+      ResourcesUnavailableError: if all resources required by the task are on
+        a cloud that is not enabled.
     """
     enabled_clouds = sky_check.get_cached_enabled_clouds_or_refresh(
         raise_if_no_cloud_access=True)
-    launchable = collections.defaultdict(list)
+
+    launchable: Dict[resources_lib.Resources, List[resources_lib.Resources]] = (
+        collections.defaultdict(list))
     all_fuzzy_candidates = set()
     cloud_candidates: _PerCloudCandidates = collections.defaultdict(
         List[resources_lib.Resources])
@@ -1169,59 +1232,52 @@ def _fill_in_launchable_resources(
         blocked_resources = []
     for resources in task.resources:
         if (resources.cloud is not None and
-                not clouds.cloud_in_list(resources.cloud, enabled_clouds)):
-            if try_fix_with_sky_check:
-                # Explicitly check again to update the enabled cloud list.
-                sky_check.check(quiet=True)
-                return _fill_in_launchable_resources(task, blocked_resources,
-                                                     False)
-            with ux_utils.print_exception_no_traceback():
-                raise exceptions.ResourcesUnavailableError(
-                    f'Task requires {resources.cloud} which is '
-                    f'not enabled: {task}.\nTo enable access, run '
-                    f'{colorama.Style.BRIGHT}'
-                    f'sky check {colorama.Style.RESET_ALL}, or change the '
-                    'cloud requirement')
-        else:
-            clouds_list = ([resources.cloud]
-                           if resources.cloud is not None else enabled_clouds)
-            for cloud in clouds_list:
-                (feasible_resources, fuzzy_candidate_list) = (
-                    cloud.get_feasible_launchable_resources(
-                        resources, num_nodes=task.num_nodes))
-                if len(feasible_resources) > 0:
-                    # Assume feasible_resources is sorted by prices.
-                    cheapest = feasible_resources[0]
-                    # Generate region/zone-specified resources.
-                    launchable[resources].extend(
-                        _make_launchables_for_valid_region_zones(cheapest))
-                    cloud_candidates[cloud] = feasible_resources
-                else:
-                    all_fuzzy_candidates.update(fuzzy_candidate_list)
-            if len(launchable[resources]) == 0:
-                clouds_str = str(clouds_list) if len(clouds_list) > 1 else str(
-                    clouds_list[0])
-                num_node_str = ''
-                if task.num_nodes > 1:
-                    num_node_str = f'{task.num_nodes}x '
-                if not quiet:
-                    logger.info(
-                        f'No resource satisfying {num_node_str}'
-                        f'{resources.repr_with_region_zone} on {clouds_str}.')
-                if len(all_fuzzy_candidates) > 0:
+                not clouds.cloud_in_iterable(resources.cloud, enabled_clouds)):
+            # Skip the resources that are on a cloud that is not enabled. The
+            # hint has been printed in _check_specified_clouds.
+            launchable[resources] = []
+            continue
+        clouds_list = ([resources.cloud]
+                       if resources.cloud is not None else enabled_clouds)
+        for cloud in clouds_list:
+            (feasible_resources,
+             fuzzy_candidate_list) = cloud.get_feasible_launchable_resources(
+                 resources, num_nodes=task.num_nodes)
+            if len(feasible_resources) > 0:
+                # Assume feasible_resources is sorted by prices. Guaranteed by
+                # the implementation of get_feasible_launchable_resources and
+                # the underlying service_catalog filtering
+                cheapest = feasible_resources[0]
+                # Generate region/zone-specified resources.
+                launchable[resources].extend(
+                    _make_launchables_for_valid_region_zones(cheapest))
+                cloud_candidates[cloud] = feasible_resources
+            else:
+                all_fuzzy_candidates.update(fuzzy_candidate_list)
+        if len(launchable[resources]) == 0:
+            clouds_str = str(clouds_list) if len(clouds_list) > 1 else str(
+                clouds_list[0])
+            num_node_str = ''
+            if task.num_nodes > 1:
+                num_node_str = f'{task.num_nodes}x '
+            if not quiet:
+                logger.info(
+                    f'No resource satisfying {num_node_str}'
+                    f'{resources.repr_with_region_zone} on {clouds_str}.')
+                if all_fuzzy_candidates:
                     logger.info('Did you mean: '
                                 f'{colorama.Fore.CYAN}'
                                 f'{sorted(all_fuzzy_candidates)}'
                                 f'{colorama.Style.RESET_ALL}')
-                else:
-                    if resources.cpus is not None:
-                        logger.info('Try specifying a different CPU count, '
-                                    'or add "+" to the end of the CPU count '
-                                    'to allow for larger instances.')
-                    if resources.memory is not None:
-                        logger.info('Try specifying a different memory size, '
-                                    'or add "+" to the end of the memory size '
-                                    'to allow for larger instances.')
+            else:
+                if resources.cpus is not None:
+                    logger.info('Try specifying a different CPU count, '
+                                'or add "+" to the end of the CPU count '
+                                'to allow for larger instances.')
+                if resources.memory is not None:
+                    logger.info('Try specifying a different memory size, '
+                                'or add "+" to the end of the memory size '
+                                'to allow for larger instances.')
 
         launchable[resources] = _filter_out_blocked_launchable_resources(
             launchable[resources], blocked_resources)
diff --git a/sky/provision/__init__.py b/sky/provision/__init__.py
index 9dc73a54a53..8371fb8ad83 100644
--- a/sky/provision/__init__.py
+++ b/sky/provision/__init__.py
@@ -5,7 +5,7 @@
 """
 import functools
 import inspect
-from typing import Any, Dict, List, Optional
+from typing import Any, Dict, List, Optional, Type
 
 from sky import sky_logging
 from sky import status_lib
@@ -21,6 +21,7 @@
 from sky.provision import kubernetes
 from sky.provision import runpod
 from sky.provision import vsphere
+from sky.utils import command_runner
 
 logger = sky_logging.init_logger(__name__)
 
@@ -41,8 +42,12 @@ def _wrapper(*args, **kwargs):
         module = globals().get(module_name)
         assert module is not None, f'Unknown provider: {module_name}'
 
-        impl = getattr(module, func.__name__)
-        return impl(*args, **kwargs)
+        impl = getattr(module, func.__name__, None)
+        if impl is not None:
+            return impl(*args, **kwargs)
+
+        # If implementation does not exist, fall back to default implementation
+        return func(provider_name, *args, **kwargs)
 
     return _wrapper
 
@@ -141,13 +146,19 @@ def query_ports(
     provider_name: str,
     cluster_name_on_cloud: str,
     ports: List[str],
+    head_ip: Optional[str] = None,
     provider_config: Optional[Dict[str, Any]] = None,
 ) -> Dict[int, List[common.Endpoint]]:
     """Query details about ports on a cluster.
 
+    If head_ip is provided, it may be used by the cloud implementation to
+    return the endpoint without querying the cloud provider. If head_ip is not
+    provided, the cloud provider will be queried to get the endpoint info.
+
     Returns a dict with port as the key and a list of common.Endpoint.
     """
-    raise NotImplementedError
+    del provider_name, provider_config, cluster_name_on_cloud  # unused
+    return common.query_ports_passthrough(ports, head_ip)
 
 
 @_route_to_cloud_impl
@@ -165,3 +176,18 @@ def get_cluster_info(
         provider_config: Optional[Dict[str, Any]] = None) -> common.ClusterInfo:
     """Get the metadata of instances in a cluster."""
     raise NotImplementedError
+
+
+@_route_to_cloud_impl
+def get_command_runners(
+    provider_name: str,
+    cluster_info: common.ClusterInfo,
+    **crednetials: Dict[str, Any],
+) -> List[command_runner.CommandRunner]:
+    """Get a command runner for the given cluster."""
+    ip_list = cluster_info.get_feasible_ips()
+    port_list = cluster_info.get_ssh_ports()
+    return command_runner.SSHCommandRunner.make_runner_list(
+        node_list=zip(ip_list, port_list),
+        **crednetials,
+    )
diff --git a/sky/provision/aws/__init__.py b/sky/provision/aws/__init__.py
index bcbe646f219..e569d3b042e 100644
--- a/sky/provision/aws/__init__.py
+++ b/sky/provision/aws/__init__.py
@@ -5,7 +5,6 @@
 from sky.provision.aws.instance import get_cluster_info
 from sky.provision.aws.instance import open_ports
 from sky.provision.aws.instance import query_instances
-from sky.provision.aws.instance import query_ports
 from sky.provision.aws.instance import run_instances
 from sky.provision.aws.instance import stop_instances
 from sky.provision.aws.instance import terminate_instances
diff --git a/sky/provision/aws/instance.py b/sky/provision/aws/instance.py
index e5a8aa78c72..e279b30c74b 100644
--- a/sky/provision/aws/instance.py
+++ b/sky/provision/aws/instance.py
@@ -61,17 +61,17 @@
 def _default_ec2_resource(region: str) -> Any:
     if not hasattr(aws, 'version'):
         # For backward compatibility, reload the module if the aws module was
-        # imported before and stale. Used for, e.g., a live spot controller
+        # imported before and stale. Used for, e.g., a live jobs controller
         # running an older version and a new version gets installed by
-        # `sky spot launch`.
+        # `sky jobs launch`.
         #
         # Detailed explanation follows. Assume we're in this situation: an old
-        # spot controller running a spot job and then the code gets updated on
-        # the controller due to a new `sky spot launch` or `sky start`.
+        # jobs controller running a managed job and then the code gets updated
+        # on the controller due to a new `sky jobs launch or `sky start`.
         #
-        # First, controller consists of an outer process (sky.spot.controller's
+        # First, controller consists of an outer process (sky.jobs.controller's
         # main) and an inner process running the controller logic (started as a
-        # multiprocessing.Process in sky.spot.controller). `sky.provision.aws`
+        # multiprocessing.Process in sky.jobs.controller). `sky.provision.aws`
         # is only imported in the inner process due to its load-on-use
         # semantics.
         #
@@ -79,8 +79,8 @@ def _default_ec2_resource(region: str) -> Any:
         # {old sky.provision.aws, old sky.adaptors.aws}, and outer process has
         # loaded {old sky.adaptors.aws}.
         #
-        # In controller.py's start(), the inner process may exit due to spot job
-        # exits or `sky spot cancel`, entering outer process'
+        # In controller.py's start(), the inner process may exit due to managed
+        # job exits or `sky jobs cancel`, entering outer process'
         # `finally: ... _cleanup()` path. Inside _cleanup(), we eventually call
         # into `sky.provision.aws` which loads this module for the first time
         # for the outer process. At this point, outer process has loaded
@@ -843,7 +843,6 @@ def get_cluster_info(
         cluster_name_on_cloud: str,
         provider_config: Optional[Dict[str, Any]] = None) -> common.ClusterInfo:
     """See sky/provision/__init__.py"""
-    del provider_config  # unused
     ec2 = _default_ec2_resource(region)
     filters = [
         {
@@ -875,14 +874,6 @@ def get_cluster_info(
     return common.ClusterInfo(
         instances=instances,
         head_instance_id=head_instance_id,
+        provider_name='aws',
+        provider_config=provider_config,
     )
-
-
-def query_ports(
-    cluster_name_on_cloud: str,
-    ports: List[str],
-    provider_config: Optional[Dict[str, Any]] = None,
-) -> Dict[int, List[common.Endpoint]]:
-    """See sky/provision/__init__.py"""
-    return common.query_ports_passthrough(cluster_name_on_cloud, ports,
-                                          provider_config)
diff --git a/sky/provision/azure/__init__.py b/sky/provision/azure/__init__.py
index 9c87fc907db..b83dbb462d9 100644
--- a/sky/provision/azure/__init__.py
+++ b/sky/provision/azure/__init__.py
@@ -2,4 +2,3 @@
 
 from sky.provision.azure.instance import cleanup_ports
 from sky.provision.azure.instance import open_ports
-from sky.provision.azure.instance import query_ports
diff --git a/sky/provision/azure/instance.py b/sky/provision/azure/instance.py
index 3264f3e9350..de5c7cbf0e9 100644
--- a/sky/provision/azure/instance.py
+++ b/sky/provision/azure/instance.py
@@ -4,7 +4,6 @@
 
 from sky import sky_logging
 from sky.adaptors import azure
-from sky.provision import common
 from sky.utils import ux_utils
 
 logger = sky_logging.init_logger(__name__)
@@ -79,7 +78,7 @@ def open_ports(
                 with ux_utils.print_exception_no_traceback():
                     raise ValueError(f'Failed to open ports {ports} in NSG '
                                      f'{nsg.name}: {poller.status()}')
-        except azure.exceptions().HttpResponseError() as e:
+        except azure.exceptions().HttpResponseError as e:
             with ux_utils.print_exception_no_traceback():
                 raise ValueError(
                     f'Failed to open ports {ports} in NSG {nsg.name}.') from e
@@ -94,13 +93,3 @@ def cleanup_ports(
     # Azure will automatically cleanup network security groups when cleanup
     # resource group. So we don't need to do anything here.
     del cluster_name_on_cloud, ports, provider_config  # Unused.
-
-
-def query_ports(
-    cluster_name_on_cloud: str,
-    ports: List[str],
-    provider_config: Optional[Dict[str, Any]] = None,
-) -> Dict[int, List[common.Endpoint]]:
-    """See sky/provision/__init__.py"""
-    return common.query_ports_passthrough(cluster_name_on_cloud, ports,
-                                          provider_config)
diff --git a/sky/provision/common.py b/sky/provision/common.py
index 75178b15623..7c1bcb32652 100644
--- a/sky/provision/common.py
+++ b/sky/provision/common.py
@@ -4,7 +4,7 @@
 import os
 from typing import Any, Dict, List, Optional, Tuple
 
-from sky.utils.resources_utils import port_ranges_to_set
+from sky.utils import resources_utils
 
 # NOTE: we can use pydantic instead of dataclasses or namedtuples, because
 # pydantic provides more features like validation or parsing from
@@ -104,6 +104,10 @@ class ClusterInfo:
     # The unique identifier of the head instance, i.e., the
     # `instance_info.instance_id` of the head node.
     head_instance_id: Optional[InstanceId]
+    # Provider related information.
+    provider_name: str
+    provider_config: Optional[Dict[str, Any]] = None
+
     docker_user: Optional[str] = None
     # Override the ssh_user from the cluster config.
     ssh_user: Optional[str] = None
@@ -151,6 +155,19 @@ def ip_tuples(self) -> List[Tuple[str, Optional[str]]]:
             other_ips.append(pair)
         return head_instance_ip + other_ips
 
+    def instance_ids(self) -> List[str]:
+        """Return the instance ids in the same order of ip_tuples."""
+        id_list = []
+        if self.head_instance_id is not None:
+            id_list.append(self.head_instance_id + '-0')
+        for inst_id, instances in self.instances.items():
+            start_idx = 0
+            if inst_id == self.head_instance_id:
+                start_idx = 1
+            id_list.extend(
+                [f'{inst_id}-{i}' for i in range(start_idx, len(instances))])
+        return id_list
+
     def has_external_ips(self) -> bool:
         """True if the cluster has external IP."""
         ip_tuples = self.ip_tuples()
@@ -186,8 +203,10 @@ def get_feasible_ips(self, force_internal_ips: bool = False) -> List[str]:
     def get_ssh_ports(self) -> List[int]:
         """Get the SSH port of all the instances."""
         head_instance = self.get_head_instance()
-        assert head_instance is not None, self
-        head_instance_port = [head_instance.ssh_port]
+
+        head_instance_port = []
+        if head_instance is not None:
+            head_instance_port = [head_instance.ssh_port]
 
         worker_instances = self.get_worker_instances()
         worker_instance_ports = [
@@ -201,7 +220,7 @@ class Endpoint:
     pass
 
     @abc.abstractmethod
-    def url(self, ip: str):
+    def url(self, override_ip: Optional[str] = None) -> str:
         raise NotImplementedError
 
 
@@ -211,44 +230,41 @@ class SocketEndpoint(Endpoint):
     port: Optional[int]
     host: str = ''
 
-    def url(self, ip: str):
-        if not self.host:
-            self.host = ip
-        return f'{self.host}{":" + str(self.port) if self.port else ""}'
+    def url(self, override_ip: Optional[str] = None) -> str:
+        host = override_ip if override_ip else self.host
+        return f'{host}{":" + str(self.port) if self.port else ""}'
 
 
 @dataclasses.dataclass
 class HTTPEndpoint(SocketEndpoint):
-    """HTTP endpoint accesible via a url."""
+    """HTTP endpoint accessible via a url."""
     path: str = ''
 
-    def url(self, ip: str):
-        del ip  # Unused.
-        return f'http://{os.path.join(super().url(self.host), self.path)}'
+    def url(self, override_ip: Optional[str] = None) -> str:
+        host = override_ip if override_ip else self.host
+        return f'http://{os.path.join(super().url(host), self.path)}'
 
 
 @dataclasses.dataclass
 class HTTPSEndpoint(SocketEndpoint):
-    """HTTPS endpoint accesible via a url."""
+    """HTTPS endpoint accessible via a url."""
     path: str = ''
 
-    def url(self, ip: str):
-        del ip  # Unused.
-        return f'https://{os.path.join(super().url(self.host), self.path)}'
+    def url(self, override_ip: Optional[str] = None) -> str:
+        host = override_ip if override_ip else self.host
+        return f'https://{os.path.join(super().url(host), self.path)}'
 
 
 def query_ports_passthrough(
-    cluster_name_on_cloud: str,
     ports: List[str],
-    provider_config: Optional[Dict[str, Any]] = None,
+    head_ip: Optional[str],
 ) -> Dict[int, List[Endpoint]]:
-    """Common function to query ports for AWS, GCP and Azure.
+    """Common function to get endpoints for AWS, GCP and Azure.
 
-    Returns a list of socket endpoint with empty host and the input ports."""
-    del cluster_name_on_cloud, provider_config  # Unused.
-    ports = list(port_ranges_to_set(ports))
+    Returns a list of socket endpoint using head_ip and ports."""
+    assert head_ip is not None, head_ip
+    ports = list(resources_utils.port_ranges_to_set(ports))
     result: Dict[int, List[Endpoint]] = {}
     for port in ports:
-        result[port] = [SocketEndpoint(port=port)]
-
+        result[port] = [SocketEndpoint(port=port, host=head_ip)]
     return result
diff --git a/sky/provision/cudo/cudo_wrapper.py b/sky/provision/cudo/cudo_wrapper.py
index b60db7fd625..691c69bda8c 100644
--- a/sky/provision/cudo/cudo_wrapper.py
+++ b/sky/provision/cudo/cudo_wrapper.py
@@ -30,7 +30,7 @@ def launch(name: str, data_center_id: str, ssh_key: str, machine_type: str,
 
     try:
         api = cudo.cudo.cudo_api.virtual_machines()
-        vm = api.create_vm(cudo.cudo.cudo_api.project_id(), request)
+        vm = api.create_vm(cudo.cudo.cudo_api.project_id_throwable(), request)
         return vm.to_dict()['id']
     except cudo.cudo.rest.ApiException as e:
         raise e
@@ -52,7 +52,7 @@ def remove(instance_id: str):
     retry_interval = 5
     retry_count = 0
     state = 'unknown'
-    project_id = cudo.cudo.cudo_api.project_id()
+    project_id = cudo.cudo.cudo_api.project_id_throwable()
     while retry_count < max_retries:
         try:
             vm = api.get_vm(project_id, instance_id)
@@ -91,7 +91,7 @@ def set_tags(instance_id: str, tags: Dict):
 def get_instance(vm_id):
     try:
         api = cudo.cudo.cudo_api.virtual_machines()
-        vm = api.get_vm(cudo.cudo.cudo_api.project_id(), vm_id)
+        vm = api.get_vm(cudo.cudo.cudo_api.project_id_throwable(), vm_id)
         vm_dict = vm.to_dict()
         return vm_dict
     except cudo.cudo.rest.ApiException as e:
@@ -101,7 +101,7 @@ def get_instance(vm_id):
 def list_instances():
     try:
         api = cudo.cudo.cudo_api.virtual_machines()
-        vms = api.list_vms(cudo.cudo.cudo_api.project_id())
+        vms = api.list_vms(cudo.cudo.cudo_api.project_id_throwable())
         instances = {}
         for vm in vms.to_dict()['vms']:
             ex_ip = vm['external_ip_address']
diff --git a/sky/provision/cudo/instance.py b/sky/provision/cudo/instance.py
index e4a2db722e4..39d4bc6b3d1 100644
--- a/sky/provision/cudo/instance.py
+++ b/sky/provision/cudo/instance.py
@@ -162,7 +162,7 @@ def get_cluster_info(
         region: str,
         cluster_name_on_cloud: str,
         provider_config: Optional[Dict[str, Any]] = None) -> common.ClusterInfo:
-    del region, provider_config
+    del region
     nodes = _filter_instances(cluster_name_on_cloud, ['runn', 'pend'])
     instances: Dict[str, List[common.InstanceInfo]] = {}
     head_instance_id = None
@@ -178,10 +178,10 @@ def get_cluster_info(
         if node_info['name'].endswith('-head'):
             head_instance_id = node_id
 
-    return common.ClusterInfo(
-        instances=instances,
-        head_instance_id=head_instance_id,
-    )
+    return common.ClusterInfo(instances=instances,
+                              head_instance_id=head_instance_id,
+                              provider_name='cudo',
+                              provider_config=provider_config)
 
 
 def query_instances(
diff --git a/sky/provision/docker_utils.py b/sky/provision/docker_utils.py
index b5394cd560c..b9ed689fdaf 100644
--- a/sky/provision/docker_utils.py
+++ b/sky/provision/docker_utils.py
@@ -3,21 +3,29 @@
 import dataclasses
 import shlex
 import time
-import typing
 from typing import Any, Dict, List
 
 from sky import sky_logging
 from sky.skylet import constants
+from sky.utils import command_runner
 from sky.utils import subprocess_utils
 
-if typing.TYPE_CHECKING:
-    from sky.utils import command_runner
-
 logger = sky_logging.init_logger(__name__)
 
 DOCKER_PERMISSION_DENIED_STR = ('permission denied while trying to connect to '
                                 'the Docker daemon socket')
 
+# Configure environment variables. A docker image can have environment variables
+# set in the Dockerfile with `ENV``. We need to export these variables to the
+# shell environment, so that our ssh session can access them.
+SETUP_ENV_VARS_CMD = (
+    'prefix_cmd() '
+    '{ if [ $(id -u) -ne 0 ]; then echo "sudo"; else echo ""; fi; } && '
+    'printenv | while IFS=\'=\' read -r key value; do echo "export $key=\\\"$value\\\""; done > '  # pylint: disable=line-too-long
+    '~/container_env_var.sh && '
+    '$(prefix_cmd) mv ~/container_env_var.sh /etc/profile.d/container_env_var.sh'
+)
+
 
 @dataclasses.dataclass
 class DockerLoginConfig:
@@ -117,7 +125,7 @@ class DockerInitializer:
     """Initializer for docker containers on a remote node."""
 
     def __init__(self, docker_config: Dict[str, Any],
-                 runner: 'command_runner.SSHCommandRunner', log_path: str):
+                 runner: 'command_runner.CommandRunner', log_path: str):
         self.docker_config = docker_config
         self.container_name = docker_config['container_name']
         self.runner = runner
@@ -166,8 +174,7 @@ def _run(self,
             rc,
             cmd,
             error_msg='Failed to run docker setup commands',
-            stderr=stdout + stderr,
-            stream_logs=False)
+            stderr=stdout + stderr)
         return stdout.strip()
 
     def initialize(self) -> str:
@@ -218,14 +225,25 @@ def initialize(self) -> str:
                     f'{specific_image}')
         container_running = self._check_container_status()
         if container_running:
-            running_image = (self._run(
-                check_docker_image(self.container_name, self.docker_cmd)))
+            running_image = self._run(
+                check_docker_image(self.container_name, self.docker_cmd))
             if running_image != specific_image:
                 logger.error(
                     f'A container with name {self.container_name} is running '
                     f'image {running_image} instead of {specific_image} (which '
                     'was provided in the YAML)')
         else:
+            # Edit docker config first to avoid disconnecting the container
+            # from GPUs when a systemctl command is called. This is a known
+            # issue with nvidia container toolkit:
+            # https://github.com/NVIDIA/nvidia-container-toolkit/issues/48
+            self._run(
+                '[ -f /etc/docker/daemon.json ] || '
+                'echo "{}" | sudo tee /etc/docker/daemon.json;'
+                'sudo jq \'.["exec-opts"] = ["native.cgroupdriver=cgroupfs"]\' '
+                '/etc/docker/daemon.json > /tmp/daemon.json;'
+                'sudo mv /tmp/daemon.json /etc/docker/daemon.json;'
+                'sudo systemctl restart docker')
             user_docker_run_options = self.docker_config.get('run_options', [])
             start_command = docker_start_cmds(
                 specific_image,
@@ -237,6 +255,8 @@ def initialize(self) -> str:
             self._run(start_command)
 
         # SkyPilot: Setup Commands.
+        # TODO(zhwu): the following setups should be aligned with the kubernetes
+        # pod setup, like provision.kubernetes.instance::_set_env_vars_in_pods
         # TODO(tian): These setup commands assumed that the container is
         # debian-based. We should make it more general.
         # Most of docker images are using root as default user, so we set an
@@ -245,7 +265,8 @@ def initialize(self) -> str:
         # Disable apt-get from asking user input during installation.
         # see https://askubuntu.com/questions/909277/avoiding-user-interaction-with-tzdata-when-installing-certbot-in-a-docker-contai  # pylint: disable=line-too-long
         self._run(
-            'echo \'[ "$(whoami)" == "root" ] && alias sudo=""\' >> ~/.bashrc;'
+            f'echo \'{command_runner.ALIAS_SUDO_TO_EMPTY_FOR_ROOT_CMD}\' '
+            '>> ~/.bashrc;'
             'echo "export DEBIAN_FRONTEND=noninteractive" >> ~/.bashrc;',
             run_env='docker')
         # Install dependencies.
@@ -288,7 +309,8 @@ def initialize(self) -> str:
             'mkdir -p ~/.ssh;'
             'cat /tmp/host_ssh_authorized_keys >> ~/.ssh/authorized_keys;'
             'sudo service ssh start;'
-            'sudo sed -i "s/mesg n/tty -s \&\& mesg n/" ~/.profile;',
+            'sudo sed -i "s/mesg n/tty -s \&\& mesg n/" ~/.profile;'
+            f'{SETUP_ENV_VARS_CMD}',
             run_env='docker')
 
         # SkyPilot: End of Setup Commands.
diff --git a/sky/provision/fluidstack/fluidstack_utils.py b/sky/provision/fluidstack/fluidstack_utils.py
index cb57c139eec..ebc616c0bfc 100644
--- a/sky/provision/fluidstack/fluidstack_utils.py
+++ b/sky/provision/fluidstack/fluidstack_utils.py
@@ -8,8 +8,6 @@
 
 import requests
 
-from sky.clouds.service_catalog.data_fetchers import fetch_fluidstack
-
 
 def get_key_suffix():
     return str(uuid.uuid4()).replace('-', '')[:8]
@@ -119,17 +117,8 @@ def create_instance(
     ) -> List[str]:
         """Launch new instances."""
 
-        config = {}
+        config: Dict[str, Any] = {}
         plans = self.get_plans()
-        if 'custom' in instance_type:
-            values = instance_type.split(':')
-            index = values[1]
-            instance_type = values[2]
-            config = fetch_fluidstack.CUSTOM_PLANS_CONFIG[int(index)]
-            plan = [plan for plan in plans if plan['plan_id'] == instance_type
-                   ][0]
-            config['gpu_model'] = plan['gpu_type']
-
         regions = self.list_regions()
         plans = [
             plan for plan in plans if plan['plan_id'] == instance_type and
diff --git a/sky/provision/fluidstack/instance.py b/sky/provision/fluidstack/instance.py
index 2c0d836fadc..b37519a8458 100644
--- a/sky/provision/fluidstack/instance.py
+++ b/sky/provision/fluidstack/instance.py
@@ -273,7 +273,7 @@ def get_cluster_info(
         region: str,
         cluster_name_on_cloud: str,
         provider_config: Optional[Dict[str, Any]] = None) -> common.ClusterInfo:
-    del region, provider_config  # unused
+    del region  # unused
     running_instances = _filter_instances(cluster_name_on_cloud, ['running'])
     instances: Dict[str, List[common.InstanceInfo]] = {}
 
@@ -296,7 +296,9 @@ def get_cluster_info(
 
     return common.ClusterInfo(instances=instances,
                               head_instance_id=head_instance_id,
-                              custom_ray_options={'use_external_ip': True})
+                              custom_ray_options={'use_external_ip': True},
+                              provider_name='fluidstack',
+                              provider_config=provider_config)
 
 
 def query_instances(
diff --git a/sky/provision/gcp/__init__.py b/sky/provision/gcp/__init__.py
index fdadd5345e2..0d24a577690 100644
--- a/sky/provision/gcp/__init__.py
+++ b/sky/provision/gcp/__init__.py
@@ -5,7 +5,6 @@
 from sky.provision.gcp.instance import get_cluster_info
 from sky.provision.gcp.instance import open_ports
 from sky.provision.gcp.instance import query_instances
-from sky.provision.gcp.instance import query_ports
 from sky.provision.gcp.instance import run_instances
 from sky.provision.gcp.instance import stop_instances
 from sky.provision.gcp.instance import terminate_instances
diff --git a/sky/provision/gcp/instance.py b/sky/provision/gcp/instance.py
index a7776f5bf2b..62f234725dd 100644
--- a/sky/provision/gcp/instance.py
+++ b/sky/provision/gcp/instance.py
@@ -434,6 +434,8 @@ def get_cluster_info(
     return common.ClusterInfo(
         instances=instances,
         head_instance_id=head_instance_id,
+        provider_name='gcp',
+        provider_config=provider_config,
     )
 
 
@@ -628,13 +630,3 @@ def cleanup_ports(
         firewall_rule_name = provider_config['firewall_rule']
         instance_utils.GCPComputeInstance.delete_firewall_rule(
             project_id, firewall_rule_name)
-
-
-def query_ports(
-    cluster_name_on_cloud: str,
-    ports: List[str],
-    provider_config: Optional[Dict[str, Any]] = None,
-) -> Dict[int, List[common.Endpoint]]:
-    """See sky/provision/__init__.py"""
-    return common.query_ports_passthrough(cluster_name_on_cloud, ports,
-                                          provider_config)
diff --git a/sky/provision/gcp/instance_utils.py b/sky/provision/gcp/instance_utils.py
index 3e23cceb4ac..e1e72a25d6c 100644
--- a/sky/provision/gcp/instance_utils.py
+++ b/sky/provision/gcp/instance_utils.py
@@ -98,19 +98,20 @@ def _generate_node_name(cluster_name: str, node_suffix: str,
     return node_name
 
 
-def _log_errors(errors: List[Dict[str, str]], e: Any,
-                zone: Optional[str]) -> None:
-    """Format errors into a string."""
+def _format_and_log_message_from_errors(errors: List[Dict[str, str]], e: Any,
+                                        zone: Optional[str]) -> str:
+    """Format errors into a string and log it to the console."""
     if errors:
         plural = 's' if len(errors) > 1 else ''
         codes = ', '.join(repr(e.get('code', 'N/A')) for e in errors)
         messages = '; '.join(
             repr(e.get('message', 'N/A').strip('.')) for e in errors)
         zone_str = f' in {zone}' if zone else ''
-        logger.warning(f'Got return code{plural} {codes}'
-                       f'{zone_str}: {messages}')
+        msg = f'Got return code{plural} {codes}{zone_str}: {messages}'
     else:
-        logger.warning(f'create_instances: Failed with reason: {e}')
+        msg = f'create_instances: Failed with reason: {e}'
+    logger.warning(msg)
+    return msg
 
 
 def selflink_to_name(selflink: str) -> str:
@@ -467,8 +468,10 @@ def call_operation(fn, timeout: int):
                     logger.debug(
                         'wait_operations: Failed to create instances. Reason: '
                         f'{errors}')
-                    _log_errors(errors, result, zone)
+                    msg = _format_and_log_message_from_errors(
+                        errors, result, zone)
                     error = common.ProvisionerError('Operation failed')
+                    setattr(error, 'detailed_reason', msg)
                     error.errors = errors
                     raise error
                 return
@@ -489,8 +492,10 @@ def call_operation(fn, timeout: int):
                 'message': f'Timeout waiting for operation {operation["name"]}',
                 'domain': 'wait_for_operation'
             }]
-            _log_errors(errors, None, zone)
+            msg = _format_and_log_message_from_errors(errors, None, zone)
             error = common.ProvisionerError('Operation timed out')
+            # Used for usage collection only, to include in the usage message.
+            setattr(error, 'detailed_reason', msg)
             error.errors = errors
             raise error
 
@@ -852,7 +857,7 @@ def _handle_http_error(e):
                 })
             logger.debug(
                 f'create_instances: googleapiclient.errors.HttpError: {e}')
-            _log_errors(errors, e, zone)
+            _format_and_log_message_from_errors(errors, e, zone)
             return errors
 
         # Allow Google Compute Engine instance templates.
@@ -882,7 +887,7 @@ def _handle_http_error(e):
             if errors:
                 logger.debug('create_instances: Failed to create instances. '
                              f'Reason: {errors}')
-                _log_errors(errors, operations, zone)
+                _format_and_log_message_from_errors(errors, operations, zone)
                 return errors
 
         logger.debug('Waiting GCP instances to be ready ...')
@@ -1506,7 +1511,7 @@ def create_instances(
                         'domain': 'create_instances',
                         'message': error_details,
                     })
-                    _log_errors(errors, e, zone)
+                    _format_and_log_message_from_errors(errors, e, zone)
                     return errors, names
                 for detail in error_details:
                     # To be consistent with error messages returned by operation
@@ -1525,7 +1530,7 @@ def create_instances(
                                 'domain': violation.get('subject'),
                                 'message': violation.get('description'),
                             })
-                _log_errors(errors, e, zone)
+                _format_and_log_message_from_errors(errors, e, zone)
                 return errors, names
         errors = []
         for operation in operations:
@@ -1543,7 +1548,7 @@ def create_instances(
         if errors:
             logger.debug('create_instances: Failed to create instances. '
                          f'Reason: {errors}')
-            _log_errors(errors, operations, zone)
+            _format_and_log_message_from_errors(errors, operations, zone)
             return errors, names
 
         logger.debug('Waiting GCP instances to be ready ...')
@@ -1585,7 +1590,7 @@ def create_instances(
                 'message': 'Timeout waiting for creation operation',
                 'domain': 'create_instances'
             }]
-            _log_errors(errors, None, zone)
+            _format_and_log_message_from_errors(errors, None, zone)
             return errors, names
 
         # NOTE: Error example:
@@ -1602,7 +1607,7 @@ def create_instances(
             logger.debug(
                 'create_instances: Failed to create instances. Reason: '
                 f'{errors}')
-            _log_errors(errors, results, zone)
+            _format_and_log_message_from_errors(errors, results, zone)
             return errors, names
         assert all(success), (
             'Failed to create instances, but there is no error. '
@@ -1730,7 +1735,7 @@ def create_tpu_node(project_id: str, zone: str, tpu_node_config: Dict[str, str],
                            'https://console.cloud.google.com/iam-admin/quotas '
                            'for more information.'
             }]
-            _log_errors(provisioner_err.errors, e, zone)
+            _format_and_log_message_from_errors(provisioner_err.errors, e, zone)
             raise provisioner_err from e
 
         if 'PERMISSION_DENIED' in stderr:
@@ -1739,7 +1744,7 @@ def create_tpu_node(project_id: str, zone: str, tpu_node_config: Dict[str, str],
                 'domain': 'tpu',
                 'message': 'TPUs are not available in this zone.'
             }]
-            _log_errors(provisioner_err.errors, e, zone)
+            _format_and_log_message_from_errors(provisioner_err.errors, e, zone)
             raise provisioner_err from e
 
         if 'no more capacity in the zone' in stderr:
@@ -1748,7 +1753,7 @@ def create_tpu_node(project_id: str, zone: str, tpu_node_config: Dict[str, str],
                 'domain': 'tpu',
                 'message': 'No more capacity in this zone.'
             }]
-            _log_errors(provisioner_err.errors, e, zone)
+            _format_and_log_message_from_errors(provisioner_err.errors, e, zone)
             raise provisioner_err from e
 
         if 'CloudTpu received an invalid AcceleratorType' in stderr:
@@ -1761,7 +1766,7 @@ def create_tpu_node(project_id: str, zone: str, tpu_node_config: Dict[str, str],
                 'message': (f'TPU type {tpu_type} is not available in this '
                             f'zone {zone}.')
             }]
-            _log_errors(provisioner_err.errors, e, zone)
+            _format_and_log_message_from_errors(provisioner_err.errors, e, zone)
             raise provisioner_err from e
 
         # TODO(zhwu): Add more error code handling, if needed.
@@ -1770,7 +1775,7 @@ def create_tpu_node(project_id: str, zone: str, tpu_node_config: Dict[str, str],
             'domain': 'tpu',
             'message': stderr
         }]
-        _log_errors(provisioner_err.errors, e, zone)
+        _format_and_log_message_from_errors(provisioner_err.errors, e, zone)
         raise provisioner_err from e
 
 
diff --git a/sky/provision/instance_setup.py b/sky/provision/instance_setup.py
index 81e13f54fd0..c81ecd78db4 100644
--- a/sky/provision/instance_setup.py
+++ b/sky/provision/instance_setup.py
@@ -8,6 +8,7 @@
 import time
 from typing import Any, Dict, List, Optional, Tuple
 
+from sky import provision
 from sky import sky_logging
 from sky.provision import common
 from sky.provision import docker_utils
@@ -60,7 +61,10 @@
     'done;')
 
 # Restart skylet when the version does not match to keep the skylet up-to-date.
-MAYBE_SKYLET_RESTART_CMD = (f'{constants.SKY_PYTHON_CMD} -m '
+# We need to activate the python environment to make sure autostop in skylet
+# can find the cloud SDK/CLI in PATH.
+MAYBE_SKYLET_RESTART_CMD = (f'{constants.ACTIVATE_SKY_REMOTE_PYTHON_ENV}; '
+                            f'{constants.SKY_PYTHON_CMD} -m '
                             'sky.skylet.attempt_skylet;')
 
 
@@ -123,24 +127,22 @@ def _parallel_ssh_with_cache(func,
         max_workers = subprocess_utils.get_parallel_threads()
     with futures.ThreadPoolExecutor(max_workers=max_workers) as pool:
         results = []
-        for instance_id, metadatas in cluster_info.instances.items():
-            for i, metadata in enumerate(metadatas):
-                cache_id = f'{instance_id}-{i}'
-                runner = command_runner.SSHCommandRunner(
-                    metadata.get_feasible_ip(),
-                    port=metadata.ssh_port,
-                    **ssh_credentials)
-                wrapper = metadata_utils.cache_func(cluster_name, cache_id,
-                                                    stage_name, digest)
-                if (cluster_info.head_instance_id == instance_id and i == 0):
-                    # Log the head node's output to the provision.log
-                    log_path_abs = str(provision_logging.get_log_path())
-                else:
-                    log_dir_abs = metadata_utils.get_instance_log_dir(
-                        cluster_name, cache_id)
-                    log_path_abs = str(log_dir_abs / (stage_name + '.log'))
-                results.append(
-                    pool.submit(wrapper(func), runner, metadata, log_path_abs))
+        runners = provision.get_command_runners(cluster_info.provider_name,
+                                                cluster_info, **ssh_credentials)
+        # instance_ids is guaranteed to be in the same order as runners.
+        instance_ids = cluster_info.instance_ids()
+        for i, runner in enumerate(runners):
+            cache_id = instance_ids[i]
+            wrapper = metadata_utils.cache_func(cluster_name, cache_id,
+                                                stage_name, digest)
+            if i == 0:
+                # Log the head node's output to the provision.log
+                log_path_abs = str(provision_logging.get_log_path())
+            else:
+                log_dir_abs = metadata_utils.get_instance_log_dir(
+                    cluster_name, cache_id)
+                log_path_abs = str(log_dir_abs / (stage_name + '.log'))
+            results.append(pool.submit(wrapper(func), runner, log_path_abs))
 
         return [future.result() for future in results]
 
@@ -155,9 +157,7 @@ def initialize_docker(cluster_name: str, docker_config: Dict[str, Any],
     _hint_worker_log_path(cluster_name, cluster_info, 'initialize_docker')
 
     @_auto_retry
-    def _initialize_docker(runner: command_runner.SSHCommandRunner,
-                           metadata: common.InstanceInfo, log_path: str):
-        del metadata  # Unused.
+    def _initialize_docker(runner: command_runner.CommandRunner, log_path: str):
         docker_user = docker_utils.DockerInitializer(docker_config, runner,
                                                      log_path).initialize()
         logger.debug(f'Initialized docker user: {docker_user}')
@@ -194,14 +194,16 @@ def setup_runtime_on_cluster(cluster_name: str, setup_commands: List[str],
     digest = hasher.hexdigest()
 
     @_auto_retry
-    def _setup_node(runner: command_runner.SSHCommandRunner,
-                    metadata: common.InstanceInfo, log_path: str):
-        del metadata
+    def _setup_node(runner: command_runner.CommandRunner, log_path: str):
         for cmd in setup_commands:
-            returncode, stdout, stderr = runner.run(cmd,
-                                                    stream_logs=False,
-                                                    log_path=log_path,
-                                                    require_outputs=True)
+            returncode, stdout, stderr = runner.run(
+                cmd,
+                stream_logs=False,
+                log_path=log_path,
+                require_outputs=True,
+                # Installing dependencies requires source bashrc to access
+                # conda.
+                source_bashrc=True)
             retry_cnt = 0
             while returncode == 255 and retry_cnt < _MAX_RETRY:
                 # Got network connection issue occur during setup. This could
@@ -215,7 +217,8 @@ def _setup_node(runner: command_runner.SSHCommandRunner,
                 returncode, stdout, stderr = runner.run(cmd,
                                                         stream_logs=False,
                                                         log_path=log_path,
-                                                        require_outputs=True)
+                                                        require_outputs=True,
+                                                        source_bashrc=True)
                 if not returncode:
                     break
 
@@ -256,11 +259,9 @@ def start_ray_on_head_node(cluster_name: str, custom_resource: Optional[str],
                            cluster_info: common.ClusterInfo,
                            ssh_credentials: Dict[str, Any]) -> None:
     """Start Ray on the head node."""
-    ip_list = cluster_info.get_feasible_ips()
-    port_list = cluster_info.get_ssh_ports()
-    ssh_runner = command_runner.SSHCommandRunner(ip_list[0],
-                                                 port=port_list[0],
-                                                 **ssh_credentials)
+    runners = provision.get_command_runners(cluster_info.provider_name,
+                                            cluster_info, **ssh_credentials)
+    head_runner = runners[0]
     assert cluster_info.head_instance_id is not None, (cluster_name,
                                                        cluster_info)
 
@@ -297,10 +298,14 @@ def start_ray_on_head_node(cluster_name: str, custom_resource: Optional[str],
            _RAY_PRLIMIT + _DUMP_RAY_PORTS + RAY_HEAD_WAIT_INITIALIZED_COMMAND)
     logger.info(f'Running command on head node: {cmd}')
     # TODO(zhwu): add the output to log files.
-    returncode, stdout, stderr = ssh_runner.run(cmd,
-                                                stream_logs=False,
-                                                log_path=log_path_abs,
-                                                require_outputs=True)
+    returncode, stdout, stderr = head_runner.run(
+        cmd,
+        stream_logs=False,
+        log_path=log_path_abs,
+        require_outputs=True,
+        # Source bashrc for starting ray cluster to make sure actors started by
+        # ray will have the correct PATH.
+        source_bashrc=True)
     if returncode:
         raise RuntimeError('Failed to start ray on the head node '
                            f'(exit code {returncode}). Error: \n'
@@ -318,11 +323,9 @@ def start_ray_on_worker_nodes(cluster_name: str, no_restart: bool,
     if cluster_info.num_instances <= 1:
         return
     _hint_worker_log_path(cluster_name, cluster_info, 'ray_cluster')
-    ip_list = cluster_info.get_feasible_ips()
-    ssh_runners = command_runner.SSHCommandRunner.make_runner_list(
-        ip_list[1:],
-        port_list=cluster_info.get_ssh_ports()[1:],
-        **ssh_credentials)
+    runners = provision.get_command_runners(cluster_info.provider_name,
+                                            cluster_info, **ssh_credentials)
+    worker_runners = runners[1:]
     worker_instances = cluster_info.get_worker_instances()
     cache_ids = []
     prev_instance_id = None
@@ -374,11 +377,11 @@ def start_ray_on_worker_nodes(cluster_name: str, no_restart: bool,
                f'grep "gcs-address={head_ip}:${{RAY_PORT}}" || '
                f'{{ {cmd} }}')
     else:
-        cmd = 'ray stop; ' + cmd
+        cmd = f'{constants.SKY_RAY_CMD} stop; ' + cmd
 
     logger.info(f'Running command on worker nodes: {cmd}')
 
-    def _setup_ray_worker(runner_and_id: Tuple[command_runner.SSHCommandRunner,
+    def _setup_ray_worker(runner_and_id: Tuple[command_runner.CommandRunner,
                                                str]):
         # for cmd in config_from_yaml['worker_start_ray_commands']:
         #     cmd = cmd.replace('$RAY_HEAD_IP', ip_list[0][0])
@@ -386,13 +389,17 @@ def _setup_ray_worker(runner_and_id: Tuple[command_runner.SSHCommandRunner,
         runner, instance_id = runner_and_id
         log_dir = metadata_utils.get_instance_log_dir(cluster_name, instance_id)
         log_path_abs = str(log_dir / ('ray_cluster' + '.log'))
-        return runner.run(cmd,
-                          stream_logs=False,
-                          require_outputs=True,
-                          log_path=log_path_abs)
+        return runner.run(
+            cmd,
+            stream_logs=False,
+            require_outputs=True,
+            log_path=log_path_abs,
+            # Source bashrc for starting ray cluster to make sure actors started
+            # by ray will have the correct PATH.
+            source_bashrc=True)
 
     results = subprocess_utils.run_in_parallel(
-        _setup_ray_worker, list(zip(ssh_runners, cache_ids)))
+        _setup_ray_worker, list(zip(worker_runners, cache_ids)))
     for returncode, stdout, stderr in results:
         if returncode:
             with ux_utils.print_exception_no_traceback():
@@ -410,18 +417,19 @@ def start_skylet_on_head_node(cluster_name: str,
                               ssh_credentials: Dict[str, Any]) -> None:
     """Start skylet on the head node."""
     del cluster_name
-    ip_list = cluster_info.get_feasible_ips()
-    port_list = cluster_info.get_ssh_ports()
-    ssh_runner = command_runner.SSHCommandRunner(ip_list[0],
-                                                 port=port_list[0],
-                                                 **ssh_credentials)
+    runners = provision.get_command_runners(cluster_info.provider_name,
+                                            cluster_info, **ssh_credentials)
+    head_runner = runners[0]
     assert cluster_info.head_instance_id is not None, cluster_info
     log_path_abs = str(provision_logging.get_log_path())
     logger.info(f'Running command on head node: {MAYBE_SKYLET_RESTART_CMD}')
-    returncode, stdout, stderr = ssh_runner.run(MAYBE_SKYLET_RESTART_CMD,
-                                                stream_logs=False,
-                                                require_outputs=True,
-                                                log_path=log_path_abs)
+    # We need to source bashrc for skylet to make sure the autostop event can
+    # access the path to the cloud CLIs.
+    returncode, stdout, stderr = head_runner.run(MAYBE_SKYLET_RESTART_CMD,
+                                                 stream_logs=False,
+                                                 require_outputs=True,
+                                                 log_path=log_path_abs,
+                                                 source_bashrc=True)
     if returncode:
         raise RuntimeError('Failed to start skylet on the head node '
                            f'(exit code {returncode}). Error: '
@@ -431,7 +439,7 @@ def start_skylet_on_head_node(cluster_name: str,
 
 @_auto_retry
 def _internal_file_mounts(file_mounts: Dict,
-                          runner: command_runner.SSHCommandRunner,
+                          runner: command_runner.CommandRunner,
                           log_path: str) -> None:
     if file_mounts is None or not file_mounts:
         return
@@ -493,9 +501,7 @@ def internal_file_mounts(cluster_name: str, common_file_mounts: Dict[str, str],
     """Executes file mounts - rsyncing internal local files"""
     _hint_worker_log_path(cluster_name, cluster_info, 'internal_file_mounts')
 
-    def _setup_node(runner: command_runner.SSHCommandRunner,
-                    metadata: common.InstanceInfo, log_path: str):
-        del metadata
+    def _setup_node(runner: command_runner.CommandRunner, log_path: str):
         _internal_file_mounts(common_file_mounts, runner, log_path)
 
     _parallel_ssh_with_cache(
diff --git a/sky/provision/kubernetes/__init__.py b/sky/provision/kubernetes/__init__.py
index ca3938215c9..c72f0c14054 100644
--- a/sky/provision/kubernetes/__init__.py
+++ b/sky/provision/kubernetes/__init__.py
@@ -2,6 +2,7 @@
 
 from sky.provision.kubernetes.config import bootstrap_instances
 from sky.provision.kubernetes.instance import get_cluster_info
+from sky.provision.kubernetes.instance import get_command_runners
 from sky.provision.kubernetes.instance import query_instances
 from sky.provision.kubernetes.instance import run_instances
 from sky.provision.kubernetes.instance import stop_instances
diff --git a/sky/provision/kubernetes/config.py b/sky/provision/kubernetes/config.py
index 3649f123658..c4c834d85fe 100644
--- a/sky/provision/kubernetes/config.py
+++ b/sky/provision/kubernetes/config.py
@@ -2,7 +2,10 @@
 import copy
 import logging
 import math
-from typing import Any, Dict, Union
+import os
+from typing import Any, Dict, Optional, Union
+
+import yaml
 
 from sky.adaptors import kubernetes
 from sky.provision import common
@@ -24,12 +27,65 @@ def bootstrap_instances(
 
     config = _configure_ssh_jump(namespace, config)
 
-    if not config.provider_config.get('_operator'):
-        # These steps are unecessary when using the Operator.
+    requested_service_account = config.node_config['spec']['serviceAccountName']
+    if (requested_service_account ==
+            kubernetes_utils.DEFAULT_SERVICE_ACCOUNT_NAME):
+        # If the user has requested a different service account (via pod_config
+        # in ~/.sky/config.yaml), we assume they have already set up the
+        # necessary roles and role bindings.
+        # If not, set up the roles and bindings for skypilot-service-account
+        # here.
         _configure_autoscaler_service_account(namespace, config.provider_config)
-        _configure_autoscaler_role(namespace, config.provider_config)
-        _configure_autoscaler_role_binding(namespace, config.provider_config)
-
+        _configure_autoscaler_role(namespace,
+                                   config.provider_config,
+                                   role_field='autoscaler_role')
+        _configure_autoscaler_role_binding(
+            namespace,
+            config.provider_config,
+            binding_field='autoscaler_role_binding')
+        _configure_autoscaler_cluster_role(namespace, config.provider_config)
+        _configure_autoscaler_cluster_role_binding(namespace,
+                                                   config.provider_config)
+        # SkyPilot system namespace is required for FUSE mounting. Here we just
+        # create the namespace and set up the necessary permissions.
+        #
+        # We need to setup the namespace outside the
+        # if config.provider_config.get('fuse_device_required') block below
+        # because if we put in the if block, the following happens:
+        # 1. User launches job controller on Kubernetes with SERVICE_ACCOUNT. No
+        #    namespace is created at this point since the controller does not
+        #    require FUSE.
+        # 2. User submits a job requiring FUSE.
+        # 3. The namespace is created here, but since the job controller is
+        #    using DEFAULT_SERVICE_ACCOUNT_NAME, it does not have the necessary
+        #    permissions to create a role for itself to create the FUSE manager.
+        # 4. The job fails to launch.
+        _configure_skypilot_system_namespace(config.provider_config)
+        if config.provider_config.get('port_mode', 'loadbalancer') == 'ingress':
+            logger.info('Port mode is set to ingress, setting up ingress role '
+                        'and role binding.')
+            try:
+                _configure_autoscaler_role(namespace,
+                                           config.provider_config,
+                                           role_field='autoscaler_ingress_role')
+                _configure_autoscaler_role_binding(
+                    namespace,
+                    config.provider_config,
+                    binding_field='autoscaler_ingress_role_binding')
+            except kubernetes.api_exception() as e:
+                # If namespace is not found, we will ignore the error
+                if e.status == 404:
+                    logger.info(
+                        'Namespace not found - is your nginx ingress installed?'
+                        ' Skipping ingress role and role binding setup.')
+                else:
+                    raise e
+
+    elif requested_service_account != 'default':
+        logger.info(f'Using service account {requested_service_account!r}, '
+                    'skipping role and role binding setup.')
+    if config.provider_config.get('fuse_device_required', False):
+        _configure_fuse_mounting(config.provider_config)
     return config
 
 
@@ -197,6 +253,9 @@ def _configure_autoscaler_service_account(
         namespace, field_selector=field_selector).items)
     if len(accounts) > 0:
         assert len(accounts) == 1
+        # Nothing to check for equality and patch here,
+        # since the service_account.metadata.name is the only important
+        # attribute, which is already filtered for above.
         logger.info('_configure_autoscaler_service_account: '
                     f'{using_existing_msg(account_field, name)}')
         return
@@ -208,9 +267,16 @@ def _configure_autoscaler_service_account(
                 f'{created_msg(account_field, name)}')
 
 
-def _configure_autoscaler_role(namespace: str,
-                               provider_config: Dict[str, Any]) -> None:
-    role_field = 'autoscaler_role'
+def _configure_autoscaler_role(namespace: str, provider_config: Dict[str, Any],
+                               role_field: str) -> None:
+    """ Reads the role from the provider config, creates if it does not exist.
+
+    Args:
+        namespace: The namespace to create the role in.
+        provider_config: The provider config.
+        role_field: The field in the provider config that contains the role.
+    """
+
     if role_field not in provider_config:
         logger.info('_configure_autoscaler_role: '
                     f'{not_provided_msg(role_field)}')
@@ -219,17 +285,25 @@ def _configure_autoscaler_role(namespace: str,
     role = provider_config[role_field]
     if 'namespace' not in role['metadata']:
         role['metadata']['namespace'] = namespace
-    elif role['metadata']['namespace'] != namespace:
-        raise InvalidNamespaceError(role_field, namespace)
+    else:
+        namespace = role['metadata']['namespace']
 
     name = role['metadata']['name']
     field_selector = f'metadata.name={name}'
-    accounts = (kubernetes.auth_api().list_namespaced_role(
+    roles = (kubernetes.auth_api().list_namespaced_role(
         namespace, field_selector=field_selector).items)
-    if len(accounts) > 0:
-        assert len(accounts) == 1
+    if len(roles) > 0:
+        assert len(roles) == 1
+        existing_role = roles[0]
+        # Convert to k8s object to compare
+        new_role = kubernetes_utils.dict_to_k8s_object(role, 'V1Role')
+        if new_role.rules == existing_role.rules:
+            logger.info('_configure_autoscaler_role: '
+                        f'{using_existing_msg(role_field, name)}')
+            return
         logger.info('_configure_autoscaler_role: '
-                    f'{using_existing_msg(role_field, name)}')
+                    f'{updating_existing_msg(role_field, name)}')
+        kubernetes.auth_api().patch_namespaced_role(name, namespace, role)
         return
 
     logger.info('_configure_autoscaler_role: '
@@ -238,14 +312,118 @@ def _configure_autoscaler_role(namespace: str,
     logger.info(f'_configure_autoscaler_role: {created_msg(role_field, name)}')
 
 
-def _configure_autoscaler_role_binding(namespace: str,
-                                       provider_config: Dict[str, Any]) -> None:
-    binding_field = 'autoscaler_role_binding'
+def _configure_autoscaler_role_binding(
+        namespace: str,
+        provider_config: Dict[str, Any],
+        binding_field: str,
+        override_name: Optional[str] = None,
+        override_subject_namespace: Optional[str] = None) -> None:
+    """ Reads the role binding from the config, creates if it does not exist.
+
+    Args:
+        namespace: The namespace to create the role binding in.
+        provider_config: The provider config.
+        binding_field: The field in the provider config that contains the role
+    """
+
     if binding_field not in provider_config:
         logger.info('_configure_autoscaler_role_binding: '
                     f'{not_provided_msg(binding_field)}')
         return
 
+    binding = provider_config[binding_field]
+    if 'namespace' not in binding['metadata']:
+        binding['metadata']['namespace'] = namespace
+        rb_namespace = namespace
+    else:
+        rb_namespace = binding['metadata']['namespace']
+
+    # If override_subject_namespace is provided, we will use that
+    # namespace for the subject. Otherwise, we will raise an error.
+    subject_namespace = override_subject_namespace or namespace
+    for subject in binding['subjects']:
+        if 'namespace' not in subject:
+            subject['namespace'] = subject_namespace
+        elif subject['namespace'] != subject_namespace:
+            subject_name = subject['name']
+            raise InvalidNamespaceError(
+                binding_field + f' subject {subject_name}', namespace)
+
+    # Override name if provided
+    binding['metadata']['name'] = override_name or binding['metadata']['name']
+    name = binding['metadata']['name']
+
+    field_selector = f'metadata.name={name}'
+    role_bindings = (kubernetes.auth_api().list_namespaced_role_binding(
+        rb_namespace, field_selector=field_selector).items)
+    if len(role_bindings) > 0:
+        assert len(role_bindings) == 1
+        existing_binding = role_bindings[0]
+        new_rb = kubernetes_utils.dict_to_k8s_object(binding, 'V1RoleBinding')
+        if (new_rb.role_ref == existing_binding.role_ref and
+                new_rb.subjects == existing_binding.subjects):
+            logger.info('_configure_autoscaler_role_binding: '
+                        f'{using_existing_msg(binding_field, name)}')
+            return
+        logger.info('_configure_autoscaler_role_binding: '
+                    f'{updating_existing_msg(binding_field, name)}')
+        kubernetes.auth_api().patch_namespaced_role_binding(
+            name, rb_namespace, binding)
+        return
+
+    logger.info('_configure_autoscaler_role_binding: '
+                f'{not_found_msg(binding_field, name)}')
+    kubernetes.auth_api().create_namespaced_role_binding(rb_namespace, binding)
+    logger.info('_configure_autoscaler_role_binding: '
+                f'{created_msg(binding_field, name)}')
+
+
+def _configure_autoscaler_cluster_role(namespace,
+                                       provider_config: Dict[str, Any]) -> None:
+    role_field = 'autoscaler_cluster_role'
+    if role_field not in provider_config:
+        logger.info('_configure_autoscaler_cluster_role: '
+                    f'{not_provided_msg(role_field)}')
+        return
+
+    role = provider_config[role_field]
+    if 'namespace' not in role['metadata']:
+        role['metadata']['namespace'] = namespace
+    elif role['metadata']['namespace'] != namespace:
+        raise InvalidNamespaceError(role_field, namespace)
+
+    name = role['metadata']['name']
+    field_selector = f'metadata.name={name}'
+    cluster_roles = (kubernetes.auth_api().list_cluster_role(
+        field_selector=field_selector).items)
+    if len(cluster_roles) > 0:
+        assert len(cluster_roles) == 1
+        existing_cr = cluster_roles[0]
+        new_cr = kubernetes_utils.dict_to_k8s_object(role, 'V1ClusterRole')
+        if new_cr.rules == existing_cr.rules:
+            logger.info('_configure_autoscaler_cluster_role: '
+                        f'{using_existing_msg(role_field, name)}')
+            return
+        logger.info('_configure_autoscaler_cluster_role: '
+                    f'{updating_existing_msg(role_field, name)}')
+        kubernetes.auth_api().patch_cluster_role(name, role)
+        return
+
+    logger.info('_configure_autoscaler_cluster_role: '
+                f'{not_found_msg(role_field, name)}')
+    kubernetes.auth_api().create_cluster_role(role)
+    logger.info(
+        f'_configure_autoscaler_cluster_role: {created_msg(role_field, name)}')
+
+
+def _configure_autoscaler_cluster_role_binding(
+        namespace, provider_config: Dict[str, Any]) -> None:
+    binding_field = 'autoscaler_cluster_role_binding'
+    if binding_field not in provider_config:
+        logger.info('_configure_autoscaler_cluster_role_binding: '
+                    f'{not_provided_msg(binding_field)}')
+        return
+
     binding = provider_config[binding_field]
     if 'namespace' not in binding['metadata']:
         binding['metadata']['namespace'] = namespace
@@ -261,18 +439,27 @@ def _configure_autoscaler_role_binding(namespace: str,
 
     name = binding['metadata']['name']
     field_selector = f'metadata.name={name}'
-    accounts = (kubernetes.auth_api().list_namespaced_role_binding(
-        namespace, field_selector=field_selector).items)
-    if len(accounts) > 0:
-        assert len(accounts) == 1
-        logger.info('_configure_autoscaler_role_binding: '
-                    f'{using_existing_msg(binding_field, name)}')
+    cr_bindings = (kubernetes.auth_api().list_cluster_role_binding(
+        field_selector=field_selector).items)
+    if len(cr_bindings) > 0:
+        assert len(cr_bindings) == 1
+        existing_binding = cr_bindings[0]
+        new_binding = kubernetes_utils.dict_to_k8s_object(
+            binding, 'V1ClusterRoleBinding')
+        if (new_binding.role_ref == existing_binding.role_ref and
+                new_binding.subjects == existing_binding.subjects):
+            logger.info('_configure_autoscaler_cluster_role_binding: '
+                        f'{using_existing_msg(binding_field, name)}')
+            return
+        logger.info('_configure_autoscaler_cluster_role_binding: '
+                    f'{updating_existing_msg(binding_field, name)}')
+        kubernetes.auth_api().patch_cluster_role_binding(name, binding)
         return
 
-    logger.info('_configure_autoscaler_role_binding: '
+    logger.info('_configure_autoscaler_cluster_role_binding: '
                 f'{not_found_msg(binding_field, name)}')
-    kubernetes.auth_api().create_namespaced_role_binding(namespace, binding)
-    logger.info('_configure_autoscaler_role_binding: '
+    kubernetes.auth_api().create_cluster_role_binding(binding)
+    logger.info('_configure_autoscaler_cluster_role_binding: '
                 f'{created_msg(binding_field, name)}')
 
 
@@ -311,6 +498,106 @@ def _configure_ssh_jump(namespace, config: common.ProvisionConfig):
     return config
 
 
+def _configure_skypilot_system_namespace(
+        provider_config: Dict[str, Any]) -> None:
+    """Creates the namespace for skypilot-system mounting if it does not exist.
+
+    Also patches the SkyPilot service account to have the necessary permissions
+    to manage resources in the namespace.
+    """
+    svc_account_namespace = provider_config['namespace']
+    skypilot_system_namespace = provider_config['skypilot_system_namespace']
+    kubernetes_utils.create_namespace(skypilot_system_namespace)
+
+    # Note - this must be run only after the service account has been
+    # created in the cluster (in bootstrap_instances).
+    # Create the role in the skypilot-system namespace if it does not exist.
+    _configure_autoscaler_role(skypilot_system_namespace,
+                               provider_config,
+                               role_field='autoscaler_skypilot_system_role')
+    # We must create a unique role binding per-namespace that SkyPilot is
+    # running in, so we override the name with a unique name identifying
+    # the namespace. This is required for multi-tenant setups where
+    # different SkyPilot instances may be running in different namespaces.
+    override_name = provider_config['autoscaler_skypilot_system_role_binding'][
+        'metadata']['name'] + '-' + svc_account_namespace
+
+    # Create the role binding in the skypilot-system namespace, and have
+    # the subject namespace be the namespace that the SkyPilot service
+    # account is created in.
+    _configure_autoscaler_role_binding(
+        skypilot_system_namespace,
+        provider_config,
+        binding_field='autoscaler_skypilot_system_role_binding',
+        override_name=override_name,
+        override_subject_namespace=svc_account_namespace)
+
+
+def _configure_fuse_mounting(provider_config: Dict[str, Any]) -> None:
+    """Creates sidecars required for FUSE mounting.
+
+    FUSE mounting in Kubernetes without privileged containers requires us to
+    run a sidecar container with the necessary capabilities. We run a daemonset
+    which exposes the host /dev/fuse device as a Kubernetes resource. The
+    SkyPilot pod requests this resource to mount the FUSE filesystem.
+
+    We create this daemonset in the skypilot_system_namespace, which is
+    configurable in the provider config. This allows the FUSE mounting sidecar
+    to be shared across multiple tenants. The default namespace is
+    'skypilot-system' (populated in clouds.Kubernetes).
+    """
+
+    logger.info('_configure_fuse_mounting: Setting up FUSE device manager.')
+
+    fuse_device_manager_namespace = provider_config['skypilot_system_namespace']
+
+    # Read the device manager YAMLs from the manifests directory
+    root_dir = os.path.dirname(os.path.dirname(__file__))
+
+    # Load and create the ConfigMap
+    logger.info('_configure_fuse_mounting: Creating configmap.')
+    config_map_path = os.path.join(
+        root_dir, 'kubernetes/manifests/smarter-device-manager-configmap.yaml')
+    with open(config_map_path, 'r', encoding='utf-8') as file:
+        config_map = yaml.safe_load(file)
+    kubernetes_utils.merge_custom_metadata(config_map['metadata'])
+    try:
+        kubernetes.core_api().create_namespaced_config_map(
+            fuse_device_manager_namespace, config_map)
+    except kubernetes.api_exception() as e:
+        if e.status == 409:
+            logger.info('_configure_fuse_mounting: ConfigMap already exists '
+                        f'in namespace {fuse_device_manager_namespace!r}')
+        else:
+            raise
+    else:
+        logger.info('_configure_fuse_mounting: ConfigMap created '
+                    f'in namespace {fuse_device_manager_namespace!r}')
+
+    # Load and create the DaemonSet
+    logger.info('_configure_fuse_mounting: Creating daemonset.')
+    daemonset_path = os.path.join(
+        root_dir, 'kubernetes/manifests/smarter-device-manager-daemonset.yaml')
+    with open(daemonset_path, 'r', encoding='utf-8') as file:
+        daemonset = yaml.safe_load(file)
+    kubernetes_utils.merge_custom_metadata(daemonset['metadata'])
+    try:
+        kubernetes.apps_api().create_namespaced_daemon_set(
+            fuse_device_manager_namespace, daemonset)
+    except kubernetes.api_exception() as e:
+        if e.status == 409:
+            logger.info('_configure_fuse_mounting: DaemonSet already exists '
+                        f'in namespace {fuse_device_manager_namespace!r}')
+        else:
+            raise
+    else:
+        logger.info('_configure_fuse_mounting: DaemonSet created '
+                    f'in namespace {fuse_device_manager_namespace!r}')
+
+    logger.info('FUSE device manager setup complete '
+                f'in namespace {fuse_device_manager_namespace!r}')
+
+
 def _configure_services(namespace: str, provider_config: Dict[str,
                                                               Any]) -> None:
     service_field = 'services'
@@ -332,7 +619,9 @@ def _configure_services(namespace: str, provider_config: Dict[str,
         if len(services) > 0:
             assert len(services) == 1
             existing_service = services[0]
-            if service == existing_service:
+            # Convert to k8s object to compare
+            new_svc = kubernetes_utils.dict_to_k8s_object(service, 'V1Service')
+            if new_svc.spec.ports == existing_service.spec.ports:
                 logger.info('_configure_services: '
                             f'{using_existing_msg("service", name)}')
                 return
diff --git a/sky/provision/kubernetes/instance.py b/sky/provision/kubernetes/instance.py
index aa410c5a7b7..a0727b26a5b 100644
--- a/sky/provision/kubernetes/instance.py
+++ b/sky/provision/kubernetes/instance.py
@@ -10,8 +10,10 @@
 from sky import status_lib
 from sky.adaptors import kubernetes
 from sky.provision import common
+from sky.provision import docker_utils
 from sky.provision.kubernetes import config as config_lib
 from sky.provision.kubernetes import utils as kubernetes_utils
+from sky.utils import command_runner
 from sky.utils import common_utils
 from sky.utils import kubernetes_enums
 from sky.utils import ux_utils
@@ -157,6 +159,15 @@ def _raise_pod_scheduling_errors(namespace, new_nodes):
     raise config_lib.KubernetesError(f'{timeout_err_msg}')
 
 
+def _raise_command_running_error(message: str, command: str, pod_name: str,
+                                 rc: int, stdout: str) -> None:
+    if rc == 0:
+        return
+    raise config_lib.KubernetesError(
+        f'Failed to {message} for pod {pod_name} with return '
+        f'code {rc}: {command!r}\nOutput: {stdout}.')
+
+
 def _wait_for_pods_to_schedule(namespace, new_nodes, timeout: int):
     """Wait for all pods to be scheduled.
 
@@ -218,7 +229,7 @@ def _wait_for_pods_to_run(namespace, new_nodes):
                 node.metadata.name, namespace)
 
             # Continue if pod and all the containers within the
-            # pod are succesfully created and running.
+            # pod are successfully created and running.
             if pod.status.phase == 'Running' and all(
                     container.state.running
                     for container in pod.status.container_statuses):
@@ -241,7 +252,7 @@ def _wait_for_pods_to_run(namespace, new_nodes):
                             'the node. Error details: '
                             f'{container_status.state.waiting.message}.')
             # Reaching this point means that one of the pods had an issue,
-            # so break out of the loop
+            # so break out of the loop, and wait until next second.
             break
 
         if all_pods_running:
@@ -249,39 +260,6 @@ def _wait_for_pods_to_run(namespace, new_nodes):
         time.sleep(1)
 
 
-def _run_command_on_pods(node_name: str,
-                         node_namespace: str,
-                         command: List[str],
-                         stream_logs: bool = False):
-    """Run command on Kubernetes pods.
-
-    If `stream_logs` is True, we poll for output and error messages while the
-    command is executing, and the stdout and stderr is written to logger.info.
-    When called from the provisioner, this logger.info is written to the
-    provision.log file (see setup_provision_logging()).
-    """
-    cmd_output = kubernetes.stream()(
-        kubernetes.core_api().connect_get_namespaced_pod_exec,
-        node_name,
-        node_namespace,
-        command=command,
-        stderr=True,
-        stdin=False,
-        stdout=True,
-        tty=False,
-        _preload_content=(not stream_logs),
-        _request_timeout=kubernetes.API_TIMEOUT)
-    if stream_logs:
-        while cmd_output.is_open():
-            cmd_output.update(timeout=1)
-            if cmd_output.peek_stdout():
-                logger.info(f'{cmd_output.read_stdout().strip()}')
-            if cmd_output.peek_stderr():
-                logger.info(f'{cmd_output.read_stderr().strip()}')
-        cmd_output.close()
-    return cmd_output
-
-
 def _set_env_vars_in_pods(namespace: str, new_pods: List):
     """Setting environment variables in pods.
 
@@ -298,48 +276,44 @@ def _set_env_vars_in_pods(namespace: str, new_pods: List):
     /etc/profile.d/, making them available for all users in future
     shell sessions.
     """
-    set_k8s_env_var_cmd = [
-        '/bin/sh',
-        '-c',
-        (
-            'prefix_cmd() '
-            '{ if [ $(id -u) -ne 0 ]; then echo "sudo"; else echo ""; fi; } && '
-            'printenv | while IFS=\'=\' read -r key value; do echo "export $key=\\\"$value\\\""; done > '  # pylint: disable=line-too-long
-            '~/k8s_env_var.sh && '
-            'mv ~/k8s_env_var.sh /etc/profile.d/k8s_env_var.sh || '
-            '$(prefix_cmd) mv ~/k8s_env_var.sh /etc/profile.d/k8s_env_var.sh')
-    ]
+    set_k8s_env_var_cmd = docker_utils.SETUP_ENV_VARS_CMD
 
     for new_pod in new_pods:
-        _run_command_on_pods(new_pod.metadata.name, namespace,
-                             set_k8s_env_var_cmd)
+        runner = command_runner.KubernetesCommandRunner(
+            (namespace, new_pod.metadata.name))
+        rc, stdout, _ = runner.run(set_k8s_env_var_cmd,
+                                   require_outputs=True,
+                                   stream_logs=False)
+        _raise_command_running_error('set env vars', set_k8s_env_var_cmd,
+                                     new_pod.metadata.name, rc, stdout)
 
 
 def _check_user_privilege(namespace: str, new_nodes: List) -> None:
     # Checks if the default user has sufficient privilege to set up
     # the kubernetes instance pod.
-    check_k8s_user_sudo_cmd = [
-        '/bin/sh',
-        '-c',
-        (
-            'if [ $(id -u) -eq 0 ]; then'
-            # If user is root, create an alias for sudo used in skypilot setup
-            '  echo \'alias sudo=""\' >> ~/.bashrc; '
-            'else '
-            '  if command -v sudo >/dev/null 2>&1; then '
-            '    timeout 2 sudo -l >/dev/null 2>&1 || '
-            f'    ( echo {exceptions.INSUFFICIENT_PRIVILEGES_CODE!r}; ); '
-            '  else '
-            f'    ( echo {exceptions.INSUFFICIENT_PRIVILEGES_CODE!r}; ); '
-            '  fi; '
-            'fi')
-    ]
+    check_k8s_user_sudo_cmd = (
+        'if [ $(id -u) -eq 0 ]; then'
+        # If user is root, create an alias for sudo used in skypilot setup
+        '  echo \'alias sudo=""\' >> ~/.bashrc; echo succeed;'
+        'else '
+        '  if command -v sudo >/dev/null 2>&1; then '
+        '    timeout 2 sudo -l >/dev/null 2>&1 && echo succeed || '
+        f'    ( echo {exceptions.INSUFFICIENT_PRIVILEGES_CODE!r}; ); '
+        '  else '
+        f'    ( echo {exceptions.INSUFFICIENT_PRIVILEGES_CODE!r}; ); '
+        '  fi; '
+        'fi')
 
     for new_node in new_nodes:
-        privilege_check = _run_command_on_pods(new_node.metadata.name,
-                                               namespace,
-                                               check_k8s_user_sudo_cmd)
-        if privilege_check == str(exceptions.INSUFFICIENT_PRIVILEGES_CODE):
+        runner = command_runner.KubernetesCommandRunner(
+            (namespace, new_node.metadata.name))
+        rc, stdout, _ = runner.run(check_k8s_user_sudo_cmd,
+                                   require_outputs=True,
+                                   stream_logs=False)
+        _raise_command_running_error('check user privilege',
+                                     check_k8s_user_sudo_cmd,
+                                     new_node.metadata.name, rc, stdout)
+        if stdout == str(exceptions.INSUFFICIENT_PRIVILEGES_CODE):
             raise config_lib.KubernetesError(
                 'Insufficient system privileges detected. '
                 'Ensure the default user has root access or '
@@ -350,44 +324,43 @@ def _check_user_privilege(namespace: str, new_nodes: List) -> None:
 def _setup_ssh_in_pods(namespace: str, new_nodes: List) -> None:
     # Setting up ssh for the pod instance. This is already setup for
     # the jump pod so it does not need to be run for it.
-    set_k8s_ssh_cmd = [
-        '/bin/sh',
-        '-c',
-        (
-            'set -x; '
-            'prefix_cmd() '
-            '{ if [ $(id -u) -ne 0 ]; then echo "sudo"; else echo ""; fi; }; '
-            'export DEBIAN_FRONTEND=noninteractive;'
-            '$(prefix_cmd) apt-get update;'
-            '$(prefix_cmd) apt install openssh-server rsync -y; '
-            '$(prefix_cmd) mkdir -p /var/run/sshd; '
-            '$(prefix_cmd) '
-            'sed -i "s/PermitRootLogin prohibit-password/PermitRootLogin yes/" '
-            '/etc/ssh/sshd_config; '
-            '$(prefix_cmd) sed '
-            '"s@session\\s*required\\s*pam_loginuid.so@session optional '
-            'pam_loginuid.so@g" -i /etc/pam.d/sshd; '
-            'cd /etc/ssh/ && $(prefix_cmd) ssh-keygen -A; '
-            '$(prefix_cmd) mkdir -p ~/.ssh; '
-            '$(prefix_cmd) chown -R $(whoami) ~/.ssh;'
-            '$(prefix_cmd) chmod 700 ~/.ssh; '
-            '$(prefix_cmd) chmod 644 ~/.ssh/authorized_keys; '
-            '$(prefix_cmd) cat /etc/secret-volume/ssh-publickey* > '
-            '~/.ssh/authorized_keys; '
-            '$(prefix_cmd) service ssh restart; '
-            # Eliminate the error
-            # `mesg: ttyname failed: inappropriate ioctl for device`.
-            # See https://www.educative.io/answers/error-mesg-ttyname-failed-inappropriate-ioctl-for-device  # pylint: disable=line-too-long
-            '$(prefix_cmd) sed -i "s/mesg n/tty -s \\&\\& mesg n/" ~/.profile;')
-    ]
+    set_k8s_ssh_cmd = (
+        'set -ex; '
+        'prefix_cmd() '
+        '{ if [ $(id -u) -ne 0 ]; then echo "sudo"; else echo ""; fi; }; '
+        'export DEBIAN_FRONTEND=noninteractive;'
+        '$(prefix_cmd) apt-get update;'
+        '$(prefix_cmd) apt install openssh-server rsync -y; '
+        '$(prefix_cmd) mkdir -p /var/run/sshd; '
+        '$(prefix_cmd) '
+        'sed -i "s/PermitRootLogin prohibit-password/PermitRootLogin yes/" '
+        '/etc/ssh/sshd_config; '
+        '$(prefix_cmd) sed '
+        '"s@session\\s*required\\s*pam_loginuid.so@session optional '
+        'pam_loginuid.so@g" -i /etc/pam.d/sshd; '
+        'cd /etc/ssh/ && $(prefix_cmd) ssh-keygen -A; '
+        '$(prefix_cmd) mkdir -p ~/.ssh; '
+        '$(prefix_cmd) chown -R $(whoami) ~/.ssh;'
+        '$(prefix_cmd) chmod 700 ~/.ssh; '
+        '$(prefix_cmd) cat /etc/secret-volume/ssh-publickey* > '
+        '~/.ssh/authorized_keys; '
+        '$(prefix_cmd) chmod 644 ~/.ssh/authorized_keys; '
+        '$(prefix_cmd) service ssh restart; '
+        # Eliminate the error
+        # `mesg: ttyname failed: inappropriate ioctl for device`.
+        # See https://www.educative.io/answers/error-mesg-ttyname-failed-inappropriate-ioctl-for-device  # pylint: disable=line-too-long
+        '$(prefix_cmd) sed -i "s/mesg n/tty -s \\&\\& mesg n/" ~/.profile;')
+
     # TODO(romilb): Parallelize the setup of SSH in pods for multi-node clusters
     for new_node in new_nodes:
         pod_name = new_node.metadata.name
+        runner = command_runner.KubernetesCommandRunner((namespace, pod_name))
         logger.info(f'{"-"*20}Start: Set up SSH in pod {pod_name!r} {"-"*20}')
-        _run_command_on_pods(new_node.metadata.name,
-                             namespace,
-                             set_k8s_ssh_cmd,
-                             stream_logs=True)
+        rc, stdout, _ = runner.run(set_k8s_ssh_cmd,
+                                   require_outputs=True,
+                                   stream_logs=False)
+        _raise_command_running_error('setup ssh', set_k8s_ssh_cmd, pod_name, rc,
+                                     stdout)
         logger.info(f'{"-"*20}End: Set up SSH in pod {pod_name!r} {"-"*20}')
 
 
@@ -496,18 +469,25 @@ def _create_pods(region: str, cluster_name_on_cloud: str,
             # "nodes".
             pod_spec['spec']['affinity'] = {
                 'podAntiAffinity': {
-                    'requiredDuringSchedulingIgnoredDuringExecution': [{
-                        'labelSelector': {
-                            'matchExpressions': [{
-                                'key': TAG_SKYPILOT_CLUSTER_NAME,
-                                'operator': 'In',
-                                'values': [cluster_name_on_cloud]
-                            }]
-                        },
-                        'topologyKey': 'kubernetes.io/hostname'
+                    # Set as a soft constraint
+                    'preferredDuringSchedulingIgnoredDuringExecution': [{
+                        # Max weight to avoid scheduling on the
+                        # same physical node unless necessary.
+                        'weight': 100,
+                        'podAffinityTerm': {
+                            'labelSelector': {
+                                'matchExpressions': [{
+                                    'key': TAG_SKYPILOT_CLUSTER_NAME,
+                                    'operator': 'In',
+                                    'values': [cluster_name_on_cloud]
+                                }]
+                            },
+                            'topologyKey': 'kubernetes.io/hostname'
+                        }
                     }]
                 }
             }
+
         pod = kubernetes.core_api().create_namespaced_pod(namespace, pod_spec)
         created_pods[pod.metadata.name] = pod
         if head_pod_name is None:
@@ -533,6 +513,8 @@ def _create_pods(region: str, cluster_name_on_cloud: str,
     _wait_for_pods_to_schedule(namespace, wait_pods, provision_timeout)
     # Wait until the pods and their containers are up and running, and
     # fail early if there is an error
+    logger.debug(f'run_instances: waiting for pods to be running (pulling '
+                 f'images): {list(wait_pods_dict.keys())}')
     _wait_for_pods_to_run(namespace, wait_pods)
     logger.debug(f'run_instances: all pods are scheduled and running: '
                  f'{list(wait_pods_dict.keys())}')
@@ -705,11 +687,15 @@ def get_cluster_info(
     assert cpu_request is not None, 'cpu_request should not be None'
 
     ssh_user = 'sky'
-    get_k8s_ssh_user_cmd = ['/bin/sh', '-c', ('echo $(whoami)')]
+    get_k8s_ssh_user_cmd = 'echo $(whoami)'
     assert head_pod_name is not None
-    ssh_user = _run_command_on_pods(head_pod_name, namespace,
-                                    get_k8s_ssh_user_cmd)
-    ssh_user = ssh_user.strip()
+    runner = command_runner.KubernetesCommandRunner((namespace, head_pod_name))
+    rc, stdout, _ = runner.run(get_k8s_ssh_user_cmd,
+                               require_outputs=True,
+                               stream_logs=False)
+    _raise_command_running_error('get ssh user', get_k8s_ssh_user_cmd,
+                                 head_pod_name, rc, stdout)
+    ssh_user = stdout.strip()
     logger.debug(
         f'Using ssh user {ssh_user} for cluster {cluster_name_on_cloud}')
 
@@ -723,7 +709,9 @@ def get_cluster_info(
         custom_ray_options={
             'object-store-memory': 500000000,
             'num-cpus': cpu_request,
-        })
+        },
+        provider_name='kubernetes',
+        provider_config=provider_config)
 
 
 def query_instances(
@@ -770,3 +758,21 @@ def query_instances(
             continue
         cluster_status[pod.metadata.name] = pod_status
     return cluster_status
+
+
+def get_command_runners(
+    cluster_info: common.ClusterInfo,
+    **credentials: Dict[str, Any],
+) -> List[command_runner.CommandRunner]:
+    """Get a command runner for the given cluster."""
+    assert cluster_info.provider_config is not None, cluster_info
+    instances = cluster_info.instances
+    namespace = _get_namespace(cluster_info.provider_config)
+    node_list = []
+    if cluster_info.head_instance_id is not None:
+        node_list = [(namespace, cluster_info.head_instance_id)]
+    node_list.extend((namespace, pod_name)
+                     for pod_name in instances.keys()
+                     if pod_name != cluster_info.head_instance_id)
+    return command_runner.KubernetesCommandRunner.make_runner_list(
+        node_list=node_list, **credentials)
diff --git a/sky/provision/kubernetes/manifests/smarter-device-manager-configmap.yaml b/sky/provision/kubernetes/manifests/smarter-device-manager-configmap.yaml
new file mode 100644
index 00000000000..e3ecd0304b4
--- /dev/null
+++ b/sky/provision/kubernetes/manifests/smarter-device-manager-configmap.yaml
@@ -0,0 +1,10 @@
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: smarter-device-manager
+  labels:
+    parent: skypilot
+data:
+  conf.yaml: |
+    - devicematch: ^fuse$
+      nummaxdevices: 1000  # Max number of simultaneous pods that can use fuse
diff --git a/sky/provision/kubernetes/manifests/smarter-device-manager-daemonset.yaml b/sky/provision/kubernetes/manifests/smarter-device-manager-daemonset.yaml
new file mode 100644
index 00000000000..664fd69a8c8
--- /dev/null
+++ b/sky/provision/kubernetes/manifests/smarter-device-manager-daemonset.yaml
@@ -0,0 +1,65 @@
+# A daemonset for the smarter-device-manager agent. Used to expose FUSE devices
+# to SkyPilot pods, bypassing the need to run SkyPilot pods as privileged.
+# From smarter-device-manager daemonset: https://gitlab.com/arm-research/smarter/smarter-device-manager/-/blob/master/smarter-device-manager-ds.yaml
+apiVersion: apps/v1
+kind: DaemonSet
+metadata:
+  name: smarter-device-manager
+  labels:
+    name: smarter-device-manager
+    role: agent
+    parent: skypilot
+spec:
+  selector:
+    matchLabels:
+      name: smarter-device-manager
+  updateStrategy:
+    type: RollingUpdate
+  template:
+    metadata:
+      labels:
+        name: smarter-device-manager
+        parent: skypilot
+      annotations:
+        node.kubernetes.io/bootstrap-checkpoint: "true"
+    spec:
+      hostname: smarter-device-management
+      hostNetwork: true
+      dnsPolicy: ClusterFirstWithHostNet
+      containers:
+      - name: smarter-device-manager
+        image: us-central1-docker.pkg.dev/skypilot-375900/skypilotk8s/smarter-device-manager:v1.1.2
+        imagePullPolicy: IfNotPresent
+        securityContext:
+          allowPrivilegeEscalation: false
+          capabilities:
+            drop: ["ALL"]
+        resources:
+          limits:
+            cpu: 100m
+            memory: 15Mi
+          requests:
+            cpu: 10m
+            memory: 15Mi
+        volumeMounts:
+          - name: device-plugin
+            mountPath: /var/lib/kubelet/device-plugins
+          - name: dev-dir
+            mountPath: /dev
+          - name: sys-dir
+            mountPath: /sys
+          - name: config
+            mountPath: /root/config
+      volumes:
+        - name: device-plugin
+          hostPath:
+            path: /var/lib/kubelet/device-plugins
+        - name: dev-dir
+          hostPath:
+            path: /dev
+        - name: sys-dir
+          hostPath:
+            path: /sys
+        - name: config
+          configMap:
+            name: smarter-device-manager
diff --git a/sky/provision/kubernetes/network.py b/sky/provision/kubernetes/network.py
index 0895290b4df..875547e7677 100644
--- a/sky/provision/kubernetes/network.py
+++ b/sky/provision/kubernetes/network.py
@@ -8,8 +8,8 @@
 from sky.utils import kubernetes_enums
 from sky.utils.resources_utils import port_ranges_to_set
 
-_PATH_PREFIX = '/skypilot/{cluster_name_on_cloud}/{port}'
-_LOADBALANCER_SERVICE_NAME = '{cluster_name_on_cloud}-skypilot-loadbalancer'
+_PATH_PREFIX = '/skypilot/{namespace}/{cluster_name_on_cloud}/{port}'
+_LOADBALANCER_SERVICE_NAME = '{cluster_name_on_cloud}--skypilot-lb'
 
 
 def open_ports(
@@ -31,6 +31,9 @@ def open_ports(
         _open_ports_using_ingress(cluster_name_on_cloud=cluster_name_on_cloud,
                                   ports=ports,
                                   provider_config=provider_config)
+    elif port_mode == kubernetes_enums.KubernetesPortMode.PODIP:
+        # Do nothing, as PodIP mode does not require opening ports
+        pass
 
 
 def _open_ports_using_loadbalancer(
@@ -71,12 +74,13 @@ def _open_ports_using_ingress(
         )
 
     # Prepare service names, ports,  for template rendering
-    service_details = [
-        (f'{cluster_name_on_cloud}-skypilot-service--{port}', port,
-         _PATH_PREFIX.format(cluster_name_on_cloud=cluster_name_on_cloud,
-                             port=port).rstrip('/').lstrip('/'))
-        for port in ports
-    ]
+    service_details = [(f'{cluster_name_on_cloud}--skypilot-svc--{port}', port,
+                        _PATH_PREFIX.format(
+                            cluster_name_on_cloud=cluster_name_on_cloud,
+                            port=port,
+                            namespace=kubernetes_utils.
+                            get_current_kube_config_context_namespace()).rstrip(
+                                '/').lstrip('/')) for port in ports]
 
     # Generate ingress and services specs
     # We batch ingress rule creation because each rule triggers a hot reload of
@@ -133,6 +137,9 @@ def cleanup_ports(
         _cleanup_ports_for_ingress(cluster_name_on_cloud=cluster_name_on_cloud,
                                    ports=ports,
                                    provider_config=provider_config)
+    elif port_mode == kubernetes_enums.KubernetesPortMode.PODIP:
+        # Do nothing, as PodIP mode does not require opening ports
+        pass
 
 
 def _cleanup_ports_for_loadbalancer(
@@ -154,7 +161,7 @@ def _cleanup_ports_for_ingress(
 ) -> None:
     # Delete services for each port
     for port in ports:
-        service_name = f'{cluster_name_on_cloud}-skypilot-service--{port}'
+        service_name = f'{cluster_name_on_cloud}--skypilot-svc--{port}'
         network_utils.delete_namespaced_service(
             namespace=provider_config.get('namespace', 'default'),
             service_name=service_name,
@@ -171,9 +178,11 @@ def _cleanup_ports_for_ingress(
 def query_ports(
     cluster_name_on_cloud: str,
     ports: List[str],
+    head_ip: Optional[str] = None,
     provider_config: Optional[Dict[str, Any]] = None,
 ) -> Dict[int, List[common.Endpoint]]:
     """See sky/provision/__init__.py"""
+    del head_ip  # unused
     assert provider_config is not None, 'provider_config is required'
     port_mode = network_utils.get_port_mode(
         provider_config.get('port_mode', None))
@@ -191,6 +200,11 @@ def query_ports(
                 cluster_name_on_cloud=cluster_name_on_cloud,
                 ports=ports,
             )
+        elif port_mode == kubernetes_enums.KubernetesPortMode.PODIP:
+            return _query_ports_for_podip(
+                cluster_name_on_cloud=cluster_name_on_cloud,
+                ports=ports,
+            )
         else:
             return {}
     except kubernetes.kubernetes.client.ApiException as e:
@@ -232,7 +246,10 @@ def _query_ports_for_ingress(
     result: Dict[int, List[common.Endpoint]] = {}
     for port in ports:
         path_prefix = _PATH_PREFIX.format(
-            cluster_name_on_cloud=cluster_name_on_cloud, port=port)
+            cluster_name_on_cloud=cluster_name_on_cloud,
+            port=port,
+            namespace=kubernetes_utils.
+            get_current_kube_config_context_namespace())
 
         http_port, https_port = external_ports \
             if external_ports is not None else (None, None)
@@ -246,3 +263,21 @@ def _query_ports_for_ingress(
         ]
 
     return result
+
+
+def _query_ports_for_podip(
+    cluster_name_on_cloud: str,
+    ports: List[int],
+) -> Dict[int, List[common.Endpoint]]:
+    namespace = kubernetes_utils.get_current_kube_config_context_namespace()
+    pod_name = kubernetes_utils.get_head_pod_name(cluster_name_on_cloud)
+    pod_ip = network_utils.get_pod_ip(namespace, pod_name)
+
+    result: Dict[int, List[common.Endpoint]] = {}
+    if pod_ip is None:
+        return {}
+
+    for port in ports:
+        result[port] = [common.SocketEndpoint(host=pod_ip, port=port)]
+
+    return result
diff --git a/sky/provision/kubernetes/network_utils.py b/sky/provision/kubernetes/network_utils.py
index 4102d10c4ac..836d75af41f 100644
--- a/sky/provision/kubernetes/network_utils.py
+++ b/sky/provision/kubernetes/network_utils.py
@@ -199,11 +199,17 @@ def get_ingress_external_ip_and_ports(
 
     ingress_service = ingress_services[0]
     if ingress_service.status.load_balancer.ingress is None:
-        # Try to use assigned external IP if it exists,
-        # otherwise return 'localhost'
+        # We try to get an IP/host for the service in the following order:
+        # 1. Try to use assigned external IP if it exists
+        # 2. Use the skypilot.co/external-ip annotation in the service
+        # 3. Otherwise return 'localhost'
+        ip = None
         if ingress_service.spec.external_i_ps is not None:
             ip = ingress_service.spec.external_i_ps[0]
-        else:
+        elif ingress_service.metadata.annotations is not None:
+            ip = ingress_service.metadata.annotations.get(
+                'skypilot.co/external-ip', None)
+        if ip is None:
             ip = 'localhost'
         ports = ingress_service.spec.ports
         http_port = [port for port in ports if port.name == 'http'][0].node_port
@@ -228,3 +234,13 @@ def get_loadbalancer_ip(namespace: str, service_name: str) -> Optional[str]:
     ip = service.status.load_balancer.ingress[
         0].ip or service.status.load_balancer.ingress[0].hostname
     return ip if ip is not None else None
+
+
+def get_pod_ip(namespace: str, pod_name: str) -> Optional[str]:
+    """Returns the IP address of the pod."""
+    core_api = kubernetes.core_api()
+    pod = core_api.read_namespaced_pod(pod_name,
+                                       namespace,
+                                       _request_timeout=kubernetes.API_TIMEOUT)
+
+    return pod.status.pod_ip if pod.status.pod_ip is not None else None
diff --git a/sky/provision/kubernetes/utils.py b/sky/provision/kubernetes/utils.py
index df90a287e77..d8b4f73956b 100644
--- a/sky/provision/kubernetes/utils.py
+++ b/sky/provision/kubernetes/utils.py
@@ -1,4 +1,5 @@
 """Kubernetes utilities for SkyPilot."""
+import json
 import math
 import os
 import re
@@ -18,10 +19,14 @@
 from sky.utils import common_utils
 from sky.utils import env_options
 from sky.utils import kubernetes_enums
+from sky.utils import schemas
 from sky.utils import ux_utils
 
+# TODO(romilb): Move constants to constants.py
 DEFAULT_NAMESPACE = 'default'
 
+DEFAULT_SERVICE_ACCOUNT_NAME = 'skypilot-service-account'
+
 MEMORY_SIZE_UNITS = {
     'B': 1,
     'K': 2**10,
@@ -30,10 +35,16 @@
     'T': 2**40,
     'P': 2**50,
 }
-NO_GPU_ERROR_MESSAGE = 'No GPUs found in Kubernetes cluster. \
-If your cluster contains GPUs, make sure nvidia.com/gpu resource is available on the nodes and the node labels for identifying GPUs \
-(e.g., skypilot.co/accelerator) are setup correctly. \
-To further debug, run: sky check.'
+NO_GPU_HELP_MESSAGE = ('If your cluster contains GPUs, make sure '
+                       'nvidia.com/gpu resource is available on the nodes and '
+                       'the node labels for identifying GPUs '
+                       '(e.g., skypilot.co/accelerator) are setup correctly. ')
+
+KUBERNETES_AUTOSCALER_NOTE = (
+    'Note: Kubernetes cluster autoscaling is enabled. '
+    'All GPUs that can be provisioned may not be listed '
+    'here. Refer to your autoscaler\'s node pool '
+    'configuration to see the list of supported GPUs.')
 
 # TODO(romilb): Add links to docs for configuration instructions when ready.
 ENDPOINTS_DEBUG_MESSAGE = ('Additionally, make sure your {endpoint_type} '
@@ -89,6 +100,9 @@ def get_gke_accelerator_name(accelerator: str) -> str:
     Uses the format - nvidia-tesla-<accelerator>.
     A100-80GB, H100-80GB and L4 are an exception. They use nvidia-<accelerator>.
     """
+    if accelerator == 'H100':
+        # H100 is named as H100-80GB in GKE.
+        accelerator = 'H100-80GB'
     if accelerator in ('A100-80GB', 'L4', 'H100-80GB'):
         # A100-80GB, L4 and H100-80GB have a different name pattern.
         return 'nvidia-{}'.format(accelerator.lower())
@@ -172,19 +186,97 @@ def get_accelerator_from_label_value(cls, value: str) -> str:
         if value.startswith('nvidia-tesla-'):
             return value.replace('nvidia-tesla-', '').upper()
         elif value.startswith('nvidia-'):
-            return value.replace('nvidia-', '').upper()
+            acc = value.replace('nvidia-', '').upper()
+            if acc == 'H100-80GB':
+                # H100 is named as H100-80GB in GKE.
+                return 'H100'
+            return acc
         else:
             raise ValueError(
                 f'Invalid accelerator name in GKE cluster: {value}')
 
 
+class GFDLabelFormatter(GPULabelFormatter):
+    """GPU Feature Discovery label formatter
+
+    NVIDIA GPUs nodes are labeled by GPU feature discovery
+    e.g. nvidia.com/gpu.product=NVIDIA-H100-80GB-HBM3
+    https://github.com/NVIDIA/gpu-feature-discovery
+
+    GPU feature discovery is included as part of the
+    NVIDIA GPU Operator:
+    https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/overview.html
+
+    This LabelFormatter can't be used in autoscaling clusters since accelerators
+    may map to multiple label, so we're not implementing `get_label_value`
+    """
+
+    LABEL_KEY = 'nvidia.com/gpu.product'
+
+    @classmethod
+    def get_label_key(cls) -> str:
+        return cls.LABEL_KEY
+
+    @classmethod
+    def get_label_value(cls, accelerator: str) -> str:
+        """An accelerator can map to many Nvidia GFD labels
+        (e.g., A100-80GB-PCIE vs. A100-SXM4-80GB).
+        As a result, we do not support get_label_value for GFDLabelFormatter."""
+        raise NotImplementedError
+
+    @classmethod
+    def get_accelerator_from_label_value(cls, value: str) -> str:
+        """Searches against a canonical list of NVIDIA GPUs and pattern
+        matches the canonical GPU name against the GFD label.
+        """
+        canonical_gpu_names = [
+            'A100-80GB', 'A100', 'A10G', 'H100', 'K80', 'M60', 'T4g', 'T4',
+            'V100', 'A10', 'P4000', 'P100', 'P40', 'P4', 'L4'
+        ]
+        for canonical_name in canonical_gpu_names:
+            # A100-80G accelerator is A100-SXM-80GB or A100-PCIE-80GB
+            if canonical_name == 'A100-80GB' and re.search(
+                    r'A100.*-80GB', value):
+                return canonical_name
+            elif canonical_name in value:
+                return canonical_name
+
+        # If we didn't find a canonical name:
+        # 1. remove 'NVIDIA-' (e.g., 'NVIDIA-RTX-A6000' -> 'RTX-A6000')
+        # 2. remove 'GEFORCE-' (e.g., 'NVIDIA-GEFORCE-RTX-3070' -> 'RTX-3070')
+        # 3. remove 'RTX-' (e.g. 'RTX-6000' -> 'RTX6000')
+        # Same logic, but uppercased, as the Skypilot labeler job found in
+        # sky/utils/kubernetes/k8s_gpu_labeler_setup.yaml
+        return value.upper().replace('NVIDIA-',
+                                     '').replace('GEFORCE-',
+                                                 '').replace('RTX-', 'RTX')
+
+
+class KarpenterLabelFormatter(SkyPilotLabelFormatter):
+    """Karpeneter label formatter
+    Karpenter uses the label `karpenter.k8s.aws/instance-gpu-name` to identify
+    the GPU type. Details: https://karpenter.sh/docs/reference/instance-types/
+    The naming scheme is same as the SkyPilot formatter, so we inherit from it.
+    """
+    LABEL_KEY = 'karpenter.k8s.aws/instance-gpu-name'
+
+
 # LABEL_FORMATTER_REGISTRY stores the label formats SkyPilot will try to
 # discover the accelerator type from. The order of the list is important, as
-# it will be used to determine the priority of the label formats.
+# it will be used to determine the priority of the label formats when
+# auto-detecting the GPU label type.
 LABEL_FORMATTER_REGISTRY = [
-    SkyPilotLabelFormatter, CoreWeaveLabelFormatter, GKELabelFormatter
+    SkyPilotLabelFormatter, CoreWeaveLabelFormatter, GKELabelFormatter,
+    KarpenterLabelFormatter, GFDLabelFormatter
 ]
 
+# Mapping of autoscaler type to label formatter
+AUTOSCALER_TO_LABEL_FORMATTER = {
+    kubernetes_enums.KubernetesAutoscalerType.GKE: GKELabelFormatter,
+    kubernetes_enums.KubernetesAutoscalerType.KARPENTER: KarpenterLabelFormatter,  # pylint: disable=line-too-long
+    kubernetes_enums.KubernetesAutoscalerType.GENERIC: SkyPilotLabelFormatter,
+}
+
 
 def detect_gpu_label_formatter(
 ) -> Tuple[Optional[GPULabelFormatter], Dict[str, List[Tuple[str, str]]]]:
@@ -251,6 +343,18 @@ def get_kubernetes_nodes() -> List[Any]:
     return nodes
 
 
+def get_kubernetes_pods() -> List[Any]:
+    try:
+        ns = get_current_kube_config_context_namespace()
+        pods = kubernetes.core_api().list_namespaced_pod(
+            ns, _request_timeout=kubernetes.API_TIMEOUT).items
+    except kubernetes.max_retry_error():
+        raise exceptions.ResourcesUnavailableError(
+            'Timed out when trying to get pod info from Kubernetes cluster. '
+            'Please check if the cluster is healthy and retry.') from None
+    return pods
+
+
 def check_instance_fits(instance: str) -> Tuple[bool, Optional[str]]:
     """Checks if the instance fits on the Kubernetes cluster.
 
@@ -348,10 +452,26 @@ def get_gpu_label_key_value(acc_type: str, check_mode=False) -> Tuple[str, str]:
     # Check if the cluster has GPU resources
     # TODO(romilb): This assumes the accelerator is a nvidia GPU. We
     #  need to support TPUs and other accelerators as well.
-    # TODO(romilb): This will fail early for autoscaling clusters.
-    #  For AS clusters, we may need a way for users to specify GPU node pools
-    #  to use since the cluster may be scaling up from zero nodes and may not
-    #  have any GPU nodes yet.
+    # TODO(romilb): Currently, we broadly disable all GPU checks if autoscaling
+    #  is configured in config.yaml since the cluster may be scaling up from
+    #  zero nodes and may not have any GPU nodes yet. In the future, we should
+    #  support pollingthe clusters for autoscaling information, such as the
+    #  node pools configured etc.
+
+    autoscaler_type = get_autoscaler_type()
+    if autoscaler_type is not None:
+        # If autoscaler is set in config.yaml, override the label key and value
+        # to the autoscaler's format and bypass the GPU checks.
+        if check_mode:
+            # If check mode is enabled and autoscaler is set, we can return
+            # early since we assume the cluster autoscaler will handle GPU
+            # node provisioning.
+            return '', ''
+        formatter = AUTOSCALER_TO_LABEL_FORMATTER.get(autoscaler_type)
+        assert formatter is not None, ('Unsupported autoscaler type:'
+                                       f' {autoscaler_type}')
+        return formatter.get_label_key(), formatter.get_label_value(acc_type)
+
     has_gpus, cluster_resources = detect_gpu_resource()
     if has_gpus:
         # Check if the cluster has GPU labels setup correctly
@@ -390,7 +510,6 @@ def get_gpu_label_key_value(acc_type: str, check_mode=False) -> Tuple[str, str]:
                 # conclude that the cluster is setup correctly and return.
                 return '', ''
             k8s_acc_label_key = label_formatter.get_label_key()
-            k8s_acc_label_value = label_formatter.get_label_value(acc_type)
             # Search in node_labels to see if any node has the requested
             # GPU type.
             # Note - this only checks if the label is available on a
@@ -400,10 +519,9 @@ def get_gpu_label_key_value(acc_type: str, check_mode=False) -> Tuple[str, str]:
             for node_name, label_list in node_labels.items():
                 for label, value in label_list:
                     if (label == k8s_acc_label_key and
-                            value == k8s_acc_label_value):
-                        # If a node is found, we can break out of the loop
-                        # and proceed to deploy.
-                        return k8s_acc_label_key, k8s_acc_label_value
+                            label_formatter.get_accelerator_from_label_value(
+                                value) == acc_type):
+                        return label, value
             # If no node is found with the requested acc_type, raise error
             with ux_utils.print_exception_no_traceback():
                 suffix = ''
@@ -509,20 +627,108 @@ def check_credentials(timeout: int = kubernetes.API_TIMEOUT) -> \
     except Exception as e:  # pylint: disable=broad-except
         return False, ('An error occurred: '
                        f'{common_utils.format_exception(e, use_bracket=True)}')
-    # If we reach here, the credentials are valid and Kubernetes cluster is up
+
+    # If we reach here, the credentials are valid and Kubernetes cluster is up.
+    # We now do softer checks to check if exec based auth is used and to
+    # see if the cluster is GPU-enabled.
+
+    _, exec_msg = is_kubeconfig_exec_auth()
+
     # We now check if GPUs are available and labels are set correctly on the
     # cluster, and if not we return hints that may help debug any issues.
     # This early check avoids later surprises for user when they try to run
     # `sky launch --gpus <gpu>` and the optimizer does not list Kubernetes as a
     # provider if their cluster GPUs are not setup correctly.
+    gpu_msg = ''
     try:
         _, _ = get_gpu_label_key_value(acc_type='', check_mode=True)
     except exceptions.ResourcesUnavailableError as e:
         # If GPUs are not available, we return cluster as enabled (since it can
         # be a CPU-only cluster) but we also return the exception message which
         # serves as a hint for how to enable GPU access.
-        return True, f'{e}'
-    return True, None
+        gpu_msg = str(e)
+    if exec_msg and gpu_msg:
+        return True, f'{gpu_msg}\n    Additionally, {exec_msg}'
+    elif gpu_msg:
+        return True, gpu_msg
+    elif exec_msg:
+        return True, exec_msg
+    else:
+        return True, None
+
+
+def is_kubeconfig_exec_auth() -> Tuple[bool, Optional[str]]:
+    """Checks if the kubeconfig file uses exec-based authentication
+
+    Exec-based auth is commonly used for authenticating with cloud hosted
+    Kubernetes services, such as GKE. Here is an example snippet from a
+    kubeconfig using exec-based authentication for a GKE cluster:
+    - name: mycluster
+      user:
+        exec:
+          apiVersion: client.authentication.k8s.io/v1beta1
+          command: /Users/romilb/google-cloud-sdk/bin/gke-gcloud-auth-plugin
+          installHint: Install gke-gcloud-auth-plugin ...
+          provideClusterInfo: true
+
+
+    Using exec-based authentication is problematic when used in conjunction
+    with kubernetes.remote_identity = LOCAL_CREDENTIAL in ~/.sky/config.yaml.
+    This is because the exec-based authentication may not have the relevant
+    dependencies installed on the remote cluster or may have hardcoded paths
+    that are not available on the remote cluster.
+
+    Returns:
+        bool: True if exec-based authentication is used and LOCAL_CREDENTIAL
+            mode is used for remote_identity in ~/.sky/config.yaml.
+        str: Error message if exec-based authentication is used, None otherwise
+    """
+    k8s = kubernetes.kubernetes
+    try:
+        k8s.config.load_kube_config()
+    except kubernetes.config_exception():
+        # Using service account token or other auth methods, continue
+        return False, None
+
+    # Get active context and user from kubeconfig using k8s api
+    _, current_context = k8s.config.list_kube_config_contexts()
+    target_username = current_context['context']['user']
+
+    # K8s api does not provide a mechanism to get the user details from the
+    # context. We need to load the kubeconfig file and parse it to get the
+    # user details.
+    kubeconfig_path = os.path.expanduser(
+        os.getenv('KUBECONFIG',
+                  k8s.config.kube_config.KUBE_CONFIG_DEFAULT_LOCATION))
+    # Load the kubeconfig file as a dictionary
+    with open(kubeconfig_path, 'r', encoding='utf-8') as f:
+        kubeconfig = yaml.safe_load(f)
+
+    user_details = kubeconfig['users']
+
+    # Find user matching the target username
+    user_details = next(
+        user for user in user_details if user['name'] == target_username)
+
+    remote_identity = skypilot_config.get_nested(
+        ('kubernetes', 'remote_identity'),
+        schemas.get_default_remote_identity('kubernetes'))
+    if ('exec' in user_details.get('user', {}) and remote_identity
+            == schemas.RemoteIdentityOptions.LOCAL_CREDENTIALS.value):
+        ctx_name = current_context['name']
+        exec_msg = ('exec-based authentication is used for '
+                    f'Kubernetes context {ctx_name!r}.'
+                    ' This may cause issues with autodown or when running '
+                    'Managed Jobs or SkyServe controller on Kubernetes. '
+                    'To fix, configure SkyPilot to create a service account '
+                    'for running pods by setting the following in '
+                    '~/.sky/config.yaml:\n'
+                    '    kubernetes:\n'
+                    '      remote_identity: SERVICE_ACCOUNT\n'
+                    '    More: https://skypilot.readthedocs.io/en/latest/'
+                    'reference/config.html')
+        return True, exec_msg
+    return False, None
 
 
 def get_current_kube_config_context_name() -> Optional[str]:
@@ -547,6 +753,12 @@ def get_current_kube_config_context_namespace() -> str:
             the default namespace.
     """
     k8s = kubernetes.kubernetes
+    # Get namespace if using in-cluster config
+    ns_path = '/var/run/secrets/kubernetes.io/serviceaccount/namespace'
+    if os.path.exists(ns_path):
+        with open(ns_path, encoding='utf-8') as f:
+            return f.read().strip()
+    # If not in-cluster, get the namespace from kubeconfig
     try:
         _, current_context = k8s.config.list_kube_config_contexts()
         if 'namespace' in current_context['context']:
@@ -588,23 +800,18 @@ def parse_memory_resource(resource_qty_str: str,
 
 class KubernetesInstanceType:
     """Class to represent the "Instance Type" in a Kubernetes.
-
     Since Kubernetes does not have a notion of instances, we generate
     virtual instance types that represent the resources requested by a
     pod ("node").
-
     This name captures the following resource requests:
         - CPU
         - Memory
         - Accelerators
-
     The name format is "{n}CPU--{k}GB" where n is the number of vCPUs and
     k is the amount of memory in GB. Accelerators can be specified by
     appending "--{a}{type}" where a is the number of accelerators and
     type is the accelerator type.
-
     CPU and memory can be specified as floats. Accelerator count must be int.
-
     Examples:
         - 4CPU--16GB
         - 0.5CPU--1.5GB
@@ -643,7 +850,6 @@ def _parse_instance_type(
             cls,
             name: str) -> Tuple[float, float, Optional[int], Optional[str]]:
         """Parses and returns resources from the given InstanceType name
-
         Returns:
             cpus | float: Number of CPUs
             memory | float: Amount of memory in GB
@@ -688,7 +894,6 @@ def from_resources(cls,
                        accelerator_count: Union[float, int] = 0,
                        accelerator_type: str = '') -> 'KubernetesInstanceType':
         """Returns an instance name object from the given resources.
-
         If accelerator_count is not an int, it will be rounded up since GPU
         requests in Kubernetes must be int.
         """
@@ -1025,28 +1230,64 @@ def fill_ssh_jump_template(ssh_key_secret: str, ssh_jump_image: str,
 
 
 def check_port_forward_mode_dependencies() -> None:
-    """Checks if 'socat' is installed"""
-    # We store the dependency list as a list of lists. Each inner list
-    # contains the name of the dependency, the command to check if it is
-    # installed, and the package name to install it.
-    dependency_list = [['socat', ['socat', '-V'], 'socat'],
-                       ['nc', ['nc', '-h'], 'netcat']]
-    for name, check_cmd, install_cmd in dependency_list:
-        try:
-            subprocess.run(check_cmd,
-                           stdout=subprocess.DEVNULL,
-                           stderr=subprocess.DEVNULL,
-                           check=True)
-        except (FileNotFoundError, subprocess.CalledProcessError):
+    """Checks if 'socat' and 'nc' are installed"""
+
+    # Construct runtime errors
+    socat_default_error = RuntimeError(
+        f'`socat` is required to setup Kubernetes cloud with '
+        f'`{kubernetes_enums.KubernetesNetworkingMode.PORTFORWARD.value}` '  # pylint: disable=line-too-long
+        'default networking mode and it is not installed. '
+        'On Debian/Ubuntu, install it with:\n'
+        f'  $ sudo apt install socat\n'
+        f'On MacOS, install it with: \n'
+        f'  $ brew install socat')
+    netcat_default_error = RuntimeError(
+        f'`nc` is required to setup Kubernetes cloud with '
+        f'`{kubernetes_enums.KubernetesNetworkingMode.PORTFORWARD.value}` '  # pylint: disable=line-too-long
+        'default networking mode and it is not installed. '
+        'On Debian/Ubuntu, install it with:\n'
+        f'  $ sudo apt install netcat\n'
+        f'On MacOS, install it with: \n'
+        f'  $ brew install netcat')
+    mac_installed_error = RuntimeError(
+        f'The default MacOS `nc` is installed. However, for '
+        f'`{kubernetes_enums.KubernetesNetworkingMode.PORTFORWARD.value}` '  # pylint: disable=line-too-long
+        'default networking mode, GNU netcat is required. '
+        f'On MacOS, install it with: \n'
+        f'  $ brew install netcat')
+
+    # Ensure socat is installed
+    try:
+        subprocess.run(['socat', '-V'],
+                       stdout=subprocess.DEVNULL,
+                       stderr=subprocess.DEVNULL,
+                       check=True)
+    except (FileNotFoundError, subprocess.CalledProcessError):
+        with ux_utils.print_exception_no_traceback():
+            raise socat_default_error from None
+
+    # Ensure netcat is installed
+    #
+    # In some cases, the user may have the default MacOS nc installed, which
+    # does not support the -z flag. To use the -z flag for port scanning,
+    # they need GNU nc installed. We check for this case and raise an error.
+    try:
+        netcat_output = subprocess.run(['nc', '-h'],
+                                       capture_output=True,
+                                       check=False)
+        nc_mac_installed = netcat_output.returncode == 1 and 'apple' in str(
+            netcat_output.stderr)
+
+        if nc_mac_installed:
+            with ux_utils.print_exception_no_traceback():
+                raise mac_installed_error from None
+        elif netcat_output.returncode != 0:
             with ux_utils.print_exception_no_traceback():
-                raise RuntimeError(
-                    f'`{name}` is required to setup Kubernetes cloud with '
-                    f'`{kubernetes_enums.KubernetesNetworkingMode.PORTFORWARD.value}` '  # pylint: disable=line-too-long
-                    'default networking mode and it is not installed. '
-                    'On Debian/Ubuntu, install it with:\n'
-                    f'  $ sudo apt install {install_cmd}\n'
-                    f'On MacOS, install it with: \n'
-                    f'  $ brew install {install_cmd}') from None
+                raise netcat_default_error from None
+
+    except FileNotFoundError:
+        with ux_utils.print_exception_no_traceback():
+            raise netcat_default_error from None
 
 
 def get_endpoint_debug_message() -> str:
@@ -1064,6 +1305,9 @@ def get_endpoint_debug_message() -> str:
     elif port_mode == kubernetes_enums.KubernetesPortMode.LOADBALANCER:
         endpoint_type = 'LoadBalancer'
         debug_cmd = 'kubectl describe service'
+    elif port_mode == kubernetes_enums.KubernetesPortMode.PODIP:
+        endpoint_type = 'PodIP'
+        debug_cmd = 'kubectl describe pod'
     return ENDPOINTS_DEBUG_MESSAGE.format(endpoint_type=endpoint_type,
                                           debug_cmd=debug_cmd)
 
@@ -1235,3 +1479,70 @@ def check_secret_exists(secret_name: str, namespace: str) -> bool:
         raise
     else:
         return True
+
+
+def create_namespace(namespace: str) -> None:
+    """Creates a namespace in the cluster.
+
+    If the namespace already exists, logs a message and does nothing.
+
+    Args:
+        namespace: Name of the namespace to create
+    """
+    kubernetes_client = kubernetes.kubernetes.client
+    ns_metadata = dict(name=namespace, labels={'parent': 'skypilot'})
+    merge_custom_metadata(ns_metadata)
+    namespace_obj = kubernetes_client.V1Namespace(metadata=ns_metadata)
+    try:
+        kubernetes.core_api().create_namespace(namespace_obj)
+    except kubernetes.api_exception() as e:
+        if e.status == 409:
+            logger.info(f'Namespace {namespace} already exists in the cluster.')
+        else:
+            raise
+
+
+def get_head_pod_name(cluster_name_on_cloud: str):
+    """Returns the pod name of the head pod for the given cluster name on cloud
+
+    Args:
+        cluster_name_on_cloud: Name of the cluster on cloud
+
+    Returns:
+        str: Pod name of the head pod
+    """
+    # We could have iterated over all pods in the namespace and checked for the
+    # label, but since we know the naming convention, we can directly return the
+    # head pod name.
+    return f'{cluster_name_on_cloud}-head'
+
+
+def get_autoscaler_type(
+) -> Optional[kubernetes_enums.KubernetesAutoscalerType]:
+    """Returns the autoscaler type by reading from config"""
+    autoscaler_type = skypilot_config.get_nested(['kubernetes', 'autoscaler'],
+                                                 None)
+    if autoscaler_type is not None:
+        autoscaler_type = kubernetes_enums.KubernetesAutoscalerType(
+            autoscaler_type)
+    return autoscaler_type
+
+
+def dict_to_k8s_object(object_dict: Dict[str, Any], object_type: 'str') -> Any:
+    """Converts a dictionary to a Kubernetes object.
+
+    Useful for comparing two Kubernetes objects. Adapted from
+    https://github.com/kubernetes-client/python/issues/977#issuecomment-592030030  # pylint: disable=line-too-long
+
+    Args:
+        object_dict: Dictionary representing the Kubernetes object
+        object_type: Type of the Kubernetes object. E.g., 'V1Pod', 'V1Service'.
+    """
+
+    class FakeKubeResponse:
+
+        def __init__(self, obj):
+            self.data = json.dumps(obj)
+
+    fake_kube_response = FakeKubeResponse(object_dict)
+    return kubernetes.api_client().deserialize(fake_kube_response, object_type)
diff --git a/sky/provision/paperspace/constants.py b/sky/provision/paperspace/constants.py
index 438cbebb3d9..8c6084d80b7 100644
--- a/sky/provision/paperspace/constants.py
+++ b/sky/provision/paperspace/constants.py
@@ -1,14 +1,20 @@
-"""Constants for Paperspace provisioner"""
+"""Constants for Paperspace provisioner
+(TODO) asaiacai: fetch default images using API
+The H100 and A100 family of GPUs use `Ubuntu 22.04 MLaiB`
+V100s use Ubuntu `20.04 MLaiB`
+CPU instances use `Ubuntu 22.04 Server`
+These instances are fetched using:
+    # curl -X GET 'https://api.paperspace.io/templates/getTemplates \
+    # -H 'X-Api-Key: ...'
+"""
 
 CPU_INSTANCES_TEMPLATEID = {f'C{i}': 't0nspur5' for i in range(5, 10)}
 
 INSTANCE_TO_TEMPLATEID = {
-    'H100x8': 'tvimtol9',
-    'H100': 'tiq8mhti',
+    'H100x8': 't7vp562h',
+    'H100': 'tilqt47t',
     'A100-80Gx8': 'tvimtol9',
-    'A100-80Gx4': 'tiq8mhti',
-    'A100-80Gx2': 'tiq8mhti',
-    'A100-80G': 'tiq8mhti',
+    'A100-80G': 'tqqsxr6b',
     'V100-32Gx4': 'twnlo3zj',
     'V100-32Gx2': 'twnlo3zj',
     'V100-32G': 'twnlo3zj',
diff --git a/sky/provision/paperspace/instance.py b/sky/provision/paperspace/instance.py
index 12c581c8314..ce1a4768c24 100644
--- a/sky/provision/paperspace/instance.py
+++ b/sky/provision/paperspace/instance.py
@@ -251,7 +251,7 @@ def get_cluster_info(
     cluster_name_on_cloud: str,
     provider_config: Optional[Dict[str, Any]] = None,
 ) -> common.ClusterInfo:
-    del region, provider_config  # unused
+    del region  # unused
     running_instances = _filter_instances(cluster_name_on_cloud, ['ready'])
     instances: Dict[str, List[common.InstanceInfo]] = {}
     head_instance_id = None
@@ -271,6 +271,8 @@ def get_cluster_info(
     return common.ClusterInfo(
         instances=instances,
         head_instance_id=head_instance_id,
+        provider_name='paperspace',
+        provider_config=provider_config,
     )
 
 
diff --git a/sky/provision/provisioner.py b/sky/provision/provisioner.py
index 8c86ca7f0ba..df9a9fcc58a 100644
--- a/sky/provision/provisioner.py
+++ b/sky/provision/provisioner.py
@@ -24,7 +24,6 @@
 from sky.provision import logging as provision_logging
 from sky.provision import metadata_utils
 from sky.skylet import constants
-from sky.utils import command_runner
 from sky.utils import common_utils
 from sky.utils import rich_utils
 from sky.utils import ux_utils
@@ -444,8 +443,6 @@ def _post_provision_setup(
             'status with: sky status -r; and retry provisioning.')
 
     # TODO(suquark): Move wheel build here in future PRs.
-    ip_list = cluster_info.get_feasible_ips()
-    port_list = cluster_info.get_ssh_ports()
     # We don't set docker_user here, as we are configuring the VM itself.
     ssh_credentials = backend_utils.ssh_credential_from_yaml(
         cluster_yaml, ssh_user=cluster_info.ssh_user)
@@ -482,7 +479,7 @@ def _post_provision_setup(
 
         # We mount the metadata with sky wheel for speedup.
         # NOTE: currently we mount all credentials for all nodes, because
-        # (1) spot controllers need permission to launch/down nodes of
+        # (1) jobs controllers need permission to launch/down nodes of
         #     multiple clouds
         # (2) head instances need permission for auto stop or auto down
         #     nodes for the current cloud
@@ -505,9 +502,9 @@ def _post_provision_setup(
             cluster_name.name_on_cloud, config_from_yaml['setup_commands'],
             cluster_info, ssh_credentials)
 
-        head_runner = command_runner.SSHCommandRunner(ip_list[0],
-                                                      port=port_list[0],
-                                                      **ssh_credentials)
+        runners = provision.get_command_runners(cloud_name, cluster_info,
+                                                **ssh_credentials)
+        head_runner = runners[0]
 
         status.update(
             runtime_preparation_str.format(step=3, step_name='runtime'))
@@ -544,7 +541,7 @@ def _post_provision_setup(
         #     if provision_record.is_instance_just_booted(inst.instance_id):
         #         worker_ips.append(inst.public_ip)
 
-        if len(ip_list) > 1:
+        if cluster_info.num_instances > 1:
             instance_setup.start_ray_on_worker_nodes(
                 cluster_name.name_on_cloud,
                 no_restart=not full_ray_setup,
diff --git a/sky/provision/runpod/instance.py b/sky/provision/runpod/instance.py
index 3ae99dae8d5..d7cb20b57a6 100644
--- a/sky/provision/runpod/instance.py
+++ b/sky/provision/runpod/instance.py
@@ -154,7 +154,7 @@ def get_cluster_info(
         region: str,
         cluster_name_on_cloud: str,
         provider_config: Optional[Dict[str, Any]] = None) -> common.ClusterInfo:
-    del region, provider_config  # unused
+    del region  # unused
     running_instances = _filter_instances(cluster_name_on_cloud, ['RUNNING'])
     instances: Dict[str, List[common.InstanceInfo]] = {}
     head_instance_id = None
@@ -174,6 +174,8 @@ def get_cluster_info(
     return common.ClusterInfo(
         instances=instances,
         head_instance_id=head_instance_id,
+        provider_name='runpod',
+        provider_config=provider_config,
     )
 
 
diff --git a/sky/provision/vsphere/instance.py b/sky/provision/vsphere/instance.py
index 69a544210b3..787d8c97f62 100644
--- a/sky/provision/vsphere/instance.py
+++ b/sky/provision/vsphere/instance.py
@@ -571,8 +571,6 @@ def get_cluster_info(
         cluster_name: str,
         provider_config: Optional[Dict[str, Any]] = None) -> common.ClusterInfo:
     """See sky/provision/__init__.py"""
-    if provider_config:
-        del provider_config  # unused
     logger.info('New provision of Vsphere: get_cluster_info().')
 
     # Init the vsphere client
@@ -610,4 +608,6 @@ def get_cluster_info(
     return common.ClusterInfo(
         instances=instances,
         head_instance_id=head_instance_id,
+        provider_name='vsphere',
+        provider_config=provider_config,
     )
diff --git a/sky/resources.py b/sky/resources.py
index aeb37b79649..252edff5da6 100644
--- a/sky/resources.py
+++ b/sky/resources.py
@@ -1,4 +1,5 @@
 """Resources: compute requirements of Tasks."""
+import dataclasses
 import functools
 import textwrap
 from typing import Any, Dict, List, Optional, Set, Tuple, Union
@@ -8,9 +9,9 @@
 from sky import check as sky_check
 from sky import clouds
 from sky import exceptions
+from sky import jobs as managed_jobs
 from sky import sky_logging
 from sky import skypilot_config
-from sky import spot
 from sky.clouds import service_catalog
 from sky.provision import docker_utils
 from sky.skylet import constants
@@ -43,7 +44,7 @@ class Resources:
     """
     # If any fields changed, increment the version. For backward compatibility,
     # modify the __setstate__ method to handle the old version.
-    _VERSION = 15
+    _VERSION = 18
 
     def __init__(
         self,
@@ -54,17 +55,19 @@ def __init__(
         accelerators: Union[None, str, Dict[str, int]] = None,
         accelerator_args: Optional[Dict[str, str]] = None,
         use_spot: Optional[bool] = None,
-        spot_recovery: Optional[str] = None,
+        job_recovery: Optional[str] = None,
         region: Optional[str] = None,
         zone: Optional[str] = None,
         image_id: Union[Dict[str, str], str, None] = None,
         disk_size: Optional[int] = None,
         disk_tier: Optional[Union[str, resources_utils.DiskTier]] = None,
         ports: Optional[Union[int, str, List[str], Tuple[str]]] = None,
+        labels: Optional[Dict[str, str]] = None,
         # Internal use only.
         # pylint: disable=invalid-name
         _docker_login_config: Optional[docker_utils.DockerLoginConfig] = None,
         _is_image_managed: Optional[bool] = None,
+        _requires_fuse: Optional[bool] = None,
     ):
         """Initialize a Resources object.
 
@@ -103,9 +106,9 @@ def __init__(
             ``{'tpu_vm': True, 'runtime_version': 'tpu-vm-base'}`` for TPUs.
           use_spot: whether to use spot instances. If None, defaults to
             False.
-          spot_recovery: the spot recovery strategy to use for the managed
-            spot to recover the cluster from preemption. Refer to
-            `recovery_strategy module <https://github.com/skypilot-org/skypilot/blob/master/sky/spot/recovery_strategy.py>`__ # pylint: disable=line-too-long
+          job_recovery: the job recovery strategy to use for the managed
+            job to recover the cluster from preemption. Refer to
+            `recovery_strategy module <https://github.com/skypilot-org/skypilot/blob/master/sky/jobs/recovery_strategy.py>`__ # pylint: disable=line-too-long
             for more details.
           region: the region to use.
           zone: the zone to use.
@@ -128,9 +131,20 @@ def __init__(
           disk_tier: the disk performance tier to use. If None, defaults to
             ``'medium'``.
           ports: the ports to open on the instance.
-          _docker_login_config: the docker configuration to use. This include
+          labels: the labels to apply to the instance. These are useful for
+            assigning metadata that may be used by external tools.
+            Implementation depends on the chosen cloud - On AWS, labels map to
+            instance tags. On GCP, labels map to instance labels. On
+            Kubernetes, labels map to pod labels. On other clouds, labels are
+            not supported and will be ignored.
+          _docker_login_config: the docker configuration to use. This includes
             the docker username, password, and registry server. If None, skip
             docker login.
+          _requires_fuse: whether the task requires FUSE mounting support. This
+            is used internally by certain cloud implementations to do additional
+            setup for FUSE mounting. This flag also safeguards against using
+            FUSE mounting on existing clusters that do not support it. If None,
+            defaults to False.
 
         Raises:
             ValueError: if some attributes are invalid.
@@ -146,10 +160,10 @@ def __init__(
 
         self._use_spot_specified = use_spot is not None
         self._use_spot = use_spot if use_spot is not None else False
-        self._spot_recovery = None
-        if spot_recovery is not None:
-            if spot_recovery.strip().lower() != 'none':
-                self._spot_recovery = spot_recovery.upper()
+        self._job_recovery = None
+        if job_recovery is not None:
+            if job_recovery.strip().lower() != 'none':
+                self._job_recovery = job_recovery.upper()
 
         if disk_size is not None:
             if round(disk_size) != disk_size:
@@ -198,18 +212,23 @@ def __init__(
                 ports = None
         self._ports = ports
 
+        self._labels = labels
+
         self._docker_login_config = _docker_login_config
 
+        self._requires_fuse = _requires_fuse
+
         self._set_cpus(cpus)
         self._set_memory(memory)
         self._set_accelerators(accelerators, accelerator_args)
 
         self._try_validate_instance_type()
         self._try_validate_cpus_mem()
-        self._try_validate_spot()
+        self._try_validate_managed_job_attributes()
         self._try_validate_image_id()
         self._try_validate_disk_tier()
         self._try_validate_ports()
+        self._try_validate_labels()
 
     # When querying the accelerators inside this func (we call self.accelerators
     # which is a @property), we will check the cloud's catalog, which can error
@@ -249,8 +268,8 @@ def __repr__(self) -> str:
                 accelerator_args = f', accelerator_args={self.accelerator_args}'
 
         cpus = ''
-        if self.cpus is not None:
-            cpus = f', cpus={self.cpus}'
+        if self._cpus is not None:
+            cpus = f', cpus={self._cpus}'
 
         memory = ''
         if self.memory is not None:
@@ -334,6 +353,7 @@ def instance_type(self):
         return self._instance_type
 
     @property
+    @functools.lru_cache(maxsize=1)
     def cpus(self) -> Optional[str]:
         """Returns the number of vCPUs that each instance must have.
 
@@ -344,7 +364,13 @@ def cpus(self) -> Optional[str]:
         at launch time. Thus, Resources in the backend's ResourceHandle will
         always have the cpus field set to None.)
         """
-        return self._cpus
+        if self._cpus is not None:
+            return self._cpus
+        if self.cloud is not None and self._instance_type is not None:
+            vcpus, _ = self.cloud.get_vcpus_mem_from_instance_type(
+                self._instance_type)
+            return str(vcpus)
+        return None
 
     @property
     def memory(self) -> Optional[str]:
@@ -389,8 +415,8 @@ def use_spot_specified(self) -> bool:
         return self._use_spot_specified
 
     @property
-    def spot_recovery(self) -> Optional[str]:
-        return self._spot_recovery
+    def job_recovery(self) -> Optional[str]:
+        return self._job_recovery
 
     @property
     def disk_size(self) -> int:
@@ -408,10 +434,24 @@ def disk_tier(self) -> resources_utils.DiskTier:
     def ports(self) -> Optional[List[str]]:
         return self._ports
 
+    @property
+    def labels(self) -> Optional[Dict[str, str]]:
+        return self._labels
+
     @property
     def is_image_managed(self) -> Optional[bool]:
         return self._is_image_managed
 
+    @property
+    def requires_fuse(self) -> bool:
+        if self._requires_fuse is None:
+            return False
+        return self._requires_fuse
+
+    @requires_fuse.setter
+    def requires_fuse(self, value: Optional[bool]) -> None:
+        self._requires_fuse = value
+
     def _set_cpus(
         self,
         cpus: Union[None, int, float, str],
@@ -454,7 +494,7 @@ def _set_memory(
         if isinstance(memory, str):
             if memory.endswith(('+', 'x')):
                 # 'x' is used internally for make sure our resources used by
-                # spot controller (memory: 3x) to have enough memory based on
+                # jobs controller (memory: 3x) to have enough memory based on
                 # the vCPUs.
                 num_memory_gb = memory[:-1]
             else:
@@ -543,10 +583,10 @@ def _set_accelerators(
     def is_launchable(self) -> bool:
         return self.cloud is not None and self._instance_type is not None
 
-    def need_cleanup_after_preemption(self) -> bool:
-        """Returns whether a spot resource needs cleanup after preemption."""
+    def need_cleanup_after_preemption_or_failure(self) -> bool:
+        """Whether a resource needs cleanup after preemption or failure."""
         assert self.is_launchable(), self
-        return self.cloud.need_cleanup_after_preemption(self)
+        return self.cloud.need_cleanup_after_preemption_or_failure(self)
 
     def _validate_and_set_region_zone(self, region: Optional[str],
                                       zone: Optional[str]) -> None:
@@ -712,30 +752,30 @@ def _try_validate_cpus_mem(self) -> None:
         Raises:
             ValueError: if the attributes are invalid.
         """
-        if self.cpus is None and self.memory is None:
+        if self._cpus is None and self._memory is None:
             return
-        if self.instance_type is not None:
+        if self._instance_type is not None:
             # The assertion should be true because we have already executed
             # _try_validate_instance_type() before this method.
             # The _try_validate_instance_type() method infers and sets
             # self.cloud if self.instance_type is not None.
             assert self.cloud is not None
             cpus, mem = self.cloud.get_vcpus_mem_from_instance_type(
-                self.instance_type)
-            if self.cpus is not None:
-                if self.cpus.endswith('+'):
-                    if cpus < float(self.cpus[:-1]):
+                self._instance_type)
+            if self._cpus is not None:
+                if self._cpus.endswith('+'):
+                    if cpus < float(self._cpus[:-1]):
                         with ux_utils.print_exception_no_traceback():
                             raise ValueError(
                                 f'{self.instance_type} does not have enough '
                                 f'vCPUs. {self.instance_type} has {cpus} '
-                                f'vCPUs, but {self.cpus} is requested.')
-                elif cpus != float(self.cpus):
+                                f'vCPUs, but {self._cpus} is requested.')
+                elif cpus != float(self._cpus):
                     with ux_utils.print_exception_no_traceback():
                         raise ValueError(
                             f'{self.instance_type} does not have the requested '
                             f'number of vCPUs. {self.instance_type} has {cpus} '
-                            f'vCPUs, but {self.cpus} is requested.')
+                            f'vCPUs, but {self._cpus} is requested.')
             if self.memory is not None:
                 if self.memory.endswith(('+', 'x')):
                     if mem < float(self.memory[:-1]):
@@ -751,25 +791,20 @@ def _try_validate_cpus_mem(self) -> None:
                             f'memory. {self.instance_type} has {mem} GB '
                             f'memory, but {self.memory} is requested.')
 
-    def _try_validate_spot(self) -> None:
-        """Try to validate the spot related attributes.
+    def _try_validate_managed_job_attributes(self) -> None:
+        """Try to validate managed job related attributes.
 
         Raises:
             ValueError: if the attributes are invalid.
         """
-        if self._spot_recovery is None:
+        if self._job_recovery is None:
             return
-        if not self._use_spot:
+        if self._job_recovery not in managed_jobs.RECOVERY_STRATEGIES:
             with ux_utils.print_exception_no_traceback():
                 raise ValueError(
-                    'Cannot specify spot_recovery without use_spot set to True.'
-                )
-        if self._spot_recovery not in spot.SPOT_STRATEGIES:
-            with ux_utils.print_exception_no_traceback():
-                raise ValueError(
-                    f'Spot recovery strategy {self._spot_recovery} '
+                    f'Spot recovery strategy {self._job_recovery} '
                     'is not supported. The strategy should be among '
-                    f'{list(spot.SPOT_STRATEGIES.keys())}')
+                    f'{list(managed_jobs.RECOVERY_STRATEGIES.keys())}')
 
     def extract_docker_image(self) -> Optional[str]:
         if self.image_id is None:
@@ -913,6 +948,34 @@ def _try_validate_ports(self) -> None:
         # We don't need to check the ports format since we already done it
         # in resources_utils.simplify_ports
 
+    def _try_validate_labels(self) -> None:
+        """Try to validate the labels attribute.
+
+        Raises:
+            ValueError: if the attribute is invalid.
+        """
+        if not self._labels:
+            return
+
+        if self.cloud is None:
+            # Because each cloud has its own label format, we cannot validate
+            # the labels without knowing the cloud.
+            with ux_utils.print_exception_no_traceback():
+                raise ValueError(
+                    'Cloud must be specified when labels are provided.')
+
+        # Check if the label key value pairs are valid.
+        invalid_table = log_utils.create_table(['Label', 'Reason'])
+        for key, value in self._labels.items():
+            valid, err_msg = self.cloud.is_label_valid(key, value)
+            if not valid:
+                invalid_table.add_row([f'{key}: {value}', err_msg])
+        if len(invalid_table.rows) > 0:
+            with ux_utils.print_exception_no_traceback():
+                raise ValueError(
+                    'The following labels are invalid:'
+                    '\n\t' + invalid_table.get_string().replace('\n', '\n\t'))
+
     def get_cost(self, seconds: float) -> float:
         """Returns cost in USD for the runtime in seconds."""
         hours = seconds / 3600
@@ -1071,6 +1134,13 @@ def less_demanding_than(
                 if not self_ports <= other_ports:
                     return False
 
+        if self.requires_fuse and not other.requires_fuse:
+            # On Kubernetes, we can't launch a task that requires FUSE on a pod
+            # that wasn't initialized with FUSE support at the start.
+            # Other clouds don't have this limitation.
+            if other.cloud.is_same_cloud(clouds.Kubernetes()):
+                return False
+
         # self <= other
         return True
 
@@ -1101,7 +1171,7 @@ def is_empty(self) -> bool:
         return all([
             self.cloud is None,
             self._instance_type is None,
-            self.cpus is None,
+            self._cpus is None,
             self.memory is None,
             self.accelerators is None,
             self.accelerator_args is None,
@@ -1119,23 +1189,25 @@ def copy(self, **override) -> 'Resources':
         resources = Resources(
             cloud=override.pop('cloud', self.cloud),
             instance_type=override.pop('instance_type', self.instance_type),
-            cpus=override.pop('cpus', self.cpus),
+            cpus=override.pop('cpus', self._cpus),
             memory=override.pop('memory', self.memory),
             accelerators=override.pop('accelerators', self.accelerators),
             accelerator_args=override.pop('accelerator_args',
                                           self.accelerator_args),
             use_spot=override.pop('use_spot', use_spot),
-            spot_recovery=override.pop('spot_recovery', self.spot_recovery),
+            job_recovery=override.pop('job_recovery', self.job_recovery),
             disk_size=override.pop('disk_size', self.disk_size),
             region=override.pop('region', self.region),
             zone=override.pop('zone', self.zone),
             image_id=override.pop('image_id', self.image_id),
             disk_tier=override.pop('disk_tier', self.disk_tier),
             ports=override.pop('ports', self.ports),
+            labels=override.pop('labels', self.labels),
             _docker_login_config=override.pop('_docker_login_config',
                                               self._docker_login_config),
             _is_image_managed=override.pop('_is_image_managed',
                                            self._is_image_managed),
+            _requires_fuse=override.pop('_requires_fuse', self._requires_fuse),
         )
         assert len(override) == 0
         return resources
@@ -1183,7 +1255,18 @@ def _override_resources(
             resources_list = []
             for override_config in override_configs:
                 new_resource_config = base_resource_config.copy()
+                # Labels are handled separately.
+                override_labels = override_config.pop('labels', None)
                 new_resource_config.update(override_config)
+
+                # Update the labels with the override labels.
+                labels = new_resource_config.get('labels', None)
+                if labels is not None and override_labels is not None:
+                    labels.update(override_labels)
+                elif override_labels is not None:
+                    labels = override_labels
+                new_resource_config['labels'] = labels
+
                 # Call from_yaml_config again instead of
                 # _from_yaml_config_single to handle the case, where both
                 # multiple accelerators and `any_of` is specified.
@@ -1263,17 +1346,27 @@ def _from_yaml_config_single(cls, config: Dict[str, str]) -> 'Resources':
         resources_fields['accelerator_args'] = config.pop(
             'accelerator_args', None)
         resources_fields['use_spot'] = config.pop('use_spot', None)
-        resources_fields['spot_recovery'] = config.pop('spot_recovery', None)
+        if config.get('spot_recovery') is not None:
+            logger.warning('spot_recovery is deprecated. Use job_recovery '
+                           'instead (the system is defaulting to that for '
+                           'you).')
+            resources_fields['job_recovery'] = config.pop('spot_recovery', None)
+        else:
+            # spot_recovery and job_recovery are guaranteed to be mutually
+            # exclusive by the schema validation.
+            resources_fields['job_recovery'] = config.pop('job_recovery', None)
         resources_fields['disk_size'] = config.pop('disk_size', None)
         resources_fields['region'] = config.pop('region', None)
         resources_fields['zone'] = config.pop('zone', None)
         resources_fields['image_id'] = config.pop('image_id', None)
         resources_fields['disk_tier'] = config.pop('disk_tier', None)
         resources_fields['ports'] = config.pop('ports', None)
+        resources_fields['labels'] = config.pop('labels', None)
         resources_fields['_docker_login_config'] = config.pop(
             '_docker_login_config', None)
         resources_fields['_is_image_managed'] = config.pop(
             '_is_image_managed', None)
+        resources_fields['_requires_fuse'] = config.pop('_requires_fuse', None)
 
         if resources_fields['cpus'] is not None:
             resources_fields['cpus'] = str(resources_fields['cpus'])
@@ -1298,14 +1391,14 @@ def add_if_not_none(key, value):
 
         add_if_not_none('cloud', str(self.cloud))
         add_if_not_none('instance_type', self.instance_type)
-        add_if_not_none('cpus', self.cpus)
+        add_if_not_none('cpus', self._cpus)
         add_if_not_none('memory', self.memory)
         add_if_not_none('accelerators', self.accelerators)
         add_if_not_none('accelerator_args', self.accelerator_args)
 
         if self._use_spot_specified:
             add_if_not_none('use_spot', self.use_spot)
-            add_if_not_none('spot_recovery', self.spot_recovery)
+        add_if_not_none('job_recovery', self.job_recovery)
         add_if_not_none('disk_size', self.disk_size)
         add_if_not_none('region', self.region)
         add_if_not_none('zone', self.zone)
@@ -1313,8 +1406,14 @@ def add_if_not_none(key, value):
         if self.disk_tier is not None:
             config['disk_tier'] = self.disk_tier.value
         add_if_not_none('ports', self.ports)
+        add_if_not_none('labels', self.labels)
+        if self._docker_login_config is not None:
+            config['_docker_login_config'] = dataclasses.asdict(
+                self._docker_login_config)
         if self._is_image_managed is not None:
             config['_is_image_managed'] = self._is_image_managed
+        if self._requires_fuse is not None:
+            config['_requires_fuse'] = self._requires_fuse
         return config
 
     def __setstate__(self, state):
@@ -1346,6 +1445,8 @@ def __setstate__(self, state):
         if version < 2:
             self._region = None
 
+        # spot_recovery is deprecated. We keep the history just for readability,
+        # it should be removed by chunk in the future.
         if version < 3:
             self._spot_recovery = None
 
@@ -1412,4 +1513,16 @@ def __setstate__(self, state):
                 state['_disk_tier'] = resources_utils.DiskTier(
                     original_disk_tier)
 
+        if version < 16:
+            # Kubernetes clusters launched prior to version 16 run in privileged
+            # mode and have FUSE support enabled by default. As a result, we
+            # set the default to True for backward compatibility.
+            state['_requires_fuse'] = state.get('_requires_fuse', True)
+
+        if version < 17:
+            state['_labels'] = state.get('_labels', None)
+
+        if version < 18:
+            self._job_recovery = state.pop('_spot_recovery', None)
+
         self.__dict__.update(state)
diff --git a/sky/serve/README.md b/sky/serve/README.md
index 1131849a8d3..838f4dd6d3b 100644
--- a/sky/serve/README.md
+++ b/sky/serve/README.md
@@ -2,7 +2,7 @@
 
 Serving library for SkyPilot.
 
-The goal of Sky Serve is simple - expose one endpoint, that redirects to serving endpoints running on different resources, regions and clouds.
+The goal of Sky Serve is simple - exposing one endpoint, that distributes any incoming traffic to serving endpoints running on different resources, regions, and clouds.
 
 Sky Serve transparently handles load balancing, failover and autoscaling of the serving endpoints.
 
@@ -11,8 +11,8 @@ Sky Serve transparently handles load balancing, failover and autoscaling of the
 ![Architecture](../../docs/source/images/sky-serve-architecture.png)
 
 Sky Serve has four key components:
-1. Redirector - receiving requests and redirecting them to healthy endpoints.
-2. Load balancers - spread requests across healthy endpoints according to different policies.
+1. Load Balancers - receiving requests and distributing them to healthy endpoints.
+2. Load Balancing Policies - spread requests across healthy endpoints according to different policies.
 3. Autoscalers - scale up and down the number of serving endpoints according to different policies.
 4. Replica Managers -  monitoring replica status and handle recovery of unhealthy endpoints.
 
diff --git a/sky/serve/autoscalers.py b/sky/serve/autoscalers.py
index d533df382cc..0a6b84111c6 100644
--- a/sky/serve/autoscalers.py
+++ b/sky/serve/autoscalers.py
@@ -514,7 +514,7 @@ def update_version(self, version: int, spec: 'service_spec.SkyServiceSpec',
                                           if spec.dynamic_ondemand_fallback
                                           is not None else False)
 
-    # spot_recovery field is checked earlier in core
+    # job_recovery field is checked earlier in core
     def _get_spot_resources_override_dict(self) -> Dict[str, Any]:
         return {'use_spot': True}
 
diff --git a/sky/serve/constants.py b/sky/serve/constants.py
index d6cea6154d3..89ca683ada5 100644
--- a/sky/serve/constants.py
+++ b/sky/serve/constants.py
@@ -21,6 +21,18 @@
 # interval.
 LB_CONTROLLER_SYNC_INTERVAL_SECONDS = 20
 
+# The maximum retry times for load balancer for each request. After changing to
+# proxy implementation, we do retry for failed requests.
+# TODO(tian): Expose this option to users in yaml file.
+LB_MAX_RETRY = 3
+
+# The timeout in seconds for load balancer to wait for a response from replica.
+# Large LLMs like Llama2-70b is able to process the request within ~30 seconds.
+# We set the timeout to 120s to be safe. For reference, FastChat uses 100s:
+# https://github.com/lm-sys/FastChat/blob/f2e6ca964af7ad0585cadcf16ab98e57297e2133/fastchat/constants.py#L39 # pylint: disable=line-too-long
+# TODO(tian): Expose this option to users in yaml file.
+LB_STREAM_TIMEOUT = 120
+
 # Interval in seconds to probe replica endpoint.
 ENDPOINT_PROBE_INTERVAL_SECONDS = 10
 
@@ -53,9 +65,9 @@
 # do some log rotation.
 CONTROLLER_RESOURCES = {'cpus': '4+', 'disk_size': 200}
 
-# Due to the CPU/memory usage of the controller process launched with sky job (
-# use ray job under the hood), we need to reserve some CPU/memory for each serve
-# controller process.
+# Due to the CPU/memory usage of the controller process launched with a job on
+# controller VM (use ray job under the hood), we need to reserve some CPU/memory
+# for each serve controller process.
 # Serve: A default controller with 4 vCPU and 16 GB memory can run up to 16
 # services.
 CONTROLLER_MEMORY_USAGE_GB = 1.0
@@ -69,7 +81,7 @@
 # automatically generated from this start port.
 CONTROLLER_PORT_START = 20001
 LOAD_BALANCER_PORT_START = 30001
-LOAD_BALANCER_PORT_RANGE = '30001-30100'
+LOAD_BALANCER_PORT_RANGE = '30001-30020'
 
 # Initial version of service.
 INITIAL_VERSION = 1
diff --git a/sky/serve/controller.py b/sky/serve/controller.py
index b9d18d3eb58..8d7964f090b 100644
--- a/sky/serve/controller.py
+++ b/sky/serve/controller.py
@@ -3,6 +3,7 @@
 Responsible for autoscaling and replica management.
 """
 import logging
+import os
 import threading
 import time
 import traceback
@@ -39,7 +40,7 @@ class SkyServeController:
     """
 
     def __init__(self, service_name: str, service_spec: serve.SkyServiceSpec,
-                 task_yaml: str, port: int) -> None:
+                 task_yaml: str, host: str, port: int) -> None:
         self._service_name = service_name
         self._replica_manager: replica_managers.ReplicaManager = (
             replica_managers.SkyPilotReplicaManager(service_name=service_name,
@@ -47,6 +48,7 @@ def __init__(self, service_name: str, service_spec: serve.SkyServiceSpec,
                                                     task_yaml_path=task_yaml))
         self._autoscaler: autoscalers.Autoscaler = (
             autoscalers.Autoscaler.from_spec(service_name, service_spec))
+        self._host = host
         self._port = port
         self._app = fastapi.FastAPI()
 
@@ -150,15 +152,25 @@ def configure_logger():
         threading.Thread(target=self._run_autoscaler).start()
 
         logger.info('SkyServe Controller started on '
-                    f'http://localhost:{self._port}')
+                    f'http://{self._host}:{self._port}')
 
-        uvicorn.run(self._app, host='localhost', port=self._port)
+        uvicorn.run(self._app, host={self._host}, port=self._port)
 
 
 # TODO(tian): Probably we should support service that will stop the VM in
 # specific time period.
 def run_controller(service_name: str, service_spec: serve.SkyServiceSpec,
                    task_yaml: str, controller_port: int):
-    controller = SkyServeController(service_name, service_spec, task_yaml,
+    # We expose the controller to the public network when running inside a
+    # kubernetes cluster to allow external load balancers (example, for
+    # high availability load balancers) to communicate with the controller.
+    def _get_host():
+        if 'KUBERNETES_SERVICE_HOST' in os.environ:
+            return '0.0.0.0'
+        else:
+            return 'localhost'
+
+    host = _get_host()
+    controller = SkyServeController(service_name, service_spec, task_yaml, host,
                                     controller_port)
     controller.run()
diff --git a/sky/serve/core.py b/sky/serve/core.py
index c39f3661ffe..4f15413cf7f 100644
--- a/sky/serve/core.py
+++ b/sky/serve/core.py
@@ -1,15 +1,13 @@
 """SkyServe core APIs."""
 import re
 import tempfile
-import typing
-from typing import Any, Dict, List, Optional, Set, Union
+from typing import Any, Dict, List, Optional, Tuple, Union
 
 import colorama
 
 import sky
 from sky import backends
 from sky import exceptions
-from sky import global_user_state
 from sky import sky_logging
 from sky import task as task_lib
 from sky.backends import backend_utils
@@ -26,9 +24,6 @@
 from sky.utils import subprocess_utils
 from sky.utils import ux_utils
 
-if typing.TYPE_CHECKING:
-    from sky import clouds
-
 logger = sky_logging.init_logger(__name__)
 
 
@@ -62,9 +57,9 @@ def _validate_service_task(task: 'sky.Task') -> None:
     policy_description = ('on-demand'
                           if task.service.dynamic_ondemand_fallback else 'spot')
     for resource in list(task.resources):
-        if resource.spot_recovery is not None:
+        if resource.job_recovery is not None:
             with ux_utils.print_exception_no_traceback():
-                raise ValueError('spot_recovery is disabled for SkyServe. '
+                raise ValueError('job_recovery is disabled for SkyServe. '
                                  'SkyServe will replenish preempted spot '
                                  f'with {policy_description} instances.')
 
@@ -99,7 +94,7 @@ def _validate_service_task(task: 'sky.Task') -> None:
 def up(
     task: 'sky.Task',
     service_name: Optional[str] = None,
-) -> None:
+) -> Tuple[str, str]:
     """Spin up a service.
 
     Please refer to the sky.cli.serve_up for the document.
@@ -107,6 +102,11 @@ def up(
     Args:
         task: sky.Task to serve up.
         service_name: Name of the service.
+
+    Returns:
+        service_name: str; The name of the service.  Same if passed in as an
+            argument.
+        endpoint: str; The service endpoint.
     """
     if service_name is None:
         service_name = serve_utils.generate_service_name()
@@ -127,14 +127,6 @@ def up(
     controller_utils.maybe_translate_local_file_mounts_and_sync_up(task,
                                                                    path='serve')
 
-    # If the controller and replicas are from the same cloud, it should
-    # provide better connectivity. We will let the controller choose from
-    # the clouds of the resources if the controller does not exist.
-    requested_clouds: Set['clouds.Cloud'] = set()
-    for resources in task.resources:
-        if resources.cloud is not None:
-            requested_clouds.add(resources.cloud)
-
     with tempfile.NamedTemporaryFile(
             prefix=f'service-task-{service_name}-',
             mode='w',
@@ -151,11 +143,9 @@ def up(
             serve_utils.generate_remote_config_yaml_file_name(service_name))
         controller_log_file = (
             serve_utils.generate_remote_controller_log_file_name(service_name))
-        controller_resources_in_config = (
-            controller_utils.get_controller_resources(
-                controller_type='serve',
-                controller_resources_config=serve_constants.CONTROLLER_RESOURCES
-            ))
+        controller_resources = controller_utils.get_controller_resources(
+            controller=controller_utils.Controllers.SKY_SERVE_CONTROLLER,
+            task_resources=task.resources)
 
         vars_to_fill = {
             'remote_task_yaml_path': remote_tmp_task_yaml_path,
@@ -166,7 +156,7 @@ def up(
             'modified_catalogs':
                 service_catalog_common.get_modified_catalog_file_mounts(),
             **controller_utils.shared_controller_vars_to_fill(
-                'serve',
+                controller=controller_utils.Controllers.SKY_SERVE_CONTROLLER,
                 remote_user_config_path=remote_config_yaml_path,
             ),
         }
@@ -174,23 +164,13 @@ def up(
                                    vars_to_fill,
                                    output_path=controller_file.name)
         controller_task = task_lib.Task.from_yaml(controller_file.name)
-        controller_exist = (
-            global_user_state.get_cluster_from_name(controller_name)
-            is not None)
-        if (not controller_exist and
-                controller_resources_in_config.cloud is None):
-            controller_clouds = requested_clouds
-        else:
-            controller_clouds = {controller_resources_in_config.cloud}
         # TODO(tian): Probably run another sky.launch after we get the load
         # balancer port from the controller? So we don't need to open so many
         # ports here. Or, we should have a nginx traffic control to refuse
         # any connection to the unregistered ports.
         controller_resources = {
-            controller_resources_in_config.copy(
-                cloud=controller_cloud,
-                ports=[serve_constants.LOAD_BALANCER_PORT_RANGE])
-            for controller_cloud in controller_clouds
+            r.copy(ports=[serve_constants.LOAD_BALANCER_PORT_RANGE])
+            for r in controller_resources
         }
         controller_task.set_resources(controller_resources)
 
@@ -211,13 +191,13 @@ def up(
         # whether the service is already running. If the id is the same
         # with the current job id, we know the service is up and running
         # for the first time; otherwise it is a name conflict.
+        idle_minutes_to_autostop = constants.CONTROLLER_IDLE_MINUTES_TO_AUTOSTOP
         controller_job_id, controller_handle = sky.launch(
             task=controller_task,
             stream_logs=False,
             cluster_name=controller_name,
             detach_run=True,
-            idle_minutes_to_autostop=constants.
-            CONTROLLER_IDLE_MINUTES_TO_AUTOSTOP,
+            idle_minutes_to_autostop=idle_minutes_to_autostop,
             retry_until_up=True,
             _disable_controller_check=True,
         )
@@ -278,7 +258,10 @@ def up(
         else:
             lb_port = serve_utils.load_service_initialization_result(
                 lb_port_payload)
-            endpoint = f'{controller_handle.head_ip}:{lb_port}'
+            endpoint = backend_utils.get_endpoints(
+                controller_handle.cluster_name, lb_port,
+                skip_status_check=True).get(lb_port)
+            assert endpoint is not None, 'Did not get endpoint for controller.'
 
         sky_logging.print(
             f'{fore.CYAN}Service name: '
@@ -306,13 +289,14 @@ def up(
             f'{backend_utils.BOLD}watch -n10 sky serve status {service_name}'
             f'{backend_utils.RESET_BOLD}'
             '\nTo send a test request:\t\t'
-            f'{backend_utils.BOLD}curl -L {endpoint}'
+            f'{backend_utils.BOLD}curl {endpoint}'
             f'{backend_utils.RESET_BOLD}'
             '\n'
             f'\n{fore.GREEN}SkyServe is spinning up your service now.'
             f'{style.RESET_ALL}'
             f'\n{fore.GREEN}The replicas should be ready within a '
             f'short time.{style.RESET_ALL}')
+        return service_name, endpoint
 
 
 @usage_lib.entrypoint
@@ -330,7 +314,7 @@ def update(
     """
     _validate_service_task(task)
     handle = backend_utils.is_controller_accessible(
-        controller_type=controller_utils.Controllers.SKY_SERVE_CONTROLLER,
+        controller=controller_utils.Controllers.SKY_SERVE_CONTROLLER,
         stopped_message=
         'Service controller is stopped. There is no service to update. '
         f'To spin up a new service, use {backend_utils.BOLD}'
@@ -474,7 +458,7 @@ def down(
     if isinstance(service_names, str):
         service_names = [service_names]
     handle = backend_utils.is_controller_accessible(
-        controller_type=controller_utils.Controllers.SKY_SERVE_CONTROLLER,
+        controller=controller_utils.Controllers.SKY_SERVE_CONTROLLER,
         stopped_message='All services should have terminated.')
 
     service_names_str = ','.join(service_names)
@@ -495,7 +479,7 @@ def down(
                                                     code,
                                                     require_outputs=True,
                                                     stream_logs=False)
-    except exceptions.FetchIPError as e:
+    except exceptions.FetchClusterInfoError as e:
         raise RuntimeError(
             'Failed to fetch controller IP. Please refresh controller status '
             f'by `sky status -r {serve_utils.SKY_SERVE_CONTROLLER_NAME}` '
@@ -581,7 +565,7 @@ def status(
 
     controller_type = controller_utils.Controllers.SKY_SERVE_CONTROLLER
     handle = backend_utils.is_controller_accessible(
-        controller_type=controller_type,
+        controller=controller_type,
         stopped_message=controller_type.value.default_hint_if_non_existent)
 
     backend = backend_utils.get_backend_from_handle(handle)
@@ -665,7 +649,7 @@ def tail_logs(
                 raise ValueError('`replica_id` must be None when using '
                                  'target=CONTROLLER/LOAD_BALANCER.')
     handle = backend_utils.is_controller_accessible(
-        controller_type=controller_utils.Controllers.SKY_SERVE_CONTROLLER,
+        controller=controller_utils.Controllers.SKY_SERVE_CONTROLLER,
         stopped_message=(controller_utils.Controllers.SKY_SERVE_CONTROLLER.
                          value.default_hint_if_non_existent))
 
diff --git a/sky/serve/load_balancer.py b/sky/serve/load_balancer.py
index b3b8fe5403e..24d0958489d 100644
--- a/sky/serve/load_balancer.py
+++ b/sky/serve/load_balancer.py
@@ -1,29 +1,30 @@
-"""LoadBalancer: redirect any incoming request to an endpoint replica."""
+"""LoadBalancer: Distribute any incoming request to all ready replicas."""
+import asyncio
 import logging
 import threading
-import time
+from typing import Dict, Union
 
+import aiohttp
 import fastapi
-import requests
+import httpx
+from starlette import background
 import uvicorn
 
 from sky import sky_logging
 from sky.serve import constants
 from sky.serve import load_balancing_policies as lb_policies
 from sky.serve import serve_utils
+from sky.utils import common_utils
 
 logger = sky_logging.init_logger(__name__)
 
 
 class SkyServeLoadBalancer:
-    """SkyServeLoadBalancer: redirect incoming traffic.
+    """SkyServeLoadBalancer: distribute incoming traffic with proxy.
 
-    This class accept any traffic to the controller and redirect it
+    This class accept any traffic to the controller and proxies it
     to the appropriate endpoint replica according to the load balancing
     policy.
-
-    NOTE: HTTP redirect is used. Thus, when using `curl`, be sure to use
-    `curl -L`.
     """
 
     def __init__(self, controller_url: str, load_balancer_port: int) -> None:
@@ -34,14 +35,27 @@ def __init__(self, controller_url: str, load_balancer_port: int) -> None:
             load_balancer_port: The port where the load balancer listens to.
         """
         self._app = fastapi.FastAPI()
-        self._controller_url = controller_url
-        self._load_balancer_port = load_balancer_port
+        self._controller_url: str = controller_url
+        self._load_balancer_port: int = load_balancer_port
         self._load_balancing_policy: lb_policies.LoadBalancingPolicy = (
             lb_policies.RoundRobinPolicy())
         self._request_aggregator: serve_utils.RequestsAggregator = (
             serve_utils.RequestTimestamp())
-
-    def _sync_with_controller(self):
+        # TODO(tian): httpx.Client has a resource limit of 100 max connections
+        # for each client. We should wait for feedback on the best max
+        # connections.
+        # Reference: https://www.python-httpx.org/advanced/resource-limits/
+        #
+        # If more than 100 requests are sent to the same replica, the
+        # httpx.Client will queue the requests and send them when a
+        # connection is available.
+        # Reference: https://github.com/encode/httpcore/blob/a8f80980daaca98d556baea1783c5568775daadc/httpcore/_async/connection_pool.py#L69-L71 # pylint: disable=line-too-long
+        self._client_pool: Dict[str, httpx.AsyncClient] = dict()
+        # We need this lock to avoid getting from the client pool while
+        # updating it from _sync_with_controller.
+        self._client_pool_lock: threading.Lock = threading.Lock()
+
+    async def _sync_with_controller(self):
         """Sync with controller periodically.
 
         Every `constants.LB_CONTROLLER_SYNC_INTERVAL_SECONDS` seconds, the
@@ -51,58 +65,157 @@ def _sync_with_controller(self):
         autoscaling decisions.
         """
         # Sleep for a while to wait the controller bootstrap.
-        time.sleep(5)
+        await asyncio.sleep(5)
 
         while True:
-            with requests.Session() as session:
+            close_client_tasks = []
+            async with aiohttp.ClientSession() as session:
                 try:
                     # Send request information
-                    response = session.post(
-                        self._controller_url + '/controller/load_balancer_sync',
-                        json={
-                            'request_aggregator':
-                                self._request_aggregator.to_dict()
-                        },
-                        timeout=5)
-                    # Clean up after reporting request information to avoid OOM.
-                    self._request_aggregator.clear()
-                    response.raise_for_status()
-                    ready_replica_urls = response.json().get(
-                        'ready_replica_urls')
-                except requests.RequestException as e:
-                    print(f'An error occurred: {e}')
+                    async with session.post(
+                            self._controller_url +
+                            '/controller/load_balancer_sync',
+                            json={
+                                'request_aggregator':
+                                    self._request_aggregator.to_dict()
+                            },
+                            timeout=5,
+                    ) as response:
+                        # Clean up after reporting request info to avoid OOM.
+                        self._request_aggregator.clear()
+                        response.raise_for_status()
+                        response_json = await response.json()
+                        ready_replica_urls = response_json.get(
+                            'ready_replica_urls', [])
+                except aiohttp.ClientError as e:
+                    logger.error('An error occurred when syncing with '
+                                 f'the controller: {e}')
                 else:
                     logger.info(f'Available Replica URLs: {ready_replica_urls}')
-                    self._load_balancing_policy.set_ready_replicas(
-                        ready_replica_urls)
-            time.sleep(constants.LB_CONTROLLER_SYNC_INTERVAL_SECONDS)
-
-    async def _redirect_handler(self, request: fastapi.Request):
+                    with self._client_pool_lock:
+                        self._load_balancing_policy.set_ready_replicas(
+                            ready_replica_urls)
+                        for replica_url in ready_replica_urls:
+                            if replica_url not in self._client_pool:
+                                self._client_pool[replica_url] = (
+                                    httpx.AsyncClient(base_url=replica_url))
+                        urls_to_close = set(
+                            self._client_pool.keys()) - set(ready_replica_urls)
+                        client_to_close = []
+                        for replica_url in urls_to_close:
+                            client_to_close.append(
+                                self._client_pool.pop(replica_url))
+                    for client in client_to_close:
+                        close_client_tasks.append(client.aclose())
+
+            await asyncio.sleep(constants.LB_CONTROLLER_SYNC_INTERVAL_SECONDS)
+            # Await those tasks after the interval to avoid blocking.
+            await asyncio.gather(*close_client_tasks)
+
+    async def _proxy_request_to(
+        self, url: str, request: fastapi.Request
+    ) -> Union[fastapi.responses.Response, Exception]:
+        """Proxy the request to the specified URL.
+
+        Returns:
+            The response from the endpoint replica. Return the exception
+            encountered if anything goes wrong.
+        """
+        logger.info(f'Proxy request to {url}')
+        try:
+            # We defer the get of the client here on purpose, for case when the
+            # replica is ready in `_proxy_with_retries` but refreshed before
+            # entering this function. In that case we will return an error here
+            # and retry to find next ready replica. We also need to wait for the
+            # update of the client pool to finish before getting the client.
+            with self._client_pool_lock:
+                client = self._client_pool.get(url, None)
+            if client is None:
+                return RuntimeError(f'Client for {url} not found.')
+            worker_url = httpx.URL(path=request.url.path,
+                                   query=request.url.query.encode('utf-8'))
+            proxy_request = client.build_request(
+                request.method,
+                worker_url,
+                headers=request.headers.raw,
+                content=await request.body(),
+                timeout=constants.LB_STREAM_TIMEOUT)
+            proxy_response = await client.send(proxy_request, stream=True)
+            return fastapi.responses.StreamingResponse(
+                content=proxy_response.aiter_raw(),
+                status_code=proxy_response.status_code,
+                headers=proxy_response.headers,
+                background=background.BackgroundTask(proxy_response.aclose))
+        except (httpx.RequestError, httpx.HTTPStatusError) as e:
+            logger.error(f'Error when proxy request to {url}: '
+                         f'{common_utils.format_exception(e)}')
+            return e
+
+    async def _proxy_with_retries(
+            self, request: fastapi.Request) -> fastapi.responses.Response:
+        """Try to proxy the request to the endpoint replica with retries."""
         self._request_aggregator.add(request)
-        ready_replica_url = self._load_balancing_policy.select_replica(request)
-
-        if ready_replica_url is None:
-            raise fastapi.HTTPException(status_code=503,
-                                        detail='No ready replicas. '
-                                        'Use "sky serve status [SERVICE_NAME]" '
-                                        'to check the replica status.')
-
-        path = f'http://{ready_replica_url}{request.url.path}'
-        logger.info(f'Redirecting request to {path}')
-        return fastapi.responses.RedirectResponse(url=path)
+        # TODO(tian): Finetune backoff parameters.
+        backoff = common_utils.Backoff(initial_backoff=1)
+        # SkyServe supports serving on Spot Instances. To avoid preemptions
+        # during request handling, we add a retry here.
+        retry_cnt = 0
+        while True:
+            retry_cnt += 1
+            with self._client_pool_lock:
+                ready_replica_url = self._load_balancing_policy.select_replica(
+                    request)
+            if ready_replica_url is None:
+                response_or_exception = fastapi.HTTPException(
+                    # 503 means that the server is currently
+                    # unable to handle the incoming requests.
+                    status_code=503,
+                    detail='No ready replicas. '
+                    'Use "sky serve status [SERVICE_NAME]" '
+                    'to check the replica status.')
+            else:
+                response_or_exception = await self._proxy_request_to(
+                    ready_replica_url, request)
+            if not isinstance(response_or_exception, Exception):
+                return response_or_exception
+            # When the user aborts the request during streaming, the request
+            # will be disconnected. We do not need to retry for this case.
+            if await request.is_disconnected():
+                # 499 means a client terminates the connection
+                # before the server is able to respond.
+                return fastapi.responses.Response(status_code=499)
+            # TODO(tian): Fail fast for errors like 404 not found.
+            if retry_cnt == constants.LB_MAX_RETRY:
+                if isinstance(response_or_exception, fastapi.HTTPException):
+                    raise response_or_exception
+                exception = common_utils.remove_color(
+                    common_utils.format_exception(response_or_exception,
+                                                  use_bracket=True))
+                raise fastapi.HTTPException(
+                    # 500 means internal server error.
+                    status_code=500,
+                    detail=f'Max retries {constants.LB_MAX_RETRY} exceeded. '
+                    f'Last error encountered: {exception}. Please use '
+                    '"sky serve logs [SERVICE_NAME] --load-balancer" '
+                    'for more information.')
+            current_backoff = backoff.current_backoff()
+            logger.error(f'Retry in {current_backoff} seconds.')
+            await asyncio.sleep(current_backoff)
 
     def run(self):
         self._app.add_api_route('/{path:path}',
-                                self._redirect_handler,
+                                self._proxy_with_retries,
                                 methods=['GET', 'POST', 'PUT', 'DELETE'])
 
         @self._app.on_event('startup')
-        def configure_logger():
+        async def startup():
+            # Configure logger
             uvicorn_access_logger = logging.getLogger('uvicorn.access')
             for handler in uvicorn_access_logger.handlers:
                 handler.setFormatter(sky_logging.FORMATTER)
 
-        threading.Thread(target=self._sync_with_controller, daemon=True).start()
+            # Register controller synchronization task
+            asyncio.create_task(self._sync_with_controller())
 
         logger.info('SkyServe Load Balancer started on '
                     f'http://0.0.0.0:{self._load_balancer_port}')
@@ -114,3 +227,19 @@ def run_load_balancer(controller_addr: str, load_balancer_port: int):
     load_balancer = SkyServeLoadBalancer(controller_url=controller_addr,
                                          load_balancer_port=load_balancer_port)
     load_balancer.run()
+
+
+if __name__ == '__main__':
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--controller-addr',
+                        required=True,
+                        default='127.0.0.1',
+                        help='The address of the controller.')
+    parser.add_argument('--load-balancer-port',
+                        type=int,
+                        required=True,
+                        default=8890,
+                        help='The port where the load balancer listens to.')
+    args = parser.parse_args()
+    run_load_balancer(args.controller_addr, args.load_balancer_port)
diff --git a/sky/serve/load_balancing_policies.py b/sky/serve/load_balancing_policies.py
index c8c9aa07765..34c1fa4249b 100644
--- a/sky/serve/load_balancing_policies.py
+++ b/sky/serve/load_balancing_policies.py
@@ -11,6 +11,14 @@
 logger = sky_logging.init_logger(__name__)
 
 
+def _request_repr(request: 'fastapi.Request') -> str:
+    return ('<Request '
+            f'method="{request.method}" '
+            f'url="{request.url}" '
+            f'headers={dict(request.headers)} '
+            f'query_params={dict(request.query_params)}>')
+
+
 class LoadBalancingPolicy:
     """Abstract class for load balancing policies."""
 
@@ -20,39 +28,43 @@ def __init__(self) -> None:
     def set_ready_replicas(self, ready_replicas: List[str]) -> None:
         raise NotImplementedError
 
+    def select_replica(self, request: 'fastapi.Request') -> Optional[str]:
+        replica = self._select_replica(request)
+        if replica is not None:
+            logger.info(f'Selected replica {replica} '
+                        f'for request {_request_repr(request)}')
+        else:
+            logger.warning('No replica selected for request '
+                           f'{_request_repr(request)}')
+        return replica
+
     # TODO(tian): We should have an abstract class for Request to
     # compatible with all frameworks.
-    def select_replica(self, request: 'fastapi.Request') -> Optional[str]:
+    def _select_replica(self, request: 'fastapi.Request') -> Optional[str]:
         raise NotImplementedError
 
 
 class RoundRobinPolicy(LoadBalancingPolicy):
     """Round-robin load balancing policy."""
 
-    def __init__(self, *args, **kwargs) -> None:
-        super().__init__(*args, **kwargs)
+    def __init__(self) -> None:
+        super().__init__()
         self.index = 0
 
     def set_ready_replicas(self, ready_replicas: List[str]) -> None:
-        if set(ready_replicas) != set(self.ready_replicas):
-            # If the autoscaler keeps scaling up and down the replicas,
-            # we need this shuffle to not let the first replica have the
-            # most of the load.
-            random.shuffle(ready_replicas)
-            self.ready_replicas = ready_replicas
-            self.index = 0
+        if set(self.ready_replicas) == set(ready_replicas):
+            return
+        # If the autoscaler keeps scaling up and down the replicas,
+        # we need this shuffle to not let the first replica have the
+        # most of the load.
+        random.shuffle(ready_replicas)
+        self.ready_replicas = ready_replicas
+        self.index = 0
 
-    def select_replica(self, request: 'fastapi.Request') -> Optional[str]:
+    def _select_replica(self, request: 'fastapi.Request') -> Optional[str]:
+        del request  # Unused.
         if not self.ready_replicas:
             return None
         ready_replica_url = self.ready_replicas[self.index]
         self.index = (self.index + 1) % len(self.ready_replicas)
-        request_repr = ('<Request '
-                        f'method="{request.method}" '
-                        f'url="{request.url}" '
-                        f'headers={dict(request.headers)} '
-                        f'query_params={dict(request.query_params)}'
-                        '>')
-        logger.info(f'Selected replica {ready_replica_url} '
-                    f'for request {request_repr}')
         return ready_replica_url
diff --git a/sky/serve/replica_managers.py b/sky/serve/replica_managers.py
index 70e5ba2c6dd..b4732d36153 100644
--- a/sky/serve/replica_managers.py
+++ b/sky/serve/replica_managers.py
@@ -17,6 +17,7 @@
 
 import sky
 from sky import backends
+from sky import core
 from sky import exceptions
 from sky import global_user_state
 from sky import sky_logging
@@ -428,7 +429,20 @@ def url(self) -> Optional[str]:
         handle = self.handle()
         if handle is None:
             return None
-        return f'{handle.head_ip}:{self.replica_port}'
+        replica_port_int = int(self.replica_port)
+        try:
+            endpoint_dict = core.endpoints(handle.cluster_name,
+                                           replica_port_int)
+        except exceptions.ClusterNotUpError:
+            return None
+        endpoint = endpoint_dict.get(replica_port_int, None)
+        if not endpoint:
+            return None
+        assert isinstance(endpoint, str), endpoint
+        # If replica doesn't start with http or https, add http://
+        if not endpoint.startswith('http'):
+            endpoint = 'http://' + endpoint
+        return endpoint
 
     @property
     def status(self) -> serve_state.ReplicaStatus:
@@ -446,6 +460,7 @@ def to_info_dict(self, with_handle: bool) -> Dict[str, Any]:
             'name': self.cluster_name,
             'status': self.status,
             'version': self.version,
+            'endpoint': self.url,
             'is_spot': self.is_spot,
             'launched_at': (cluster_record['launched_at']
                             if cluster_record is not None else None),
@@ -473,6 +488,7 @@ def probe(
         self,
         readiness_path: str,
         post_data: Optional[Dict[str, Any]],
+        headers: Optional[Dict[str, str]],
     ) -> Tuple['ReplicaInfo', bool, float]:
         """Probe the readiness of the replica.
 
@@ -487,17 +503,25 @@ def probe(
         try:
             msg = ''
             # TODO(tian): Support HTTPS in the future.
-            readiness_path = (f'http://{self.url}{readiness_path}')
+            url = self.url
+            if url is None:
+                logger.info(f'Error when probing {replica_identity}: '
+                            'Cannot get the endpoint.')
+                return self, False, probe_time
+            readiness_path = (f'{url}{readiness_path}')
+            logger.info(f'Probing {replica_identity} with {readiness_path}.')
             if post_data is not None:
                 msg += 'POST'
                 response = requests.post(
                     readiness_path,
+                    headers=headers,
                     json=post_data,
                     timeout=serve_constants.READINESS_PROBE_TIMEOUT_SECONDS)
             else:
                 msg += 'GET'
                 response = requests.get(
                     readiness_path,
+                    headers=headers,
                     timeout=serve_constants.READINESS_PROBE_TIMEOUT_SECONDS)
             msg += (f' request to {replica_identity} returned status '
                     f'code {response.status_code}')
@@ -544,9 +568,13 @@ def __init__(self, service_name: str,
         self._service_name: str = service_name
         self._uptime: Optional[float] = None
         self._update_mode = serve_utils.DEFAULT_UPDATE_MODE
+        header_keys = None
+        if spec.readiness_headers is not None:
+            header_keys = list(spec.readiness_headers.keys())
         logger.info(f'Readiness probe path: {spec.readiness_path}\n'
                     f'Initial delay seconds: {spec.initial_delay_seconds}\n'
-                    f'Post data: {spec.post_data}')
+                    f'Post data: {spec.post_data}\n'
+                    f'Readiness header keys: {header_keys}')
 
         # Newest version among the currently provisioned and launched replicas
         self.latest_version: int = serve_constants.INITIAL_VERSION
@@ -1012,8 +1040,11 @@ def _probe_all_replicas(self) -> None:
                 probe_futures.append(
                     pool.apply_async(
                         info.probe,
-                        (self._get_readiness_path(
-                            info.version), self._get_post_data(info.version)),
+                        (
+                            self._get_readiness_path(info.version),
+                            self._get_post_data(info.version),
+                            self._get_readiness_headers(info.version),
+                        ),
                     ),)
             logger.info(f'Replicas to probe: {", ".join(replica_to_probe)}')
 
@@ -1194,5 +1225,8 @@ def _get_readiness_path(self, version: int) -> str:
     def _get_post_data(self, version: int) -> Optional[Dict[str, Any]]:
         return self._get_version_spec(version).post_data
 
+    def _get_readiness_headers(self, version: int) -> Optional[Dict[str, str]]:
+        return self._get_version_spec(version).readiness_headers
+
     def _get_initial_delay_seconds(self, version: int) -> int:
         return self._get_version_spec(version).initial_delay_seconds
diff --git a/sky/serve/serve_utils.py b/sky/serve/serve_utils.py
index 0814441eb79..8a4387b40c0 100644
--- a/sky/serve/serve_utils.py
+++ b/sky/serve/serve_utils.py
@@ -24,6 +24,7 @@
 from sky import exceptions
 from sky import global_user_state
 from sky import status_lib
+from sky.backends import backend_utils
 from sky.serve import constants
 from sky.serve import serve_state
 from sky.skylet import constants as skylet_constants
@@ -725,12 +726,21 @@ def get_endpoint(service_record: Dict[str, Any]) -> str:
     handle = global_user_state.get_handle_from_cluster_name(
         SKY_SERVE_CONTROLLER_NAME)
     assert isinstance(handle, backends.CloudVmRayResourceHandle)
-    if handle is None or handle.head_ip is None:
+    if handle is None:
         return '-'
     load_balancer_port = service_record['load_balancer_port']
     if load_balancer_port is None:
         return '-'
-    return f'{handle.head_ip}:{load_balancer_port}'
+    try:
+        endpoint = backend_utils.get_endpoints(handle.cluster_name,
+                                               load_balancer_port).get(
+                                                   load_balancer_port, None)
+    except exceptions.ClusterNotUpError:
+        return '-'
+    if endpoint is None:
+        return '-'
+    assert isinstance(endpoint, str), endpoint
+    return endpoint
 
 
 def format_service_table(service_records: List[Dict[str, Any]],
@@ -794,7 +804,7 @@ def _format_replica_table(replica_records: List[Dict[str, Any]],
         return 'No existing replicas.'
 
     replica_columns = [
-        'SERVICE_NAME', 'ID', 'VERSION', 'IP', 'LAUNCHED', 'RESOURCES',
+        'SERVICE_NAME', 'ID', 'VERSION', 'ENDPOINT', 'LAUNCHED', 'RESOURCES',
         'STATUS', 'REGION'
     ]
     if show_all:
@@ -808,10 +818,11 @@ def _format_replica_table(replica_records: List[Dict[str, Any]],
         replica_records = replica_records[:_REPLICA_TRUNC_NUM]
 
     for record in replica_records:
+        endpoint = record.get('endpoint', '-')
         service_name = record['service_name']
         replica_id = record['replica_id']
         version = (record['version'] if 'version' in record else '-')
-        replica_ip = '-'
+        replica_endpoint = endpoint if endpoint else '-'
         launched_at = log_utils.readable_time_duration(record['launched_at'])
         resources_str = '-'
         replica_status = record['status']
@@ -821,8 +832,6 @@ def _format_replica_table(replica_records: List[Dict[str, Any]],
 
         replica_handle: 'backends.CloudVmRayResourceHandle' = record['handle']
         if replica_handle is not None:
-            if replica_handle.head_ip is not None:
-                replica_ip = replica_handle.head_ip
             resources_str = resources_utils.get_readable_resources_repr(
                 replica_handle, simplify=not show_all)
             if replica_handle.launched_resources.region is not None:
@@ -834,7 +843,7 @@ def _format_replica_table(replica_records: List[Dict[str, Any]],
             service_name,
             replica_id,
             version,
-            replica_ip,
+            replica_endpoint,
             launched_at,
             resources_str,
             status_str,
diff --git a/sky/serve/service_spec.py b/sky/serve/service_spec.py
index eba38aa5a79..80217acfff8 100644
--- a/sky/serve/service_spec.py
+++ b/sky/serve/service_spec.py
@@ -23,6 +23,7 @@ def __init__(
         max_replicas: Optional[int] = None,
         target_qps_per_replica: Optional[float] = None,
         post_data: Optional[Dict[str, Any]] = None,
+        readiness_headers: Optional[Dict[str, str]] = None,
         dynamic_ondemand_fallback: Optional[bool] = None,
         base_ondemand_fallback_replicas: Optional[int] = None,
         upscale_delay_seconds: Optional[int] = None,
@@ -81,6 +82,7 @@ def __init__(
         self._max_replicas: Optional[int] = max_replicas
         self._target_qps_per_replica: Optional[float] = target_qps_per_replica
         self._post_data: Optional[Dict[str, Any]] = post_data
+        self._readiness_headers: Optional[Dict[str, str]] = readiness_headers
         self._dynamic_ondemand_fallback: Optional[
             bool] = dynamic_ondemand_fallback
         self._base_ondemand_fallback_replicas: Optional[
@@ -111,11 +113,13 @@ def from_yaml_config(config: Dict[str, Any]) -> 'SkyServiceSpec':
             service_config['readiness_path'] = readiness_section
             initial_delay_seconds = None
             post_data = None
+            readiness_headers = None
         else:
             service_config['readiness_path'] = readiness_section['path']
             initial_delay_seconds = readiness_section.get(
                 'initial_delay_seconds', None)
             post_data = readiness_section.get('post_data', None)
+            readiness_headers = readiness_section.get('headers', None)
         if initial_delay_seconds is None:
             initial_delay_seconds = constants.DEFAULT_INITIAL_DELAY_SECONDS
         service_config['initial_delay_seconds'] = initial_delay_seconds
@@ -129,6 +133,7 @@ def from_yaml_config(config: Dict[str, Any]) -> 'SkyServiceSpec':
                         '`readiness_probe` section of your service YAML.'
                     ) from e
         service_config['post_data'] = post_data
+        service_config['readiness_headers'] = readiness_headers
 
         policy_section = config.get('replica_policy', None)
         simplified_policy_section = config.get('replicas', None)
@@ -204,6 +209,7 @@ def add_if_not_none(section, key, value, no_empty: bool = False):
         add_if_not_none('readiness_probe', 'initial_delay_seconds',
                         self.initial_delay_seconds)
         add_if_not_none('readiness_probe', 'post_data', self.post_data)
+        add_if_not_none('readiness_probe', 'headers', self._readiness_headers)
         add_if_not_none('replica_policy', 'min_replicas', self.min_replicas)
         add_if_not_none('replica_policy', 'max_replicas', self.max_replicas)
         add_if_not_none('replica_policy', 'target_qps_per_replica',
@@ -220,8 +226,12 @@ def add_if_not_none(section, key, value, no_empty: bool = False):
 
     def probe_str(self):
         if self.post_data is None:
-            return f'GET {self.readiness_path}'
-        return f'POST {self.readiness_path} {json.dumps(self.post_data)}'
+            method = f'GET {self.readiness_path}'
+        else:
+            method = f'POST {self.readiness_path} {json.dumps(self.post_data)}'
+        headers = ('' if self.readiness_headers is None else
+                   ' with custom headers')
+        return f'{method}{headers}'
 
     def spot_policy_str(self):
         policy_strs = []
@@ -287,6 +297,10 @@ def target_qps_per_replica(self) -> Optional[float]:
     def post_data(self) -> Optional[Dict[str, Any]]:
         return self._post_data
 
+    @property
+    def readiness_headers(self) -> Optional[Dict[str, str]]:
+        return self._readiness_headers
+
     @property
     def base_ondemand_fallback_replicas(self) -> Optional[int]:
         return self._base_ondemand_fallback_replicas
diff --git a/sky/setup_files/MANIFEST.in b/sky/setup_files/MANIFEST.in
index 5a80a5dbcf4..ad0163a2e22 100644
--- a/sky/setup_files/MANIFEST.in
+++ b/sky/setup_files/MANIFEST.in
@@ -1,5 +1,6 @@
 include sky/backends/monkey_patches/*.py
 exclude sky/clouds/service_catalog/data_fetchers/analyze.py
+include sky/provision/kubernetes/manifests/*
 include sky/setup_files/*
 include sky/skylet/*.sh
 include sky/skylet/LICENSE
@@ -12,8 +13,8 @@ include sky/skylet/providers/oci/*
 include sky/skylet/providers/scp/*
 include sky/skylet/providers/*.py
 include sky/skylet/ray_patches/*.patch
-include sky/spot/dashboard/*
-include sky/spot/dashboard/templates/*
-include sky/spot/dashboard/static/*
+include sky/jobs/dashboard/*
+include sky/jobs/dashboard/templates/*
+include sky/jobs/dashboard/static/*
 include sky/templates/*
 include sky/utils/kubernetes/*
diff --git a/sky/setup_files/setup.py b/sky/setup_files/setup.py
index c05ffcc4f35..a1327cee622 100644
--- a/sky/setup_files/setup.py
+++ b/sky/setup_files/setup.py
@@ -169,7 +169,7 @@ def parse_readme(readme: str) -> str:
     # click/grpcio/protobuf.
     # Excluded 2.6.0 as it has a bug in the cluster launcher:
     # https://github.com/ray-project/ray/releases/tag/ray-2.6.1
-    'ray[default] >= 2.2.0, <= 2.9.3, != 2.6.0',
+    'ray[default] >= 2.2.0, != 2.6.0',
 ]
 
 remote = [
@@ -190,7 +190,7 @@ def parse_readme(readme: str) -> str:
     'pydantic!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,<3',
 ]
 
-# NOTE: Change the templates/spot-controller.yaml.j2 file if any of the
+# NOTE: Change the templates/jobs-controller.yaml.j2 file if any of the
 # following packages dependencies are changed.
 aws_dependencies = [
     # botocore does not work with urllib3>=2.0.0, according to https://github.com/boto/botocore/issues/2926
@@ -235,7 +235,7 @@ def parse_readme(readme: str) -> str:
     'remote': remote,
     'runpod': ['runpod>=1.5.1'],
     'fluidstack': [],  # No dependencies needed for fluidstack
-    'cudo': ['cudo-compute>=0.1.8'],
+    'cudo': ['cudo-compute>=0.1.10'],
     'paperspace': [],  # No dependencies needed for paperspace
     'vsphere': [
         'pyvmomi==8.0.1.0.2',
diff --git a/sky/skylet/attempt_skylet.py b/sky/skylet/attempt_skylet.py
index 609cfa09141..54df4986080 100644
--- a/sky/skylet/attempt_skylet.py
+++ b/sky/skylet/attempt_skylet.py
@@ -21,6 +21,9 @@ def restart_skylet():
         shell=True,
         check=False)
     subprocess.run(
+        # We have made sure that `attempt_skylet.py` is executed with the
+        # skypilot runtime env activated, so that skylet can access the cloud
+        # CLI tools.
         f'nohup {constants.SKY_PYTHON_CMD} -m sky.skylet.skylet'
         ' >> ~/.sky/skylet.log 2>&1 &',
         shell=True,
diff --git a/sky/skylet/autostop_lib.py b/sky/skylet/autostop_lib.py
index 687c04f5211..130e39fb425 100644
--- a/sky/skylet/autostop_lib.py
+++ b/sky/skylet/autostop_lib.py
@@ -75,10 +75,16 @@ def set_autostopping_started() -> None:
     configs.set_config(_AUTOSTOP_INDICATOR, str(psutil.boot_time()))
 
 
-def get_is_autostopping_payload() -> str:
+def get_is_autostopping() -> bool:
     """Returns whether the cluster is in the process of autostopping."""
     result = configs.get_config(_AUTOSTOP_INDICATOR)
     is_autostopping = (result == str(psutil.boot_time()))
+    return is_autostopping
+
+
+def get_is_autostopping_payload() -> str:
+    """Payload for whether the cluster is in the process of autostopping."""
+    is_autostopping = get_is_autostopping()
     return common_utils.encode_payload(is_autostopping)
 
 
diff --git a/sky/skylet/constants.py b/sky/skylet/constants.py
index 9dd8ac25f24..3ac1ac47d33 100644
--- a/sky/skylet/constants.py
+++ b/sky/skylet/constants.py
@@ -30,21 +30,38 @@
 # used for installing SkyPilot runtime (ray and skypilot).
 SKY_PYTHON_PATH_FILE = '~/.sky/python_path'
 SKY_RAY_PATH_FILE = '~/.sky/ray_path'
-SKY_GET_PYTHON_PATH_CMD = (f'cat {SKY_PYTHON_PATH_FILE} 2> /dev/null || '
+SKY_GET_PYTHON_PATH_CMD = (f'[ -s {SKY_PYTHON_PATH_FILE} ] && '
+                           f'cat {SKY_PYTHON_PATH_FILE} 2> /dev/null || '
                            'which python3')
 # Python executable, e.g., /opt/conda/bin/python3
 SKY_PYTHON_CMD = f'$({SKY_GET_PYTHON_PATH_CMD})'
 SKY_PIP_CMD = f'{SKY_PYTHON_CMD} -m pip'
 # Ray executable, e.g., /opt/conda/bin/ray
-SKY_RAY_CMD = f'$(cat {SKY_RAY_PATH_FILE} 2> /dev/null || which ray)'
+# We need to add SKY_PYTHON_CMD before ray executable because:
+# The ray executable is a python script with a header like:
+#   #!/opt/conda/bin/python3
+# When we create the skypilot-runtime venv, the previously installed ray
+# executable will be reused (due to --system-site-packages), and that will cause
+# running ray CLI commands to use the wrong python executable.
+SKY_RAY_CMD = (f'{SKY_PYTHON_CMD} $([ -s {SKY_RAY_PATH_FILE} ] && '
+               f'cat {SKY_RAY_PATH_FILE} 2> /dev/null || which ray)')
+# Separate env for SkyPilot runtime dependencies.
+SKY_REMOTE_PYTHON_ENV_NAME = 'skypilot-runtime'
+SKY_REMOTE_PYTHON_ENV = f'~/{SKY_REMOTE_PYTHON_ENV_NAME}'
+ACTIVATE_SKY_REMOTE_PYTHON_ENV = f'source {SKY_REMOTE_PYTHON_ENV}/bin/activate'
+# Deleting the SKY_REMOTE_PYTHON_ENV_NAME from the PATH to deactivate the
+# environment. `deactivate` command does not work when conda is used.
+DEACTIVATE_SKY_REMOTE_PYTHON_ENV = (
+    'export PATH='
+    f'$(echo $PATH | sed "s|$(echo ~)/{SKY_REMOTE_PYTHON_ENV_NAME}/bin:||")')
 
 # The name for the environment variable that stores the unique ID of the
 # current task. This will stay the same across multiple recoveries of the
-# same spot task.
+# same managed task.
 TASK_ID_ENV_VAR = 'SKYPILOT_TASK_ID'
 # This environment variable stores a '\n'-separated list of task IDs that
-# are within the same spot job (DAG). This can be used by the user to
-# retrieve the task IDs of any tasks that are within the same spot job.
+# are within the same managed job (DAG). This can be used by the user to
+# retrieve the task IDs of any tasks that are within the same managed job.
 # This environment variable is pre-assigned before any task starts
 # running within the same job, and will remain constant throughout the
 # lifetime of the job.
@@ -64,9 +81,9 @@
 SKYLET_LIB_VERSION = 1
 SKYLET_VERSION_FILE = '~/.sky/skylet_version'
 
-# `sky spot dashboard`-related
+# `sky jobs dashboard`-related
 #
-# Port on the remote spot controller that the dashboard is running on.
+# Port on the remote jobs controller that the dashboard is running on.
 SPOT_DASHBOARD_REMOTE_PORT = 5000
 
 # Docker default options
@@ -89,17 +106,27 @@
 # AWS's Deep Learning AMI's default conda environment.
 CONDA_INSTALLATION_COMMANDS = (
     'which conda > /dev/null 2>&1 || '
-    '(wget -nc https://repo.anaconda.com/miniconda/Miniconda3-py310_23.11.0-2-Linux-x86_64.sh -O Miniconda3-Linux-x86_64.sh && '  # pylint: disable=line-too-long
+    '{ curl https://repo.anaconda.com/miniconda/Miniconda3-py310_23.11.0-2-Linux-x86_64.sh -o Miniconda3-Linux-x86_64.sh && '  # pylint: disable=line-too-long
     'bash Miniconda3-Linux-x86_64.sh -b && '
     'eval "$(~/miniconda3/bin/conda shell.bash hook)" && conda init && '
-    'conda config --set auto_activate_base true); '
-    'grep "# >>> conda initialize >>>" ~/.bashrc || conda init;'
-    '(type -a python | grep -q python3) || '
-    'echo \'alias python=python3\' >> ~/.bashrc;'
-    '(type -a pip | grep -q pip3) || echo \'alias pip=pip3\' >> ~/.bashrc;'
-    'source ~/.bashrc;'
-    # Writes Python path to file if it does not exist or the file is empty.
-    f'[ -s {SKY_PYTHON_PATH_FILE} ] || which python3 > {SKY_PYTHON_PATH_FILE};')
+    'conda config --set auto_activate_base true && '
+    f'conda activate base; }}; '
+    'grep "# >>> conda initialize >>>" ~/.bashrc || '
+    '{ conda init && source ~/.bashrc; };'
+    # If Python version is larger then equal to 3.12, create a new conda env
+    # with Python 3.10.
+    # We don't use a separate conda env for SkyPilot dependencies because it is
+    # costly to create a new conda env, and venv should be a lightweight and
+    # faster alternative when the python version satisfies the requirement.
+    '[[ $(python3 --version | cut -d " " -f 2 | cut -d "." -f 2) -ge 12 ]] && '
+    f'echo "Creating conda env with Python 3.10" && '
+    f'conda create -y -n {SKY_REMOTE_PYTHON_ENV_NAME} python=3.10 && '
+    f'conda activate {SKY_REMOTE_PYTHON_ENV_NAME};'
+    # Create a separate conda environment for SkyPilot dependencies.
+    f'[ -d {SKY_REMOTE_PYTHON_ENV} ] || '
+    f'{{ {SKY_PYTHON_CMD} -m venv {SKY_REMOTE_PYTHON_ENV} --system-site-packages && '
+    f'echo "$(echo {SKY_REMOTE_PYTHON_ENV})/bin/python" > {SKY_PYTHON_PATH_FILE}; }};'
+)
 
 _sky_version = str(version.parse(sky.__version__))
 RAY_STATUS = f'RAY_ADDRESS=127.0.0.1:{SKY_REMOTE_RAY_PORT} {SKY_RAY_CMD} status'
@@ -108,6 +135,11 @@
 # backend_utils.write_cluster_config.
 RAY_SKYPILOT_INSTALLATION_COMMANDS = (
     'mkdir -p ~/sky_workdir && mkdir -p ~/.sky/sky_app;'
+    # Disable the pip version check to avoid the warning message, which makes
+    # the output hard to read.
+    'export PIP_DISABLE_PIP_VERSION_CHECK=1;'
+    # Print the PATH in provision.log to help debug PATH issues.
+    'echo PATH=$PATH; '
     # Backward compatibility for ray upgrade (#3248): do not upgrade ray if the
     # ray cluster is already running, to avoid the ray cluster being restarted.
     #
@@ -124,9 +156,20 @@
     f'{SKY_PIP_CMD} list | grep "ray " | '
     f'grep {SKY_REMOTE_RAY_VERSION} 2>&1 > /dev/null '
     f'|| {RAY_STATUS} || '
-    f'{SKY_PIP_CMD} install --exists-action w -U ray[default]=={SKY_REMOTE_RAY_VERSION};'  # pylint: disable=line-too-long
+    f'{SKY_PIP_CMD} install --exists-action w -U ray[default]=={SKY_REMOTE_RAY_VERSION}; '  # pylint: disable=line-too-long
+    # In some envs, e.g. pip does not have permission to write under /opt/conda
+    # ray package will be installed under ~/.local/bin. If the user's PATH does
+    # not include ~/.local/bin (the pip install will have the output: `WARNING:
+    # The scripts ray, rllib, serve and tune are installed in '~/.local/bin'
+    # which is not on PATH.`), causing an empty SKY_RAY_PATH_FILE later.
+    #
+    # Here, we add ~/.local/bin to the end of the PATH to make sure the issues
+    # mentioned above are resolved.
+    'export PATH=$PATH:$HOME/.local/bin; '
     # Writes ray path to file if it does not exist or the file is empty.
-    f'[ -s {SKY_RAY_PATH_FILE} ] || which ray > {SKY_RAY_PATH_FILE};'
+    f'[ -s {SKY_RAY_PATH_FILE} ] || '
+    f'{{ {ACTIVATE_SKY_REMOTE_PYTHON_ENV} && '
+    f'which ray > {SKY_RAY_PATH_FILE} || exit 1; }}; '
     # END ray package check and installation
     f'{{ {SKY_PIP_CMD} list | grep "skypilot " && '
     '[ "$(cat ~/.sky/wheels/current_sky_wheel_hash)" == "{sky_wheel_hash}" ]; } || '  # pylint: disable=line-too-long
@@ -147,7 +190,7 @@
 
 # The name for the environment variable that stores SkyPilot user hash, which
 # is mainly used to make sure sky commands runs on a VM launched by SkyPilot
-# will be recognized as the same user (e.g., spot controller or sky serve
+# will be recognized as the same user (e.g., jobs controller or sky serve
 # controller).
 USER_ID_ENV_VAR = 'SKYPILOT_USER_ID'
 
@@ -174,11 +217,11 @@
 # TODO(tian): Refactor to controller_utils. Current blocker: circular import.
 CONTROLLER_IDLE_MINUTES_TO_AUTOSTOP = 10
 
-# Due to the CPU/memory usage of the controller process launched with sky job (
-# use ray job under the hood), we need to reserve some CPU/memory for each spot/
+# Due to the CPU/memory usage of the controller process launched with sky jobs (
+# use ray job under the hood), we need to reserve some CPU/memory for each jobs/
 # serve controller process.
-# Spot: A default controller with 8 vCPU and 32 GB memory can manage up to 32
-# spot jobs.
+# Jobs: A default controller with 8 vCPU and 32 GB memory can manage up to 32
+# managed jobs.
 # Serve: A default controller with 4 vCPU and 16 GB memory can run up to 16
 # services.
 CONTROLLER_PROCESS_CPU_DEMAND = 0.25
diff --git a/sky/skylet/events.py b/sky/skylet/events.py
index ee7f2060a38..b6e99707dab 100644
--- a/sky/skylet/events.py
+++ b/sky/skylet/events.py
@@ -3,7 +3,6 @@
 import os
 import re
 import subprocess
-import sys
 import time
 import traceback
 
@@ -14,10 +13,11 @@
 from sky import sky_logging
 from sky.backends import cloud_vm_ray_backend
 from sky.clouds import cloud_registry
+from sky.jobs import utils as managed_job_utils
 from sky.serve import serve_utils
 from sky.skylet import autostop_lib
+from sky.skylet import constants
 from sky.skylet import job_lib
-from sky.spot import spot_utils
 from sky.utils import cluster_yaml_utils
 from sky.utils import common_utils
 from sky.utils import ux_utils
@@ -64,15 +64,15 @@ class JobSchedulerEvent(SkyletEvent):
     EVENT_INTERVAL_SECONDS = 300
 
     def _run(self):
-        job_lib.scheduler.schedule_step()
+        job_lib.scheduler.schedule_step(force_update_jobs=True)
 
 
-class SpotJobUpdateEvent(SkyletEvent):
-    """Skylet event for updating spot job status."""
+class ManagedJobUpdateEvent(SkyletEvent):
+    """Skylet event for updating managed job status."""
     EVENT_INTERVAL_SECONDS = 300
 
     def _run(self):
-        spot_utils.update_spot_job_status()
+        managed_job_utils.update_managed_job_status()
 
 
 class ServiceUpdateEvent(SkyletEvent):
@@ -192,21 +192,36 @@ def _stop_cluster(self, autostop_config):
                 # Passing env inherited from os.environ is technically not
                 # needed, because we call `python <script>` rather than `ray
                 # <cmd>`. We just need the {RAY_USAGE_STATS_ENABLED: 0} part.
-                subprocess.run([sys.executable, script], check=True, env=env)
+                subprocess.run(f'{constants.SKY_PYTHON_CMD} {script}',
+                               check=True,
+                               shell=True,
+                               env=env)
 
                 logger.info('Running ray down.')
                 # Stop the workers first to avoid orphan workers.
                 subprocess.run(
-                    ['ray', 'down', '-y', '--workers-only', config_path],
+                    f'{constants.SKY_RAY_CMD} down -y --workers-only '
+                    f'{config_path}',
                     check=True,
+                    shell=True,
                     # We pass env inherited from os.environ due to calling `ray
                     # <cmd>`.
                     env=env)
 
+            # Stop the ray autoscaler to avoid scaling up, during
+            # stopping/terminating of the cluster. We do not rely `ray down`
+            # below for stopping ray cluster, as it will not use the correct
+            # ray path.
+            logger.info('Stopping the ray cluster.')
+            subprocess.run(f'{constants.SKY_RAY_CMD} stop',
+                           shell=True,
+                           check=True)
+
             logger.info('Running final ray down.')
             subprocess.run(
-                ['ray', 'down', '-y', config_path],
+                f'{constants.SKY_RAY_CMD} down -y {config_path}',
                 check=True,
+                shell=True,
                 # We pass env inherited from os.environ due to calling `ray
                 # <cmd>`.
                 env=env)
@@ -228,7 +243,7 @@ def _stop_cluster_with_new_provisioner(self, autostop_config,
         # Stop the ray autoscaler to avoid scaling up, during
         # stopping/terminating of the cluster.
         logger.info('Stopping the ray cluster.')
-        subprocess.run('ray stop', shell=True, check=True)
+        subprocess.run(f'{constants.SKY_RAY_CMD} stop', shell=True, check=True)
 
         operation_fn = provision_lib.stop_instances
         if autostop_config.down:
diff --git a/sky/skylet/job_lib.py b/sky/skylet/job_lib.py
index d837e510033..93bbe99b3ce 100644
--- a/sky/skylet/job_lib.py
+++ b/sky/skylet/job_lib.py
@@ -1,4 +1,4 @@
-"""Sky job lib, backed by a sqlite database.
+"""Utilities for jobs on a remote cluster, backed by a sqlite database.
 
 This is a remote utility module that provides job queue functionality.
 """
@@ -165,9 +165,9 @@ def _run_job(self, job_id: int, run_cmd: str):
         _CONN.commit()
         subprocess.Popen(run_cmd, shell=True, stdout=subprocess.DEVNULL)
 
-    def schedule_step(self) -> None:
+    def schedule_step(self, force_update_jobs: bool = False) -> None:
         jobs = self._get_jobs()
-        if len(jobs) > 0:
+        if len(jobs) > 0 or force_update_jobs:
             update_status()
         # TODO(zhwu, mraheja): One optimization can be allowing more than one
         # job staying in the pending state after ray job submit, so that to be
@@ -392,15 +392,14 @@ def get_job_submitted_or_ended_timestamp_payload(job_id: int,
                                                  get_ended_time: bool) -> str:
     """Get the job submitted/ended timestamp.
 
-    This function should only be called by the spot controller,
-    which is ok to use `submitted_at` instead of `start_at`,
-    because the spot job duration need to include both setup
-    and running time and the job will not stay in PENDING
-    state.
+    This function should only be called by the jobs controller, which is ok to
+    use `submitted_at` instead of `start_at`, because the managed job duration
+    need to include both setup and running time and the job will not stay in
+    PENDING state.
 
-    The normal job duration will use `start_at` instead of
-    `submitted_at` (in `format_job_queue()`), because the job
-    may stay in PENDING if the cluster is busy.
+    The normal job duration will use `start_at` instead of `submitted_at` (in
+    `format_job_queue()`), because the job may stay in PENDING if the cluster is
+    busy.
     """
     field = 'end_at' if get_ended_time else 'submitted_at'
     rows = _CURSOR.execute(f'SELECT {field} FROM jobs WHERE job_id=(?)',
@@ -880,15 +879,15 @@ def fail_all_jobs_in_progress(cls) -> str:
     @classmethod
     def tail_logs(cls,
                   job_id: Optional[int],
-                  spot_job_id: Optional[int],
+                  managed_job_id: Optional[int],
                   follow: bool = True) -> str:
         # pylint: disable=line-too-long
         code = [
-            f'job_id = {job_id} if {job_id} is not None else job_lib.get_latest_job_id()',
+            f'job_id = {job_id} if {job_id} != None else job_lib.get_latest_job_id()',
             'run_timestamp = job_lib.get_run_timestamp(job_id)',
             f'log_dir = None if run_timestamp is None else os.path.join({constants.SKY_LOGS_DIRECTORY!r}, run_timestamp)',
             f'log_lib.tail_logs(job_id=job_id, log_dir=log_dir, '
-            f'spot_job_id={spot_job_id!r}, follow={follow}, **job_owner_kwargs)',
+            f'managed_job_id={managed_job_id!r}, follow={follow}, **job_owner_kwargs)',
         ]
         return cls._build(code)
 
diff --git a/sky/skylet/log_lib.py b/sky/skylet/log_lib.py
index e8b4de8b7fa..d184abd107e 100644
--- a/sky/skylet/log_lib.py
+++ b/sky/skylet/log_lib.py
@@ -263,6 +263,9 @@ def make_task_bash_script(codegen: str,
     # set -a is used for exporting all variables functions to the environment
     # so that bash `user_script` can access `conda activate`. Detail: #436.
     # Reference: https://www.gnu.org/software/bash/manual/html_node/The-Set-Builtin.html # pylint: disable=line-too-long
+    # DEACTIVATE_SKY_REMOTE_PYTHON_ENV: Deactivate the SkyPilot runtime env, as
+    # the ray cluster is started within the runtime env, which may cause the
+    # user program to run in that env as well.
     # PYTHONUNBUFFERED is used to disable python output buffering.
     script = [
         textwrap.dedent(f"""\
@@ -271,6 +274,7 @@ def make_task_bash_script(codegen: str,
             set -a
             . $(conda info --base 2> /dev/null)/etc/profile.d/conda.sh > /dev/null 2>&1 || true
             set +a
+            {constants.DEACTIVATE_SKY_REMOTE_PYTHON_ENV}
             export PYTHONUNBUFFERED=1
             cd {constants.SKY_REMOTE_WORKDIR}"""),
     ]
@@ -379,14 +383,15 @@ def _follow_job_logs(file,
 
 def tail_logs(job_id: Optional[int],
               log_dir: Optional[str],
-              spot_job_id: Optional[int] = None,
+              managed_job_id: Optional[int] = None,
               follow: bool = True) -> None:
     """Tail the logs of a job.
 
     Args:
         job_id: The job id.
         log_dir: The log directory of the job.
-        spot_job_id: The spot job id (for logging info only to avoid confusion).
+        managed_job_id: The managed job id (for logging info only to avoid
+            confusion).
         follow: Whether to follow the logs or print the logs so far and exit.
     """
     if job_id is None:
@@ -396,14 +401,14 @@ def tail_logs(job_id: Optional[int],
         logger.info('Skip streaming logs as no job has been submitted.')
         return
     job_str = f'job {job_id}'
-    if spot_job_id is not None:
-        job_str = f'spot job {spot_job_id}'
+    if managed_job_id is not None:
+        job_str = f'managed job {managed_job_id}'
     if log_dir is None:
         print(f'{job_str.capitalize()} not found (see `sky queue`).',
               file=sys.stderr)
         return
-    logger.debug(f'Tailing logs for job, real job_id {job_id}, spot_job_id '
-                 f'{spot_job_id}.')
+    logger.debug(f'Tailing logs for job, real job_id {job_id}, managed_job_id '
+                 f'{managed_job_id}.')
     logger.info(f'{colorama.Fore.YELLOW}Start streaming logs for {job_str}.'
                 f'{colorama.Style.RESET_ALL}')
     log_path = os.path.join(log_dir, 'run.log')
diff --git a/sky/skylet/log_lib.pyi b/sky/skylet/log_lib.pyi
index 6815905a461..01b08b6444f 100644
--- a/sky/skylet/log_lib.pyi
+++ b/sky/skylet/log_lib.pyi
@@ -120,6 +120,6 @@ def run_bash_command_with_log(bash_command: str,
 
 def tail_logs(job_id: int,
               log_dir: Optional[str],
-              spot_job_id: Optional[int] = ...,
+              managed_job_id: Optional[int] = ...,
               follow: bool = ...) -> None:
     ...
diff --git a/sky/skylet/providers/azure/config.py b/sky/skylet/providers/azure/config.py
index 0c1827a1141..a19273761ba 100644
--- a/sky/skylet/providers/azure/config.py
+++ b/sky/skylet/providers/azure/config.py
@@ -120,6 +120,8 @@ def _configure_resource_group(config):
     create_or_update = get_azure_sdk_function(
         client=resource_client.deployments, function_name="create_or_update"
     )
+    # TODO (skypilot): this takes a long time (> 40 seconds) for stopping an
+    # azure VM, and this can be called twice during ray down.
     outputs = (
         create_or_update(
             resource_group_name=resource_group,
diff --git a/sky/skylet/providers/azure/node_provider.py b/sky/skylet/providers/azure/node_provider.py
index 1aa53781ac0..068930eb390 100644
--- a/sky/skylet/providers/azure/node_provider.py
+++ b/sky/skylet/providers/azure/node_provider.py
@@ -15,6 +15,7 @@
     bootstrap_azure,
     get_azure_sdk_function,
 )
+from sky.skylet import autostop_lib
 from sky.skylet.providers.command_runner import SkyDockerCommandRunner
 from sky.provision import docker_utils
 
@@ -61,16 +62,23 @@ class AzureNodeProvider(NodeProvider):
 
     def __init__(self, provider_config, cluster_name):
         NodeProvider.__init__(self, provider_config, cluster_name)
-        # TODO(suquark): This is a temporary patch for resource group.
-        # By default, Ray autoscaler assumes the resource group is still here even
-        # after the whole cluster is destroyed. However, now we deletes the resource
-        # group after tearing down the cluster. To comfort the autoscaler, we need
-        # to create/update it here, so the resource group always exists.
-        from sky.skylet.providers.azure.config import _configure_resource_group
-
-        _configure_resource_group(
-            {"cluster_name": cluster_name, "provider": provider_config}
-        )
+        if not autostop_lib.get_is_autostopping():
+            # TODO(suquark): This is a temporary patch for resource group.
+            # By default, Ray autoscaler assumes the resource group is still
+            # here even after the whole cluster is destroyed. However, now we
+            # deletes the resource group after tearing down the cluster. To
+            # comfort the autoscaler, we need to create/update it here, so the
+            # resource group always exists.
+            #
+            # We should not re-configure the resource group again, when it is
+            # running on the remote VM and the autostopping is in progress,
+            # because the VM is running which guarantees the resource group
+            # exists.
+            from sky.skylet.providers.azure.config import _configure_resource_group
+
+            _configure_resource_group(
+                {"cluster_name": cluster_name, "provider": provider_config}
+            )
         subscription_id = provider_config["subscription_id"]
         self.cache_stopped_nodes = provider_config.get("cache_stopped_nodes", True)
         # Sky only supports Azure CLI credential for now.
@@ -116,7 +124,12 @@ def _extract_metadata(self, vm):
             resource_group_name=resource_group, vm_name=vm.name
         ).as_dict()
         for status in instance["statuses"]:
-            code, state = status["code"].split("/")
+            code_state = status["code"].split("/")
+            # It is possible that sometimes the 'code' is empty string, and we
+            # should skip them.
+            if len(code_state) != 2:
+                continue
+            code, state = code_state
             # skip provisioning status
             if code == "PowerState":
                 metadata["status"] = state
diff --git a/sky/skylet/providers/command_runner.py b/sky/skylet/providers/command_runner.py
index b6ea52c6eeb..06c5d6d48af 100644
--- a/sky/skylet/providers/command_runner.py
+++ b/sky/skylet/providers/command_runner.py
@@ -229,6 +229,19 @@ def run_init(self, *, as_head: bool, file_mounts: Dict[str, str],
                     f'The `image_env` is:\n{image_env}')
                 raise e
 
+            # Edit docker config first to avoid disconnecting the container
+            # from GPUs when a systemctl command is called. This is a known
+            # issue with nvidia container toolkit:
+            # https://github.com/NVIDIA/nvidia-container-toolkit/issues/48
+            self.run(
+                '[ -f /etc/docker/daemon.json ] || '
+                'echo "{}" | sudo tee /etc/docker/daemon.json;'
+                'sudo jq \'.["exec-opts"] = ["native.cgroupdriver=cgroupfs"]\' '
+                '/etc/docker/daemon.json > /tmp/daemon.json;'
+                'sudo mv /tmp/daemon.json /etc/docker/daemon.json;'
+                'sudo systemctl restart docker',
+                run_env='host')
+
             user_docker_run_options = self.docker_config.get(
                 'run_options', []) + self.docker_config.get(
                     f'{"head" if as_head else "worker"}_run_options', [])
diff --git a/sky/skylet/providers/lambda_cloud/node_provider.py b/sky/skylet/providers/lambda_cloud/node_provider.py
index 8a9c5997a0b..bb8d40da62e 100644
--- a/sky/skylet/providers/lambda_cloud/node_provider.py
+++ b/sky/skylet/providers/lambda_cloud/node_provider.py
@@ -155,9 +155,10 @@ def _get_internal_ip(node: Dict[str, Any]):
             if node['external_ip'] is None or node['status'] != 'active':
                 node['internal_ip'] = None
                 return
-            runner = command_runner.SSHCommandRunner(node['external_ip'],
-                                                     'ubuntu',
-                                                     self.ssh_key_path)
+            runner = command_runner.SSHCommandRunner(
+                node=(node['external_ip'], 22),
+                ssh_user='ubuntu',
+                ssh_private_key=self.ssh_key_path)
             rc, stdout, stderr = runner.run(_GET_INTERNAL_IP_CMD,
                                             require_outputs=True,
                                             stream_logs=False)
diff --git a/sky/skylet/skylet.py b/sky/skylet/skylet.py
index 509fd83c788..a114d622de4 100644
--- a/sky/skylet/skylet.py
+++ b/sky/skylet/skylet.py
@@ -17,10 +17,10 @@
 EVENTS = [
     events.AutostopEvent(),
     events.JobSchedulerEvent(),
-    # The spot job update event should be after the job update event.
-    # Otherwise, the abnormal spot job status update will be delayed
+    # The managed job update event should be after the job update event.
+    # Otherwise, the abnormal managed job status update will be delayed
     # until the next job update event.
-    events.SpotJobUpdateEvent(),
+    events.ManagedJobUpdateEvent(),
     # This is for monitoring controller job status. If it becomes
     # unhealthy, this event will correctly update the controller
     # status to CONTROLLER_FAILED.
diff --git a/sky/skypilot_config.py b/sky/skypilot_config.py
index 0d950bf6e98..5b205e2692a 100644
--- a/sky/skypilot_config.py
+++ b/sky/skypilot_config.py
@@ -63,7 +63,7 @@
 # 2 in the list.
 
 # (Used internally) An env var holding the path to the local config file. This
-# is only used by spot controller tasks to ensure recoveries of the same job
+# is only used by jobs controller tasks to ensure recoveries of the same job
 # use the same config file.
 ENV_VAR_SKYPILOT_CONFIG = 'SKYPILOT_CONFIG'
 
@@ -152,7 +152,9 @@ def _try_load_config() -> None:
             common_utils.validate_schema(
                 _dict,
                 schemas.get_config_schema(),
-                f'Invalid config YAML ({config_path}): ',
+                f'Invalid config YAML ({config_path}). See: '
+                'https://skypilot.readthedocs.io/en/latest/reference/config.html. '  # pylint: disable=line-too-long
+                'Error: ',
                 skip_none=False)
 
         logger.debug('Config syntax check passed.')
diff --git a/sky/spot/__init__.py b/sky/spot/__init__.py
deleted file mode 100644
index e25f25c9176..00000000000
--- a/sky/spot/__init__.py
+++ /dev/null
@@ -1,47 +0,0 @@
-"""Modules for managed spot clusters."""
-import pathlib
-
-from sky.spot.constants import SPOT_CLUSTER_NAME_PREFIX_LENGTH
-from sky.spot.constants import SPOT_CONTROLLER_TEMPLATE
-from sky.spot.constants import SPOT_CONTROLLER_YAML_PREFIX
-from sky.spot.constants import SPOT_TASK_YAML_PREFIX
-from sky.spot.core import cancel
-from sky.spot.core import launch
-from sky.spot.core import queue
-from sky.spot.core import tail_logs
-from sky.spot.recovery_strategy import SPOT_DEFAULT_STRATEGY
-from sky.spot.recovery_strategy import SPOT_STRATEGIES
-from sky.spot.spot_state import SpotStatus
-from sky.spot.spot_utils import dump_job_table_cache
-from sky.spot.spot_utils import dump_spot_job_queue
-from sky.spot.spot_utils import format_job_table
-from sky.spot.spot_utils import load_job_table_cache
-from sky.spot.spot_utils import load_spot_job_queue
-from sky.spot.spot_utils import SPOT_CONTROLLER_NAME
-from sky.spot.spot_utils import SpotCodeGen
-
-pathlib.Path(SPOT_TASK_YAML_PREFIX).expanduser().parent.mkdir(parents=True,
-                                                              exist_ok=True)
-__all__ = [
-    'SPOT_STRATEGIES',
-    'SPOT_DEFAULT_STRATEGY',
-    'SPOT_CONTROLLER_NAME',
-    # Constants
-    'SPOT_CONTROLLER_TEMPLATE',
-    'SPOT_CONTROLLER_YAML_PREFIX',
-    'SPOT_TASK_YAML_PREFIX',
-    # Enums
-    'SpotStatus',
-    # Core
-    'cancel',
-    'launch',
-    'queue',
-    'tail_logs',
-    # utils
-    'SpotCodeGen',
-    'dump_job_table_cache',
-    'load_job_table_cache',
-    'format_job_table',
-    'dump_spot_job_queue',
-    'load_spot_job_queue',
-]
diff --git a/sky/spot/constants.py b/sky/spot/constants.py
deleted file mode 100644
index 65a99a6ce14..00000000000
--- a/sky/spot/constants.py
+++ /dev/null
@@ -1,22 +0,0 @@
-"""Constants used for Managed Spot."""
-
-SPOT_CONTROLLER_TEMPLATE = 'spot-controller.yaml.j2'
-SPOT_CONTROLLER_YAML_PREFIX = '~/.sky/spot_controller'
-
-SPOT_TASK_YAML_PREFIX = '~/.sky/spot_tasks'
-
-# Resources as a dict for the spot controller.
-# Use default CPU instance type for spot controller with >= 24GB, i.e.
-# m6i.2xlarge (8vCPUs, 32 GB) for AWS, Standard_D8s_v4 (8vCPUs, 32 GB)
-# for Azure, and n1-standard-8 (8 vCPUs, 32 GB) for GCP, etc.
-# Based on profiling, memory should be at least 3x (in GB) as num vCPUs to avoid
-# OOM (each vCPU can have 4 spot controller processes as we set the CPU
-# requirement to 0.25, and 3 GB is barely enough for 4 spot processes).
-# We use 50 GB disk size to reduce the cost.
-CONTROLLER_RESOURCES = {'cpus': '8+', 'memory': '3x', 'disk_size': 50}
-
-# Max length of the cluster name for GCP is 35, the user hash to be attached is
-# 4+1 chars, and we assume the maximum length of the job id is 4+1, so the max
-# length of the cluster name prefix is 25 to avoid the cluster name being too
-# long and truncated twice during the cluster creation.
-SPOT_CLUSTER_NAME_PREFIX_LENGTH = 25
diff --git a/sky/spot/dashboard/dashboard.py b/sky/spot/dashboard/dashboard.py
deleted file mode 100644
index b62cc523d4a..00000000000
--- a/sky/spot/dashboard/dashboard.py
+++ /dev/null
@@ -1,73 +0,0 @@
-"""Dashboard for spot jobs based on Flask.
-
-TODO(zongheng): This is a basic version. In the future we can beef up the web
-frameworks used (e.g.,
-https://github.com/ray-project/ray/tree/master/dashboard/client/src) and/or get
-rid of the SSH port-forwarding business (see cli.py's spot_dashboard()
-comment).
-"""
-import datetime
-import pathlib
-
-import flask
-import yaml
-
-import sky
-from sky import spot
-from sky.utils import common_utils
-
-app = flask.Flask(__name__)
-
-
-def _is_running_on_spot_controller() -> bool:
-    """Am I running on spot controller?
-
-    Loads ~/.sky/sky_ray.yml and check cluster_name.
-    """
-    if pathlib.Path('~/.sky/sky_ray.yml').expanduser().exists():
-        config = yaml.safe_load(
-            pathlib.Path('~/.sky/sky_ray.yml').expanduser().read_text())
-        return config.get('cluster_name', '').startswith('sky-spot-controller-')
-    return False
-
-
-@app.route('/')
-def home():
-    if not _is_running_on_spot_controller():
-        # Experimental: run on laptop (refresh is very slow).
-        all_spot_jobs = sky.spot_queue(refresh=True, skip_finished=False)
-    else:
-        job_table = spot.dump_spot_job_queue()
-        all_spot_jobs = spot.load_spot_job_queue(job_table)
-
-    timestamp = datetime.datetime.utcnow()
-    rows = spot.format_job_table(all_spot_jobs, show_all=True, return_rows=True)
-
-    # FIXME(zongheng): make the job table/queue funcs return structured info so
-    # that we don't have to do things like row[-5] below.
-    columns = [
-        'ID', 'Task', 'Name', 'Resources', 'Submitted', 'Total Duration',
-        'Job Duration', 'Recoveries', 'Status', 'Started', 'Cluster', 'Region',
-        'Failure'
-    ]
-    if rows and len(rows[0]) != len(columns):
-        raise RuntimeError(
-            'Dashboard code and spot queue code are out of sync.')
-
-    # Fix STATUS color codes: '\x1b[33mCANCELLED\x1b[0m' -> 'CANCELLED'.
-    for row in rows:
-        row[-5] = common_utils.remove_color(row[-5])
-    # Remove filler rows ([''], ..., ['-']).
-    rows = [row for row in rows if ''.join(map(str, row)) != '']
-
-    rendered_html = flask.render_template(
-        'index.html',
-        columns=columns,
-        rows=rows,
-        last_updated_timestamp=timestamp,
-    )
-    return rendered_html
-
-
-if __name__ == '__main__':
-    app.run()
diff --git a/sky/task.py b/sky/task.py
index ad3519f92e4..3dd254838f0 100644
--- a/sky/task.py
+++ b/sky/task.py
@@ -33,7 +33,7 @@
 CommandGen = Callable[[int, List[str]], Optional[str]]
 CommandOrCommandGen = Union[str, CommandGen]
 
-_VALID_NAME_REGEX = '[a-z0-9]+(?:[._-]{1,2}[a-z0-9]+)*'
+_VALID_NAME_REGEX = '[a-zA-Z0-9]+(?:[._-]{1,2}[a-zA-Z0-9]+)*'
 _VALID_NAME_DESCR = ('ASCII characters and may contain lowercase and'
                      ' uppercase letters, digits, underscores, periods,'
                      ' and dashes. Must start and end with alphanumeric'
@@ -271,9 +271,9 @@ def __init__(
                                                     int]] = None
         self.file_mounts: Optional[Dict[str, str]] = None
 
-        # Only set when 'self' is a spot controller task: 'self.spot_dag' is
-        # the underlying managed spot dag (sky.Dag object).
-        self.spot_dag: Optional['sky.Dag'] = None
+        # Only set when 'self' is a jobs controller task: 'self.managed_job_dag'
+        # is the underlying managed job dag (sky.Dag object).
+        self.managed_job_dag: Optional['sky.Dag'] = None
 
         # Only set when 'self' is a sky serve controller task.
         self.service_name: Optional[str] = None
@@ -348,6 +348,20 @@ def from_yaml_config(
         config: Dict[str, Any],
         env_overrides: Optional[List[Tuple[str, str]]] = None,
     ) -> 'Task':
+        # More robust handling for 'envs': explicitly convert keys and values to
+        # str, since users may pass '123' as keys/values which will get parsed
+        # as int causing validate_schema() to fail.
+        envs = config.get('envs')
+        if envs is not None and isinstance(envs, dict):
+            new_envs: Dict[str, Optional[str]] = {}
+            for k, v in envs.items():
+                if v is not None:
+                    new_envs[str(k)] = str(v)
+                else:
+                    new_envs[str(k)] = None
+            config['envs'] = new_envs
+        common_utils.validate_schema(config, schemas.get_task_schema(),
+                                     'Invalid task YAML: ')
         if env_overrides is not None:
             # We must override env vars before constructing the Task, because
             # the Storage object creation is eager and it (its name/source
@@ -359,15 +373,14 @@ def from_yaml_config(
             new_envs.update(env_overrides)
             config['envs'] = new_envs
 
-        # More robust handling for 'envs': explicitly convert keys and values to
-        # str, since users may pass '123' as keys/values which will get parsed
-        # as int causing validate_schema() to fail.
-        envs = config.get('envs')
-        if envs is not None and isinstance(envs, dict):
-            config['envs'] = {str(k): str(v) for k, v in envs.items()}
-
-        common_utils.validate_schema(config, schemas.get_task_schema(),
-                                     'Invalid task YAML: ')
+        for k, v in config.get('envs', {}).items():
+            if v is None:
+                with ux_utils.print_exception_no_traceback():
+                    raise ValueError(
+                        f'Environment variable {k!r} is None. Please set a '
+                        'value for it in task YAML or with --env flag. '
+                        f'To set it to be empty, use an empty string ({k}: "" '
+                        f'in task YAML or --env {k}="" in CLI).')
 
         # Fill in any Task.envs into file_mounts (src/dst paths, storage
         # name/source).
@@ -549,10 +562,6 @@ def update_envs(
                                                        self._envs)
         return self
 
-    @property
-    def need_spot_recovery(self) -> bool:
-        return any(r.spot_recovery is not None for r in self.resources)
-
     @property
     def use_spot(self) -> bool:
         return any(r.use_spot for r in self.resources)
@@ -616,6 +625,14 @@ def set_resources(
             resources = {resources}
         # TODO(woosuk): Check if the resources are None.
         self.resources = _with_docker_login_config(resources, self.envs)
+
+        # Evaluate if the task requires FUSE and set the requires_fuse flag
+        for _, storage_obj in self.storage_mounts.items():
+            if storage_obj.mode == storage_lib.StorageMode.MOUNT:
+                for r in self.resources:
+                    r.requires_fuse = True
+                break
+
         return self
 
     def set_resources_override(self, override_params: Dict[str, Any]) -> 'Task':
@@ -805,8 +822,11 @@ def set_storage_mounts(
         """
         if storage_mounts is None:
             self.storage_mounts = {}
+            # Clear the requires_fuse flag if no storage mounts are set.
+            for r in self.resources:
+                r.requires_fuse = False
             return self
-        for target, _ in storage_mounts.items():
+        for target, storage_obj in storage_mounts.items():
             # TODO(zhwu): /home/username/sky_workdir as the target path need
             # to be filtered out as well.
             if (target == constants.SKY_REMOTE_WORKDIR and
@@ -824,6 +844,12 @@ def set_storage_mounts(
                     raise ValueError(
                         'Storage mount destination path cannot be cloud storage'
                     )
+
+            if storage_obj.mode == storage_lib.StorageMode.MOUNT:
+                # If any storage is using MOUNT mode, we need to enable FUSE in
+                # the resources.
+                for r in self.resources:
+                    r.requires_fuse = True
         # Storage source validation is done in Storage object
         self.storage_mounts = storage_mounts
         return self
@@ -884,17 +910,20 @@ def _get_preferred_store(
             resources = list(self.resources)[0]
             storage_cloud = resources.cloud
             storage_region = resources.region
+
         if storage_cloud is not None:
             if str(storage_cloud) not in enabled_storage_clouds:
                 storage_cloud = None
 
+        storage_cloud_str = None
         if storage_cloud is None:
-            storage_cloud = clouds.CLOUD_REGISTRY.from_str(
-                enabled_storage_clouds[0])
-            assert storage_cloud is not None, enabled_storage_clouds[0]
+            storage_cloud_str = enabled_storage_clouds[0]
+            assert storage_cloud_str is not None, enabled_storage_clouds[0]
             storage_region = None  # Use default region in the Store class
+        else:
+            storage_cloud_str = str(storage_cloud)
 
-        store_type = storage_lib.get_storetype_from_cloud(storage_cloud)
+        store_type = storage_lib.StoreType.from_cloud(storage_cloud_str)
         return store_type, storage_region
 
     def sync_storage_mounts(self) -> None:
@@ -988,8 +1017,8 @@ def get_local_to_remote_file_mounts(self) -> Optional[Dict[str, str]]:
         return d
 
     def is_controller_task(self) -> bool:
-        """Returns whether this task is a spot/serve controller process."""
-        return self.spot_dag is not None or self.service_name is not None
+        """Returns whether this task is a jobs/serve controller process."""
+        return self.managed_job_dag is not None or self.service_name is not None
 
     def get_cloud_to_remote_file_mounts(self) -> Optional[Dict[str, str]]:
         """Returns file mounts of the form (dst=VM path, src=cloud URL).
diff --git a/sky/templates/aws-ray.yml.j2 b/sky/templates/aws-ray.yml.j2
index c8658a4b753..66c01f53617 100644
--- a/sky/templates/aws-ray.yml.j2
+++ b/sky/templates/aws-ray.yml.j2
@@ -60,6 +60,10 @@ available_node_types:
   ray.head.default:
     resources: {}
     node_config:
+      {% if remote_identity not in ['LOCAL_CREDENTIALS', 'SERVICE_ACCOUNT'] %}
+      IamInstanceProfile:
+        Name: {{remote_identity}}
+      {% endif %}
       InstanceType: {{instance_type}}
       ImageId: {{image_id}}  # Deep Learning AMI (Ubuntu 18.04); see aws.py.
       # https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
@@ -114,10 +118,10 @@ available_node_types:
           Tags:
             - Key: skypilot-user
               Value: {{ user }}
-{%- for tag_key, tag_value in instance_tags.items() %}
-            - Key: {{ tag_key }}
-              Value: {{ tag_value }}
-{%- endfor %}
+            {%- for label_key, label_value in labels.items() %}
+            - Key: {{ label_key }}
+              Value: {{ label_value|tojson }}
+            {%- endfor %}
 
 head_node_type: ray.head.default
 
@@ -161,5 +165,3 @@ setup_commands:
 
 # Command to start ray clusters are now placed in `sky.provision.instance_setup`.
 # We do not need to list it here anymore.
-
-
diff --git a/sky/templates/azure-ray.yml.j2 b/sky/templates/azure-ray.yml.j2
index f1477d92132..803327f1032 100644
--- a/sky/templates/azure-ray.yml.j2
+++ b/sky/templates/azure-ray.yml.j2
@@ -164,14 +164,14 @@ setup_commands:
 #   current num items (num SSH connections): 2
 head_start_ray_commands:
   # NOTE: --disable-usage-stats in `ray start` saves 10 seconds of idle wait.
-  - {{ sky_ray_cmd }} stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 {{ sky_ray_cmd }} start --disable-usage-stats --head --port={{ray_port}} --dashboard-port={{ray_dashboard_port}} --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml {{"--num-gpus=%s" % num_gpus if num_gpus}} {{"--resources='%s'" % custom_resources if custom_resources}} --temp-dir {{ray_temp_dir}} || exit 1; 
+  - {{ sky_activate_python_env }}; {{ sky_ray_cmd }} stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 {{ sky_ray_cmd }} start --disable-usage-stats --head --port={{ray_port}} --dashboard-port={{ray_dashboard_port}} --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml {{"--num-gpus=%s" % num_gpus if num_gpus}} {{"--resources='%s'" % custom_resources if custom_resources}} --temp-dir {{ray_temp_dir}} || exit 1; 
     which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id || true; done; 
     {{dump_port_command}}; 
     {{ray_head_wait_initialized_command}}
 
 {%- if num_nodes > 1 %}
 worker_start_ray_commands:
-  - {{ sky_ray_cmd }} stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 {{ sky_ray_cmd }} start --disable-usage-stats --address=$RAY_HEAD_IP:{{ray_port}} --object-manager-port=8076 {{"--num-gpus=%s" % num_gpus if num_gpus}} {{"--resources='%s'" % custom_resources if custom_resources}} --temp-dir {{ray_temp_dir}} || exit 1;
+  - {{ sky_activate_python_env }}; {{ sky_ray_cmd }} stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 {{ sky_ray_cmd }} start --disable-usage-stats --address=$RAY_HEAD_IP:{{ray_port}} --object-manager-port=8076 {{"--num-gpus=%s" % num_gpus if num_gpus}} {{"--resources='%s'" % custom_resources if custom_resources}} --temp-dir {{ray_temp_dir}} || exit 1;
     which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id || true; done;
 {%- else %}
 worker_start_ray_commands: []
diff --git a/sky/templates/gcp-ray.yml.j2 b/sky/templates/gcp-ray.yml.j2
index 1a75ba3c380..51a7b332a72 100644
--- a/sky/templates/gcp-ray.yml.j2
+++ b/sky/templates/gcp-ray.yml.j2
@@ -77,8 +77,8 @@ available_node_types:
     node_config:
       labels:
         skypilot-user: {{ user }}
-      {%- for tag_key, tag_value in instance_tags.items() %}
-        {{ tag_key }}: {{ tag_value }}
+      {%- for label_key, label_value in labels.items() %}
+        {{ label_key }}: {{ label_value|tojson }}
       {%- endfor %}
         managed-instance-group: {{ gcp_use_managed_instance_group }}
       {%- if gcp_use_managed_instance_group %}
diff --git a/sky/templates/ibm-ray.yml.j2 b/sky/templates/ibm-ray.yml.j2
index cb527a85a55..728367f506c 100644
--- a/sky/templates/ibm-ray.yml.j2
+++ b/sky/templates/ibm-ray.yml.j2
@@ -118,13 +118,13 @@ head_start_ray_commands:
   # NOTE: --disable-usage-stats in `ray start` saves 10 seconds of idle wait.
   # Line "which prlimit ..": increase the limit of the number of open files for the raylet process, as the `ulimit` may not take effect at this point, because it requires
   # all the sessions to be reloaded. This is a workaround.
-  - {{ sky_ray_cmd }} stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 {{ sky_ray_cmd }} start --disable-usage-stats --head --port={{ray_port}} --dashboard-port={{ray_dashboard_port}} --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml {{"--resources='%s'" % custom_resources if custom_resources}} --temp-dir {{ray_temp_dir}} || exit 1;
+  - {{ sky_activate_python_env }}; {{ sky_ray_cmd }} stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 {{ sky_ray_cmd }} start --disable-usage-stats --head --port={{ray_port}} --dashboard-port={{ray_dashboard_port}} --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml {{"--resources='%s'" % custom_resources if custom_resources}} --temp-dir {{ray_temp_dir}} || exit 1;
     which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id || true; done;
     {{dump_port_command}}; {{ray_head_wait_initialized_command}}
 
 {%- if num_nodes > 1 %}
 worker_start_ray_commands:
-  - {{ sky_ray_cmd }} stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 {{ sky_ray_cmd }} start --disable-usage-stats --address=$RAY_HEAD_IP:{{ray_port}} --object-manager-port=8076 {{"--resources='%s'" % custom_resources if custom_resources}} --temp-dir {{ray_temp_dir}} || exit 1;
+  - {{ sky_activate_python_env }}; {{ sky_ray_cmd }} stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 {{ sky_ray_cmd }} start --disable-usage-stats --address=$RAY_HEAD_IP:{{ray_port}} --object-manager-port=8076 {{"--resources='%s'" % custom_resources if custom_resources}} --temp-dir {{ray_temp_dir}} || exit 1;
     which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id || true; done;
 {%- else %}
 worker_start_ray_commands: []
diff --git a/sky/templates/spot-controller.yaml.j2 b/sky/templates/jobs-controller.yaml.j2
similarity index 51%
rename from sky/templates/spot-controller.yaml.j2
rename to sky/templates/jobs-controller.yaml.j2
index edd14a618de..51083e84a59 100644
--- a/sky/templates/spot-controller.yaml.j2
+++ b/sky/templates/jobs-controller.yaml.j2
@@ -1,4 +1,4 @@
-# The template for the spot controller
+# The template for the jobs controller
 
 name: {{dag_name}}
 
@@ -10,22 +10,29 @@ file_mounts:
   {%- endfor %}
 
 setup: |
+  {{ sky_activate_python_env }}
+  # Disable the pip version check to avoid the warning message, which makes the
+  # output hard to read.
+  export PIP_DISABLE_PIP_VERSION_CHECK=1
+
   {%- for cmd in cloud_dependencies_installation_commands %}
   {{cmd}}
   {%- endfor %}
 
   {% if is_dev %}
-  # Internal: disable logging for manually logging into the spot controller for debugging.
+  # Internal: disable logging for manually logging into the jobs controller for debugging.
   echo 'export SKYPILOT_DEV=1' >> ~/.bashrc
   {% endif %}
 
   # Dashboard.
+  ps aux | grep -v nohup | grep -v grep | grep -- "-m sky.spot.dashboard" | awk '{print $2}' | xargs kill > /dev/null 2>&1 || true
   pip list | grep flask  > /dev/null 2>&1 || pip install flask 2>&1 > /dev/null
-  ((ps aux | grep -v nohup | grep -v grep | grep -q -- "python3 -m sky.spot.dashboard.dashboard") || (nohup {{ sky_python_cmd }} -m sky.spot.dashboard.dashboard >> ~/.sky/spot-dashboard.log 2>&1 &));
+  ((ps aux | grep -v nohup | grep -v grep | grep -q -- "-m sky.jobs.dashboard.dashboard") || (nohup {{ sky_python_cmd }} -m sky.jobs.dashboard.dashboard >> ~/.sky/job-dashboard.log 2>&1 &));
 
 run: |
-  # Start the controller for the current spot job.
-  python -u -m sky.spot.controller {{remote_user_yaml_path}} \
+  {{ sky_activate_python_env }}
+  # Start the controller for the current managed job.
+  python -u -m sky.jobs.controller {{remote_user_yaml_path}} \
     --job-id $SKYPILOT_INTERNAL_JOB_ID {% if retry_until_up %}--retry-until-up{% endif %}
 
 envs:
diff --git a/sky/templates/kubernetes-ray.yml.j2 b/sky/templates/kubernetes-ray.yml.j2
index 893d5c8565a..a9c1a2fdfb3 100644
--- a/sky/templates/kubernetes-ray.yml.j2
+++ b/sky/templates/kubernetes-ray.yml.j2
@@ -33,6 +33,14 @@ provider:
 
     ssh_jump_image: {{k8s_ssh_jump_image}}
 
+    # Namespace used to host SkyPilot system components, such as fuse device
+    # manager.
+    skypilot_system_namespace: {{k8s_skypilot_system_namespace}}
+
+    # Boolean flag to indicate if the cluster requires FUSE mounting.
+    # Used to set up the necessary permissions and sidecars.
+    fuse_device_required: {{k8s_fuse_device_required}}
+
     # ServiceAccount created by the autoscaler for the head node pod that it
     # runs in. If this field isn't provided, the head pod config below must
     # contain a user-created service account with the proper permissions.
@@ -80,6 +88,115 @@ provider:
             name: skypilot-service-account-role
             apiGroup: rbac.authorization.k8s.io
 
+    # Role for the skypilot-system namespace to create FUSE device manager and
+    # any other system components.
+    autoscaler_skypilot_system_role:
+        kind: Role
+        apiVersion: rbac.authorization.k8s.io/v1
+        metadata:
+            namespace: {{k8s_skypilot_system_namespace}}
+            labels:
+                parent: skypilot
+            name: skypilot-system-service-account-role
+        rules:
+        - apiGroups: ["*"]
+          resources: ["*"]
+          verbs: ["*"]
+
+    # RoleBinding for skypilot-system namespace role
+    autoscaler_skypilot_system_role_binding:
+        apiVersion: rbac.authorization.k8s.io/v1
+        kind: RoleBinding
+        metadata:
+            namespace: {{k8s_skypilot_system_namespace}}
+            labels:
+                parent: skypilot
+            name: skypilot-system-service-account-role-binding
+        subjects:
+        - kind: ServiceAccount
+          name: skypilot-service-account
+        roleRef:
+            kind: Role
+            name: skypilot-system-service-account-role
+            apiGroup: rbac.authorization.k8s.io
+
+    # Role to access ingress services for fetching IP
+    autoscaler_ingress_role:
+        kind: Role
+        apiVersion: rbac.authorization.k8s.io/v1
+        metadata:
+            namespace: ingress-nginx
+            name: skypilot-service-account-ingress-role
+            labels:
+                parent: skypilot
+        rules:
+            - apiGroups: [ "" ]
+              resources: [ "services" ]
+              verbs: [ "list", "get", "watch" ]
+            - apiGroups: [ "rbac.authorization.k8s.io" ]
+              resources: [ "roles", "rolebindings" ]
+              verbs: [ "get", "list", "watch", "patch" ]
+
+    # RoleBinding to access ingress services for fetching IP
+    autoscaler_ingress_role_binding:
+        apiVersion: rbac.authorization.k8s.io/v1
+        kind: RoleBinding
+        metadata:
+            namespace: ingress-nginx
+            name: skypilot-service-account-ingress-role-binding
+            labels:
+                parent: skypilot
+        subjects:
+            - kind: ServiceAccount
+              name: skypilot-service-account
+        roleRef:
+            kind: Role
+            name: skypilot-service-account-ingress-role
+            apiGroup: rbac.authorization.k8s.io
+
+    # In addition to a role binding, we also need a cluster role binding to give
+    # the SkyPilot access to the cluster-wide resources such as nodes to get
+    # node resources.
+    autoscaler_cluster_role:
+        kind: ClusterRole
+        apiVersion: rbac.authorization.k8s.io/v1
+        metadata:
+            labels:
+                parent: skypilot
+            name: skypilot-service-account-cluster-role
+        rules:
+        - apiGroups: [ "" ]
+          resources: [ "nodes" ]  # Required for getting node resources.
+          verbs: [ "get", "list", "watch" ]
+        - apiGroups: [ "" ]
+          resources: [ "namespaces" ]  # Required for creating skypilot-system namespace, which hosts fuse device manager.
+          verbs: [ "get", "list", "watch", "create" ]
+        - apiGroups: [ "rbac.authorization.k8s.io" ]
+          resources: [ "clusterroles", "clusterrolebindings" ]  # Required for launching more SkyPilot clusters from within the pod.
+          verbs: [ "get", "list", "watch", "create", "update", "patch", "delete" ]
+        - apiGroups: [ "node.k8s.io" ]
+          resources: [ "runtimeclasses" ]   # Required for autodetecting the runtime class of the nodes.
+          verbs: [ "get", "list", "watch" ]
+        - apiGroups: [ "networking.k8s.io" ]   # Required for exposing services.
+          resources: [ "ingressclasses" ]
+          verbs: [ "get", "list", "watch" ]
+
+    # Bind cluster role to the service account
+    autoscaler_cluster_role_binding:
+        apiVersion: rbac.authorization.k8s.io/v1
+        kind: ClusterRoleBinding
+        metadata:
+            labels:
+                parent: skypilot
+            name: skypilot-service-account-cluster-role-binding
+        subjects:
+        - kind: ServiceAccount
+          name: skypilot-service-account
+        roleRef:
+            kind: ClusterRole
+            name: skypilot-service-account-cluster-role
+            apiGroup: rbac.authorization.k8s.io
+
     services:
       # Service to expose the head node pod's SSH port.
       - apiVersion: v1
@@ -88,6 +205,7 @@ provider:
           labels:
             parent: skypilot
             skypilot-cluster: {{cluster_name_on_cloud}}
+            skypilot-user: {{ user }}
           name: {{cluster_name_on_cloud}}-head-ssh
         spec:
           selector:
@@ -103,6 +221,7 @@ provider:
             labels:
               parent: skypilot
               skypilot-cluster: {{cluster_name_on_cloud}}
+              skypilot-user: {{ user }}
             # NOTE: If you're running multiple Ray clusters with services
             # on one Kubernetes cluster, they must have unique service
             # names.
@@ -139,10 +258,20 @@ available_node_types:
             skypilot-cluster: {{cluster_name_on_cloud}}
             # Identifies the SSH jump pod used by this pod. Used in life cycle management of the ssh jump pod.
             skypilot-ssh-jump: {{k8s_ssh_jump_name}}
+            skypilot-user: {{ user }}
+            # Custom tags for the pods
+            {%- for label_key, label_value in labels.items() %}
+            {{ label_key }}: {{ label_value|tojson }}
+            {%- endfor %}
+        {% if k8s_fuse_device_required %}
+        annotations:
+            # Required for FUSE mounting to access /dev/fuse
+            container.apparmor.security.beta.kubernetes.io/ray-node: unconfined
+        {% endif %}
       spec:
-        # Change this if you altered the autoscaler_service_account above
-        # or want to provide your own.
-        serviceAccountName: skypilot-service-account
+        # serviceAccountName: skypilot-service-account
+        serviceAccountName: {{k8s_service_account_name}}
+        automountServiceAccountToken: {{k8s_automount_sa_token}}
 
         restartPolicy: Never
 
@@ -186,19 +315,32 @@ available_node_types:
           - name: secret-volume
             readOnly: true
             mountPath: "/etc/secret-volume"
+          # This volume allocates shared memory for Ray to use for its plasma
+          # object store. If you do not provide this, Ray will fall back to
+          # /tmp which cause slowdowns if is not a shared memory volume.
           - mountPath: /dev/shm
             name: dshm
-          - mountPath: /dev/fuse    # Required for FUSE mounting
-            name: dev-fuse
-          securityContext:          # Required for FUSE mounting. TODO(romilb): See if we can grant a reduced set of privileges.
-            privileged: true
+          {% if k8s_fuse_device_required %}
+          securityContext:
+            capabilities:
+              add:
+                - "SYS_ADMIN"
+          {% endif %}
           resources:
             requests:
               cpu: {{cpus}}
               memory: {{memory}}G
               nvidia.com/gpu: {{accelerator_count}}
+              {% if k8s_fuse_device_required %}
+              # Kubernetes resource exposed by the fuse device manager
+              # https://gitlab.com/arm-research/smarter/smarter-device-manager
+              smarter-devices/fuse: "1"
+              {% endif %}
             limits:
               nvidia.com/gpu: {{accelerator_count}} # Limits need to be defined for GPU requests
+              {% if k8s_fuse_device_required %}
+              smarter-devices/fuse: "1"
+              {% endif %}
 
 setup_commands:
   # Disable `unattended-upgrades` to prevent apt-get from hanging. It should be called at the beginning before the process started to avoid being blocked. (This is a temporary fix.)
diff --git a/sky/templates/lambda-ray.yml.j2 b/sky/templates/lambda-ray.yml.j2
index 1aaf7edaddd..4e8b834503f 100644
--- a/sky/templates/lambda-ray.yml.j2
+++ b/sky/templates/lambda-ray.yml.j2
@@ -89,13 +89,13 @@ setup_commands:
 # Increment the following for catching performance bugs easier:
 #   current num items (num SSH connections): 2
 head_start_ray_commands:
-  - {{ sky_ray_cmd }} stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 {{ sky_ray_cmd }} start --disable-usage-stats --head --port={{ray_port}} --dashboard-port={{ray_dashboard_port}} --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml {{"--resources='%s'" % custom_resources if custom_resources}} --temp-dir {{ray_temp_dir}} || exit 1;
+  - {{ sky_activate_python_env }}; {{ sky_ray_cmd }} stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 {{ sky_ray_cmd }} start --disable-usage-stats --head --port={{ray_port}} --dashboard-port={{ray_dashboard_port}} --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml {{"--resources='%s'" % custom_resources if custom_resources}} --temp-dir {{ray_temp_dir}} || exit 1;
     which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id || true; done;
     {{dump_port_command}}; {{ray_head_wait_initialized_command}}
 
 {%- if num_nodes > 1 %}
 worker_start_ray_commands:
-  - {{ sky_ray_cmd }} stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 {{ sky_ray_cmd }} start --disable-usage-stats --address=$RAY_HEAD_IP:{{ray_port}} --object-manager-port=8076 {{"--resources='%s'" % custom_resources if custom_resources}} --temp-dir {{ray_temp_dir}} || exit 1;
+  - {{ sky_activate_python_env }}; {{ sky_ray_cmd }} stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 {{ sky_ray_cmd }} start --disable-usage-stats --address=$RAY_HEAD_IP:{{ray_port}} --object-manager-port=8076 {{"--resources='%s'" % custom_resources if custom_resources}} --temp-dir {{ray_temp_dir}} || exit 1;
     which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id || true; done;
 {%- else %}
 worker_start_ray_commands: []
diff --git a/sky/templates/oci-ray.yml.j2 b/sky/templates/oci-ray.yml.j2
index a15a53732b1..32bd6326ee2 100644
--- a/sky/templates/oci-ray.yml.j2
+++ b/sky/templates/oci-ray.yml.j2
@@ -114,13 +114,13 @@ head_start_ray_commands:
   # NOTE: --disable-usage-stats in `ray start` saves 10 seconds of idle wait.
   # Line "which prlimit ..": increase the limit of the number of open files for the raylet process, as the `ulimit` may not take effect at this point, because it requires
   # all the sessions to be reloaded. This is a workaround.
-  - {{ sky_ray_cmd }} stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 {{ sky_ray_cmd }} start --disable-usage-stats --head --port={{ray_port}} --dashboard-port={{ray_dashboard_port}} --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml {{"--resources='%s'" % custom_resources if custom_resources}} --temp-dir {{ray_temp_dir}} || exit 1;
+  - {{ sky_activate_python_env }}; {{ sky_ray_cmd }} stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 {{ sky_ray_cmd }} start --disable-usage-stats --head --port={{ray_port}} --dashboard-port={{ray_dashboard_port}} --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml {{"--resources='%s'" % custom_resources if custom_resources}} --temp-dir {{ray_temp_dir}} || exit 1;
     which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id || true; done;
     {{dump_port_command}}; {{ray_head_wait_initialized_command}}
 
 {%- if num_nodes > 1 %}
 worker_start_ray_commands:
-  - {{ sky_ray_cmd }} stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 {{ sky_ray_cmd }} start --disable-usage-stats --address=$RAY_HEAD_IP:{{ray_port}} --object-manager-port=8076 {{"--resources='%s'" % custom_resources if custom_resources}} --temp-dir {{ray_temp_dir}} || exit 1;
+  - {{ sky_activate_python_env }}; {{ sky_ray_cmd }} stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 {{ sky_ray_cmd }} start --disable-usage-stats --address=$RAY_HEAD_IP:{{ray_port}} --object-manager-port=8076 {{"--resources='%s'" % custom_resources if custom_resources}} --temp-dir {{ray_temp_dir}} || exit 1;
     which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id || true; done;
 {%- else %}
 worker_start_ray_commands: []
diff --git a/sky/templates/scp-ray.yml.j2 b/sky/templates/scp-ray.yml.j2
index 42126652920..907aa547d64 100644
--- a/sky/templates/scp-ray.yml.j2
+++ b/sky/templates/scp-ray.yml.j2
@@ -88,13 +88,13 @@ head_start_ray_commands:
   # NOTE: --disable-usage-stats in `ray start` saves 10 seconds of idle wait.
   # Line "which prlimit ..": increase the limit of the number of open files for the raylet process, as the `ulimit` may not take effect at this point, because it requires
   # all the sessions to be reloaded. This is a workaround.
-  - {{ sky_ray_cmd }} stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 {{ sky_ray_cmd }} start --disable-usage-stats --head --port={{ray_port}} --dashboard-port={{ray_dashboard_port}} --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml {{"--resources='%s'" % custom_resources if custom_resources}} --temp-dir {{ray_temp_dir}} || exit 1;
+  - {{ sky_activate_python_env }}; {{ sky_ray_cmd }} stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 {{ sky_ray_cmd }} start --disable-usage-stats --head --port={{ray_port}} --dashboard-port={{ray_dashboard_port}} --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml {{"--resources='%s'" % custom_resources if custom_resources}} --temp-dir {{ray_temp_dir}} || exit 1;
     which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id || true; done;
     {{dump_port_command}}; {{ray_head_wait_initialized_command}}
 
 {%- if num_nodes > 1 %}
 worker_start_ray_commands:
-  - {{ sky_ray_cmd }} stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 {{ sky_ray_cmd }} start --disable-usage-stats --address=$RAY_HEAD_IP:{{ray_port}} --object-manager-port=8076 {{"--resources='%s'" % custom_resources if custom_resources}} --temp-dir {{ray_temp_dir}} || exit 1;
+  - {{ sky_activate_python_env }}; {{ sky_ray_cmd }} stop; RAY_SCHEDULER_EVENTS=0 RAY_DEDUP_LOGS=0 {{ sky_ray_cmd }} start --disable-usage-stats --address=$RAY_HEAD_IP:{{ray_port}} --object-manager-port=8076 {{"--resources='%s'" % custom_resources if custom_resources}} --temp-dir {{ray_temp_dir}} || exit 1;
     which prlimit && for id in $(pgrep -f raylet/raylet); do sudo prlimit --nofile=1048576:1048576 --pid=$id || true; done;
 {%- else %}
 worker_start_ray_commands: []
diff --git a/sky/templates/sky-serve-controller.yaml.j2 b/sky/templates/sky-serve-controller.yaml.j2
index 351a89ae7f6..a20c2d680aa 100644
--- a/sky/templates/sky-serve-controller.yaml.j2
+++ b/sky/templates/sky-serve-controller.yaml.j2
@@ -3,6 +3,11 @@
 name: {{service_name}}
 
 setup: |
+  {{ sky_activate_python_env }}
+  # Disable the pip version check to avoid the warning message, which makes the
+  # output hard to read.
+  export PIP_DISABLE_PIP_VERSION_CHECK=1
+
   # Install all cloud dependencies.
   # This is for multicloud support. To allow controller launch on all clouds,
   # we need to install all cloud dependencies.
@@ -11,8 +16,10 @@ setup: |
   {%- endfor %}
 
   # Install serve dependencies.
+  # TODO(tian): Gather those into serve constants.
   pip list | grep uvicorn > /dev/null 2>&1 || pip install uvicorn > /dev/null 2>&1
   pip list | grep fastapi > /dev/null 2>&1 || pip install fastapi > /dev/null 2>&1
+  pip list | grep httpx > /dev/null 2>&1 || pip install httpx > /dev/null 2>&1
 
 file_mounts:
   {{remote_task_yaml_path}}: {{local_task_yaml_path}}
@@ -22,6 +29,9 @@ file_mounts:
   {%- endfor %}
 
 run: |
+  # Activate the Python environment, so that cloud SDKs can be found in the
+  # PATH.
+  {{ sky_activate_python_env }}
   # Start sky serve service.
   python -u -m sky.serve.service \
     --service-name {{service_name}} \
diff --git a/sky/usage/usage_lib.py b/sky/usage/usage_lib.py
index df45b1d2111..32eb670fa2c 100644
--- a/sky/usage/usage_lib.py
+++ b/sky/usage/usage_lib.py
@@ -190,6 +190,9 @@ def update_actual_task(self, task: 'task_lib.Task'):
                 self.task_accelerators = list(resources.accelerators.keys())[0]
                 self.task_num_accelerators = resources.accelerators[
                     self.task_accelerators]
+            else:
+                self.task_accelerators = None
+                self.task_num_accelerators = None
 
     def update_task_id(self, task_id: int):
         self.task_id = task_id
@@ -224,6 +227,9 @@ def update_cluster_resources(self, num_nodes: int,
                              f'{resources.accelerators}.')
             self.accelerators = list(resources.accelerators.keys())[0]
             self.num_accelerators = resources.accelerators[self.accelerators]
+        else:
+            self.accelerators = None
+            self.num_accelerators = None
 
         self.num_nodes = num_nodes
         self.resources = resources.to_yaml_config()
diff --git a/sky/utils/command_runner.py b/sky/utils/command_runner.py
index c66b5dfe032..be55092c680 100644
--- a/sky/utils/command_runner.py
+++ b/sky/utils/command_runner.py
@@ -5,19 +5,24 @@
 import pathlib
 import shlex
 import time
-from typing import List, Optional, Tuple, Union
+from typing import Any, Callable, Iterable, List, Optional, Tuple, Type, Union
 
 from sky import sky_logging
 from sky.skylet import constants
 from sky.skylet import log_lib
 from sky.utils import common_utils
 from sky.utils import subprocess_utils
+from sky.utils import timeline
 
 logger = sky_logging.init_logger(__name__)
 
 # The git exclude file to support.
 GIT_EXCLUDE = '.git/info/exclude'
 # Rsync options
+# TODO(zhwu): This will print a per-file progress bar (with -P),
+# shooting a lot of messages to the output. --info=progress2 is used
+# to get a total progress bar, but it requires rsync>=3.1.0 and Mac
+# OS has a default rsync==2.6.9 (16 years old).
 RSYNC_DISPLAY_OPTION = '-Pavz'
 # Legend
 #   dir-merge: ignore file can appear in any subdir, applies to that
@@ -29,6 +34,7 @@
 RSYNC_EXCLUDE_OPTION = '--exclude-from={}'
 
 _HASH_MAX_LENGTH = 10
+_DEFAULT_CONNECT_TIMEOUT = 30
 
 
 def _ssh_control_path(ssh_control_filename: Optional[str]) -> Optional[str]:
@@ -41,6 +47,12 @@ def _ssh_control_path(ssh_control_filename: Optional[str]) -> Optional[str]:
     return path
 
 
+# Disable sudo for root user. This is useful when the command is running in a
+# docker container, i.e. image_id is a docker image.
+ALIAS_SUDO_TO_EMPTY_FOR_ROOT_CMD = (
+    '{ [ "$(whoami)" == "root" ] && function sudo() { "$@"; } || true; }')
+
+
 def ssh_options_list(
     ssh_private_key: Optional[str],
     ssh_control_name: Optional[str],
@@ -53,13 +65,13 @@ def ssh_options_list(
 ) -> List[str]:
     """Returns a list of sane options for 'ssh'."""
     if connect_timeout is None:
-        connect_timeout = 30
+        connect_timeout = _DEFAULT_CONNECT_TIMEOUT
     # Forked from Ray SSHOptions:
     # https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/command_runner.py
     arg_dict = {
         # SSH port
         'Port': port,
-        # Supresses initial fingerprint verification.
+        # Suppresses initial fingerprint verification.
         'StrictHostKeyChecking': 'no',
         # SSH IP and fingerprint pairs no longer added to known_hosts.
         # This is to remove a 'REMOTE HOST IDENTIFICATION HAS CHANGED'
@@ -67,6 +79,10 @@ def ssh_options_list(
         # deleted node, because the fingerprints will not match in
         # that case.
         'UserKnownHostsFile': os.devnull,
+        # Suppresses the warning messages, such as:
+        #   Warning: Permanently added 'xx.xx.xx.xx' (EDxxx) to the list of
+        #   known hosts.
+        'LogLevel': 'ERROR',
         # Try fewer extraneous key pairs.
         'IdentitiesOnly': 'yes',
         # Abort if port forwarding fails (instead of just printing to
@@ -134,17 +150,251 @@ class SshMode(enum.Enum):
     LOGIN = 2
 
 
-class SSHCommandRunner:
+class CommandRunner:
+    """Runner for commands to be executed on the cluster."""
+
+    def __init__(self, node: Tuple[Any, Any], **kwargs):
+        del kwargs  # Unused.
+        self.node = node
+
+    @property
+    def node_id(self) -> str:
+        return '-'.join(str(x) for x in self.node)
+
+    def _get_command_to_run(
+        self,
+        cmd: Union[str, List[str]],
+        process_stream: bool,
+        separate_stderr: bool,
+        skip_lines: int,
+        source_bashrc: bool = False,
+    ) -> str:
+        """Returns the command to run."""
+        if isinstance(cmd, list):
+            cmd = ' '.join(cmd)
+
+        # We need this to correctly run the cmd, and get the output.
+        command = [
+            '/bin/bash',
+            '--login',
+            '-c',
+        ]
+        if source_bashrc:
+            command += [
+                # Need this `-i` option to make sure `source ~/.bashrc` work.
+                # Sourcing bashrc may take a few seconds causing overheads.
+                '-i',
+                shlex.quote(
+                    f'true && source ~/.bashrc && export OMP_NUM_THREADS=1 '
+                    f'PYTHONWARNINGS=ignore && ({cmd})'),
+            ]
+        else:
+            # Optimization: this reduces the time for connecting to the remote
+            # cluster by 1 second.
+            # sourcing ~/.bashrc is not required for internal executions
+            command += [
+                shlex.quote('true && export OMP_NUM_THREADS=1 '
+                            f'PYTHONWARNINGS=ignore && ({cmd})')
+            ]
+        if not separate_stderr:
+            command.append('2>&1')
+        if not process_stream and skip_lines:
+            command += [
+                # A hack to remove the following bash warnings (twice):
+                #  bash: cannot set terminal process group
+                #  bash: no job control in this shell
+                f'| stdbuf -o0 tail -n +{skip_lines}',
+                # This is required to make sure the executor of command can get
+                # correct returncode, since linux pipe is used.
+                '; exit ${PIPESTATUS[0]}'
+            ]
+
+        command_str = ' '.join(command)
+        return command_str
+
+    def _rsync(
+            self,
+            source: str,
+            target: str,
+            node_destination: str,
+            up: bool,
+            rsh_option: str,
+            # Advanced options.
+            log_path: str = os.devnull,
+            stream_logs: bool = True,
+            max_retry: int = 1,
+            prefix_command: Optional[str] = None,
+            get_remote_home_dir: Callable[[], str] = lambda: '~') -> None:
+        """Builds the rsync command."""
+        # Build command.
+        rsync_command = []
+        if prefix_command is not None:
+            rsync_command.append(prefix_command)
+        rsync_command += ['rsync', RSYNC_DISPLAY_OPTION]
+
+        # --filter
+        rsync_command.append(RSYNC_FILTER_OPTION)
+
+        if up:
+            # Build --exclude-from argument.
+            # The source is a local path, so we need to resolve it.
+            resolved_source = pathlib.Path(source).expanduser().resolve()
+            if (resolved_source / GIT_EXCLUDE).exists():
+                # Ensure file exists; otherwise, rsync will error out.
+                #
+                # We shlex.quote() because the path may contain spaces:
+                #   'my dir/.git/info/exclude'
+                # Without quoting rsync fails.
+                rsync_command.append(
+                    RSYNC_EXCLUDE_OPTION.format(
+                        shlex.quote(str(resolved_source / GIT_EXCLUDE))))
+
+        rsync_command.append(f'-e {shlex.quote(rsh_option)}')
+
+        if up:
+            resolved_target = target
+            if target.startswith('~'):
+                remote_home_dir = get_remote_home_dir()
+                resolved_target = target.replace('~', remote_home_dir)
+            full_source_str = str(resolved_source)
+            if resolved_source.is_dir():
+                full_source_str = os.path.join(full_source_str, '')
+            rsync_command.extend([
+                f'{full_source_str!r}',
+                f'{node_destination}:{resolved_target!r}',
+            ])
+        else:
+            resolved_source = source
+            if source.startswith('~'):
+                remote_home_dir = get_remote_home_dir()
+                resolved_source = source.replace('~', remote_home_dir)
+            rsync_command.extend([
+                f'{node_destination}:{resolved_source!r}',
+                f'{os.path.expanduser(target)!r}',
+            ])
+        command = ' '.join(rsync_command)
+        logger.debug(f'Running rsync command: {command}')
+
+        backoff = common_utils.Backoff(initial_backoff=5, max_backoff_factor=5)
+        assert max_retry > 0, f'max_retry {max_retry} must be positive.'
+        while max_retry >= 0:
+            returncode, stdout, stderr = log_lib.run_with_log(
+                command,
+                log_path=log_path,
+                stream_logs=stream_logs,
+                shell=True,
+                require_outputs=True)
+            if returncode == 0:
+                break
+            max_retry -= 1
+            time.sleep(backoff.current_backoff())
+
+        direction = 'up' if up else 'down'
+        error_msg = (f'Failed to rsync {direction}: {source} -> {target}. '
+                     'Ensure that the network is stable, then retry.')
+        subprocess_utils.handle_returncode(returncode,
+                                           command,
+                                           error_msg,
+                                           stderr=stdout + stderr,
+                                           stream_logs=stream_logs)
+
+    @timeline.event
+    def run(
+            self,
+            cmd: Union[str, List[str]],
+            *,
+            require_outputs: bool = False,
+            # Advanced options.
+            log_path: str = os.devnull,
+            # If False, do not redirect stdout/stderr to optimize performance.
+            process_stream: bool = True,
+            stream_logs: bool = True,
+            ssh_mode: SshMode = SshMode.NON_INTERACTIVE,
+            separate_stderr: bool = False,
+            connect_timeout: Optional[int] = None,
+            source_bashrc: bool = False,
+            skip_lines: int = 0,
+            **kwargs) -> Union[int, Tuple[int, str, str]]:
+        """Runs the command on the cluster.
+
+        Args:
+            cmd: The command to run.
+            require_outputs: Whether to return the stdout/stderr of the command.
+            log_path: Redirect stdout/stderr to the log_path.
+            stream_logs: Stream logs to the stdout/stderr.
+            ssh_mode: The mode to use for ssh.
+                See SSHMode for more details.
+            separate_stderr: Whether to separate stderr from stdout.
+            connect_timeout: timeout in seconds for the ssh connection.
+            source_bashrc: Whether to source the ~/.bashrc before running the
+                command.
+            skip_lines: The number of lines to skip at the beginning of the
+                output. This is used when the output is not processed by
+                SkyPilot but we still want to get rid of some warning messages,
+                such as SSH warnings.
+
+
+        Returns:
+            returncode
+            or
+            A tuple of (returncode, stdout, stderr).
+        """
+        raise NotImplementedError
+
+    @timeline.event
+    def rsync(
+        self,
+        source: str,
+        target: str,
+        *,
+        up: bool,
+        # Advanced options.
+        log_path: str = os.devnull,
+        stream_logs: bool = True,
+        max_retry: int = 1,
+    ) -> None:
+        """Uses 'rsync' to sync 'source' to 'target'.
+
+        Args:
+            source: The source path.
+            target: The target path.
+            up: The direction of the sync, True for local to cluster, False
+              for cluster to local.
+            log_path: Redirect stdout/stderr to the log_path.
+            stream_logs: Stream logs to the stdout/stderr.
+            max_retry: The maximum number of retries for the rsync command.
+              This value should be non-negative.
+
+        Raises:
+            exceptions.CommandError: rsync command failed.
+        """
+        raise NotImplementedError
+
+    @classmethod
+    def make_runner_list(
+        cls: Type['CommandRunner'],
+        node_list: Iterable[Any],
+        **kwargs,
+    ) -> List['CommandRunner']:
+        """Helper function for creating runners with the same credentials"""
+        return [cls(node, **kwargs) for node in node_list]
+
+    def check_connection(self) -> bool:
+        """Check if the connection to the remote machine is successful."""
+        returncode = self.run('true', connect_timeout=5, stream_logs=False)
+        return returncode == 0
+
+
+class SSHCommandRunner(CommandRunner):
     """Runner for SSH commands."""
 
     def __init__(
         self,
-        ip: str,
+        node: Tuple[str, int],
         ssh_user: str,
         ssh_private_key: str,
         ssh_control_name: Optional[str] = '__default__',
         ssh_proxy_command: Optional[str] = None,
-        port: int = 22,
         docker_user: Optional[str] = None,
         disable_control_master: Optional[bool] = False,
     ):
@@ -156,7 +406,7 @@ def __init__(
             runner.rsync(source, target, up=True)
 
         Args:
-            ip: The IP address of the remote machine.
+            node: (ip, port) The IP address and port of the remote machine.
             ssh_private_key: The path to the private key to use for ssh.
             ssh_user: The user to use for ssh.
             ssh_control_name: The files name of the ssh_control to use. This is
@@ -174,6 +424,8 @@ def __init__(
                 command will utilize ControlMaster. We currently disable
                 it for k8s instance.
         """
+        super().__init__(node)
+        ip, port = node
         self.ssh_private_key = ssh_private_key
         self.ssh_control_name = (
             None if ssh_control_name is None else hashlib.md5(
@@ -198,27 +450,6 @@ def __init__(
             self.port = port
             self._docker_ssh_proxy_command = None
 
-    @staticmethod
-    def make_runner_list(
-        ip_list: List[str],
-        ssh_user: str,
-        ssh_private_key: str,
-        ssh_control_name: Optional[str] = None,
-        ssh_proxy_command: Optional[str] = None,
-        disable_control_master: Optional[bool] = False,
-        port_list: Optional[List[int]] = None,
-        docker_user: Optional[str] = None,
-    ) -> List['SSHCommandRunner']:
-        """Helper function for creating runners with the same ssh credentials"""
-        if not port_list:
-            port_list = [22] * len(ip_list)
-        return [
-            SSHCommandRunner(ip, ssh_user, ssh_private_key, ssh_control_name,
-                             ssh_proxy_command, port, docker_user,
-                             disable_control_master)
-            for ip, port in zip(ip_list, port_list)
-        ]
-
     def _ssh_base_command(self, *, ssh_mode: SshMode,
                           port_forward: Optional[List[int]],
                           connect_timeout: Optional[int]) -> List[str]:
@@ -251,6 +482,7 @@ def _ssh_base_command(self, *, ssh_mode: SshMode,
                 f'{self.ssh_user}@{self.ip}'
             ]
 
+    @timeline.event
     def run(
             self,
             cmd: Union[str, List[str]],
@@ -265,11 +497,12 @@ def run(
             ssh_mode: SshMode = SshMode.NON_INTERACTIVE,
             separate_stderr: bool = False,
             connect_timeout: Optional[int] = None,
+            source_bashrc: bool = False,
+            skip_lines: int = 0,
             **kwargs) -> Union[int, Tuple[int, str, str]]:
         """Uses 'ssh' to run 'cmd' on a node with ip.
 
         Args:
-            ip: The IP address of the node.
             cmd: The command to run.
             port_forward: A list of ports to forward from the localhost to the
             remote host.
@@ -283,7 +516,13 @@ def run(
             ssh_mode: The mode to use for ssh.
                 See SSHMode for more details.
             separate_stderr: Whether to separate stderr from stdout.
-
+            connect_timeout: timeout in seconds for the ssh connection.
+            source_bashrc: Whether to source the bashrc before running the
+                command.
+            skip_lines: The number of lines to skip at the beginning of the
+                output. This is used when the output is not processed by
+                SkyPilot but we still want to get rid of some warning messages,
+                such as SSH warnings.
 
         Returns:
             returncode
@@ -299,39 +538,16 @@ def run(
             command = base_ssh_command + cmd
             proc = subprocess_utils.run(command, shell=False, check=False)
             return proc.returncode, '', ''
-        if isinstance(cmd, list):
-            cmd = ' '.join(cmd)
+
+        command_str = self._get_command_to_run(cmd,
+                                               process_stream,
+                                               separate_stderr,
+                                               skip_lines=skip_lines,
+                                               source_bashrc=source_bashrc)
+        command = base_ssh_command + [shlex.quote(command_str)]
 
         log_dir = os.path.expanduser(os.path.dirname(log_path))
         os.makedirs(log_dir, exist_ok=True)
-        # We need this to correctly run the cmd, and get the output.
-        command = [
-            'bash',
-            '--login',
-            '-c',
-            # Need this `-i` option to make sure `source ~/.bashrc` work.
-            '-i',
-        ]
-
-        command += [
-            shlex.quote(f'true && source ~/.bashrc && export OMP_NUM_THREADS=1 '
-                        f'PYTHONWARNINGS=ignore && ({cmd})'),
-        ]
-        if not separate_stderr:
-            command.append('2>&1')
-        if not process_stream and ssh_mode == SshMode.NON_INTERACTIVE:
-            command += [
-                # A hack to remove the following bash warnings (twice):
-                #  bash: cannot set terminal process group
-                #  bash: no job control in this shell
-                '| stdbuf -o0 tail -n +5',
-                # This is required to make sure the executor of command can get
-                # correct returncode, since linux pipe is used.
-                '; exit ${PIPESTATUS[0]}'
-            ]
-
-        command_str = ' '.join(command)
-        command = base_ssh_command + [shlex.quote(command_str)]
 
         executable = None
         if not process_stream:
@@ -354,6 +570,7 @@ def run(
                                     executable=executable,
                                     **kwargs)
 
+    @timeline.event
     def rsync(
         self,
         source: str,
@@ -380,30 +597,6 @@ def rsync(
         Raises:
             exceptions.CommandError: rsync command failed.
         """
-        # Build command.
-        # TODO(zhwu): This will print a per-file progress bar (with -P),
-        # shooting a lot of messages to the output. --info=progress2 is used
-        # to get a total progress bar, but it requires rsync>=3.1.0 and Mac
-        # OS has a default rsync==2.6.9 (16 years old).
-        rsync_command = ['rsync', RSYNC_DISPLAY_OPTION]
-
-        # --filter
-        rsync_command.append(RSYNC_FILTER_OPTION)
-
-        if up:
-            # The source is a local path, so we need to resolve it.
-            # --exclude-from
-            resolved_source = pathlib.Path(source).expanduser().resolve()
-            if (resolved_source / GIT_EXCLUDE).exists():
-                # Ensure file exists; otherwise, rsync will error out.
-                #
-                # We shlex.quote() because the path may contain spaces:
-                #   'my dir/.git/info/exclude'
-                # Without quoting rsync fails.
-                rsync_command.append(
-                    RSYNC_EXCLUDE_OPTION.format(
-                        shlex.quote(str(resolved_source / GIT_EXCLUDE))))
-
         if self._docker_ssh_proxy_command is not None:
             docker_ssh_proxy_command = self._docker_ssh_proxy_command(['ssh'])
         else:
@@ -416,50 +609,199 @@ def rsync(
                 docker_ssh_proxy_command=docker_ssh_proxy_command,
                 port=self.port,
                 disable_control_master=self.disable_control_master))
-        rsync_command.append(f'-e "ssh {ssh_options}"')
-        # To support spaces in the path, we need to quote source and target.
-        # rsync doesn't support '~' in a quoted local path, but it is ok to
-        # have '~' in a quoted remote path.
-        if up:
-            full_source_str = str(resolved_source)
-            if resolved_source.is_dir():
-                full_source_str = os.path.join(full_source_str, '')
-            rsync_command.extend([
-                f'{full_source_str!r}',
-                f'{self.ssh_user}@{self.ip}:{target!r}',
-            ])
-        else:
-            rsync_command.extend([
-                f'{self.ssh_user}@{self.ip}:{source!r}',
-                f'{os.path.expanduser(target)!r}',
-            ])
-        command = ' '.join(rsync_command)
+        rsh_option = f'ssh {ssh_options}'
+        self._rsync(source,
+                    target,
+                    node_destination=f'{self.ssh_user}@{self.ip}',
+                    up=up,
+                    rsh_option=rsh_option,
+                    log_path=log_path,
+                    stream_logs=stream_logs,
+                    max_retry=max_retry)
 
-        backoff = common_utils.Backoff(initial_backoff=5, max_backoff_factor=5)
-        while max_retry >= 0:
-            returncode, stdout, stderr = log_lib.run_with_log(
-                command,
-                log_path=log_path,
-                stream_logs=stream_logs,
-                shell=True,
-                require_outputs=True)
-            if returncode == 0:
-                break
-            max_retry -= 1
-            time.sleep(backoff.current_backoff())
 
-        direction = 'up' if up else 'down'
-        error_msg = (f'Failed to rsync {direction}: {source} -> {target}. '
-                     'Ensure that the network is stable, then retry.')
-        subprocess_utils.handle_returncode(returncode,
-                                           command,
-                                           error_msg,
-                                           stderr=stdout + stderr,
-                                           stream_logs=stream_logs)
+class KubernetesCommandRunner(CommandRunner):
+    """Runner for Kubernetes commands."""
 
-    def check_connection(self) -> bool:
-        """Check if the connection to the remote machine is successful."""
-        returncode = self.run('true', connect_timeout=5, stream_logs=False)
-        if returncode:
-            return False
-        return True
+    def __init__(
+        self,
+        node: Tuple[str, str],
+        **kwargs,
+    ):
+        """Initialize KubernetesCommandRunner.
+
+        Example Usage:
+            runner = KubernetesCommandRunner((namespace, pod_name))
+            runner.run('ls -l')
+            runner.rsync(source, target, up=True)
+
+        Args:
+            node: The namespace and pod_name of the remote machine.
+        """
+        del kwargs
+        super().__init__(node)
+        self.namespace, self.pod_name = node
+
+    @timeline.event
+    def run(
+            self,
+            cmd: Union[str, List[str]],
+            *,
+            port_forward: Optional[List[int]] = None,
+            require_outputs: bool = False,
+            # Advanced options.
+            log_path: str = os.devnull,
+            # If False, do not redirect stdout/stderr to optimize performance.
+            process_stream: bool = True,
+            stream_logs: bool = True,
+            ssh_mode: SshMode = SshMode.NON_INTERACTIVE,
+            separate_stderr: bool = False,
+            connect_timeout: Optional[int] = None,
+            source_bashrc: bool = False,
+            skip_lines: int = 0,
+            **kwargs) -> Union[int, Tuple[int, str, str]]:
+        """Uses 'kubectl exec' to run 'cmd' on a pod by its name and namespace.
+
+        Args:
+            cmd: The command to run.
+            port_forward: This should be None for k8s.
+
+            Advanced options:
+
+            require_outputs: Whether to return the stdout/stderr of the command.
+            log_path: Redirect stdout/stderr to the log_path.
+            stream_logs: Stream logs to the stdout/stderr.
+            check: Check the success of the command.
+            ssh_mode: The mode to use for ssh.
+                See SSHMode for more details.
+            separate_stderr: Whether to separate stderr from stdout.
+            connect_timeout: timeout in seconds for the pod connection.
+            source_bashrc: Whether to source the bashrc before running the
+                command.
+            skip_lines: The number of lines to skip at the beginning of the
+                output. This is used when the output is not processed by
+                SkyPilot but we still want to get rid of some warning messages,
+                such as SSH warnings.
+
+        Returns:
+            returncode
+            or
+            A tuple of (returncode, stdout, stderr).
+        """
+        # TODO(zhwu): implement port_forward for k8s.
+        assert port_forward is None, ('port_forward is not supported for k8s '
+                                      f'for now, but got: {port_forward}')
+        if connect_timeout is None:
+            connect_timeout = _DEFAULT_CONNECT_TIMEOUT
+        kubectl_args = [
+            '--pod-running-timeout', f'{connect_timeout}s', '-n',
+            self.namespace, self.pod_name
+        ]
+        if ssh_mode == SshMode.LOGIN:
+            assert isinstance(cmd, list), 'cmd must be a list for login mode.'
+            base_cmd = ['kubectl', 'exec', '-it', *kubectl_args, '--']
+            command = base_cmd + cmd
+            proc = subprocess_utils.run(command, shell=False, check=False)
+            return proc.returncode, '', ''
+
+        kubectl_base_command = ['kubectl', 'exec']
+
+        if ssh_mode == SshMode.INTERACTIVE:
+            kubectl_base_command.append('-i')
+        kubectl_base_command += [*kubectl_args, '--']
+
+        command_str = self._get_command_to_run(cmd,
+                                               process_stream,
+                                               separate_stderr,
+                                               skip_lines=skip_lines,
+                                               source_bashrc=source_bashrc)
+        command = kubectl_base_command + [
+            # It is important to use /bin/bash -c here to make sure we quote the
+            # command to be run properly. Otherwise, directly appending commands
+            # after '--' will not work for some commands, such as '&&', '>' etc.
+            '/bin/bash',
+            '-c',
+            shlex.quote(command_str)
+        ]
+
+        log_dir = os.path.expanduser(os.path.dirname(log_path))
+        os.makedirs(log_dir, exist_ok=True)
+
+        executable = None
+        if not process_stream:
+            if stream_logs:
+                command += [
+                    f'| tee {log_path}',
+                    # This also requires the executor to be '/bin/bash' instead
+                    # of the default '/bin/sh'.
+                    '; exit ${PIPESTATUS[0]}'
+                ]
+            else:
+                command += [f'> {log_path}']
+            executable = '/bin/bash'
+        return log_lib.run_with_log(' '.join(command),
+                                    log_path,
+                                    require_outputs=require_outputs,
+                                    stream_logs=stream_logs,
+                                    process_stream=process_stream,
+                                    shell=True,
+                                    executable=executable,
+                                    **kwargs)
+
+    @timeline.event
+    def rsync(
+        self,
+        source: str,
+        target: str,
+        *,
+        up: bool,
+        # Advanced options.
+        log_path: str = os.devnull,
+        stream_logs: bool = True,
+        max_retry: int = 1,
+    ) -> None:
+        """Uses 'rsync' to sync 'source' to 'target'.
+
+        Args:
+            source: The source path.
+            target: The target path.
+            up: The direction of the sync, True for local to cluster, False
+              for cluster to local.
+            log_path: Redirect stdout/stderr to the log_path.
+            stream_logs: Stream logs to the stdout/stderr.
+            max_retry: The maximum number of retries for the rsync command.
+              This value should be non-negative.
+
+        Raises:
+            exceptions.CommandError: rsync command failed.
+        """
+
+        def get_remote_home_dir() -> str:
+            # Use `echo ~` to get the remote home directory, instead of pwd or
+            # echo $HOME, because pwd can be `/` when the remote user is root
+            # and $HOME is not always set.
+            rc, remote_home_dir, _ = self.run('echo ~',
+                                              require_outputs=True,
+                                              stream_logs=False)
+            if rc != 0:
+                raise ValueError('Failed to get remote home directory.')
+            remote_home_dir = remote_home_dir.strip()
+            return remote_home_dir
+
+        # Build command.
+        helper_path = os.path.join(os.path.abspath(os.path.dirname(__file__)),
+                                   'kubernetes', 'rsync_helper.sh')
+        self._rsync(
+            source,
+            target,
+            node_destination=f'{self.pod_name}@{self.namespace}',
+            up=up,
+            rsh_option=helper_path,
+            log_path=log_path,
+            stream_logs=stream_logs,
+            max_retry=max_retry,
+            prefix_command=f'chmod +x {helper_path} && ',
+            # rsync with `kubectl` as the rsh command will cause ~/xx parsed as
+            # /~/xx, so we need to replace ~ with the remote home directory. We
+            # only need to do this when ~ is at the beginning of the path.
+            get_remote_home_dir=get_remote_home_dir)
diff --git a/sky/utils/command_runner.pyi b/sky/utils/command_runner.pyi
index f1b547927d3..077447e1d5c 100644
--- a/sky/utils/command_runner.pyi
+++ b/sky/utils/command_runner.pyi
@@ -6,7 +6,7 @@ determine the return type based on the value of require_outputs.
 """
 import enum
 import typing
-from typing import List, Optional, Tuple, Union
+from typing import Any, Iterable, List, Optional, Tuple, Union
 
 from typing_extensions import Literal
 
@@ -18,6 +18,7 @@ GIT_EXCLUDE: str
 RSYNC_DISPLAY_OPTION: str
 RSYNC_FILTER_OPTION: str
 RSYNC_EXCLUDE_OPTION: str
+ALIAS_SUDO_TO_EMPTY_FOR_ROOT_CMD: str
 
 
 def ssh_options_list(
@@ -39,40 +40,101 @@ class SshMode(enum.Enum):
     LOGIN: int
 
 
-class SSHCommandRunner:
+class CommandRunner:
+    node_id: str
+
+    def __init__(
+        self,
+        node: Tuple[Any, ...],
+        **kwargs,
+    ) -> None:
+        ...
+
+    @typing.overload
+    def run(self,
+            cmd: Union[str, List[str]],
+            *,
+            require_outputs: Literal[False] = ...,
+            log_path: str = ...,
+            process_stream: bool = ...,
+            stream_logs: bool = ...,
+            separate_stderr: bool = ...,
+            connect_timeout: Optional[int] = ...,
+            source_bashrc: bool = ...,
+            skip_lines: int = ...,
+            **kwargs) -> int:
+        ...
+
+    @typing.overload
+    def run(self,
+            cmd: Union[str, List[str]],
+            *,
+            require_outputs: Literal[True],
+            log_path: str = ...,
+            process_stream: bool = ...,
+            stream_logs: bool = ...,
+            separate_stderr: bool = ...,
+            connect_timeout: Optional[int] = ...,
+            source_bashrc: bool = ...,
+            skip_lines: int = ...,
+            **kwargs) -> Tuple[int, str, str]:
+        ...
+
+    @typing.overload
+    def run(self,
+            cmd: Union[str, List[str]],
+            *,
+            require_outputs: bool = ...,
+            log_path: str = ...,
+            process_stream: bool = ...,
+            stream_logs: bool = ...,
+            separate_stderr: bool = ...,
+            connect_timeout: Optional[int] = ...,
+            source_bashrc: bool = ...,
+            skip_lines: int = ...,
+            **kwargs) -> Union[Tuple[int, str, str], int]:
+        ...
+
+    def rsync(self,
+              source: str,
+              target: str,
+              *,
+              up: bool,
+              log_path: str = ...,
+              stream_logs: bool = ...,
+              max_retry: int = ...) -> None:
+        ...
+
+    @classmethod
+    def make_runner_list(cls: typing.Type[CommandRunner],
+                         node_list: Iterable[Tuple[Any, ...]],
+                         **kwargs) -> List[CommandRunner]:
+        ...
+
+    def check_connection(self) -> bool:
+        ...
+
+
+class SSHCommandRunner(CommandRunner):
     ip: str
+    port: int
     ssh_user: str
     ssh_private_key: str
     ssh_control_name: Optional[str]
     docker_user: str
-    port: int
     disable_control_master: Optional[bool]
 
     def __init__(
         self,
-        ip: str,
+        node: Tuple[str, int],
         ssh_user: str,
         ssh_private_key: str,
         ssh_control_name: Optional[str] = ...,
-        port: int = ...,
         docker_user: Optional[str] = ...,
         disable_control_master: Optional[bool] = ...,
     ) -> None:
         ...
 
-    @staticmethod
-    def make_runner_list(
-        ip_list: List[str],
-        ssh_user: str,
-        ssh_private_key: str,
-        ssh_control_name: Optional[str] = ...,
-        ssh_proxy_command: Optional[str] = ...,
-        port_list: Optional[List[int]] = ...,
-        docker_user: Optional[str] = ...,
-        disable_control_master: Optional[bool] = ...,
-    ) -> List['SSHCommandRunner']:
-        ...
-
     @typing.overload
     def run(self,
             cmd: Union[str, List[str]],
@@ -84,6 +146,9 @@ class SSHCommandRunner:
             stream_logs: bool = ...,
             ssh_mode: SshMode = ...,
             separate_stderr: bool = ...,
+            connect_timeout: Optional[int] = ...,
+            source_bashrc: bool = ...,
+            skip_lines: int = ...,
             **kwargs) -> int:
         ...
 
@@ -98,6 +163,9 @@ class SSHCommandRunner:
             stream_logs: bool = ...,
             ssh_mode: SshMode = ...,
             separate_stderr: bool = ...,
+            connect_timeout: Optional[int] = ...,
+            source_bashrc: bool = ...,
+            skip_lines: int = ...,
             **kwargs) -> Tuple[int, str, str]:
         ...
 
@@ -112,6 +180,9 @@ class SSHCommandRunner:
             stream_logs: bool = ...,
             ssh_mode: SshMode = ...,
             separate_stderr: bool = ...,
+            connect_timeout: Optional[int] = ...,
+            source_bashrc: bool = ...,
+            skip_lines: int = ...,
             **kwargs) -> Union[Tuple[int, str, str], int]:
         ...
 
@@ -121,8 +192,76 @@ class SSHCommandRunner:
               *,
               up: bool,
               log_path: str = ...,
-              stream_logs: bool = ...) -> None:
+              stream_logs: bool = ...,
+              max_retry: int = ...) -> None:
         ...
 
-    def check_connection(self) -> bool:
+
+class KubernetesCommandRunner(CommandRunner):
+
+    def __init__(
+        self,
+        node: Tuple[str, str],
+    ) -> None:
+        ...
+
+    @typing.overload
+    def run(self,
+            cmd: Union[str, List[str]],
+            *,
+            port_forward: Optional[List[int]] = ...,
+            require_outputs: Literal[False] = ...,
+            log_path: str = ...,
+            process_stream: bool = ...,
+            stream_logs: bool = ...,
+            ssh_mode: SshMode = ...,
+            separate_stderr: bool = ...,
+            connect_timeout: Optional[int] = ...,
+            source_bashrc: bool = ...,
+            skip_lines: int = ...,
+            **kwargs) -> int:
+        ...
+
+    @typing.overload
+    def run(self,
+            cmd: Union[str, List[str]],
+            *,
+            port_forward: Optional[List[int]] = ...,
+            require_outputs: Literal[True],
+            log_path: str = ...,
+            process_stream: bool = ...,
+            stream_logs: bool = ...,
+            ssh_mode: SshMode = ...,
+            separate_stderr: bool = ...,
+            connect_timeout: Optional[int] = ...,
+            source_bashrc: bool = ...,
+            skip_lines: int = ...,
+            **kwargs) -> Tuple[int, str, str]:
+        ...
+
+    @typing.overload
+    def run(self,
+            cmd: Union[str, List[str]],
+            *,
+            port_forward: Optional[List[int]] = ...,
+            require_outputs: bool = ...,
+            log_path: str = ...,
+            process_stream: bool = ...,
+            stream_logs: bool = ...,
+            ssh_mode: SshMode = ...,
+            separate_stderr: bool = ...,
+            connect_timeout: Optional[int] = ...,
+            source_bashrc: bool = ...,
+            skip_lines: int = ...,
+            **kwargs) -> Union[Tuple[int, str, str], int]:
+        ...
+
+    def rsync(self,
+              source: str,
+              target: str,
+              *,
+              up: bool,
+              log_path: str = ...,
+              stream_logs: bool = ...,
+              max_retry: int = ...) -> None:
         ...
diff --git a/sky/utils/common_utils.py b/sky/utils/common_utils.py
index 14a79a4d73f..103c834000c 100644
--- a/sky/utils/common_utils.py
+++ b/sky/utils/common_utils.py
@@ -61,11 +61,18 @@ def get_usage_run_id() -> str:
     return _usage_run_id
 
 
-def get_user_hash() -> str:
+def get_user_hash(force_fresh_hash: bool = False) -> str:
     """Returns a unique user-machine specific hash as a user id.
 
     We cache the user hash in a file to avoid potential user_name or
     hostname changes causing a new user hash to be generated.
+
+    Args:
+        force_fresh_hash: Bypasses the cached hash in USER_HASH_FILE and the
+            hash in the USER_ID_ENV_VAR and forces a fresh user-machine hash
+            to be generated. Used by `kubernetes.ssh_key_secret_field_name` to
+            avoid controllers sharing the same ssh key field name as the
+            local client.
     """
 
     def _is_valid_user_hash(user_hash: Optional[str]) -> bool:
@@ -77,12 +84,13 @@ def _is_valid_user_hash(user_hash: Optional[str]) -> bool:
             return False
         return len(user_hash) == USER_HASH_LENGTH
 
-    user_hash = os.getenv(constants.USER_ID_ENV_VAR)
-    if _is_valid_user_hash(user_hash):
-        assert user_hash is not None
-        return user_hash
+    if not force_fresh_hash:
+        user_hash = os.getenv(constants.USER_ID_ENV_VAR)
+        if _is_valid_user_hash(user_hash):
+            assert user_hash is not None
+            return user_hash
 
-    if os.path.exists(_USER_HASH_FILE):
+    if not force_fresh_hash and os.path.exists(_USER_HASH_FILE):
         # Read from cached user hash file.
         with open(_USER_HASH_FILE, 'r', encoding='utf-8') as f:
             # Remove invalid characters.
@@ -96,8 +104,13 @@ def _is_valid_user_hash(user_hash: Optional[str]) -> bool:
         # A fallback in case the hash is invalid.
         user_hash = uuid.uuid4().hex[:USER_HASH_LENGTH]
     os.makedirs(os.path.dirname(_USER_HASH_FILE), exist_ok=True)
-    with open(_USER_HASH_FILE, 'w', encoding='utf-8') as f:
-        f.write(user_hash)
+    if not force_fresh_hash:
+        # Do not cache to file if force_fresh_hash is True since the file may
+        # be intentionally using a different hash, e.g. we want to keep the
+        # user_hash for usage collection the same on the jobs/serve controller
+        # as users' local client.
+        with open(_USER_HASH_FILE, 'w', encoding='utf-8') as f:
+            f.write(user_hash)
     return user_hash
 
 
@@ -439,9 +452,9 @@ def class_fullname(cls, skip_builtins: bool = True):
     """Get the full name of a class.
 
     Example:
-        >>> e = sky.exceptions.FetchIPError()
+        >>> e = sky.exceptions.FetchClusterInfoError()
         >>> class_fullname(e.__class__)
-        'sky.exceptions.FetchIPError'
+        'sky.exceptions.FetchClusterInfoError'
 
     Args:
         cls: The class to get the full name.
@@ -567,20 +580,24 @@ def validate_schema(obj, schema, err_msg_prefix='', skip_none=True):
                            'The `envs` field contains invalid keys:\n' +
                            e.message)
             else:
-                err_msg = err_msg_prefix + 'The following fields are invalid:'
+                err_msg = err_msg_prefix
                 known_fields = set(e.schema.get('properties', {}).keys())
                 for field in e.instance:
                     if field not in known_fields:
                         most_similar_field = difflib.get_close_matches(
                             field, known_fields, 1)
                         if most_similar_field:
-                            err_msg += (f'\nInstead of {field!r}, did you mean '
+                            err_msg += (f'Instead of {field!r}, did you mean '
                                         f'{most_similar_field[0]!r}?')
                         else:
-                            err_msg += f'\nFound unsupported field {field!r}.'
+                            err_msg += f'Found unsupported field {field!r}.'
         else:
+            message = e.message
+            # Object in jsonschema is represented as dict in Python. Replace
+            # 'object' with 'dict' for better readability.
+            message = message.replace('type \'object\'', 'type \'dict\'')
             # Example e.json_path value: '$.resources'
-            err_msg = (err_msg_prefix + e.message +
+            err_msg = (err_msg_prefix + message +
                        f'. Check problematic field(s): {e.json_path}')
 
     if err_msg:
@@ -589,15 +606,19 @@ def validate_schema(obj, schema, err_msg_prefix='', skip_none=True):
 
 
 def get_cleaned_username(username: str = '') -> str:
-    """Cleans the username. Dots and underscores are allowed, as we will
+    """Cleans the username. Underscores are allowed, as we will
      handle it when mapping to the cluster_name_on_cloud in
      common_utils.make_cluster_name_on_cloud.
 
     Clean up includes:
      1. Making all characters lowercase
-     2. Removing any non-alphanumeric characters (excluding hyphens)
+     2. Removing any non-alphanumeric characters (excluding hyphens and
+        underscores)
      3. Removing any numbers and/or hyphens at the start of the username.
      4. Removing any hyphens at the end of the username
+     5. Truncate the username to 63 characters, as requested by GCP labels
+
+    Dots are removed due to: https://cloud.google.com/compute/docs/labeling-resources#requirements # pylint: disable=line-too-long
 
     e.g. 1SkY-PiLot2- becomes sky-pilot2
 
@@ -606,9 +627,10 @@ def get_cleaned_username(username: str = '') -> str:
     """
     username = username or getpass.getuser()
     username = username.lower()
-    username = re.sub(r'[^a-z0-9-._]', '', username)
+    username = re.sub(r'[^a-z0-9-_]', '', username)
     username = re.sub(r'^[0-9-]+', '', username)
     username = re.sub(r'-$', '', username)
+    username = username[:63]
     return username
 
 
@@ -632,8 +654,12 @@ def fill_template(template_name: str, variables: Dict,
         fout.write(content)
 
 
-def deprecated_function(func: Callable, name: str, deprecated_name: str,
-                        removing_version: str) -> Callable:
+def deprecated_function(
+        func: Callable,
+        name: str,
+        deprecated_name: str,
+        removing_version: str,
+        override_argument: Optional[Dict[str, Any]] = None) -> Callable:
     """Decorator for creating deprecated functions, for backward compatibility.
 
     It will result in a warning being emitted when the function is used.
@@ -641,9 +667,14 @@ def deprecated_function(func: Callable, name: str, deprecated_name: str,
 
     @functools.wraps(func)
     def new_func(*args, **kwargs):
+        override_argument_str = ''
+        if override_argument:
+            override_argument_str = ', '.join(
+                f'{k}={v}' for k, v in override_argument.items())
         logger.warning(
             f'Call to deprecated function {deprecated_name}, which will be '
-            f'removed in {removing_version}. Please use {name}() instead.')
+            f'removed in {removing_version}. Please use '
+            f'{name}({override_argument_str}) instead.')
         return func(*args, **kwargs)
 
     return new_func
diff --git a/sky/utils/controller_utils.py b/sky/utils/controller_utils.py
index b2eb58d3de3..c1859d52663 100644
--- a/sky/utils/controller_utils.py
+++ b/sky/utils/controller_utils.py
@@ -6,22 +6,26 @@
 import os
 import tempfile
 import typing
-from typing import Any, Dict, List, Optional
+from typing import Any, Dict, Iterable, List, Optional, Set
 
 import colorama
 
 from sky import check as sky_check
 from sky import clouds
 from sky import exceptions
+from sky import global_user_state
 from sky import resources
 from sky import sky_logging
 from sky import skypilot_config
+from sky.adaptors import cloudflare
 from sky.clouds import gcp
 from sky.data import data_utils
 from sky.data import storage as storage_lib
+from sky.jobs import constants as managed_job_constants
+from sky.jobs import utils as managed_job_utils
+from sky.serve import constants as serve_constants
 from sky.serve import serve_utils
 from sky.skylet import constants
-from sky.spot import spot_utils
 from sky.utils import common_utils
 from sky.utils import env_options
 from sky.utils import ux_utils
@@ -32,7 +36,7 @@
 
 logger = sky_logging.init_logger(__name__)
 
-# Message thrown when APIs sky.spot.launch(),sky.serve.up() received an invalid
+# Message thrown when APIs sky.jobs.launch(), sky.serve.up() received an invalid
 # controller resources spec.
 CONTROLLER_RESOURCES_NOT_VALID_MESSAGE = (
     '{controller_type} controller resources is not valid, please check '
@@ -47,65 +51,94 @@
 @dataclasses.dataclass
 class _ControllerSpec:
     """Spec for skypilot controllers."""
+    controller_type: str
     name: str
-    cluster_name: str
+    # Use a list of strings to support fallback to old names. The list is in the
+    # fallback order.
+    candidate_cluster_names: List[str]
     in_progress_hint: str
     decline_cancel_hint: str
-    decline_down_when_failed_to_fetch_status_hint: str
+    _decline_down_when_failed_to_fetch_status_hint: str
     decline_down_for_dirty_controller_hint: str
-    check_cluster_name_hint: str
+    _check_cluster_name_hint: str
     default_hint_if_non_existent: str
     connection_error_hint: str
+    default_resources_config: Dict[str, Any]
+
+    @property
+    def cluster_name(self) -> str:
+        """The name in candidate_cluster_names that exists, else the first."""
+        for candidate_name in self.candidate_cluster_names:
+            record = global_user_state.get_cluster_from_name(candidate_name)
+            if record is not None:
+                return candidate_name
+        return self.candidate_cluster_names[0]
+
+    @property
+    def decline_down_when_failed_to_fetch_status_hint(self) -> str:
+        return self._decline_down_when_failed_to_fetch_status_hint.format(
+            cluster_name=self.cluster_name)
+
+    @property
+    def check_cluster_name_hint(self) -> str:
+        return self._check_cluster_name_hint.format(
+            cluster_name=self.cluster_name)
 
 
 class Controllers(enum.Enum):
     """Skypilot controllers."""
     # NOTE(dev): Keep this align with
     # sky/cli.py::_CONTROLLER_TO_HINT_OR_RAISE
-    SPOT_CONTROLLER = _ControllerSpec(
-        name='managed spot controller',
-        cluster_name=spot_utils.SPOT_CONTROLLER_NAME,
+    JOBS_CONTROLLER = _ControllerSpec(
+        controller_type='jobs',
+        name='managed jobs controller',
+        candidate_cluster_names=[
+            managed_job_utils.JOB_CONTROLLER_NAME,
+            managed_job_utils.LEGACY_JOB_CONTROLLER_NAME
+        ],
         in_progress_hint=(
-            '* {job_info}To see all spot jobs: '
-            f'{colorama.Style.BRIGHT}sky spot queue{colorama.Style.RESET_ALL}'),
+            '* {job_info}To see all managed jobs: '
+            f'{colorama.Style.BRIGHT}sky jobs queue{colorama.Style.RESET_ALL}'),
         decline_cancel_hint=(
-            'Cancelling the spot controller\'s jobs is not allowed.\nTo cancel '
-            f'spot jobs, use: {colorama.Style.BRIGHT}sky spot cancel <spot '
-            f'job IDs> [--all]{colorama.Style.RESET_ALL}'),
-        decline_down_when_failed_to_fetch_status_hint=(
-            f'{colorama.Fore.RED}Tearing down the spot controller while '
-            'it is in INIT state is not supported (this means a spot launch '
+            'Cancelling the jobs controller\'s jobs is not allowed.\nTo cancel '
+            f'managed jobs, use: {colorama.Style.BRIGHT}sky jobs cancel '
+            f'<managed job IDs> [--all]{colorama.Style.RESET_ALL}'),
+        _decline_down_when_failed_to_fetch_status_hint=(
+            f'{colorama.Fore.RED}Tearing down the jobs controller while '
+            'it is in INIT state is not supported (this means a job launch '
             'is in progress or the previous launch failed), as we cannot '
-            'guarantee that all the spot jobs are finished. Please wait '
-            'until the spot controller is UP or fix it with '
+            'guarantee that all the managed jobs are finished. Please wait '
+            'until the jobs controller is UP or fix it with '
             f'{colorama.Style.BRIGHT}sky start '
-            f'{spot_utils.SPOT_CONTROLLER_NAME}{colorama.Style.RESET_ALL}.'),
+            '{cluster_name}'
+            f'{colorama.Style.RESET_ALL}.'),
         decline_down_for_dirty_controller_hint=(
-            f'{colorama.Fore.RED}In-progress spot jobs found. To avoid '
+            f'{colorama.Fore.RED}In-progress managed jobs found. To avoid '
             f'resource leakage, cancel all jobs first: {colorama.Style.BRIGHT}'
-            f'sky spot cancel -a{colorama.Style.RESET_ALL}\n'),
-        check_cluster_name_hint=(
-            f'Cluster {spot_utils.SPOT_CONTROLLER_NAME} is reserved for '
-            'managed spot controller.'),
-        default_hint_if_non_existent='No in-progress spot jobs.',
+            f'sky jobs cancel -a{colorama.Style.RESET_ALL}\n'),
+        _check_cluster_name_hint=('Cluster {cluster_name} is reserved for '
+                                  'managed jobs controller.'),
+        default_hint_if_non_existent='No in-progress managed jobs.',
         connection_error_hint=(
-            'Failed to connect to spot controller, please try again later.'))
+            'Failed to connect to jobs controller, please try again later.'),
+        default_resources_config=managed_job_constants.CONTROLLER_RESOURCES)
     SKY_SERVE_CONTROLLER = _ControllerSpec(
+        controller_type='serve',
         name='serve controller',
-        cluster_name=serve_utils.SKY_SERVE_CONTROLLER_NAME,
+        candidate_cluster_names=[serve_utils.SKY_SERVE_CONTROLLER_NAME],
         in_progress_hint=(
             f'* To see detailed service status: {colorama.Style.BRIGHT}'
             f'sky serve status -a{colorama.Style.RESET_ALL}'),
         decline_cancel_hint=(
             'Cancelling the sky serve controller\'s jobs is not allowed.'),
-        decline_down_when_failed_to_fetch_status_hint=(
+        _decline_down_when_failed_to_fetch_status_hint=(
             f'{colorama.Fore.RED}Tearing down the sky serve controller '
             'while it is in INIT state is not supported (this means a sky '
             'serve up is in progress or the previous launch failed), as we '
             'cannot guarantee that all the services are terminated. Please '
             'wait until the sky serve controller is UP or fix it with '
             f'{colorama.Style.BRIGHT}sky start '
-            f'{serve_utils.SKY_SERVE_CONTROLLER_NAME}'
+            '{cluster_name}'
             f'{colorama.Style.RESET_ALL}.'),
         decline_down_for_dirty_controller_hint=(
             f'{colorama.Fore.RED}Tearing down the sky serve controller is not '
@@ -113,12 +146,12 @@ class Controllers(enum.Enum):
             '{service_names}. Please terminate the services first with '
             f'{colorama.Style.BRIGHT}sky serve down -a'
             f'{colorama.Style.RESET_ALL}.'),
-        check_cluster_name_hint=(
-            f'Cluster {serve_utils.SKY_SERVE_CONTROLLER_NAME} is reserved for '
-            'sky serve controller.'),
+        _check_cluster_name_hint=('Cluster {cluster_name} is reserved for '
+                                  'sky serve controller.'),
         default_hint_if_non_existent='No live services.',
         connection_error_hint=(
-            'Failed to connect to serve controller, please try again later.'))
+            'Failed to connect to serve controller, please try again later.'),
+        default_resources_config=serve_constants.CONTROLLER_RESOURCES)
 
     @classmethod
     def from_name(cls, name: Optional[str]) -> Optional['Controllers']:
@@ -129,7 +162,20 @@ def from_name(cls, name: Optional[str]) -> Optional['Controllers']:
             Otherwise, returns None.
         """
         for controller in cls:
-            if controller.value.cluster_name == name:
+            if name in controller.value.candidate_cluster_names:
+                return controller
+        return None
+
+    @classmethod
+    def from_type(cls, controller_type: str) -> Optional['Controllers']:
+        """Get the controller by controller type.
+
+        Returns:
+            The controller if the controller type is valid.
+            Otherwise, returns None.
+        """
+        for controller in cls:
+            if controller.value.controller_type == controller_type:
                 return controller
         return None
 
@@ -138,40 +184,97 @@ def from_name(cls, name: Optional[str]) -> Optional['Controllers']:
 # can be cleaned up by another process.
 # TODO(zhwu): Keep the dependencies align with the ones in setup.py
 def _get_cloud_dependencies_installation_commands(
-        controller_type: str) -> List[str]:
-    commands = [
-        # aws
-        'pip list | grep boto3 > /dev/null 2>&1 || '
-        'pip install "urllib3<2" awscli>=1.27.10 botocore>=1.29.10 '
-        'boto3>=1.26.1 > /dev/null 2>&1',
-        # gcp
-        'pip list | grep google-api-python-client > /dev/null 2>&1 || '
-        'pip install google-api-python-client>=2.69.0 '
-        '> /dev/null 2>&1',
-        # Have to separate the installation of google-cloud-storage from above
-        # because for a VM launched on GCP, the VM may have
-        # google-api-python-client installed alone.
-        'pip list | grep google-cloud-storage > /dev/null 2>&1 || '
-        'pip install google-cloud-storage > /dev/null 2>&1',
-        f'{gcp.GOOGLE_SDK_INSTALLATION_COMMAND}',
-        # fluidstack does not need to install any cloud dependencies.
-    ]
-    # k8s and ibm doesn't support open port and spot instance yet, so we don't
-    # install them for either controller.
-    if controller_type == 'spot':
-        # oci doesn't support open port yet, so we don't install oci
-        # dependencies for sky serve controller.
-        commands.append('pip list | grep oci > /dev/null 2>&1 || '
-                        'pip install oci > /dev/null 2>&1')
+        controller: Controllers) -> List[str]:
     # TODO(tian): Make dependency installation command a method of cloud
     # class and get all installation command for enabled clouds.
-    if any(
-            cloud.is_same_cloud(clouds.Azure())
-            for cloud in sky_check.get_cached_enabled_clouds_or_refresh()):
-        commands.append(
-            'pip list | grep azure-cli > /dev/null 2>&1 || '
-            'pip install azure-cli>=2.31.0 azure-core azure-identity>=1.13.0 '
-            'azure-mgmt-network > /dev/null 2>&1')
+    commands = []
+    prefix_str = 'Check & install cloud dependencies on controller: '
+    # This is to make sure the shorter checking message does not have junk
+    # characters from the previous message.
+    empty_str = ' ' * 5
+    aws_dependencies_installation = (
+        'pip list | grep boto3 > /dev/null 2>&1 || pip install '
+        'botocore>=1.29.10 boto3>=1.26.1; '
+        # Need to separate the installation of awscli from above because some
+        # other clouds will install boto3 but not awscli.
+        'pip list | grep awscli> /dev/null 2>&1 || pip install "urllib3<2" '
+        'awscli>=1.27.10 "colorama<0.4.5" > /dev/null 2>&1')
+    for cloud in sky_check.get_cached_enabled_clouds_or_refresh():
+        if isinstance(
+                clouds,
+            (clouds.Lambda, clouds.SCP, clouds.Fluidstack, clouds.Paperspace)):
+            # no need to install any cloud dependencies for lambda, scp,
+            # fluidstack and paperspace
+            continue
+        if isinstance(cloud, clouds.AWS):
+            commands.append(f'echo -en "\\r{prefix_str}AWS{empty_str}" && ' +
+                            aws_dependencies_installation)
+        elif isinstance(cloud, clouds.Azure):
+            commands.append(
+                f'echo -en "\\r{prefix_str}Azure{empty_str}" && '
+                'pip list | grep azure-cli > /dev/null 2>&1 || '
+                'pip install "azure-cli>=2.31.0" azure-core '
+                '"azure-identity>=1.13.0" azure-mgmt-network > /dev/null 2>&1')
+        elif isinstance(cloud, clouds.GCP):
+            commands.append(
+                f'echo -en "\\r{prefix_str}GCP{empty_str}" && '
+                'pip list | grep google-api-python-client > /dev/null 2>&1 || '
+                'pip install "google-api-python-client>=2.69.0" '
+                '> /dev/null 2>&1')
+            # Have to separate the installation of google-cloud-storage from
+            # above because for a VM launched on GCP, the VM may have
+            # google-api-python-client installed alone.
+            commands.append(
+                'pip list | grep google-cloud-storage > /dev/null 2>&1 || '
+                'pip install google-cloud-storage > /dev/null 2>&1')
+            commands.append(f'{gcp.GOOGLE_SDK_INSTALLATION_COMMAND}')
+        elif isinstance(cloud, clouds.Kubernetes):
+            commands.append(
+                f'echo -en "\\r{prefix_str}Kubernetes{empty_str}" && '
+                'pip list | grep kubernetes > /dev/null 2>&1 || '
+                'pip install "kubernetes>=20.0.0" > /dev/null 2>&1 &&'
+                # Install k8s + skypilot dependencies
+                'sudo bash -c "if '
+                '! command -v curl &> /dev/null || '
+                '! command -v socat &> /dev/null || '
+                '! command -v netcat &> /dev/null; '
+                'then apt update && apt install curl socat netcat -y; '
+                'fi" && '
+                # Install kubectl
+                '(command -v kubectl &>/dev/null || '
+                '(curl -s -LO "https://dl.k8s.io/release/'
+                '$(curl -L -s https://dl.k8s.io/release/stable.txt)'
+                '/bin/linux/amd64/kubectl" && '
+                'sudo install -o root -g root -m 0755 '
+                'kubectl /usr/local/bin/kubectl))')
+        if controller == Controllers.JOBS_CONTROLLER:
+            if isinstance(cloud, clouds.IBM):
+                commands.append(
+                    f'echo -en "\\r{prefix_str}IBM{empty_str}" '
+                    '&& pip list | grep ibm-cloud-sdk-core > /dev/null 2>&1 || '
+                    'pip install ibm-cloud-sdk-core ibm-vpc '
+                    'ibm-platform-services ibm-cos-sdk > /dev/null 2>&1')
+            elif isinstance(cloud, clouds.OCI):
+                commands.append(f'echo -en "\\r{prefix_str}OCI{empty_str}" && '
+                                'pip list | grep oci > /dev/null 2>&1 || '
+                                'pip install oci > /dev/null 2>&1')
+            elif isinstance(cloud, clouds.RunPod):
+                commands.append(
+                    f'echo -en "\\r{prefix_str}RunPod{empty_str}" && '
+                    'pip list | grep runpod > /dev/null 2>&1 || '
+                    'pip install "runpod>=1.5.1" > /dev/null 2>&1')
+            elif isinstance(cloud, clouds.Cudo):
+                # cudo doesn't support open port
+                commands.append(
+                    f'echo -en "\\r{prefix_str}Cudo{empty_str}" && '
+                    'pip list | grep cudo-compute > /dev/null 2>&1 || '
+                    'pip install "cudo-compute>=0.1.8" > /dev/null 2>&1')
+    if (cloudflare.NAME
+            in storage_lib.get_cached_enabled_storage_clouds_or_refresh()):
+        commands.append(f'echo -en "\\r{prefix_str}Cloudflare{empty_str}" && ' +
+                        aws_dependencies_installation)
+    commands.append(f'echo -e "\\r{prefix_str}Done for {len(commands)} '
+                    'clouds."')
     return commands
 
 
@@ -204,7 +307,7 @@ def download_and_stream_latest_job_log(
         local_dir: str) -> Optional[str]:
     """Downloads and streams the latest job log.
 
-    This function is only used by spot controller and sky serve controller.
+    This function is only used by jobs controller and sky serve controller.
     """
     os.makedirs(local_dir, exist_ok=True)
     log_file = None
@@ -212,11 +315,11 @@ def download_and_stream_latest_job_log(
         log_dirs = backend.sync_down_logs(
             handle,
             # Download the log of the latest job.
-            # The job_id for the spot job running on the spot cluster is not
+            # The job_id for the managed job running on the cluster is not
             # necessarily 1, as it is possible that the worker node in a
-            # multi-node cluster is preempted, and we recover the spot job
+            # multi-node cluster is preempted, and we recover the managed job
             # on the existing cluster, which leads to a larger job_id. Those
-            # job_ids all represent the same logical spot job.
+            # job_ids all represent the same logical managed job.
             job_ids=None,
             local_dir=local_dir)
     except exceptions.CommandError as e:
@@ -240,10 +343,15 @@ def download_and_stream_latest_job_log(
 
 
 def shared_controller_vars_to_fill(
-        controller_type: str, remote_user_config_path: str) -> Dict[str, str]:
+        controller: Controllers,
+        remote_user_config_path: str) -> Dict[str, str]:
     vars_to_fill: Dict[str, Any] = {
         'cloud_dependencies_installation_commands':
-            _get_cloud_dependencies_installation_commands(controller_type)
+            _get_cloud_dependencies_installation_commands(controller),
+        # We need to activate the python environment on the controller to ensure
+        # cloud SDKs are installed in SkyPilot runtime environment and can be
+        # accessed.
+        'sky_activate_python_env': constants.ACTIVATE_SKY_REMOTE_PYTHON_ENV,
     }
     env_vars: Dict[str, str] = {
         env.value: '1' for env in env_options.Options if env.get()
@@ -265,27 +373,32 @@ def shared_controller_vars_to_fill(
 
 
 def get_controller_resources(
-    controller_type: str,
-    controller_resources_config: Dict[str, Any],
-) -> 'resources.Resources':
+    controller: Controllers,
+    task_resources: Iterable['resources.Resources'],
+) -> Set['resources.Resources']:
     """Read the skypilot config and setup the controller resources.
 
     Returns:
-        A tuple of (vars_to_fill, controller_resources_config). `var_to_fill`
-        is a dict of variables that will be filled in the controller template.
-        The controller_resources_config is the resources config that will be
-        used to launch the controller.
+        A set of controller resources that will be used to launch the
+        controller. All fields are the same except for the cloud. If no
+        controller exists and the controller resources has no cloud
+        specified, the controller will be launched on one of the clouds
+        of the task resources for better connectivity.
     """
     controller_resources_config_copied: Dict[str, Any] = copy.copy(
-        controller_resources_config)
+        controller.value.default_resources_config)
     if skypilot_config.loaded():
         # Override the controller resources with the ones specified in the
         # config.
         custom_controller_resources_config = skypilot_config.get_nested(
-            (controller_type, 'controller', 'resources'), None)
+            (controller.value.controller_type, 'controller', 'resources'), None)
         if custom_controller_resources_config is not None:
             controller_resources_config_copied.update(
                 custom_controller_resources_config)
+        elif controller == Controllers.JOBS_CONTROLLER:
+            controller_resources_config_copied.update(
+                skypilot_config.get_nested(('spot', 'controller', 'resources'),
+                                           {}))
 
     try:
         controller_resources = resources.Resources.from_yaml_config(
@@ -294,18 +407,69 @@ def get_controller_resources(
         with ux_utils.print_exception_no_traceback():
             raise ValueError(
                 CONTROLLER_RESOURCES_NOT_VALID_MESSAGE.format(
-                    controller_type=controller_type,
-                    err=common_utils.format_exception(e,
-                                                      use_bracket=True))) from e
+                    controller_type=controller.value.controller_type,
+                    err=common_utils.format_exception(
+                        e, use_bracket=True)).capitalize()) from e
+    # TODO(tian): Support multiple resources for the controller. One blocker
+    # here is the semantic if controller resources use `ordered` and we want
+    # to override it with multiple cloud from task resources.
     if len(controller_resources) != 1:
         with ux_utils.print_exception_no_traceback():
             raise ValueError(
                 CONTROLLER_RESOURCES_NOT_VALID_MESSAGE.format(
-                    controller_type=controller_type,
+                    controller_type=controller.value.controller_type,
                     err=f'Expected exactly one resource, got '
                     f'{len(controller_resources)} resources: '
-                    f'{controller_resources}'))
-    return list(controller_resources)[0]
+                    f'{controller_resources}').capitalize())
+    controller_resources_to_use: resources.Resources = list(
+        controller_resources)[0]
+
+    controller_record = global_user_state.get_cluster_from_name(
+        controller.value.cluster_name)
+    if controller_record is not None:
+        handle = controller_record.get('handle', None)
+        if handle is not None:
+            controller_resources_to_use = handle.launched_resources
+
+    if controller_resources_to_use.cloud is not None:
+        return {controller_resources_to_use}
+
+    # If the controller and replicas are from the same cloud, it should
+    # provide better connectivity. We will let the controller choose from
+    # the clouds of the resources if the controller does not exist.
+    # TODO(tian): Consider respecting the regions/zones specified for the
+    # resources as well.
+    requested_clouds: Set['clouds.Cloud'] = set()
+    for resource in task_resources:
+        # cloud is an object and will not be able to be distinguished by set.
+        # Here we manually check if the cloud is in the set.
+        if resource.cloud is not None:
+            if not clouds.cloud_in_iterable(resource.cloud, requested_clouds):
+                try:
+                    resource.cloud.check_features_are_supported(
+                        resources.Resources(),
+                        {clouds.CloudImplementationFeatures.HOST_CONTROLLERS})
+                except exceptions.NotSupportedError:
+                    # Skip the cloud if it does not support hosting controllers.
+                    continue
+                requested_clouds.add(resource.cloud)
+        else:
+            # if one of the resource.cloud is None, this could represent user
+            # does not know which cloud is best for the specified resources.
+            # For example:
+            #   resources:
+            #     - accelerators: L4     # Both available on AWS and GCP
+            #     - cloud: runpod
+            #       accelerators: A40
+            # In this case, we allow the controller to be launched on any cloud.
+            requested_clouds.clear()
+            break
+    if not requested_clouds:
+        return {controller_resources_to_use}
+    return {
+        controller_resources_to_use.copy(cloud=controller_cloud)
+        for controller_cloud in requested_clouds
+    }
 
 
 def _setup_proxy_command_on_controller(
@@ -328,18 +492,18 @@ def _setup_proxy_command_on_controller(
     # to this scenario.)
     #
     # This file will be uploaded to the controller node and will be
-    # used throughout the spot job's / service's recovery attempts
+    # used throughout the managed job's / service's recovery attempts
     # (i.e., if it relaunches due to preemption, we make sure the
     # same config is used).
     #
     # NOTE: suppose that we have a controller in old VPC, then user
-    # changes 'vpc_name' in the config and does a 'spot launch' /
+    # changes 'vpc_name' in the config and does a 'job launch' /
     # 'serve up'. In general, the old controller may not successfully
     # launch the job in the new VPC. This happens if the two VPCs don’t
     # have peering set up. Like other places in the code, we assume
     # properly setting up networking is user's responsibilities.
     # TODO(zongheng): consider adding a basic check that checks
-    # controller VPC (or name) == the spot job's / service's VPC
+    # controller VPC (or name) == the managed job's / service's VPC
     # (or name). It may not be a sufficient check (as it's always
     # possible that peering is not set up), but it may catch some
     # obvious errors.
@@ -393,7 +557,7 @@ def replace_skypilot_config_path_in_file_mounts(
 
 
 def maybe_translate_local_file_mounts_and_sync_up(task: 'task_lib.Task',
-                                                  path: str):
+                                                  path: str) -> None:
     """Translates local->VM mounts into Storage->VM, then syncs up any Storage.
 
     Eagerly syncing up local->Storage ensures Storage->VM would work at task
@@ -519,24 +683,48 @@ def maybe_translate_local_file_mounts_and_sync_up(task: 'task_lib.Task',
     # Step 4: Upload storage from sources
     # Upload the local source to a bucket. The task will not be executed
     # locally, so we need to upload the files/folders to the bucket manually
-    # here before sending the task to the remote spot controller.
+    # here before sending the task to the remote jobs controller.
     if task.storage_mounts:
         # There may be existing (non-translated) storage mounts, so log this
         # whenever task.storage_mounts is non-empty.
         logger.info(f'{colorama.Fore.YELLOW}Uploading sources to cloud storage.'
                     f'{colorama.Style.RESET_ALL} See: sky storage ls')
-    task.sync_storage_mounts()
+    try:
+        task.sync_storage_mounts()
+    except ValueError as e:
+        if 'No enabled cloud for storage' in str(e):
+            data_src = None
+            if has_local_source_paths_file_mounts:
+                data_src = 'file_mounts'
+            if has_local_source_paths_workdir:
+                if data_src:
+                    data_src += ' and workdir'
+                else:
+                    data_src = 'workdir'
+            store_enabled_clouds = ', '.join(storage_lib.STORE_ENABLED_CLOUDS)
+            with ux_utils.print_exception_no_traceback():
+                raise exceptions.NotSupportedError(
+                    f'Unable to use {data_src} - no cloud with object store '
+                    'is enabled. Please enable at least one cloud with '
+                    f'object store support ({store_enabled_clouds}) by running '
+                    f'`sky check`, or remove {data_src} from your task.'
+                    '\nHint: If you do not have any cloud access, you may still'
+                    ' download data and code over the network using curl or '
+                    'other tools in the `setup` section of the task.') from None
 
     # Step 5: Add the file download into the file mounts, such as
     #  /original-dst: s3://spot-fm-file-only-bucket-name/file-0
     new_file_mounts = {}
-    for dst, src in copy_mounts_with_file_in_src.items():
+    if copy_mounts_with_file_in_src:
+        # file_mount_remote_tmp_dir will only exist when there are files in
+        # the src for copy mounts.
         storage = task.storage_mounts[file_mount_remote_tmp_dir]
         store_type = list(storage.stores.keys())[0]
-        store_prefix = storage_lib.get_store_prefix(store_type)
+        store_prefix = store_type.store_prefix()
         bucket_url = store_prefix + file_bucket_name
-        file_id = src_to_file_id[src]
-        new_file_mounts[dst] = bucket_url + f'/file-{file_id}'
+        for dst, src in copy_mounts_with_file_in_src.items():
+            file_id = src_to_file_id[src]
+            new_file_mounts[dst] = bucket_url + f'/file-{file_id}'
     task.update_file_mounts(new_file_mounts)
 
     # Step 6: Replace the source field that is local path in all storage_mounts
@@ -551,14 +739,6 @@ def maybe_translate_local_file_mounts_and_sync_up(task: 'task_lib.Task',
             assert len(store_types) == 1, (
                 'We only support one store type for now.', storage_obj.stores)
             store_type = store_types[0]
-            if store_type == storage_lib.StoreType.S3:
-                storage_obj.source = f's3://{storage_obj.name}'
-            elif store_type == storage_lib.StoreType.GCS:
-                storage_obj.source = f'gs://{storage_obj.name}'
-            elif store_type == storage_lib.StoreType.R2:
-                storage_obj.source = f'r2://{storage_obj.name}'
-            else:
-                with ux_utils.print_exception_no_traceback():
-                    raise exceptions.NotSupportedError(
-                        f'Unsupported store type: {store_type}')
+            store_prefix = store_type.store_prefix()
+            storage_obj.source = f'{store_prefix}{storage_obj.name}'
             storage_obj.force_delete = True
diff --git a/sky/utils/dag_utils.py b/sky/utils/dag_utils.py
index 9803821d8bb..7a4fe90e7fb 100644
--- a/sky/utils/dag_utils.py
+++ b/sky/utils/dag_utils.py
@@ -3,8 +3,8 @@
 from typing import Any, Dict, List, Optional, Tuple
 
 from sky import dag as dag_lib
+from sky import jobs
 from sky import sky_logging
-from sky import spot
 from sky import task as task_lib
 from sky.backends import backend_utils
 from sky.utils import common_utils
@@ -70,9 +70,9 @@ def load_chain_dag_from_yaml(
     Has special handling for an initial section in YAML that contains only the
     'name' field, which is the DAG name.
 
-    'env_overrides' is in effect only when there's exactly one task. It is a
-    list of (key, value) pairs that will be used to update the task's 'envs'
-    section.
+    'env_overrides' is a list of (key, value) pairs that will be used to update
+    the task's 'envs' section. If it is a chain dag, the envs will be updated
+    for all tasks in the chain.
 
     Returns:
       A chain Dag with 1 or more tasks (an empty entrypoint would create a
@@ -90,12 +90,6 @@ def load_chain_dag_from_yaml(
         # YAML has only `name: xxx`. Still instantiate a task.
         configs = [{'name': dag_name}]
 
-    if len(configs) > 1:
-        # TODO(zongheng): in a chain DAG of N tasks, cli.py currently makes the
-        # decision to not apply overrides. Here we maintain this behavior. We
-        # can listen to user feedback to change this.
-        env_overrides = None
-
     current_task = None
     with dag_lib.Dag() as dag:
         for task_config in configs:
@@ -142,33 +136,24 @@ def maybe_infer_and_fill_dag_and_task_names(dag: dag_lib.Dag) -> None:
                 task.name = f'{dag.name}-{task_id}'
 
 
-def fill_default_spot_config_in_dag_for_spot_launch(dag: dag_lib.Dag) -> None:
+def fill_default_config_in_dag_for_job_launch(dag: dag_lib.Dag) -> None:
     for task_ in dag.tasks:
 
         new_resources_list = []
         for resources in list(task_.resources):
             change_default_value: Dict[str, Any] = {}
-            if resources.use_spot_specified and not resources.use_spot:
-                logger.info(
-                    'Field `use_spot` is set to false but a managed spot job is '  # pylint: disable=line-too-long
-                    'being launched. Ignoring the field and proceeding to use spot '  # pylint: disable=line-too-long
-                    'instance(s).')
-            change_default_value['use_spot'] = True
-            if resources.spot_recovery is None:
+            if resources.job_recovery is None:
                 change_default_value[
-                    'spot_recovery'] = spot.SPOT_DEFAULT_STRATEGY
+                    'job_recovery'] = jobs.DEFAULT_RECOVERY_STRATEGY
 
             new_resources = resources.copy(**change_default_value)
             new_resources_list.append(new_resources)
 
-        spot_recovery_strategy = new_resources_list[0].spot_recovery
+        job_recovery_strategy = new_resources_list[0].job_recovery
         for resource in new_resources_list:
-            if resource.spot_recovery != spot_recovery_strategy:
+            if resource.job_recovery != job_recovery_strategy:
                 with ux_utils.print_exception_no_traceback():
                     raise ValueError('All resources in the task must have'
-                                     'the same spot recovery strategy.')
+                                     'the same job recovery strategy.')
 
-        if isinstance(task_.resources, list):
-            task_.set_resources(new_resources_list)
-        else:
-            task_.set_resources(set(new_resources_list))
+        task_.set_resources(type(task_.resources)(new_resources_list))
diff --git a/sky/utils/env_options.py b/sky/utils/env_options.py
index 522d076c171..166bf42ce80 100644
--- a/sky/utils/env_options.py
+++ b/sky/utils/env_options.py
@@ -9,11 +9,11 @@ class Options(enum.Enum):
     SHOW_DEBUG_INFO = 'SKYPILOT_DEBUG'
     DISABLE_LOGGING = 'SKYPILOT_DISABLE_USAGE_COLLECTION'
     MINIMIZE_LOGGING = 'SKYPILOT_MINIMIZE_LOGGING'
-    # Internal: this is used to skip the cloud user identity check,
-    # which is used to protect cluster operations in a multi-identity
-    # scenario. Currently, this is only used in the spot controller,
-    # as there will not be multiple identities, and skipping the check
-    # can increase robustness.
+    # Internal: this is used to skip the cloud user identity check, which is
+    # used to protect cluster operations in a multi-identity scenario.
+    # Currently, this is only used in the job and serve controller, as there
+    # will not be multiple identities, and skipping the check can increase
+    # robustness.
     SKIP_CLOUD_IDENTITY_CHECK = 'SKYPILOT_SKIP_CLOUD_IDENTITY_CHECK'
 
     def get(self):
diff --git a/sky/utils/kubernetes/create_cluster.sh b/sky/utils/kubernetes/create_cluster.sh
index b5911b1acbe..62fb700edf3 100755
--- a/sky/utils/kubernetes/create_cluster.sh
+++ b/sky/utils/kubernetes/create_cluster.sh
@@ -195,7 +195,7 @@ if $ENABLE_GPUS; then
     echo "Enabling GPU support..."
     # Run patch for missing ldconfig.real
     # https://github.com/NVIDIA/nvidia-docker/issues/614#issuecomment-423991632
-    docker exec -ti skypilot-control-plane ln -s /sbin/ldconfig /sbin/ldconfig.real
+    docker exec -ti skypilot-control-plane /bin/bash -c '[ ! -f /sbin/ldconfig.real ] && ln -s /sbin/ldconfig /sbin/ldconfig.real || echo "/sbin/ldconfig.real already exists"'
 
     echo "Installing NVIDIA GPU operator..."
     # Install the NVIDIA GPU operator
diff --git a/sky/utils/kubernetes/generate_static_kubeconfig.sh b/sky/utils/kubernetes/generate_static_kubeconfig.sh
new file mode 100755
index 00000000000..3b0c331584d
--- /dev/null
+++ b/sky/utils/kubernetes/generate_static_kubeconfig.sh
@@ -0,0 +1,287 @@
+#!/bin/bash
+# This script creates a new k8s Service Account and generates a kubeconfig with
+# its credentials. This Service Account has the minimal permissions necessary for
+# SkyPilot. The kubeconfig is written in the current directory.
+#
+# Before running this script, you must configure your local kubectl to point to
+# the right k8s cluster and have admin-level access.
+#
+# By default, this script will create a service account "sky-sa" in "default"
+# namespace. If you want to use a different namespace or service account name:
+#
+#   * Specify SKYPILOT_NAMESPACE env var to override the default namespace
+#   * Specify SKYPILOT_SA_NAME env var to override the default service account name
+#   * Specify SKIP_SA_CREATION=1 to skip creating the service account and use an existing one
+#
+# Usage:
+#   # Create "sky-sa" service account with minimal permissions in "default" namespace and generate kubeconfig
+#   $ ./generate_static_kubeconfig.sh
+#
+#   # Create "my-sa" account with minimal permissions in "my-namespace" namespace and generate kubeconfig
+#   $ SKYPILOT_SA_NAME=my-sa SKYPILOT_NAMESPACE=my-namespace ./generate_static_kubeconfig.sh
+#
+#   # Use an existing service account "my-sa" in "my-namespace" namespace and generate kubeconfig
+#   $ SKIP_SA_CREATION=1 SKYPILOT_SA_NAME=my-sa SKYPILOT_NAMESPACE=my-namespace ./generate_static_kubeconfig.sh
+
+set -eu -o pipefail
+
+# Allow passing in common name and username in environment. If not provided,
+# use default.
+SKYPILOT_SA=${SKYPILOT_SA_NAME:-sky-sa}
+NAMESPACE=${SKYPILOT_NAMESPACE:-default}
+
+echo "Service account: ${SKYPILOT_SA}"
+echo "Namespace: ${NAMESPACE}"
+
+# Set OS specific values.
+if [[ "$OSTYPE" == "linux-gnu" ]]; then
+    BASE64_DECODE_FLAG="-d"
+elif [[ "$OSTYPE" == "darwin"* ]]; then
+    BASE64_DECODE_FLAG="-D"
+elif [[ "$OSTYPE" == "linux-musl" ]]; then
+    BASE64_DECODE_FLAG="-d"
+else
+    echo "Unknown OS ${OSTYPE}"
+    exit 1
+fi
+
+# If the user has set SKIP_SA_CREATION=1, skip creating the service account.
+if [ -z ${SKIP_SA_CREATION+x} ]; then
+  echo "Creating the Kubernetes Service Account with minimal RBAC permissions."
+  kubectl apply -f - <<EOF
+# Create/update namespace specified by the user
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: ${NAMESPACE}
+  labels:
+    parent: skypilot
+---
+kind: ServiceAccount
+apiVersion: v1
+metadata:
+  name: ${SKYPILOT_SA}
+  namespace: ${NAMESPACE}
+  labels:
+    parent: skypilot
+---
+# Role for the service account
+kind: Role
+apiVersion: rbac.authorization.k8s.io/v1
+metadata:
+  name: ${SKYPILOT_SA}-role
+  namespace: ${NAMESPACE}
+  labels:
+    parent: skypilot
+rules:
+  - apiGroups: ["*"]  # Required for creating pods, services, secrets and other necessary resources in the namespace.
+    resources: ["*"]
+    verbs: ["*"]
+---
+# RoleBinding for the service account
+kind: RoleBinding
+apiVersion: rbac.authorization.k8s.io/v1
+metadata:
+  name: ${SKYPILOT_SA}-rb
+  namespace: ${NAMESPACE}
+  labels:
+    parent: skypilot
+subjects:
+  - kind: ServiceAccount
+    name: ${SKYPILOT_SA}
+roleRef:
+  kind: Role
+  name: ${SKYPILOT_SA}-role
+  apiGroup: rbac.authorization.k8s.io
+---
+# ClusterRole for the service account
+kind: ClusterRole
+apiVersion: rbac.authorization.k8s.io/v1
+metadata:
+  name: ${SKYPILOT_SA}-cluster-role
+  namespace: ${NAMESPACE}
+  labels:
+    parent: skypilot
+rules:
+  - apiGroups: [""]
+    resources: ["nodes"]  # Required for getting node resources.
+    verbs: ["get", "list", "watch"]
+  - apiGroups: ["node.k8s.io"]
+    resources: ["runtimeclasses"]   # Required for autodetecting the runtime class of the nodes.
+    verbs: ["get", "list", "watch"]
+  - apiGroups: ["networking.k8s.io"]   # Required for exposing services through ingresses
+    resources: ["ingressclasses"]
+    verbs: ["get", "list", "watch"]
+---
+# ClusterRoleBinding for the service account
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRoleBinding
+metadata:
+  name: ${SKYPILOT_SA}-cluster-role-binding
+  namespace: ${NAMESPACE}
+  labels:
+    parent: skypilot
+subjects:
+  - kind: ServiceAccount
+    name: ${SKYPILOT_SA}
+    namespace: ${NAMESPACE}
+roleRef:
+  kind: ClusterRole
+  name: ${SKYPILOT_SA}-cluster-role
+  apiGroup: rbac.authorization.k8s.io
+---
+# Optional: If using object store mounting, create the skypilot-system namespace
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: skypilot-system
+  labels:
+    parent: skypilot
+---
+# Optional: If using object store mounting, create role in the skypilot-system
+# namespace to create FUSE device manager.
+kind: Role
+apiVersion: rbac.authorization.k8s.io/v1
+metadata:
+  name: skypilot-system-service-account-role
+  namespace: skypilot-system
+  labels:
+    parent: skypilot
+rules:
+  - apiGroups: ["*"]
+    resources: ["*"]
+    verbs: ["*"]
+---
+# Optional: If using object store mounting, create rolebinding in the skypilot-system
+# namespace to create FUSE device manager.
+apiVersion: rbac.authorization.k8s.io/v1
+kind: RoleBinding
+metadata:
+  name: ${SKYPILOT_SA}-skypilot-system-role-binding-${NAMESPACE}
+  namespace: skypilot-system  # Do not change this namespace
+  labels:
+    parent: skypilot
+subjects:
+  - kind: ServiceAccount
+    name: ${SKYPILOT_SA}
+    namespace: ${NAMESPACE}
+roleRef:
+  kind: Role
+  name: skypilot-system-service-account-role
+  apiGroup: rbac.authorization.k8s.io
+---
+# Optional: Role for accessing ingress resources
+apiVersion: rbac.authorization.k8s.io/v1
+kind: Role
+metadata:
+  name: ${SKYPILOT_SA}-role-ingress-nginx
+  namespace: ingress-nginx  # Do not change this namespace
+  labels:
+    parent: skypilot
+rules:
+  - apiGroups: [""]
+    resources: ["services"]
+    verbs: ["list", "get", "watch"]
+  - apiGroups: ["rbac.authorization.k8s.io"]
+    resources: ["roles", "rolebindings"]
+    verbs: ["list", "get", "watch"]
+---
+# Optional: RoleBinding for accessing ingress resources
+apiVersion: rbac.authorization.k8s.io/v1
+kind: RoleBinding
+metadata:
+  name: ${SKYPILOT_SA}-rolebinding-ingress-nginx
+  namespace: ingress-nginx  # Do not change this namespace
+  labels:
+    parent: skypilot
+subjects:
+  - kind: ServiceAccount
+    name: ${SKYPILOT_SA}
+    namespace: ${NAMESPACE}
+roleRef:
+  kind: Role
+  name: ${SKYPILOT_SA}-role-ingress-nginx  # Use the same name as the role at line 119
+  apiGroup: rbac.authorization.k8s.io
+EOF
+fi
+
+# Checks if secret entry was defined for Service account. If defined it means that Kubernetes server has a
+# version bellow 1.24, otherwise one must manually create the secret and bind it to the Service account to have a non expiring token.
+# After Kubernetes v1.24 Service accounts no longer generate automatic tokens/secrets.
+# We can use kubectl create token but the token has a expiration time.
+# https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.24.md#urgent-upgrade-notes
+SA_SECRET_NAME=$(kubectl get -n ${NAMESPACE} sa/${SKYPILOT_SA} -o "jsonpath={.secrets[0]..name}")
+if [ -z $SA_SECRET_NAME ]
+then
+# Create the secret and bind it to the desired SA
+kubectl apply -f - <<EOF
+apiVersion: v1
+kind: Secret
+type: kubernetes.io/service-account-token
+metadata:
+  name: ${SKYPILOT_SA}
+  namespace: ${NAMESPACE}
+  annotations:
+    kubernetes.io/service-account.name: "${SKYPILOT_SA}"
+  labels:
+    parent: skypilot
+EOF
+
+SA_SECRET_NAME=${SKYPILOT_SA}
+fi
+
+# Sleep for 2 seconds to allow the secret to be created before fetching it.
+sleep 2
+
+# Note: service account token is stored base64-encoded in the secret but must
+# be plaintext in kubeconfig.
+SA_TOKEN=$(kubectl get -n ${NAMESPACE} secrets/${SA_SECRET_NAME} -o "jsonpath={.data['token']}" | base64 ${BASE64_DECODE_FLAG})
+CA_CERT=$(kubectl get -n ${NAMESPACE} secrets/${SA_SECRET_NAME} -o "jsonpath={.data['ca\.crt']}")
+
+# Extract cluster IP from the current context
+CURRENT_CONTEXT=$(kubectl config current-context)
+CURRENT_CLUSTER=$(kubectl config view -o jsonpath="{.contexts[?(@.name == \"${CURRENT_CONTEXT}\"})].context.cluster}")
+CURRENT_CLUSTER_ADDR=$(kubectl config view -o jsonpath="{.clusters[?(@.name == \"${CURRENT_CLUSTER}\"})].cluster.server}")
+
+echo "Writing kubeconfig."
+cat > kubeconfig <<EOF
+apiVersion: v1
+clusters:
+- cluster:
+    certificate-authority-data: ${CA_CERT}
+    server: ${CURRENT_CLUSTER_ADDR}
+  name: ${CURRENT_CLUSTER}
+contexts:
+- context:
+    cluster: ${CURRENT_CLUSTER}
+    user: ${CURRENT_CLUSTER}-${SKYPILOT_SA}
+    namespace: ${NAMESPACE}
+  name: ${CURRENT_CONTEXT}
+current-context: ${CURRENT_CONTEXT}
+kind: Config
+preferences: {}
+users:
+- name: ${CURRENT_CLUSTER}-${SKYPILOT_SA}
+  user:
+    token: ${SA_TOKEN}
+EOF
+
+echo "---
+Done!
+
+Copy the generated kubeconfig file to your ~/.kube/ directory to use it with
+kubectl and skypilot:
+
+# Backup your existing kubeconfig file
+mv ~/.kube/config ~/.kube/config.bak
+cp kubeconfig ~/.kube/config
+
+# Verify that you can access the cluster
+kubectl get pods
+
+Also add this to your ~/.sky/config.yaml to use the new service account:
+
+# ~/.sky/config.yaml
+kubernetes:
+  remote_identity: ${SKYPILOT_SA}
+"
diff --git a/sky/utils/kubernetes/rsync_helper.sh b/sky/utils/kubernetes/rsync_helper.sh
new file mode 100755
index 00000000000..f6240fca08e
--- /dev/null
+++ b/sky/utils/kubernetes/rsync_helper.sh
@@ -0,0 +1,7 @@
+# When using pod@namespace, rsync passes args as: {us} -l pod namespace 
+shift
+pod=$1
+shift
+namespace=$1
+shift
+kubectl exec -i $pod -n $namespace -- "$@"
diff --git a/sky/utils/kubernetes_enums.py b/sky/utils/kubernetes_enums.py
index d8af2eb0821..6ebe924ea47 100644
--- a/sky/utils/kubernetes_enums.py
+++ b/sky/utils/kubernetes_enums.py
@@ -35,3 +35,11 @@ class KubernetesPortMode(enum.Enum):
     """
     INGRESS = 'ingress'
     LOADBALANCER = 'loadbalancer'
+    PODIP = 'podip'
+
+
+class KubernetesAutoscalerType(enum.Enum):
+    """Enum for the different types of cluster autoscalers for Kubernetes."""
+    GKE = 'gke'
+    KARPENTER = 'karpenter'
+    GENERIC = 'generic'
diff --git a/sky/utils/schemas.py b/sky/utils/schemas.py
index fae793af4c9..1c6994d5f7b 100644
--- a/sky/utils/schemas.py
+++ b/sky/utils/schemas.py
@@ -3,9 +3,35 @@
 Schemas conform to the JSON Schema specification as defined at
 https://json-schema.org/
 """
+import enum
 
 
-def get_single_resources_schema():
+def _check_not_both_fields_present(field1: str, field2: str):
+    return {
+        'oneOf': [{
+            'required': [field1],
+            'not': {
+                'required': [field2]
+            }
+        }, {
+            'required': [field2],
+            'not': {
+                'required': [field1]
+            }
+        }, {
+            'not': {
+                'anyOf': [{
+                    'required': [field1]
+                }, {
+                    'required': [field2]
+                }]
+            }
+        }]
+    }
+
+
+def _get_single_resources_schema():
+    """Schema for a single resource in a resources list."""
     # To avoid circular imports, only import when needed.
     # pylint: disable=import-outside-toplevel
     from sky.clouds import service_catalog
@@ -57,9 +83,14 @@ def get_single_resources_schema():
             'use_spot': {
                 'type': 'boolean',
             },
+            # Deprecated: use 'job_recovery' instead. This is for backward
+            # compatibility, and can be removed in 0.8.0.
             'spot_recovery': {
                 'type': 'string',
             },
+            'job_recovery': {
+                'type': 'string',
+            },
             'disk_size': {
                 'type': 'integer',
             },
@@ -82,6 +113,12 @@ def get_single_resources_schema():
                     }
                 }],
             },
+            'labels': {
+                'type': 'object',
+                'additionalProperties': {
+                    'type': 'string'
+                }
+            },
             'accelerator_args': {
                 'type': 'object',
                 'required': [],
@@ -104,46 +141,61 @@ def get_single_resources_schema():
                 }, {
                     'type': 'object',
                     'required': [],
+                }, {
+                    'type': 'null',
                 }]
-            }
+            },
+            # The following fields are for internal use only.
+            '_docker_login_config': {
+                'type': 'object',
+                'required': ['username', 'password', 'server'],
+                'additionalProperties': False,
+                'properties': {
+                    'username': {
+                        'type': 'string',
+                    },
+                    'password': {
+                        'type': 'string',
+                    },
+                    'server': {
+                        'type': 'string',
+                    }
+                }
+            },
+            '_is_image_managed': {
+                'type': 'boolean',
+            },
+            '_requires_fuse': {
+                'type': 'boolean',
+            },
         }
     }
 
 
+def _get_multi_resources_schema():
+    multi_resources_schema = {
+        k: v
+        for k, v in _get_single_resources_schema().items()
+        # Validation may fail if $schema is included.
+        if k != '$schema'
+    }
+    return multi_resources_schema
+
+
 def get_resources_schema():
-    # To avoid circular imports, only import when needed.
-    # pylint: disable=import-outside-toplevel
-    from sky.clouds import service_catalog
+    """Resource schema in task config."""
+    single_resources_schema = _get_single_resources_schema()['properties']
+    single_resources_schema.pop('accelerators')
+    multi_resources_schema = _get_multi_resources_schema()
     return {
         '$schema': 'http://json-schema.org/draft-07/schema#',
         'type': 'object',
         'required': [],
         'additionalProperties': False,
         'properties': {
-            'cloud': {
-                'type': 'string',
-                'case_insensitive_enum': list(service_catalog.ALL_CLOUDS)
-            },
-            'region': {
-                'type': 'string',
-            },
-            'zone': {
-                'type': 'string',
-            },
-            'cpus': {
-                'anyOf': [{
-                    'type': 'string',
-                }, {
-                    'type': 'number',
-                }],
-            },
-            'memory': {
-                'anyOf': [{
-                    'type': 'string',
-                }, {
-                    'type': 'number',
-                }],
-            },
+            **single_resources_schema,
+            # We redefine the 'accelerators' field to allow one line list or
+            # a set of accelerators.
             'accelerators': {
                 # {'V100:1', 'A100:1'} will be
                 # read as a string and converted to dict.
@@ -166,80 +218,17 @@ def get_resources_schema():
                     }
                 }]
             },
-            'instance_type': {
-                'type': 'string',
-            },
-            'use_spot': {
-                'type': 'boolean',
-            },
-            'spot_recovery': {
-                'type': 'string',
-            },
-            'disk_size': {
-                'type': 'integer',
-            },
-            'disk_tier': {
-                'type': 'string',
-            },
-            'ports': {
-                'anyOf': [{
-                    'type': 'string',
-                }, {
-                    'type': 'integer',
-                }, {
-                    'type': 'array',
-                    'items': {
-                        'anyOf': [{
-                            'type': 'string',
-                        }, {
-                            'type': 'integer',
-                        }]
-                    }
-                }],
-            },
-            'accelerator_args': {
-                'type': 'object',
-                'required': [],
-                'additionalProperties': False,
-                'properties': {
-                    'runtime_version': {
-                        'type': 'string',
-                    },
-                    'tpu_name': {
-                        'type': 'string',
-                    },
-                    'tpu_vm': {
-                        'type': 'boolean',
-                    }
-                }
-            },
-            'image_id': {
-                'anyOf': [{
-                    'type': 'string',
-                }, {
-                    'type': 'object',
-                    'required': [],
-                }]
-            },
             'any_of': {
                 'type': 'array',
-                'items': {
-                    k: v
-                    for k, v in get_single_resources_schema().items()
-                    # Validation may fail if $schema is included.
-                    if k != '$schema'
-                },
+                'items': multi_resources_schema,
             },
             'ordered': {
                 'type': 'array',
-                'items': {
-                    k: v
-                    for k, v in get_single_resources_schema().items()
-                    # Validation may fail if $schema is included.
-                    if k != '$schema'
-                },
+                'items': multi_resources_schema,
             }
-        }
+        },
+        # Avoid job_recovery and spot_recovery being present at the same time.
+        **_check_not_both_fields_present('job_recovery', 'spot_recovery')
     }
 
 
@@ -316,7 +305,13 @@ def get_service_schema():
                             }, {
                                 'type': 'object',
                             }]
-                        }
+                        },
+                        'headers': {
+                            'type': 'object',
+                            'additionalProperties': {
+                                'type': 'string'
+                            }
+                        },
                     }
                 }]
             },
@@ -415,7 +410,7 @@ def get_task_schema():
                 'patternProperties': {
                     # Checks env keys are valid env var names.
                     '^[a-zA-Z_][a-zA-Z0-9_]*$': {
-                        'type': 'string'
+                        'type': ['string', 'null']
                     }
                 },
                 'additionalProperties': False,
@@ -517,7 +512,9 @@ def get_cluster_schema():
     },
 }
 
-_INSTANCE_TAGS_SCHEMA = {
+_LABELS_SCHEMA = {
+    # Deprecated: 'instance_tags' is replaced by 'labels'. Keeping for backward
+    # compatibility. Will be removed after 0.7.0.
     'instance_tags': {
         'type': 'object',
         'required': [],
@@ -525,18 +522,80 @@ def get_cluster_schema():
             'type': 'string',
         },
     },
+    'labels': {
+        'type': 'object',
+        'required': [],
+        'additionalProperties': {
+            'type': 'string',
+        },
+    }
 }
 
+
+class RemoteIdentityOptions(enum.Enum):
+    """Enum for remote identity types.
+
+    Some clouds (e.g., AWS, Kubernetes) also allow string values for remote
+    identity, which map to the service account/role to use. Those are not
+    included in this enum.
+    """
+    LOCAL_CREDENTIALS = 'LOCAL_CREDENTIALS'
+    SERVICE_ACCOUNT = 'SERVICE_ACCOUNT'
+
+
+def get_default_remote_identity(cloud: str) -> str:
+    """Get the default remote identity for the specified cloud."""
+    if cloud == 'kubernetes':
+        return RemoteIdentityOptions.SERVICE_ACCOUNT.value
+    return RemoteIdentityOptions.LOCAL_CREDENTIALS.value
+
+
 _REMOTE_IDENTITY_SCHEMA = {
     'remote_identity': {
         'type': 'string',
-        'case_insensitive_enum': ['LOCAL_CREDENTIALS', 'SERVICE_ACCOUNT'],
+        'case_insensitive_enum': [
+            option.value for option in RemoteIdentityOptions
+        ]
     }
 }
 
+_REMOTE_IDENTITY_SCHEMA_AWS = {
+    'remote_identity': {
+        'oneOf': [
+            {
+                'type': 'string'
+            },
+            {
+                # A list of single-element dict to pretain the order.
+                # Example:
+                #  remote_identity:
+                #    - my-cluster1-*: my-iam-role-1
+                #    - my-cluster2-*: my-iam-role-2
+                #    - "*"": my-iam-role-3
+                'type': 'array',
+                'items': {
+                    'type': 'object',
+                    'additionalProperties': {
+                        'type': 'string'
+                    },
+                    'maxProperties': 1,
+                    'minProperties': 1,
+                },
+            }
+        ]
+    }
+}
+
+_REMOTE_IDENTITY_SCHEMA_KUBERNETES = {
+    'remote_identity': {
+        'type': 'string'
+    },
+}
+
 
 def get_config_schema():
     # pylint: disable=import-outside-toplevel
+    from sky.clouds import service_catalog
     from sky.utils import kubernetes_enums
 
     resources_schema = {
@@ -568,11 +627,12 @@ def get_config_schema():
             'additionalProperties': False,
             'properties': {
                 'security_group_name': {
-                    'type': 'string',
+                    'type': 'string'
                 },
-                **_INSTANCE_TAGS_SCHEMA,
+                **_LABELS_SCHEMA,
                 **_NETWORK_CONFIG_SCHEMA,
-            }
+            },
+            **_check_not_both_fields_present('instance_tags', 'labels')
         },
         'gcp': {
             'type': 'object',
@@ -601,9 +661,10 @@ def get_config_schema():
                         }
                     }
                 },
-                **_INSTANCE_TAGS_SCHEMA,
+                **_LABELS_SCHEMA,
                 **_NETWORK_CONFIG_SCHEMA,
-            }
+            },
+            **_check_not_both_fields_present('instance_tags', 'labels')
         },
         'kubernetes': {
             'type': 'object',
@@ -647,6 +708,13 @@ def get_config_schema():
                 'provision_timeout': {
                     'type': 'integer',
                 },
+                'autoscaler': {
+                    'type': 'string',
+                    'case_insensitive_enum': [
+                        type.value
+                        for type in kubernetes_enums.KubernetesAutoscalerType
+                    ]
+                },
             }
         },
         'oci': {
@@ -676,16 +744,35 @@ def get_config_schema():
         },
     }
 
-    for config in cloud_configs.values():
-        config['properties'].update(_REMOTE_IDENTITY_SCHEMA)
+    allowed_clouds = {
+        # A list of cloud names that are allowed to be used
+        'type': 'array',
+        'items': {
+            'type': 'string',
+            'case_insensitive_enum':
+                (list(service_catalog.ALL_CLOUDS) + ['cloudflare'])
+        }
+    }
+
+    for cloud, config in cloud_configs.items():
+        if cloud == 'aws':
+            config['properties'].update(_REMOTE_IDENTITY_SCHEMA_AWS)
+        elif cloud == 'kubernetes':
+            config['properties'].update(_REMOTE_IDENTITY_SCHEMA_KUBERNETES)
+        else:
+            config['properties'].update(_REMOTE_IDENTITY_SCHEMA)
     return {
         '$schema': 'https://json-schema.org/draft/2020-12/schema',
         'type': 'object',
         'required': [],
         'additionalProperties': False,
         'properties': {
+            'jobs': controller_resources_schema,
             'spot': controller_resources_schema,
             'serve': controller_resources_schema,
+            'allowed_clouds': allowed_clouds,
             **cloud_configs,
-        }
+        },
+        # Avoid spot and jobs being present at the same time.
+        **_check_not_both_fields_present('spot', 'jobs')
     }
diff --git a/sky/utils/subprocess_utils.py b/sky/utils/subprocess_utils.py
index bd48a91a796..d1779352a81 100644
--- a/sky/utils/subprocess_utils.py
+++ b/sky/utils/subprocess_utils.py
@@ -78,7 +78,7 @@ def handle_returncode(returncode: int,
         error_msg: The error message to print.
         stderr: The stderr of the command.
     """
-    echo = logger.error if stream_logs else lambda _: None
+    echo = logger.error if stream_logs else logger.debug
     if returncode != 0:
         if stderr is not None:
             echo(stderr)
diff --git a/tests/backward_compatibility_tests.sh b/tests/backward_compatibility_tests.sh
index 424a0f5f9f7..2156057953c 100644
--- a/tests/backward_compatibility_tests.sh
+++ b/tests/backward_compatibility_tests.sh
@@ -13,7 +13,7 @@
 #   bash tests/backward_compatibility_tests.sh
 
 #!/bin/bash
-set -ev
+set -evx
 
 need_launch=${1:-0}
 start_from=${2:-0}
@@ -73,7 +73,8 @@ s=$(sky launch --cloud ${CLOUD} -d -c ${CLUSTER_NAME} examples/minimal.yaml)
 sky logs ${CLUSTER_NAME} 2 --status | grep RUNNING || exit 1
 # remove color and find the job id
 echo "$s" | sed -r "s/\x1B\[([0-9]{1,3}(;[0-9]{1,2})?)?[mGK]//g" | grep "Job ID: 4" || exit 1
-sleep 45
+# wait for ready
+sky logs ${CLUSTER_NAME} 2
 q=$(sky queue ${CLUSTER_NAME})
 echo "$q"
 echo "$q" | grep "SUCCEEDED" | wc -l | grep 4 || exit 1
@@ -158,32 +159,36 @@ sky logs ${CLUSTER_NAME}-6 2 --status
 sky logs ${CLUSTER_NAME}-6 2
 fi
 
-# Test spot jobs to make sure existing jobs and new job can run correctly, after
-# the spot controller is updated.
+# Test managed jobs to make sure existing jobs and new job can run correctly,
+# after the jobs controller is updated.
+# Get a new uuid to avoid conflict with previous back-compat tests.
+uuid=$(uuidgen)
+MANAGED_JOB_JOB_NAME=${CLUSTER_NAME}-${uuid:0:4}
 if [ "$start_from" -le 7 ]; then
 conda activate sky-back-compat-master
 rm -r  ~/.sky/wheels || true
-sky spot launch -d --cloud ${CLOUD} -y --cpus 2 -n ${CLUSTER_NAME}-7-0 "echo hi; sleep 1000"
-sky spot launch -d --cloud ${CLOUD} -y --cpus 2 -n ${CLUSTER_NAME}-7-1 "echo hi; sleep 300"
+sky spot launch -d --cloud ${CLOUD} -y --cpus 2 -n ${MANAGED_JOB_JOB_NAME}-7-0 "echo hi; sleep 1000"
+sky spot launch -d --cloud ${CLOUD} -y --cpus 2 -n ${MANAGED_JOB_JOB_NAME}-7-1 "echo hi; sleep 300"
 conda activate sky-back-compat-current
 rm -r  ~/.sky/wheels || true
-s=$(sky spot logs --no-follow -n ${CLUSTER_NAME}-7-1)
+s=$(sky jobs queue | grep ${MANAGED_JOB_JOB_NAME}-7 | grep "RUNNING" | wc -l)
+s=$(sky jobs logs --no-follow -n ${MANAGED_JOB_JOB_NAME}-7-1)
 echo "$s"
 echo "$s" | grep " hi" || exit 1
-sky spot launch -d --cloud ${CLOUD} -y -n ${CLUSTER_NAME}-7-2 "echo hi; sleep 60"
-s=$(sky spot logs --no-follow -n ${CLUSTER_NAME}-7-2)
+sky jobs launch -d --cloud ${CLOUD} -y -n ${MANAGED_JOB_JOB_NAME}-7-2 "echo hi; sleep 40"
+s=$(sky jobs logs --no-follow -n ${MANAGED_JOB_JOB_NAME}-7-2)
 echo "$s"
 echo "$s" | grep " hi" || exit 1
-s=$(sky spot queue | grep ${CLUSTER_NAME}-7)
+s=$(sky jobs queue | grep ${MANAGED_JOB_JOB_NAME}-7)
 echo "$s"
 echo "$s" | grep "RUNNING" | wc -l | grep 3 || exit 1
-sky spot cancel -y -n ${CLUSTER_NAME}-7-0
-sky spot logs -n "${CLUSTER_NAME}-7-1"
-s=$(sky spot queue | grep ${CLUSTER_NAME}-7)
+sky jobs cancel -y -n ${MANAGED_JOB_JOB_NAME}-7-0
+sky jobs logs -n "${MANAGED_JOB_JOB_NAME}-7-1"
+s=$(sky jobs queue | grep ${MANAGED_JOB_JOB_NAME}-7)
 echo "$s"
 echo "$s" | grep "SUCCEEDED" | wc -l | grep 2 || exit 1
 echo "$s" | grep "CANCELLED" | wc -l | grep 1 || exit 1
 fi
 
 sky down ${CLUSTER_NAME}* -y
-sky spot cancel -n ${CLUSTER_NAME}* -y
+sky jobs cancel -n ${MANAGED_JOB_JOB_NAME}* -y
diff --git a/tests/conftest.py b/tests/conftest.py
index 2c72d449977..ce92afd88c7 100644
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -18,8 +18,8 @@
 # To only run tests for a specific cloud (as well as generic tests), use
 # --aws, --gcp, --azure, or --lambda.
 #
-# To only run tests for managed spot (without generic tests), use
-# --managed-spot.
+# To only run tests for managed jobs (without generic tests), use
+# --managed-jobs.
 all_clouds_in_smoke_tests = [
     'aws', 'gcp', 'azure', 'lambda', 'cloudflare', 'ibm', 'scp', 'oci',
     'kubernetes', 'vsphere', 'cudo', 'fluidstack', 'paperspace'
@@ -57,10 +57,10 @@ def pytest_addoption(parser):
                          action='store_true',
                          default=False,
                          help=f'Only run {cloud.upper()} tests.')
-    parser.addoption('--managed-spot',
+    parser.addoption('--managed-jobs',
                      action='store_true',
                      default=False,
-                     help='Only run tests for managed spot.')
+                     help='Only run tests for managed jobs.')
     parser.addoption('--serve',
                      action='store_true',
                      default=False,
@@ -115,8 +115,8 @@ def _get_cloud_to_run(config) -> List[str]:
 def pytest_collection_modifyitems(config, items):
     skip_marks = {}
     skip_marks['slow'] = pytest.mark.skip(reason='need --runslow option to run')
-    skip_marks['managed_spot'] = pytest.mark.skip(
-        reason='skipped, because --managed-spot option is set')
+    skip_marks['managed_jobs'] = pytest.mark.skip(
+        reason='skipped, because --managed-jobs option is set')
     skip_marks['serve'] = pytest.mark.skip(
         reason='skipped, because --serve option is set')
     skip_marks['tpu'] = pytest.mark.skip(
@@ -144,9 +144,9 @@ def pytest_collection_modifyitems(config, items):
                     continue
                 item.add_marker(skip_marks[cloud])
 
-        if (not 'managed_spot'
-                in item.keywords) and config.getoption('--managed-spot'):
-            item.add_marker(skip_marks['managed_spot'])
+        if (not 'managed_jobs'
+                in item.keywords) and config.getoption('--managed-jobs'):
+            item.add_marker(skip_marks['managed_jobs'])
         if (not 'tpu' in item.keywords) and config.getoption('--tpu'):
             item.add_marker(skip_marks['tpu'])
         if (not 'serve' in item.keywords) and config.getoption('--serve'):
@@ -160,7 +160,7 @@ def pytest_collection_modifyitems(config, items):
     serial_mark = pytest.mark.xdist_group(
         name=f'serial_{generic_cloud_keyword}')
     # Handle generic tests
-    if generic_cloud in ['lambda', 'kubernetes']:
+    if generic_cloud in ['lambda']:
         for item in items:
             if (_is_generic_test(item) and
                     f'no_{generic_cloud_keyword}' not in item.keywords):
diff --git a/tests/kubernetes/README.md b/tests/kubernetes/README.md
index 4a882352703..d00e3032e13 100644
--- a/tests/kubernetes/README.md
+++ b/tests/kubernetes/README.md
@@ -32,22 +32,26 @@ sky local up
      ```bash
       PROJECT_ID=$(gcloud config get-value project)
       CLUSTER_NAME=testclusterromil
-      gcloud beta container --project "${PROJECT_ID}" clusters create "${CLUSTER_NAME}" --zone "us-central1-c" --no-enable-basic-auth --cluster-version "1.29.0-gke.1381000" --release-channel "regular" --machine-type "n1-standard-8" --accelerator "type=nvidia-tesla-t4,count=1" --image-type "COS_CONTAINERD" --disk-type "pd-balanced" --disk-size "100" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "2" --logging=SYSTEM,WORKLOAD --monitoring=SYSTEM --enable-ip-alias --network "projects/${PROJECT_ID}/global/networks/default" --subnetwork "projects/${PROJECT_ID}/regions/us-central1/subnetworks/default" --no-enable-intra-node-visibility --default-max-pods-per-node "110" --security-posture=standard --workload-vulnerability-scanning=disabled --no-enable-master-authorized-networks --addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --enable-managed-prometheus --enable-shielded-nodes --node-locations "us-central1-c" && gcloud beta container --project "${PROJECT_ID}" node-pools create "v100" --cluster "${CLUSTER_NAME}" --zone "us-central1-c" --machine-type "n1-standard-8" --accelerator "type=nvidia-tesla-v100,count=1" --image-type "COS_CONTAINERD" --disk-type "pd-balanced" --disk-size "100" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "2" --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --node-locations "us-central1-c" && gcloud beta container --project "${PROJECT_ID}" node-pools create "largecpu" --cluster "${CLUSTER_NAME}" --zone "us-central1-c" --machine-type "n1-standard-16" --image-type "COS_CONTAINERD" --disk-type "pd-balanced" --disk-size "100" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "2" --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --node-locations "us-central1-c"
+      gcloud beta container --project "${PROJECT_ID}" clusters create "${CLUSTER_NAME}" --zone "us-central1-c" --no-enable-basic-auth --cluster-version "1.29.1-gke.1589020" --release-channel "regular" --machine-type "n1-standard-8" --accelerator "type=nvidia-tesla-t4,count=1" --image-type "COS_CONTAINERD" --disk-type "pd-balanced" --disk-size "100" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "2" --logging=SYSTEM,WORKLOAD --monitoring=SYSTEM --enable-ip-alias --network "projects/${PROJECT_ID}/global/networks/default" --subnetwork "projects/${PROJECT_ID}/regions/us-central1/subnetworks/default" --no-enable-intra-node-visibility --default-max-pods-per-node "110" --security-posture=standard --workload-vulnerability-scanning=disabled --no-enable-master-authorized-networks --addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --enable-managed-prometheus --enable-shielded-nodes --node-locations "us-central1-c" && gcloud beta container --project "${PROJECT_ID}" node-pools create "v100" --cluster "${CLUSTER_NAME}" --zone "us-central1-c" --machine-type "n1-standard-8" --accelerator "type=nvidia-tesla-v100,count=1" --image-type "COS_CONTAINERD" --disk-type "pd-balanced" --disk-size "100" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "2" --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --node-locations "us-central1-c" && gcloud beta container --project "${PROJECT_ID}" node-pools create "largecpu" --cluster "${CLUSTER_NAME}" --zone "us-central1-c" --machine-type "n1-standard-16" --image-type "COS_CONTAINERD" --disk-type "pd-balanced" --disk-size "100" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "2" --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --node-locations "us-central1-c" && gcloud beta container --project "${PROJECT_ID}" node-pools create "l4" --cluster "${CLUSTER_NAME}" --zone "us-central1-c" --machine-type "g2-standard-4" --accelerator "type=nvidia-l4,count=1" --image-type "COS_CONTAINERD" --disk-type "pd-balanced" --disk-size "100" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "2" --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --node-locations "us-central1-c" 
       ```
 2. Get the kubeconfig for your cluster and place it in `~/.kube/config`:
    ```bash
    gcloud container clusters get-credentials <cluster-name> --region <region>
    # Example:
-   # gcloud container clusters get-credentials testcluster --region us-central1-c
+   # gcloud container clusters get-credentials $CLUSTER_NAME --region us-central1-c
    ```
 3. Verify by running `kubectl get nodes`. You should see your nodes.
 4. **If you want GPU support**, make sure you install GPU drivers by running:
    ```bash
    # If using COS based nodes (e.g., in the example above):
    kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
+
+   kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml
    
    # If using Ubuntu based nodes:
    kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml
+
+   kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded-R525.yaml
    ```
    This will create a resource like `nvidia.com/gpu: 1`. You can verify this resource is available by running:
    ```bash
diff --git a/tests/kubernetes/scripts/deploy_k3s.sh b/tests/kubernetes/scripts/deploy_k3s.sh
index fb202d135e9..eef43bb6422 100644
--- a/tests/kubernetes/scripts/deploy_k3s.sh
+++ b/tests/kubernetes/scripts/deploy_k3s.sh
@@ -5,6 +5,9 @@
 #   sky launch -c k3s --cloud gcp --gpus T4:1
 #   scp deploy_k3s.sh k3s:~/
 #   ssh k3s
+#   # (optional) skip the skypilot labeler job
+#   export SKY_SKIP_K8S_LABEL=1
+#   # deploy k3s
 #   chmod +x deploy_k3s.sh && ./deploy_k3s.sh
 
 set -e
@@ -71,6 +74,7 @@ sudo chown $(id -u):$(id -g) $HOME/.kube/config
 
 # Wait for k3s to be ready
 echo "Waiting for k3s to be ready"
+sleep 5
 kubectl wait --for=condition=ready node --all --timeout=5m
 
 # =========== GPU support ===========
@@ -113,11 +117,14 @@ metadata:
 handler: nvidia
 EOF
 
-# Label nodes with GPUs
-echo "Labelling nodes with GPUs..."
-python -m sky.utils.kubernetes.gpu_labeler
+if [ ! "$SKY_SKIP_K8S_LABEL" == "1" ]
+then
+    # Label nodes with GPUs
+    echo "Labelling nodes with GPUs..."
+    python -m sky.utils.kubernetes.gpu_labeler
 
-# Wait for all the GPU labeling jobs to complete
-wait_for_gpu_labeling_jobs
+    # Wait for all the GPU labeling jobs to complete
+    wait_for_gpu_labeling_jobs
+fi
 
 echo "K3s cluster ready! Run sky check to setup Kubernetes access."
diff --git a/tests/skyserve/auto_restart.yaml b/tests/skyserve/auto_restart.yaml
index 0b440753902..2a3a31051b9 100644
--- a/tests/skyserve/auto_restart.yaml
+++ b/tests/skyserve/auto_restart.yaml
@@ -8,7 +8,6 @@ service:
 resources:
   ports: 8080
   cloud: gcp
-  zone: us-central1-a
   cpus: 2+
 
 workdir: examples/serve/http_server
diff --git a/tests/skyserve/cancel/cancel.yaml b/tests/skyserve/cancel/cancel.yaml
index 1e4007878fb..a8dab23b822 100644
--- a/tests/skyserve/cancel/cancel.yaml
+++ b/tests/skyserve/cancel/cancel.yaml
@@ -13,4 +13,6 @@ resources:
 
 workdir: examples/serve/misc/cancel
 
+setup: pip install aiohttp
+
 run: python3 server.py --port 9000
diff --git a/tests/skyserve/http/kubernetes.yaml b/tests/skyserve/http/kubernetes.yaml
new file mode 100644
index 00000000000..441d097b12b
--- /dev/null
+++ b/tests/skyserve/http/kubernetes.yaml
@@ -0,0 +1,15 @@
+service:
+  readiness_probe:
+    path: /health
+    initial_delay_seconds: 20
+  replicas: 2
+
+resources:
+  ports: 8080
+  cloud: kubernetes
+  cpus: 2+
+
+workdir: examples/serve/http_server
+
+# Use 8080 to test jupyter service is terminated
+run: python3 server.py --port 8080
diff --git a/tests/skyserve/llm/get_response.py b/tests/skyserve/llm/get_response.py
index f0fa530effc..9dd6ea53804 100644
--- a/tests/skyserve/llm/get_response.py
+++ b/tests/skyserve/llm/get_response.py
@@ -27,4 +27,6 @@
                              'temperature': 0,
                          })
 
+    if resp.status_code != 200:
+        raise RuntimeError(f'Failed to get response: {resp.text}')
     print(resp.json()['choices'][0]['message']['content'])
diff --git a/tests/skyserve/llm/service.yaml b/tests/skyserve/llm/service.yaml
index 79370fcd36a..48160e8f3db 100644
--- a/tests/skyserve/llm/service.yaml
+++ b/tests/skyserve/llm/service.yaml
@@ -7,6 +7,7 @@ service:
         - role: user
           content: How to print hello world?
       max_tokens: 1
+    initial_delay_seconds: 1800
   replicas: 1
 
 envs:
@@ -16,7 +17,8 @@ resources:
   ports: 8087
   cloud: gcp
   accelerators: T4
-  memory: 32+
+  cpus: 7+
+  memory: 20+
 
 setup: |
   conda activate chatbot
@@ -33,11 +35,12 @@ run: |
   conda activate chatbot
 
   echo 'Starting controller...'
-  python -u -m fastchat.serve.controller > ~/controller.log 2>&1 &
+  python -u -m fastchat.serve.controller --host 127.0.0.1 > ~/controller.log 2>&1 &
   sleep 10
   echo 'Starting model worker...'
   python -u -m fastchat.serve.model_worker \
-            --model-path lmsys/$MODEL_NAME 2>&1 \
+             --host 127.0.0.1 \
+             --model-path lmsys/$MODEL_NAME 2>&1 \
             | tee model_worker.log &
 
   echo 'Waiting for model worker to start...'
diff --git a/tests/skyserve/restart/user_bug.yaml b/tests/skyserve/restart/user_bug.yaml
index b3cbf9e907d..959e725d23d 100644
--- a/tests/skyserve/restart/user_bug.yaml
+++ b/tests/skyserve/restart/user_bug.yaml
@@ -8,7 +8,6 @@ service:
 resources:
   ports: 8080
   cpus: 2+
-  use_spot: True
 
 workdir: tests/skyserve/restart
 
diff --git a/tests/skyserve/streaming/example.txt b/tests/skyserve/streaming/example.txt
new file mode 100644
index 00000000000..0e9cd7421d3
--- /dev/null
+++ b/tests/skyserve/streaming/example.txt
@@ -0,0 +1 @@
+Hello! How can I help you today?
\ No newline at end of file
diff --git a/tests/skyserve/streaming/send_streaming_request.py b/tests/skyserve/streaming/send_streaming_request.py
new file mode 100644
index 00000000000..7c56d929761
--- /dev/null
+++ b/tests/skyserve/streaming/send_streaming_request.py
@@ -0,0 +1,24 @@
+import argparse
+
+import requests
+
+with open('tests/skyserve/streaming/example.txt', 'r') as f:
+    WORD_TO_STREAM = f.read()
+
+parser = argparse.ArgumentParser()
+parser.add_argument('--endpoint', type=str, required=True)
+args = parser.parse_args()
+url = f'http://{args.endpoint}/'
+
+expected = WORD_TO_STREAM.split()
+index = 0
+with requests.get(url, stream=True) as response:
+    response.raise_for_status()
+    for chunk in response.iter_content(chunk_size=8192):
+        if chunk:
+            current = chunk.decode().strip()
+            assert current == expected[index], (current, expected[index])
+            index += 1
+assert index == len(expected)
+
+print('Streaming test passed')
diff --git a/tests/skyserve/streaming/server.py b/tests/skyserve/streaming/server.py
new file mode 100644
index 00000000000..d9528af2205
--- /dev/null
+++ b/tests/skyserve/streaming/server.py
@@ -0,0 +1,24 @@
+import asyncio
+
+import fastapi
+import uvicorn
+
+with open('example.txt', 'r') as f:
+    WORD_TO_STREAM = f.read()
+
+app = fastapi.FastAPI()
+
+
+@app.get('/')
+async def stream():
+
+    async def generate_words():
+        for word in WORD_TO_STREAM.split():
+            yield word + "\n"
+            await asyncio.sleep(0.2)
+
+    return fastapi.responses.StreamingResponse(generate_words(),
+                                               media_type="text/plain")
+
+
+uvicorn.run(app, host='0.0.0.0', port=8080)
diff --git a/tests/skyserve/streaming/streaming.yaml b/tests/skyserve/streaming/streaming.yaml
new file mode 100644
index 00000000000..a352d120dde
--- /dev/null
+++ b/tests/skyserve/streaming/streaming.yaml
@@ -0,0 +1,13 @@
+service:
+  readiness_probe: /
+  replicas: 1
+
+resources:
+  cpus: 2+
+  ports: 8080
+
+workdir: tests/skyserve/streaming
+
+setup: pip install fastapi uvicorn
+
+run: python server.py
diff --git a/tests/test_cli.py b/tests/test_cli.py
index 96203bb8186..3a2417a6cde 100644
--- a/tests/test_cli.py
+++ b/tests/test_cli.py
@@ -48,10 +48,9 @@ def _capture_match_gpus_spec(file_path, gpus: str):
 
 
 def test_show_gpus():
-    """
-    This is a test suite for `sky show-gpus` to check functionality (but not correctness).
-    The tests below correspond to the following terminal commands,
-    in order:
+    """Tests `sky show-gpus` can be invoked (but not correctness).
+
+    Tests below correspond to the following terminal commands, in order:
 
     -> sky show-gpus
     -> sky show-gpus --all
diff --git a/tests/test_config.py b/tests/test_config.py
index 31839fcaadf..44154d7348d 100644
--- a/tests/test_config.py
+++ b/tests/test_config.py
@@ -68,7 +68,7 @@ def test_valid_null_proxy_config(monkeypatch, tmp_path) -> None:
     with open(tmp_path / 'valid.yaml', 'w', encoding='utf-8') as f:
         f.write(f"""\
         aws:
-            instance_tags:
+            labels:
                 mytag: myvalue
             ssh_proxy_command:
                 eu-west-1: null
diff --git a/tests/test_spot_serve.py b/tests/test_jobs_and_serve.py
similarity index 83%
rename from tests/test_spot_serve.py
rename to tests/test_jobs_and_serve.py
index 15dcde6a8de..a599fb7ba88 100644
--- a/tests/test_spot_serve.py
+++ b/tests/test_jobs_and_serve.py
@@ -14,20 +14,20 @@
 from sky import cli
 from sky import exceptions
 from sky import global_user_state
+from sky import jobs
 from sky import serve
-from sky import spot
 from sky.utils import common_utils
 from sky.utils import controller_utils
 from sky.utils import db_utils
 
 
-def test_spot_nonexist_strategy():
+def test_job_nonexist_strategy():
     """Test the nonexist recovery strategy."""
     task_yaml = textwrap.dedent("""\
         resources:
             cloud: aws
             use_spot: true
-            spot_recovery: nonexist""")
+            job_recovery: nonexist""")
     with tempfile.NamedTemporaryFile(mode='w') as f:
         f.write(task_yaml)
         f.flush()
@@ -95,18 +95,18 @@ def _mock_cluster_state(_mock_db_conn):
 
 
 @pytest.fixture
-def _mock_spot_controller(_mock_db_conn):
+def _mock_jobs_controller(_mock_db_conn):
     handle = backends.CloudVmRayResourceHandle(
-        cluster_name=spot.SPOT_CONTROLLER_NAME,
-        cluster_name_on_cloud=spot.SPOT_CONTROLLER_NAME,
-        cluster_yaml='/tmp/spot_controller.yaml',
+        cluster_name=jobs.JOB_CONTROLLER_NAME,
+        cluster_name_on_cloud=jobs.JOB_CONTROLLER_NAME,
+        cluster_yaml='/tmp/jobs_controller.yaml',
         launched_nodes=1,
         launched_resources=sky.Resources(sky.AWS(),
                                          instance_type='m4.2xlarge',
                                          region='us-west-1'),
     )
     global_user_state.add_or_update_cluster(
-        spot.SPOT_CONTROLLER_NAME,
+        jobs.JOB_CONTROLLER_NAME,
         handle,
         requested_resources={handle.launched_resources},
         ready=True)
@@ -131,22 +131,22 @@ def _mock_serve_controller(_mock_db_conn):
 
 
 def mock_is_controller_accessible(
-    controller_type: controller_utils.Controllers,
+    controller: controller_utils.Controllers,
     stopped_message: str,
     non_existent_message: Optional[str] = None,
     exit_on_error: bool = False,
 ):
     record = global_user_state.get_cluster_from_name(
-        controller_type.value.cluster_name)
+        controller.value.cluster_name)
     return record['handle']
 
 
-class TestSpotOperations:
-    """Test operations for managed spot."""
+class TestJobsOperations:
+    """Test operations for managed jobs."""
 
     @pytest.mark.timeout(60)
-    def test_down_spot_controller(self, _mock_cluster_state,
-                                  _mock_spot_controller, monkeypatch):
+    def test_down_jobs_controller(self, _mock_cluster_state,
+                                  _mock_jobs_controller, monkeypatch):
 
         def mock_get_job_table_no_job(cls, handle, code, require_outputs,
                                       stream_logs,
@@ -168,7 +168,7 @@ def mock_get_job_table_one_job(cls, handle, code, require_outputs,
                 'last_recovered_at': None,
                 'recovery_count': 0,
                 'failure_reason': '',
-                'spot_job_id': '1',
+                'managed_job_id': '1',
                 'task_id': 0,
                 'task_name': 'test_task',
                 'job_duration': 20,
@@ -183,9 +183,9 @@ def mock_get_job_table_one_job(cls, handle, code, require_outputs,
             mock_get_job_table_no_job)
 
         cli_runner = cli_testing.CliRunner()
-        result = cli_runner.invoke(cli.down, [spot.SPOT_CONTROLLER_NAME],
+        result = cli_runner.invoke(cli.down, [jobs.JOB_CONTROLLER_NAME],
                                    input='n')
-        assert 'WARNING: Tearing down the managed spot controller.' in result.output, (
+        assert 'WARNING: Tearing down the managed jobs controller.' in result.output, (
             result.exception, result.output, result.exc_info)
         assert isinstance(result.exception,
                           SystemExit), (result.exception, result.output)
@@ -193,17 +193,17 @@ def mock_get_job_table_one_job(cls, handle, code, require_outputs,
         monkeypatch.setattr(
             'sky.backends.cloud_vm_ray_backend.CloudVmRayBackend.run_on_head',
             mock_get_job_table_one_job)
-        result = cli_runner.invoke(cli.down, [spot.SPOT_CONTROLLER_NAME],
+        result = cli_runner.invoke(cli.down, [jobs.JOB_CONTROLLER_NAME],
                                    input='n')
-        assert 'WARNING: Tearing down the managed spot controller.' in result.output, (
+        assert 'WARNING: Tearing down the managed jobs controller.' in result.output, (
             result.exception, result.output, result.exc_info)
         assert isinstance(result.exception, exceptions.NotSupportedError)
 
-        result = cli_runner.invoke(cli.down, ['sky-spot-con*'])
+        result = cli_runner.invoke(cli.down, ['sky-jobs-con*'])
         assert not result.exception
         assert 'Cluster(s) not found' in result.output
 
-        result = cli_runner.invoke(cli.down, ['sky-spot-con*', '-p'])
+        result = cli_runner.invoke(cli.down, ['sky-jobs-con*', '-p'])
         assert not result.exception
         assert 'Cluster(s) not found' in result.output
 
@@ -216,15 +216,15 @@ def mock_get_job_table_one_job(cls, handle, code, require_outputs,
         assert 'Aborted' in result.output
 
     @pytest.mark.timeout(60)
-    def test_stop_spot_controller(self, _mock_cluster_state,
-                                  _mock_spot_controller):
+    def test_stop_jobs_controller(self, _mock_cluster_state,
+                                  _mock_jobs_controller):
         cli_runner = cli_testing.CliRunner()
-        result = cli_runner.invoke(cli.stop, [spot.SPOT_CONTROLLER_NAME])
+        result = cli_runner.invoke(cli.stop, [jobs.JOB_CONTROLLER_NAME])
         assert result.exit_code == click.UsageError.exit_code
-        assert (f'Stopping controller(s) \'{spot.SPOT_CONTROLLER_NAME}\' is '
+        assert (f'Stopping controller(s) \'{jobs.JOB_CONTROLLER_NAME}\' is '
                 'currently not supported' in result.output)
 
-        result = cli_runner.invoke(cli.stop, ['sky-spot-con*'])
+        result = cli_runner.invoke(cli.stop, ['sky-jobs-con*'])
         assert not result.exception
         assert 'Cluster(s) not found' in result.output
 
@@ -233,16 +233,16 @@ def test_stop_spot_controller(self, _mock_cluster_state,
         assert 'Aborted' in result.output
 
     @pytest.mark.timeout(60)
-    def test_autostop_spot_controller(self, _mock_cluster_state,
-                                      _mock_spot_controller):
+    def test_autostop_jobs_controller(self, _mock_cluster_state,
+                                      _mock_jobs_controller):
         cli_runner = cli_testing.CliRunner()
-        result = cli_runner.invoke(cli.autostop, [spot.SPOT_CONTROLLER_NAME])
+        result = cli_runner.invoke(cli.autostop, [jobs.JOB_CONTROLLER_NAME])
         assert result.exit_code == click.UsageError.exit_code
         assert ('Scheduling autostop on controller(s) '
-                f'\'{spot.SPOT_CONTROLLER_NAME}\' is currently not supported'
+                f'\'{jobs.JOB_CONTROLLER_NAME}\' is currently not supported'
                 in result.output)
 
-        result = cli_runner.invoke(cli.autostop, ['sky-spot-con*'])
+        result = cli_runner.invoke(cli.autostop, ['sky-jobs-con*'])
         assert not result.exception
         assert 'Cluster(s) not found' in result.output
 
@@ -250,37 +250,37 @@ def test_autostop_spot_controller(self, _mock_cluster_state,
         assert isinstance(result.exception, SystemExit)
         assert 'Aborted' in result.output
 
-    def test_cancel_on_spot_controller(self, _mock_cluster_state,
-                                       _mock_spot_controller):
+    def test_cancel_on_jobs_controller(self, _mock_cluster_state,
+                                       _mock_jobs_controller):
         cli_runner = cli_testing.CliRunner()
-        result = cli_runner.invoke(cli.cancel,
-                                   [spot.SPOT_CONTROLLER_NAME, '-a'])
-        assert result.exit_code == 1
-        assert 'Cancelling the spot controller\'s jobs is not allowed.' in str(
+        result = cli_runner.invoke(cli.cancel, [jobs.JOB_CONTROLLER_NAME, '-a'])
+        assert result.exit_code == click.UsageError.exit_code
+        assert 'Cancelling the jobs controller\'s jobs is not allowed.' in str(
             result.output)
 
     @pytest.mark.timeout(60)
     def test_cancel(self, _mock_db_conn):
         cli_runner = cli_testing.CliRunner()
-        result = cli_runner.invoke(cli.spot_cancel, ['-a'])
+        result = cli_runner.invoke(cli.jobs_cancel, ['-a'])
         assert result.exit_code == 1
-        assert controller_utils.Controllers.SPOT_CONTROLLER.value.default_hint_if_non_existent in str(
+        assert controller_utils.Controllers.JOBS_CONTROLLER.value.default_hint_if_non_existent in str(
             result.output), (result.exception, result.output, result.exc_info)
 
     @pytest.mark.timeout(60)
     def test_logs(self, _mock_db_conn):
         cli_runner = cli_testing.CliRunner()
-        result = cli_runner.invoke(cli.spot_logs, ['1'])
+        result = cli_runner.invoke(cli.jobs_logs, ['1'])
         assert result.exit_code == 1
-        assert controller_utils.Controllers.SPOT_CONTROLLER.value.default_hint_if_non_existent in str(
-            result.output), (result.exception, result.output, result.exc_info)
+        assert controller_utils.Controllers.JOBS_CONTROLLER.value.default_hint_if_non_existent in str(
+            result.exception), (result.exception, result.output,
+                                result.exc_info)
 
     @pytest.mark.timeout(60)
     def test_queue(self, _mock_db_conn):
         cli_runner = cli_testing.CliRunner()
-        result = cli_runner.invoke(cli.spot_queue)
+        result = cli_runner.invoke(cli.jobs_queue)
         assert result.exit_code == 0
-        assert controller_utils.Controllers.SPOT_CONTROLLER.value.default_hint_if_non_existent in str(
+        assert controller_utils.Controllers.JOBS_CONTROLLER.value.default_hint_if_non_existent in str(
             result.output), (result.exception, result.output, result.exc_info)
 
 
@@ -288,8 +288,8 @@ class TestServeOperations:
     """Test operations for services."""
 
     @pytest.mark.timeout(60)
-    def test_down_spot_controller(self, _mock_cluster_state,
-                                  _mock_serve_controller, monkeypatch):
+    def test_down_serve_controller(self, _mock_cluster_state,
+                                   _mock_serve_controller, monkeypatch):
 
         def mock_get_services_no_service(
                 cls, handle, code, require_outputs, stream_logs,
@@ -398,7 +398,7 @@ def test_cancel_on_serve_controller(self, _mock_cluster_state,
         cli_runner = cli_testing.CliRunner()
         result = cli_runner.invoke(cli.cancel,
                                    [serve.SKY_SERVE_CONTROLLER_NAME, '-a'])
-        assert result.exit_code == 1
+        assert result.exit_code == click.UsageError.exit_code
         assert 'Cancelling the sky serve controller\'s jobs is not allowed.' in str(
             result.output)
 
@@ -417,7 +417,8 @@ def test_logs(self, _mock_db_conn):
         cli_runner = cli_testing.CliRunner()
         result = cli_runner.invoke(cli.serve_logs, ['test', '--controller'])
         assert controller_utils.Controllers.SKY_SERVE_CONTROLLER.value.default_hint_if_non_existent in str(
-            result.output), (result.exception, result.output, result.exc_info)
+            result.exception), (result.exception, result.output,
+                                result.exc_info)
 
     @pytest.mark.timeout(60)
     def test_status(self, _mock_db_conn):
diff --git a/tests/test_smoke.py b/tests/test_smoke.py
index 3d5de8b97df..d19863b52fe 100644
--- a/tests/test_smoke.py
+++ b/tests/test_smoke.py
@@ -13,8 +13,8 @@
 # Run one of the smoke tests
 # > pytest tests/test_smoke.py::test_minimal
 #
-# Only run managed spot tests
-# > pytest tests/test_smoke.py --managed-spot
+# Only run managed job tests
+# > pytest tests/test_smoke.py --managed-jobs
 #
 # Only run sky serve tests
 # > pytest tests/test_smoke.py --sky-serve
@@ -45,8 +45,8 @@
 
 import sky
 from sky import global_user_state
+from sky import jobs
 from sky import serve
-from sky import spot
 from sky.adaptors import cloudflare
 from sky.adaptors import ibm
 from sky.clouds import AWS
@@ -55,16 +55,17 @@
 from sky.data import data_utils
 from sky.data import storage as storage_lib
 from sky.data.data_utils import Rclone
+from sky.skylet import constants
 from sky.skylet import events
 from sky.utils import common_utils
 from sky.utils import resources_utils
 from sky.utils import subprocess_utils
 
 # To avoid the second smoke test reusing the cluster launched in the first
-# smoke test. Also required for test_spot_recovery to make sure the manual
-# termination with aws ec2 does not accidentally terminate other spot clusters
-# from the different spot launch with the same cluster name but a different job
-# id.
+# smoke test. Also required for test_managed_jobs_recovery to make sure the
+# manual termination with aws ec2 does not accidentally terminate other clusters
+# for for the different managed jobs launch with the same job name but a
+# different job id.
 test_id = str(uuid.uuid4())[-2:]
 
 LAMBDA_TYPE = '--cloud lambda --gpus A10'
@@ -81,22 +82,24 @@
     'touch ~/.ssh/id_rsa.pub'
 ]
 
-# Wait until the spot controller is not in INIT state.
-# This is a workaround for the issue that when multiple spot tests
-# are running in parallel, the spot controller may be in INIT and
-# the spot queue/cancel command will return staled table.
-_SPOT_QUEUE_WAIT = ('s=$(sky spot queue); '
-                    'until ! echo "$s" | grep "jobs will not be shown until"; '
-                    'do echo "Waiting for spot queue to be ready..."; '
-                    'sleep 5; s=$(sky spot queue); done; echo "$s"; '
-                    'echo; echo; echo "$s"')
-_SPOT_CANCEL_WAIT = (
-    's=$(sky spot cancel -y -n {job_name}); '
+# Wait until the jobs controller is not in INIT state.
+# This is a workaround for the issue that when multiple job tests
+# are running in parallel, the jobs controller may be in INIT and
+# the job queue/cancel command will return staled table.
+_JOB_QUEUE_WAIT = ('s=$(sky jobs queue); '
+                   'until ! echo "$s" | grep "jobs will not be shown until"; '
+                   'do echo "Waiting for job queue to be ready..."; '
+                   'sleep 5; s=$(sky jobs queue); done; echo "$s"; '
+                   'echo; echo; echo "$s"')
+_JOB_CANCEL_WAIT = (
+    's=$(sky jobs cancel -y -n {job_name}); '
     'until ! echo "$s" | grep "Please wait for the controller to be ready."; '
-    'do echo "Waiting for the spot controller '
-    'to be ready"; sleep 5; s=$(sky spot cancel -y -n {job_name}); '
+    'do echo "Waiting for the jobs controller '
+    'to be ready"; sleep 5; s=$(sky jobs cancel -y -n {job_name}); '
     'done; echo "$s"; echo; echo; echo "$s"')
-# TODO(zhwu): make the spot controller on GCP.
+# TODO(zhwu): make the jobs controller on GCP, to avoid parallel test issues
+# when the controller being on Azure, which takes a long time for launching
+# step.
 
 DEFAULT_CMD_TIMEOUT = 15 * 60
 
@@ -274,25 +277,73 @@ def test_example_app():
     run_one_test(test)
 
 
+_VALIDATE_LAUNCH_OUTPUT = (
+    # Validate the output of the job submission:
+    # I 05-23 07:52:47 cloud_vm_ray_backend.py:3217] Running setup on 1 node.
+    # running setup
+    # I 05-23 07:52:49 cloud_vm_ray_backend.py:3230] Setup completed.
+    # I 05-23 07:52:55 cloud_vm_ray_backend.py:3319] Job submitted with Job ID: 1
+    # I 05-23 07:52:58 log_lib.py:408] Start streaming logs for job 1.
+    # INFO: Tip: use Ctrl-C to exit log streaming (task will not be killed).
+    # INFO: Waiting for task resources on 1 node. This will block if the cluster is full.
+    # INFO: All task resources reserved.
+    # INFO: Reserved IPs: ['10.128.0.127']
+    # (min, pid=4164) # conda environments:
+    # (min, pid=4164) #
+    # (min, pid=4164) base                  *  /opt/conda
+    # (min, pid=4164)
+    # (min, pid=4164) task run finish
+    # INFO: Job finished (status: SUCCEEDED).
+    'echo "$s" && echo "==Validating setup output==" && '
+    'echo "$s" | grep -A 1 "Running setup on" | grep "running setup" && '
+    'echo "==Validating running output hints==" && echo "$s" | '
+    'grep -A 1 "Job submitted with Job ID:" | '
+    'grep "Start streaming logs for job" && '
+    'echo "==Validating task output starting==" && echo "$s" | '
+    'grep -A 1 "INFO: Reserved IPs" | grep "(min, pid=" && '
+    'echo "==Validating task output ending==" && '
+    'echo "$s" | grep -A 1 "task run finish" | '
+    'grep "INFO: Job finished (status: SUCCEEDED)" && '
+    'echo "==Validating task output ending 2==" && '
+    'echo "$s" | grep -A 1 "INFO: Job finished (status: SUCCEEDED)" | '
+    'grep "Job ID:"')
+
+
 # ---------- A minimal task ----------
 def test_minimal(generic_cloud: str):
     name = _get_cluster_name()
     test = Test(
         'minimal',
         [
-            f'sky launch -y -c {name} --cloud {generic_cloud} tests/test_yamls/minimal.yaml',
+            f'unset SKYPILOT_DEBUG; s=$(sky launch -y -c {name} --cloud {generic_cloud} tests/test_yamls/minimal.yaml) && {_VALIDATE_LAUNCH_OUTPUT}',
+            # Output validation done.
             f'sky logs {name} 1 --status',
             f'sky logs {name} --status | grep "Job 1: SUCCEEDED"',  # Equivalent.
+            # Test launch output again on existing cluster
+            f'unset SKYPILOT_DEBUG; s=$(sky launch -y -c {name} --cloud {generic_cloud} tests/test_yamls/minimal.yaml) && {_VALIDATE_LAUNCH_OUTPUT}',
+            f'sky logs {name} 2 --status',
+            f'sky logs {name} --status | grep "Job 2: SUCCEEDED"',  # Equivalent.
             # Check the logs downloading
-            f'log_path=$(sky logs {name} 1 --sync-down | grep "Job 1 logs:" | sed -E "s/^.*Job 1 logs: (.*)\\x1b\\[0m/\\1/g") && echo $log_path && test -f $log_path/run.log',
+            f'log_path=$(sky logs {name} 1 --sync-down | grep "Job 1 logs:" | sed -E "s/^.*Job 1 logs: (.*)\\x1b\\[0m/\\1/g") && echo "$log_path" && test -f $log_path/run.log',
             # Ensure the raylet process has the correct file descriptor limit.
             f'sky exec {name} "prlimit -n --pid=\$(pgrep -f \'raylet/raylet --raylet_socket_name\') | grep \'"\'1048576 1048576\'"\'"',
-            f'sky logs {name} 2 --status',  # Ensure the job succeeded.
-            # Check the cluster info
-            f'sky exec {name} \'echo $SKYPILOT_CLUSTER_INFO | jq .cluster_name | grep {name}\'',
             f'sky logs {name} 3 --status',  # Ensure the job succeeded.
-            f'sky exec {name} \'echo $SKYPILOT_CLUSTER_INFO | jq .cloud | grep -i {generic_cloud}\'',
-            f'sky logs {name} 4 --status',  # Ensure the job succeeded.
+            # Install jq for the next test.
+            f'sky exec {name} \'sudo apt-get update && sudo apt-get install -y jq\'',
+            # Check the cluster info
+            f'sky exec {name} \'echo "$SKYPILOT_CLUSTER_INFO" | jq .cluster_name | grep {name}\'',
+            f'sky logs {name} 5 --status',  # Ensure the job succeeded.
+            f'sky exec {name} \'echo "$SKYPILOT_CLUSTER_INFO" | jq .cloud | grep -i {generic_cloud}\'',
+            f'sky logs {name} 6 --status',  # Ensure the job succeeded.
+            # Test '-c' for exec
+            f'sky exec -c {name} echo',
+            f'sky logs {name} 7 --status',
+            f'sky exec echo -c {name}',
+            f'sky logs {name} 8 --status',
+            f'sky exec -c {name} echo hi test',
+            f'sky logs {name} 9 | grep "hi test"',
+            f'sky exec {name} && exit 1 || true',
+            f'sky exec -c {name} && exit 1 || true',
         ],
         f'sky down -y {name}',
         _get_timeout(generic_cloud),
@@ -313,6 +364,9 @@ def test_aws_region():
             f'sky status --all | grep {name} | grep us-east-2',  # Ensure the region is correct.
             f'sky exec {name} \'echo $SKYPILOT_CLUSTER_INFO | jq .region | grep us-east-2\'',
             f'sky logs {name} 2 --status',  # Ensure the job succeeded.
+            # A user program should not access SkyPilot runtime env python by default.
+            f'sky exec {name} \'which python | grep {constants.SKY_REMOTE_PYTHON_ENV_NAME} || exit 1\'',
+            f'sky logs {name} 3 --status',  # Ensure the job succeeded.
         ],
         f'sky down -y {name}',
     )
@@ -333,6 +387,9 @@ def test_gcp_region_and_service_account():
             f'sky status --all | grep {name} | grep us-central1',  # Ensure the region is correct.
             f'sky exec {name} \'echo $SKYPILOT_CLUSTER_INFO | jq .region | grep us-central1\'',
             f'sky logs {name} 3 --status',  # Ensure the job succeeded.
+            # A user program should not access SkyPilot runtime env python by default.
+            f'sky exec {name} \'which python | grep {constants.SKY_REMOTE_PYTHON_ENV_NAME} || exit 1\'',
+            f'sky logs {name} 4 --status',  # Ensure the job succeeded.
         ],
         f'sky down -y {name}',
     )
@@ -370,6 +427,9 @@ def test_azure_region():
             f'sky logs {name} 2 --status',  # Ensure the job succeeded.
             f'sky exec {name} \'echo $SKYPILOT_CLUSTER_INFO | jq .zone | grep null\'',
             f'sky logs {name} 3 --status',  # Ensure the job succeeded.
+            # A user program should not access SkyPilot runtime env python by default.
+            f'sky exec {name} \'which python | grep {constants.SKY_REMOTE_PYTHON_ENV_NAME} || exit 1\'',
+            f'sky logs {name} 4 --status',  # Ensure the job succeeded.
         ],
         f'sky down -y {name}',
     )
@@ -1044,12 +1104,15 @@ def test_kubernetes_storage_mounts():
 
 
 @pytest.mark.parametrize(
-    "image_id",
+    'image_id',
     [
-        "docker:nvidia/cuda:11.8.0-devel-ubuntu18.04",
-        "docker:ubuntu:18.04",
+        'docker:nvidia/cuda:11.8.0-devel-ubuntu18.04',
+        'docker:ubuntu:18.04',
         # Test image with python 3.11 installed by default.
-        "docker:continuumio/miniconda3",
+        'docker:continuumio/miniconda3:24.1.2-0',
+        # Test python>=3.12 where SkyPilot should automatically create a separate
+        # conda env for runtime with python 3.10.
+        'docker:continuumio/miniconda3:latest',
     ])
 def test_docker_storage_mounts(generic_cloud: str, image_id: str):
     # Tests bucket mounting on docker container
@@ -1225,22 +1288,32 @@ def test_job_queue(generic_cloud: str):
 @pytest.mark.no_oci  # Doesn't support OCI for now
 @pytest.mark.no_kubernetes  # Doesn't support Kubernetes for now
 @pytest.mark.parametrize(
-    "image_id",
+    'image_id',
     [
-        "docker:nvidia/cuda:11.8.0-devel-ubuntu18.04",
-        "docker:ubuntu:18.04",
-        # Test image with python 3.11 installed by default.
-        "docker:continuumio/miniconda3",
+        'docker:nvidia/cuda:11.8.0-devel-ubuntu18.04',
+        'docker:ubuntu:18.04',
+        # Test latest image with python 3.11 installed by default.
+        'docker:continuumio/miniconda3:24.1.2-0',
+        # Test python>=3.12 where SkyPilot should automatically create a separate
+        # conda env for runtime with python 3.10.
+        'docker:continuumio/miniconda3:latest',
+        # Axolotl image is a good example custom image that has its conda path
+        # set in PATH with dockerfile and uses python>=3.12. It could test:
+        #  1. we handle the env var set in dockerfile correctly
+        #  2. python>=3.12 works with SkyPilot runtime.
+        'docker:winglian/axolotl:main-latest'
     ])
 def test_job_queue_with_docker(generic_cloud: str, image_id: str):
     name = _get_cluster_name() + image_id[len('docker:'):][:4]
+    total_timeout_minutes = 40 if generic_cloud == 'azure' else 15
+    time_to_sleep = 300 if generic_cloud == 'azure' else 180
     test = Test(
         'job_queue_with_docker',
         [
             f'sky launch -y -c {name} --cloud {generic_cloud} --image-id {image_id} examples/job_queue/cluster_docker.yaml',
-            f'sky exec {name} -n {name}-1 -d --image-id {image_id} examples/job_queue/job_docker.yaml',
-            f'sky exec {name} -n {name}-2 -d --image-id {image_id} examples/job_queue/job_docker.yaml',
-            f'sky exec {name} -n {name}-3 -d --image-id {image_id} examples/job_queue/job_docker.yaml',
+            f'sky exec {name} -n {name}-1 -d --image-id {image_id} --env TIME_TO_SLEEP={time_to_sleep} examples/job_queue/job_docker.yaml',
+            f'sky exec {name} -n {name}-2 -d --image-id {image_id} --env TIME_TO_SLEEP={time_to_sleep} examples/job_queue/job_docker.yaml',
+            f'sky exec {name} -n {name}-3 -d --image-id {image_id} --env TIME_TO_SLEEP={time_to_sleep} examples/job_queue/job_docker.yaml',
             f's=$(sky queue {name}); echo "$s"; echo; echo; echo "$s" | grep {name}-1 | grep RUNNING',
             f's=$(sky queue {name}); echo "$s"; echo; echo; echo "$s" | grep {name}-2 | grep RUNNING',
             f's=$(sky queue {name}); echo "$s"; echo; echo; echo "$s" | grep {name}-3 | grep PENDING',
@@ -1248,6 +1321,9 @@ def test_job_queue_with_docker(generic_cloud: str, image_id: str):
             'sleep 5',
             f's=$(sky queue {name}); echo "$s"; echo; echo; echo "$s" | grep {name}-3 | grep RUNNING',
             f'sky cancel -y {name} 3',
+            # Make sure the GPU is still visible to the container.
+            f'sky exec {name} --image-id {image_id} nvidia-smi | grep "Tesla T4"',
+            f'sky logs {name} 4 --status',
             f'sky stop -y {name}',
             # Make sure the job status preserve after stop and start the
             # cluster. This is also a test for the docker container to be
@@ -1258,10 +1334,14 @@ def test_job_queue_with_docker(generic_cloud: str, image_id: str):
             f's=$(sky queue {name}); echo "$s"; echo; echo; echo "$s" | grep {name}-3 | grep CANCELLED',
             f'sky exec {name} --gpus T4:0.2 "[[ \$SKYPILOT_NUM_GPUS_PER_NODE -eq 1 ]] || exit 1"',
             f'sky exec {name} --gpus T4:1 "[[ \$SKYPILOT_NUM_GPUS_PER_NODE -eq 1 ]] || exit 1"',
-            f'sky logs {name} 4 --status',
             f'sky logs {name} 5 --status',
+            f'sky logs {name} 6 --status',
+            # Make sure it is still visible after an stop & start cycle.
+            f'sky exec {name} --image-id {image_id} nvidia-smi | grep "Tesla T4"',
+            f'sky logs {name} 7 --status'
         ],
         f'sky down -y {name}',
+        timeout=total_timeout_minutes * 60,
     )
     run_one_test(test)
 
@@ -1489,6 +1569,7 @@ def test_ibm_job_queue_multinode():
 @pytest.mark.no_scp  # Doesn't support SCP for now
 @pytest.mark.no_oci  # Doesn't support OCI for now
 @pytest.mark.no_kubernetes  # Doesn't support Kubernetes for now
+# TODO(zhwu): we should fix this for kubernetes
 def test_docker_preinstalled_package(generic_cloud: str):
     name = _get_cluster_name()
     test = Test(
@@ -1791,6 +1872,91 @@ def test_paperspace_http_server_with_custom_ports():
     run_one_test(test)
 
 
+# ---------- Labels from task on AWS (instance_tags) ----------
+@pytest.mark.aws
+def test_task_labels_aws():
+    name = _get_cluster_name()
+    template_str = pathlib.Path(
+        'tests/test_yamls/test_labels.yaml.j2').read_text()
+    template = jinja2.Template(template_str)
+    content = template.render(cloud='aws', region='us-east-1')
+    with tempfile.NamedTemporaryFile(suffix='.yaml', mode='w') as f:
+        f.write(content)
+        f.flush()
+        file_path = f.name
+        test = Test(
+            'task_labels_aws',
+            [
+                f'sky launch -y -c {name} {file_path}',
+                # Verify with aws cli that the tags are set.
+                'aws ec2 describe-instances '
+                '--query "Reservations[*].Instances[*].InstanceId" '
+                '--filters "Name=instance-state-name,Values=running" '
+                f'--filters "Name=tag:skypilot-cluster-name,Values={name}*" '
+                '--filters "Name=tag:inlinelabel1,Values=inlinevalue1" '
+                '--filters "Name=tag:inlinelabel2,Values=inlinevalue2" '
+                '--region us-east-1 --output text',
+            ],
+            f'sky down -y {name}',
+        )
+        run_one_test(test)
+
+
+# ---------- Labels from task on GCP (labels) ----------
+@pytest.mark.gcp
+def test_task_labels_gcp():
+    name = _get_cluster_name()
+    template_str = pathlib.Path(
+        'tests/test_yamls/test_labels.yaml.j2').read_text()
+    template = jinja2.Template(template_str)
+    content = template.render(cloud='gcp')
+    with tempfile.NamedTemporaryFile(suffix='.yaml', mode='w') as f:
+        f.write(content)
+        f.flush()
+        file_path = f.name
+        test = Test(
+            'task_labels_gcp',
+            [
+                f'sky launch -y -c {name} {file_path}',
+                # Verify with gcloud cli that the tags are set
+                f'gcloud compute instances list --filter="name~\'^{name}\' AND '
+                'labels.inlinelabel1=\'inlinevalue1\' AND '
+                'labels.inlinelabel2=\'inlinevalue2\'" '
+                '--format="value(name)" | grep .',
+            ],
+            f'sky down -y {name}',
+        )
+        run_one_test(test)
+
+
+# ---------- Labels from task on Kubernetes (labels) ----------
+@pytest.mark.kubernetes
+def test_task_labels_kubernetes():
+    name = _get_cluster_name()
+    template_str = pathlib.Path(
+        'tests/test_yamls/test_labels.yaml.j2').read_text()
+    template = jinja2.Template(template_str)
+    content = template.render(cloud='kubernetes')
+    with tempfile.NamedTemporaryFile(suffix='.yaml', mode='w') as f:
+        f.write(content)
+        f.flush()
+        file_path = f.name
+        test = Test(
+            'task_labels_kubernetes',
+            [
+                f'sky launch -y -c {name} {file_path}',
+                # Verify with kubectl that the labels are set.
+                'kubectl get pods '
+                '--selector inlinelabel1=inlinevalue1 '
+                '--selector inlinelabel2=inlinevalue2 '
+                '-o jsonpath=\'{.items[*].metadata.name}\' | '
+                f'grep \'^{name}\''
+            ],
+            f'sky down -y {name}',
+        )
+        run_one_test(test)
+
+
 # ---------- Task: n=2 nodes with setups. ----------
 @pytest.mark.no_lambda_cloud  # Lambda Cloud does not have V100 gpus
 @pytest.mark.no_ibm  # IBM cloud currently doesn't provide public image with CUDA
@@ -1855,8 +2021,9 @@ def test_azure_start_stop():
             f'sky start -y {name} -i 1',
             f'sky exec {name} examples/azure_start_stop.yaml',
             f'sky logs {name} 3 --status',  # Ensure the job succeeded.
-            'sleep 200',
-            f's=$(sky status -r {name}) && echo $s && echo $s | grep "INIT\|STOPPED"'
+            'sleep 260',
+            f's=$(sky status -r {name}) && echo "$s" && echo "$s" | grep "INIT\|STOPPED"'
+            f'|| {{ ssh {name} "cat ~/.sky/skylet.log"; exit 1; }}'
         ],
         f'sky down -y {name}',
         timeout=30 * 60,  # 30 mins
@@ -2159,38 +2326,34 @@ def test_stop_gcp_spot():
     run_one_test(test)
 
 
-# ---------- Testing managed spot ----------
-@pytest.mark.no_fluidstack  # FluidStack does not support spot instances
-@pytest.mark.no_azure  # Azure does not support spot instances
-@pytest.mark.no_lambda_cloud  # Lambda Cloud does not support spot instances
-@pytest.mark.no_ibm  # IBM Cloud does not support spot instances
-@pytest.mark.no_scp  # SCP does not support spot instances
-@pytest.mark.no_paperspace  # Papperspace does not support spot instances
-@pytest.mark.no_kubernetes  # Kubernetes does not have a notion of spot instances
-@pytest.mark.managed_spot
-def test_spot(generic_cloud: str):
-    """Test the spot yaml."""
+# ---------- Testing managed job ----------
+@pytest.mark.managed_jobs
+def test_managed_jobs(generic_cloud: str):
+    """Test the managed jobs yaml."""
     name = _get_cluster_name()
     test = Test(
-        'managed-spot',
+        'managed-jobs',
         [
-            f'sky spot launch -n {name}-1 --cloud {generic_cloud} examples/managed_spot.yaml -y -d',
-            f'sky spot launch -n {name}-2 --cloud {generic_cloud} examples/managed_spot.yaml -y -d',
+            f'sky jobs launch -n {name}-1 --cloud {generic_cloud} examples/managed_job.yaml -y -d',
+            f'sky jobs launch -n {name}-2 --cloud {generic_cloud} examples/managed_job.yaml -y -d',
             'sleep 5',
-            f'{_SPOT_QUEUE_WAIT}| grep {name}-1 | head -n1 | grep "STARTING\|RUNNING"',
-            f'{_SPOT_QUEUE_WAIT}| grep {name}-2 | head -n1 | grep "STARTING\|RUNNING"',
-            _SPOT_CANCEL_WAIT.format(job_name=f'{name}-1'),
+            f'{_JOB_QUEUE_WAIT}| grep {name}-1 | head -n1 | grep "STARTING\|RUNNING"',
+            f'{_JOB_QUEUE_WAIT}| grep {name}-2 | head -n1 | grep "STARTING\|RUNNING"',
+            _JOB_CANCEL_WAIT.format(job_name=f'{name}-1'),
             'sleep 5',
-            f'{_SPOT_QUEUE_WAIT}| grep {name}-1 | head -n1 | grep "CANCELLING\|CANCELLED"',
+            f'{_JOB_QUEUE_WAIT}| grep {name}-1 | head -n1 | grep "CANCELLING\|CANCELLED"',
             'sleep 200',
-            f'{_SPOT_QUEUE_WAIT}| grep {name}-1 | head -n1 | grep CANCELLED',
-            f'{_SPOT_QUEUE_WAIT}| grep {name}-2 | head -n1 | grep "RUNNING\|SUCCEEDED"',
+            f'{_JOB_QUEUE_WAIT}| grep {name}-1 | head -n1 | grep CANCELLED',
+            # Test the functionality for logging.
+            f's=$(sky jobs logs -n {name}-2 --no-follow); echo "$s"; echo "$s" | grep "start counting"',
+            f's=$(sky jobs logs --controller -n {name}-2 --no-follow); echo "$s"; echo "$s" | grep "Successfully provisioned cluster:"',
+            f'{_JOB_QUEUE_WAIT}| grep {name}-2 | head -n1 | grep "RUNNING\|SUCCEEDED"',
         ],
-        # TODO(zhwu): Change to _SPOT_CANCEL_WAIT.format(job_name=f'{name}-1 -n {name}-2') when
+        # TODO(zhwu): Change to _JOB_CANCEL_WAIT.format(job_name=f'{name}-1 -n {name}-2') when
         # canceling multiple job names is supported.
-        (_SPOT_CANCEL_WAIT.format(job_name=f'{name}-1') + '; ' +
-         _SPOT_CANCEL_WAIT.format(job_name=f'{name}-2')),
-        # Increase timeout since sky spot queue -r can be blocked by other spot tests.
+        (_JOB_CANCEL_WAIT.format(job_name=f'{name}-1') + '; ' +
+         _JOB_CANCEL_WAIT.format(job_name=f'{name}-2')),
+        # Increase timeout since sky jobs queue -r can be blocked by other spot tests.
         timeout=20 * 60,
     )
     run_one_test(test)
@@ -2203,36 +2366,36 @@ def test_spot(generic_cloud: str):
 @pytest.mark.no_scp  # SCP does not support spot instances
 @pytest.mark.no_paperspace  # Paperspace does not support spot instances
 @pytest.mark.no_kubernetes  # Kubernetes does not have a notion of spot instances
-@pytest.mark.managed_spot
-def test_spot_pipeline(generic_cloud: str):
-    """Test a spot pipeline."""
+@pytest.mark.managed_jobs
+def test_job_pipeline(generic_cloud: str):
+    """Test a job pipeline."""
     name = _get_cluster_name()
     test = Test(
         'spot-pipeline',
         [
-            f'sky spot launch -n {name} tests/test_yamls/pipeline.yaml -y -d',
+            f'sky jobs launch -n {name} tests/test_yamls/pipeline.yaml -y -d',
             'sleep 5',
-            f'{_SPOT_QUEUE_WAIT}| grep {name} | head -n1 | grep "STARTING\|RUNNING"',
+            f'{_JOB_QUEUE_WAIT}| grep {name} | head -n1 | grep "STARTING\|RUNNING"',
             # `grep -A 4 {name}` finds the job with {name} and the 4 lines
             # after it, i.e. the 4 tasks within the job.
             # `sed -n 2p` gets the second line of the 4 lines, i.e. the first
             # task within the job.
-            f'{_SPOT_QUEUE_WAIT}| grep -A 4 {name}| sed -n 2p | grep "STARTING\|RUNNING"',
-            f'{_SPOT_QUEUE_WAIT}| grep -A 4 {name}| sed -n 3p | grep "PENDING"',
-            _SPOT_CANCEL_WAIT.format(job_name=f'{name}'),
+            f'{_JOB_QUEUE_WAIT}| grep -A 4 {name}| sed -n 2p | grep "STARTING\|RUNNING"',
+            f'{_JOB_QUEUE_WAIT}| grep -A 4 {name}| sed -n 3p | grep "PENDING"',
+            _JOB_CANCEL_WAIT.format(job_name=f'{name}'),
             'sleep 5',
-            f'{_SPOT_QUEUE_WAIT}| grep -A 4 {name}| sed -n 2p | grep "CANCELLING\|CANCELLED"',
-            f'{_SPOT_QUEUE_WAIT}| grep -A 4 {name}| sed -n 3p | grep "CANCELLING\|CANCELLED"',
-            f'{_SPOT_QUEUE_WAIT}| grep -A 4 {name}| sed -n 4p | grep "CANCELLING\|CANCELLED"',
-            f'{_SPOT_QUEUE_WAIT}| grep -A 4 {name}| sed -n 5p | grep "CANCELLING\|CANCELLED"',
+            f'{_JOB_QUEUE_WAIT}| grep -A 4 {name}| sed -n 2p | grep "CANCELLING\|CANCELLED"',
+            f'{_JOB_QUEUE_WAIT}| grep -A 4 {name}| sed -n 3p | grep "CANCELLING\|CANCELLED"',
+            f'{_JOB_QUEUE_WAIT}| grep -A 4 {name}| sed -n 4p | grep "CANCELLING\|CANCELLED"',
+            f'{_JOB_QUEUE_WAIT}| grep -A 4 {name}| sed -n 5p | grep "CANCELLING\|CANCELLED"',
             'sleep 200',
-            f'{_SPOT_QUEUE_WAIT}| grep -A 4 {name}| sed -n 2p | grep "CANCELLED"',
-            f'{_SPOT_QUEUE_WAIT}| grep -A 4 {name}| sed -n 3p | grep "CANCELLED"',
-            f'{_SPOT_QUEUE_WAIT}| grep -A 4 {name}| sed -n 4p | grep "CANCELLED"',
-            f'{_SPOT_QUEUE_WAIT}| grep -A 4 {name}| sed -n 5p | grep "CANCELLED"',
+            f'{_JOB_QUEUE_WAIT}| grep -A 4 {name}| sed -n 2p | grep "CANCELLED"',
+            f'{_JOB_QUEUE_WAIT}| grep -A 4 {name}| sed -n 3p | grep "CANCELLED"',
+            f'{_JOB_QUEUE_WAIT}| grep -A 4 {name}| sed -n 4p | grep "CANCELLED"',
+            f'{_JOB_QUEUE_WAIT}| grep -A 4 {name}| sed -n 5p | grep "CANCELLED"',
         ],
-        _SPOT_CANCEL_WAIT.format(job_name=f'{name}'),
-        # Increase timeout since sky spot queue -r can be blocked by other spot tests.
+        _JOB_CANCEL_WAIT.format(job_name=f'{name}'),
+        # Increase timeout since sky jobs queue -r can be blocked by other spot tests.
         timeout=30 * 60,
     )
     run_one_test(test)
@@ -2245,20 +2408,20 @@ def test_spot_pipeline(generic_cloud: str):
 @pytest.mark.no_scp  # SCP does not support spot instances
 @pytest.mark.no_paperspace  # Paperspace does not support spot instances
 @pytest.mark.no_kubernetes  # Kubernetes does not have a notion of spot instances
-@pytest.mark.managed_spot
-def test_spot_failed_setup(generic_cloud: str):
-    """Test managed spot job with failed setup."""
+@pytest.mark.managed_jobs
+def test_managed_jobs_failed_setup(generic_cloud: str):
+    """Test managed job with failed setup."""
     name = _get_cluster_name()
     test = Test(
-        'spot_failed_setup',
+        'managed_jobs_failed_setup',
         [
-            f'sky spot launch -n {name} --cloud {generic_cloud} -y -d tests/test_yamls/failed_setup.yaml',
+            f'sky jobs launch -n {name} --cloud {generic_cloud} -y -d tests/test_yamls/failed_setup.yaml',
             'sleep 330',
             # Make sure the job failed quickly.
-            f'{_SPOT_QUEUE_WAIT} | grep {name} | head -n1 | grep "FAILED_SETUP"',
+            f'{_JOB_QUEUE_WAIT} | grep {name} | head -n1 | grep "FAILED_SETUP"',
         ],
-        _SPOT_CANCEL_WAIT.format(job_name=name),
-        # Increase timeout since sky spot queue -r can be blocked by other spot tests.
+        _JOB_CANCEL_WAIT.format(job_name=name),
+        # Increase timeout since sky jobs queue -r can be blocked by other spot tests.
         timeout=20 * 60,
     )
     run_one_test(test)
@@ -2271,51 +2434,51 @@ def test_spot_failed_setup(generic_cloud: str):
 @pytest.mark.no_scp  # SCP does not support spot instances
 @pytest.mark.no_paperspace  # Paperspace does not support spot instances
 @pytest.mark.no_kubernetes  # Kubernetes does not have a notion of spot instances
-@pytest.mark.managed_spot
-def test_spot_pipeline_failed_setup(generic_cloud: str):
-    """Test managed spot job with failed setup for a pipeline."""
+@pytest.mark.managed_jobs
+def test_managed_jobs_pipeline_failed_setup(generic_cloud: str):
+    """Test managed job with failed setup for a pipeline."""
     name = _get_cluster_name()
     test = Test(
-        'spot_pipeline_failed_setup',
+        'managed_jobs_pipeline_failed_setup',
         [
-            f'sky spot launch -n {name} -y -d tests/test_yamls/failed_setup_pipeline.yaml',
-            'sleep 800',
+            f'sky jobs launch -n {name} -y -d tests/test_yamls/failed_setup_pipeline.yaml',
+            'sleep 600',
             # Make sure the job failed quickly.
-            f'{_SPOT_QUEUE_WAIT} | grep {name} | head -n1 | grep "FAILED_SETUP"',
+            f'{_JOB_QUEUE_WAIT} | grep {name} | head -n1 | grep "FAILED_SETUP"',
             # Task 0 should be SUCCEEDED.
-            f'{_SPOT_QUEUE_WAIT} | grep -A 4 {name}| sed -n 2p | grep "SUCCEEDED"',
+            f'{_JOB_QUEUE_WAIT} | grep -A 4 {name}| sed -n 2p | grep "SUCCEEDED"',
             # Task 1 should be FAILED_SETUP.
-            f'{_SPOT_QUEUE_WAIT} | grep -A 4 {name}| sed -n 3p | grep "FAILED_SETUP"',
+            f'{_JOB_QUEUE_WAIT} | grep -A 4 {name}| sed -n 3p | grep "FAILED_SETUP"',
             # Task 2 should be CANCELLED.
-            f'{_SPOT_QUEUE_WAIT} | grep -A 4 {name}| sed -n 4p | grep "CANCELLED"',
+            f'{_JOB_QUEUE_WAIT} | grep -A 4 {name}| sed -n 4p | grep "CANCELLED"',
             # Task 3 should be CANCELLED.
-            f'{_SPOT_QUEUE_WAIT} | grep -A 4 {name}| sed -n 5p | grep "CANCELLED"',
+            f'{_JOB_QUEUE_WAIT} | grep -A 4 {name}| sed -n 5p | grep "CANCELLED"',
         ],
-        _SPOT_CANCEL_WAIT.format(job_name=name),
-        # Increase timeout since sky spot queue -r can be blocked by other spot tests.
+        _JOB_CANCEL_WAIT.format(job_name=name),
+        # Increase timeout since sky jobs queue -r can be blocked by other spot tests.
         timeout=30 * 60,
     )
     run_one_test(test)
 
 
-# ---------- Testing managed spot recovery ----------
+# ---------- Testing managed job recovery ----------
 
 
 @pytest.mark.aws
-@pytest.mark.managed_spot
-def test_spot_recovery_aws(aws_config_region):
-    """Test managed spot recovery."""
+@pytest.mark.managed_jobs
+def test_managed_jobs_recovery_aws(aws_config_region):
+    """Test managed job recovery."""
     name = _get_cluster_name()
     name_on_cloud = common_utils.make_cluster_name_on_cloud(
-        name, spot.SPOT_CLUSTER_NAME_PREFIX_LENGTH, add_user_hash=False)
+        name, jobs.JOBS_CLUSTER_NAME_PREFIX_LENGTH, add_user_hash=False)
     region = aws_config_region
     test = Test(
-        'spot_recovery_aws',
+        'managed_jobs_recovery_aws',
         [
-            f'sky spot launch --cloud aws --region {region} -n {name} "echo SKYPILOT_TASK_ID: \$SKYPILOT_TASK_ID; sleep 1800"  -y -d',
+            f'sky jobs launch --cloud aws --region {region} --use-spot -n {name} "echo SKYPILOT_TASK_ID: \$SKYPILOT_TASK_ID; sleep 1800"  -y -d',
             'sleep 360',
-            f'{_SPOT_QUEUE_WAIT}| grep {name} | head -n1 | grep "RUNNING"',
-            f'RUN_ID=$(sky spot logs -n {name} --no-follow | grep SKYPILOT_TASK_ID | cut -d: -f2); echo "$RUN_ID" | tee /tmp/{name}-run-id',
+            f'{_JOB_QUEUE_WAIT}| grep {name} | head -n1 | grep "RUNNING"',
+            f'RUN_ID=$(sky jobs logs -n {name} --no-follow | grep SKYPILOT_TASK_ID | cut -d: -f2); echo "$RUN_ID" | tee /tmp/{name}-run-id',
             # Terminate the cluster manually.
             (f'aws ec2 terminate-instances --region {region} --instance-ids $('
              f'aws ec2 describe-instances --region {region} '
@@ -2323,24 +2486,24 @@ def test_spot_recovery_aws(aws_config_region):
              f'--query Reservations[].Instances[].InstanceId '
              '--output text)'),
             'sleep 100',
-            f'{_SPOT_QUEUE_WAIT}| grep {name} | head -n1 | grep "RECOVERING"',
+            f'{_JOB_QUEUE_WAIT}| grep {name} | head -n1 | grep "RECOVERING"',
             'sleep 200',
-            f'{_SPOT_QUEUE_WAIT}| grep {name} | head -n1 | grep "RUNNING"',
-            f'RUN_ID=$(cat /tmp/{name}-run-id); echo $RUN_ID; sky spot logs -n {name} --no-follow | grep SKYPILOT_TASK_ID | grep "$RUN_ID"',
+            f'{_JOB_QUEUE_WAIT}| grep {name} | head -n1 | grep "RUNNING"',
+            f'RUN_ID=$(cat /tmp/{name}-run-id); echo "$RUN_ID"; sky jobs logs -n {name} --no-follow | grep SKYPILOT_TASK_ID | grep "$RUN_ID"',
         ],
-        _SPOT_CANCEL_WAIT.format(job_name=name),
+        _JOB_CANCEL_WAIT.format(job_name=name),
         timeout=25 * 60,
     )
     run_one_test(test)
 
 
 @pytest.mark.gcp
-@pytest.mark.managed_spot
-def test_spot_recovery_gcp():
-    """Test managed spot recovery."""
+@pytest.mark.managed_jobs
+def test_managed_jobs_recovery_gcp():
+    """Test managed job recovery."""
     name = _get_cluster_name()
     name_on_cloud = common_utils.make_cluster_name_on_cloud(
-        name, spot.SPOT_CLUSTER_NAME_PREFIX_LENGTH, add_user_hash=False)
+        name, jobs.JOBS_CLUSTER_NAME_PREFIX_LENGTH, add_user_hash=False)
     zone = 'us-east4-b'
     query_cmd = (
         f'gcloud compute instances list --filter='
@@ -2350,30 +2513,30 @@ def test_spot_recovery_gcp():
     terminate_cmd = (f'gcloud compute instances delete --zone={zone}'
                      f' --quiet $({query_cmd})')
     test = Test(
-        'spot_recovery_gcp',
+        'managed_jobs_recovery_gcp',
         [
-            f'sky spot launch --cloud gcp --zone {zone} -n {name} --cpus 2 "echo SKYPILOT_TASK_ID: \$SKYPILOT_TASK_ID; sleep 1800"  -y -d',
+            f'sky jobs launch --cloud gcp --zone {zone} -n {name} --use-spot --cpus 2 "echo SKYPILOT_TASK_ID: \$SKYPILOT_TASK_ID; sleep 1800"  -y -d',
             'sleep 360',
-            f'{_SPOT_QUEUE_WAIT}| grep {name} | head -n1 | grep "RUNNING"',
-            f'RUN_ID=$(sky spot logs -n {name} --no-follow | grep SKYPILOT_TASK_ID | cut -d: -f2); echo "$RUN_ID" | tee /tmp/{name}-run-id',
+            f'{_JOB_QUEUE_WAIT}| grep {name} | head -n1 | grep "RUNNING"',
+            f'RUN_ID=$(sky jobs logs -n {name} --no-follow | grep SKYPILOT_TASK_ID | cut -d: -f2); echo "$RUN_ID" | tee /tmp/{name}-run-id',
             # Terminate the cluster manually.
             terminate_cmd,
             'sleep 60',
-            f'{_SPOT_QUEUE_WAIT}| grep {name} | head -n1 | grep "RECOVERING"',
+            f'{_JOB_QUEUE_WAIT}| grep {name} | head -n1 | grep "RECOVERING"',
             'sleep 200',
-            f'{_SPOT_QUEUE_WAIT}| grep {name} | head -n1 | grep "RUNNING"',
-            f'RUN_ID=$(cat /tmp/{name}-run-id); echo $RUN_ID; sky spot logs -n {name} --no-follow | grep SKYPILOT_TASK_ID: | grep "$RUN_ID"',
+            f'{_JOB_QUEUE_WAIT}| grep {name} | head -n1 | grep "RUNNING"',
+            f'RUN_ID=$(cat /tmp/{name}-run-id); echo "$RUN_ID"; sky jobs logs -n {name} --no-follow | grep SKYPILOT_TASK_ID: | grep "$RUN_ID"',
         ],
-        _SPOT_CANCEL_WAIT.format(job_name=name),
+        _JOB_CANCEL_WAIT.format(job_name=name),
         timeout=25 * 60,
     )
     run_one_test(test)
 
 
 @pytest.mark.aws
-@pytest.mark.managed_spot
-def test_spot_pipeline_recovery_aws(aws_config_region):
-    """Test managed spot recovery for a pipeline."""
+@pytest.mark.managed_jobs
+def test_managed_jobs_pipeline_recovery_aws(aws_config_region):
+    """Test managed job recovery for a pipeline."""
     name = _get_cluster_name()
     user_hash = common_utils.get_user_hash()
     user_hash = user_hash[:common_utils.USER_HASH_LENGTH_IN_CLUSTER_NAME]
@@ -2381,79 +2544,80 @@ def test_spot_pipeline_recovery_aws(aws_config_region):
     if region != 'us-east-2':
         pytest.skip('Only run spot pipeline recovery test in us-east-2')
     test = Test(
-        'spot_pipeline_recovery_aws',
+        'managed_jobs_pipeline_recovery_aws',
         [
-            f'sky spot launch -n {name} tests/test_yamls/pipeline_aws.yaml  -y -d',
+            f'sky jobs launch -n {name} tests/test_yamls/pipeline_aws.yaml  -y -d',
             'sleep 400',
-            f'{_SPOT_QUEUE_WAIT}| grep {name} | head -n1 | grep "RUNNING"',
-            f'RUN_ID=$(sky spot logs -n {name} --no-follow | grep SKYPILOT_TASK_ID: | cut -d: -f2); echo "$RUN_ID" | tee /tmp/{name}-run-id',
-            f'RUN_IDS=$(sky spot logs -n {name} --no-follow | grep -A 4 SKYPILOT_TASK_IDS | cut -d")" -f2); echo "$RUN_IDS" | tee /tmp/{name}-run-ids',
+            f'{_JOB_QUEUE_WAIT}| grep {name} | head -n1 | grep "RUNNING"',
+            f'RUN_ID=$(sky jobs logs -n {name} --no-follow | grep SKYPILOT_TASK_ID: | cut -d: -f2); echo "$RUN_ID" | tee /tmp/{name}-run-id',
+            f'RUN_IDS=$(sky jobs logs -n {name} --no-follow | grep -A 4 SKYPILOT_TASK_IDS | cut -d")" -f2); echo "$RUN_IDS" | tee /tmp/{name}-run-ids',
             # Terminate the cluster manually.
             # The `cat ...| rev` is to retrieve the job_id from the
             # SKYPILOT_TASK_ID, which gets the second to last field
             # separated by `-`.
             (
-                f'SPOT_JOB_ID=`cat /tmp/{name}-run-id | rev | '
-                'cut -d\'-\' -f2 | rev`;'
+                f'MANAGED_JOB_ID=`cat /tmp/{name}-run-id | rev | '
+                'cut -d\'_\' -f1 | rev | cut -d\'-\' -f1`;'
                 f'aws ec2 terminate-instances --region {region} --instance-ids $('
                 f'aws ec2 describe-instances --region {region} '
                 # TODO(zhwu): fix the name for spot cluster.
-                '--filters Name=tag:ray-cluster-name,Values=*-${SPOT_JOB_ID}'
+                '--filters Name=tag:ray-cluster-name,Values=*-${MANAGED_JOB_ID}'
                 f'-{user_hash} '
                 f'--query Reservations[].Instances[].InstanceId '
                 '--output text)'),
             'sleep 100',
-            f'{_SPOT_QUEUE_WAIT}| grep {name} | head -n1 | grep "RECOVERING"',
+            f'{_JOB_QUEUE_WAIT}| grep {name} | head -n1 | grep "RECOVERING"',
             'sleep 200',
-            f'{_SPOT_QUEUE_WAIT}| grep {name} | head -n1 | grep "RUNNING"',
-            f'RUN_ID=$(cat /tmp/{name}-run-id); echo $RUN_ID; sky spot logs -n {name} --no-follow | grep SKYPILOT_TASK_ID: | grep "$RUN_ID"',
-            f'RUN_IDS=$(sky spot logs -n {name} --no-follow | grep -A 4 SKYPILOT_TASK_IDS | cut -d")" -f2); echo "$RUN_IDS" | tee /tmp/{name}-run-ids-new',
+            f'{_JOB_QUEUE_WAIT}| grep {name} | head -n1 | grep "RUNNING"',
+            f'RUN_ID=$(cat /tmp/{name}-run-id); echo $RUN_ID; sky jobs logs -n {name} --no-follow | grep SKYPILOT_TASK_ID: | grep "$RUN_ID"',
+            f'RUN_IDS=$(sky jobs logs -n {name} --no-follow | grep -A 4 SKYPILOT_TASK_IDS | cut -d")" -f2); echo "$RUN_IDS" | tee /tmp/{name}-run-ids-new',
             f'diff /tmp/{name}-run-ids /tmp/{name}-run-ids-new',
             f'cat /tmp/{name}-run-ids | sed -n 2p | grep `cat /tmp/{name}-run-id`',
         ],
-        _SPOT_CANCEL_WAIT.format(job_name=name),
+        _JOB_CANCEL_WAIT.format(job_name=name),
         timeout=25 * 60,
     )
     run_one_test(test)
 
 
 @pytest.mark.gcp
-@pytest.mark.managed_spot
-def test_spot_pipeline_recovery_gcp():
-    """Test managed spot recovery for a pipeline."""
+@pytest.mark.managed_jobs
+def test_managed_jobs_pipeline_recovery_gcp():
+    """Test managed job recovery for a pipeline."""
     name = _get_cluster_name()
     zone = 'us-east4-b'
     user_hash = common_utils.get_user_hash()
     user_hash = user_hash[:common_utils.USER_HASH_LENGTH_IN_CLUSTER_NAME]
-    query_cmd = ('gcloud compute instances list --filter='
-                 f'"(labels.ray-cluster-name:*-${{SPOT_JOB_ID}}-{user_hash})" '
-                 f'--zones={zone} --format="value(name)"')
+    query_cmd = (
+        'gcloud compute instances list --filter='
+        f'"(labels.ray-cluster-name:*-${{MANAGED_JOB_ID}}-{user_hash})" '
+        f'--zones={zone} --format="value(name)"')
     terminate_cmd = (f'gcloud compute instances delete --zone={zone}'
                      f' --quiet $({query_cmd})')
     test = Test(
-        'spot_pipeline_recovery_gcp',
+        'managed_jobs_pipeline_recovery_gcp',
         [
-            f'sky spot launch -n {name} tests/test_yamls/pipeline_gcp.yaml  -y -d',
+            f'sky jobs launch -n {name} tests/test_yamls/pipeline_gcp.yaml  -y -d',
             'sleep 400',
-            f'{_SPOT_QUEUE_WAIT}| grep {name} | head -n1 | grep "RUNNING"',
-            f'RUN_ID=$(sky spot logs -n {name} --no-follow | grep SKYPILOT_TASK_ID: | cut -d: -f2); echo "$RUN_ID" | tee /tmp/{name}-run-id',
-            f'RUN_IDS=$(sky spot logs -n {name} --no-follow | grep -A 4 SKYPILOT_TASK_IDS | cut -d")" -f2); echo "$RUN_IDS" | tee /tmp/{name}-run-ids',
+            f'{_JOB_QUEUE_WAIT}| grep {name} | head -n1 | grep "RUNNING"',
+            f'RUN_ID=$(sky jobs logs -n {name} --no-follow | grep SKYPILOT_TASK_ID: | cut -d: -f2); echo "$RUN_ID" | tee /tmp/{name}-run-id',
+            f'RUN_IDS=$(sky jobs logs -n {name} --no-follow | grep -A 4 SKYPILOT_TASK_IDS | cut -d")" -f2); echo "$RUN_IDS" | tee /tmp/{name}-run-ids',
             # Terminate the cluster manually.
             # The `cat ...| rev` is to retrieve the job_id from the
             # SKYPILOT_TASK_ID, which gets the second to last field
             # separated by `-`.
-            (f'SPOT_JOB_ID=`cat /tmp/{name}-run-id | rev | '
-             f'cut -d\'-\' -f2 | rev`;{terminate_cmd}'),
+            (f'MANAGED_JOB_ID=`cat /tmp/{name}-run-id | rev | '
+             f'cut -d\'_\' -f1 | rev | cut -d\'-\' -f1`; {terminate_cmd}'),
             'sleep 60',
-            f'{_SPOT_QUEUE_WAIT}| grep {name} | head -n1 | grep "RECOVERING"',
+            f'{_JOB_QUEUE_WAIT}| grep {name} | head -n1 | grep "RECOVERING"',
             'sleep 200',
-            f'{_SPOT_QUEUE_WAIT}| grep {name} | head -n1 | grep "RUNNING"',
-            f'RUN_ID=$(cat /tmp/{name}-run-id); echo $RUN_ID; sky spot logs -n {name} --no-follow | grep SKYPILOT_TASK_ID: | grep "$RUN_ID"',
-            f'RUN_IDS=$(sky spot logs -n {name} --no-follow | grep -A 4 SKYPILOT_TASK_IDS | cut -d")" -f2); echo "$RUN_IDS" | tee /tmp/{name}-run-ids-new',
+            f'{_JOB_QUEUE_WAIT}| grep {name} | head -n1 | grep "RUNNING"',
+            f'RUN_ID=$(cat /tmp/{name}-run-id); echo $RUN_ID; sky jobs logs -n {name} --no-follow | grep SKYPILOT_TASK_ID: | grep "$RUN_ID"',
+            f'RUN_IDS=$(sky jobs logs -n {name} --no-follow | grep -A 4 SKYPILOT_TASK_IDS | cut -d")" -f2); echo "$RUN_IDS" | tee /tmp/{name}-run-ids-new',
             f'diff /tmp/{name}-run-ids /tmp/{name}-run-ids-new',
             f'cat /tmp/{name}-run-ids | sed -n 2p | grep `cat /tmp/{name}-run-id`',
         ],
-        _SPOT_CANCEL_WAIT.format(job_name=name),
+        _JOB_CANCEL_WAIT.format(job_name=name),
         timeout=25 * 60,
     )
     run_one_test(test)
@@ -2466,38 +2630,38 @@ def test_spot_pipeline_recovery_gcp():
 @pytest.mark.no_scp  # SCP does not support spot instances
 @pytest.mark.no_paperspace  # Paperspace does not support spot instances
 @pytest.mark.no_kubernetes  # Kubernetes does not have a notion of spot instances
-@pytest.mark.managed_spot
-def test_spot_recovery_default_resources(generic_cloud: str):
-    """Test managed spot recovery for default resources."""
+@pytest.mark.managed_jobs
+def test_managed_jobs_recovery_default_resources(generic_cloud: str):
+    """Test managed job recovery for default resources."""
     name = _get_cluster_name()
     test = Test(
         'managed-spot-recovery-default-resources',
         [
-            f'sky spot launch -n {name} --cloud {generic_cloud} "sleep 30 && sudo shutdown now && sleep 1000" -y -d',
+            f'sky jobs launch -n {name} --cloud {generic_cloud} --use-spot "sleep 30 && sudo shutdown now && sleep 1000" -y -d',
             'sleep 360',
-            f'{_SPOT_QUEUE_WAIT}| grep {name} | head -n1 | grep "RUNNING\|RECOVERING"',
+            f'{_JOB_QUEUE_WAIT}| grep {name} | head -n1 | grep "RUNNING\|RECOVERING"',
         ],
-        _SPOT_CANCEL_WAIT.format(job_name=name),
+        _JOB_CANCEL_WAIT.format(job_name=name),
         timeout=25 * 60,
     )
     run_one_test(test)
 
 
 @pytest.mark.aws
-@pytest.mark.managed_spot
-def test_spot_recovery_multi_node_aws(aws_config_region):
-    """Test managed spot recovery."""
+@pytest.mark.managed_jobs
+def test_managed_jobs_recovery_multi_node_aws(aws_config_region):
+    """Test managed job recovery."""
     name = _get_cluster_name()
     name_on_cloud = common_utils.make_cluster_name_on_cloud(
-        name, spot.SPOT_CLUSTER_NAME_PREFIX_LENGTH, add_user_hash=False)
+        name, jobs.JOBS_CLUSTER_NAME_PREFIX_LENGTH, add_user_hash=False)
     region = aws_config_region
     test = Test(
-        'spot_recovery_multi_node_aws',
+        'managed_jobs_recovery_multi_node_aws',
         [
-            f'sky spot launch --cloud aws --region {region} -n {name} --num-nodes 2 "echo SKYPILOT_TASK_ID: \$SKYPILOT_TASK_ID; sleep 1800"  -y -d',
+            f'sky jobs launch --cloud aws --region {region} -n {name} --use-spot --num-nodes 2 "echo SKYPILOT_TASK_ID: \$SKYPILOT_TASK_ID; sleep 1800"  -y -d',
             'sleep 450',
-            f'{_SPOT_QUEUE_WAIT}| grep {name} | head -n1 | grep "RUNNING"',
-            f'RUN_ID=$(sky spot logs -n {name} --no-follow | grep SKYPILOT_TASK_ID | cut -d: -f2); echo "$RUN_ID" | tee /tmp/{name}-run-id',
+            f'{_JOB_QUEUE_WAIT}| grep {name} | head -n1 | grep "RUNNING"',
+            f'RUN_ID=$(sky jobs logs -n {name} --no-follow | grep SKYPILOT_TASK_ID | cut -d: -f2); echo "$RUN_ID" | tee /tmp/{name}-run-id',
             # Terminate the worker manually.
             (f'aws ec2 terminate-instances --region {region} --instance-ids $('
              f'aws ec2 describe-instances --region {region} '
@@ -2506,24 +2670,24 @@ def test_spot_recovery_multi_node_aws(aws_config_region):
              f'--query Reservations[].Instances[].InstanceId '
              '--output text)'),
             'sleep 50',
-            f'{_SPOT_QUEUE_WAIT}| grep {name} | head -n1 | grep "RECOVERING"',
+            f'{_JOB_QUEUE_WAIT}| grep {name} | head -n1 | grep "RECOVERING"',
             'sleep 560',
-            f'{_SPOT_QUEUE_WAIT}| grep {name} | head -n1 | grep "RUNNING"',
-            f'RUN_ID=$(cat /tmp/{name}-run-id); echo $RUN_ID; sky spot logs -n {name} --no-follow | grep SKYPILOT_TASK_ID | cut -d: -f2 | grep "$RUN_ID"',
+            f'{_JOB_QUEUE_WAIT}| grep {name} | head -n1 | grep "RUNNING"',
+            f'RUN_ID=$(cat /tmp/{name}-run-id); echo $RUN_ID; sky jobs logs -n {name} --no-follow | grep SKYPILOT_TASK_ID | cut -d: -f2 | grep "$RUN_ID"',
         ],
-        _SPOT_CANCEL_WAIT.format(job_name=name),
+        _JOB_CANCEL_WAIT.format(job_name=name),
         timeout=30 * 60,
     )
     run_one_test(test)
 
 
 @pytest.mark.gcp
-@pytest.mark.managed_spot
-def test_spot_recovery_multi_node_gcp():
-    """Test managed spot recovery."""
+@pytest.mark.managed_jobs
+def test_managed_jobs_recovery_multi_node_gcp():
+    """Test managed job recovery."""
     name = _get_cluster_name()
     name_on_cloud = common_utils.make_cluster_name_on_cloud(
-        name, spot.SPOT_CLUSTER_NAME_PREFIX_LENGTH, add_user_hash=False)
+        name, jobs.JOBS_CLUSTER_NAME_PREFIX_LENGTH, add_user_hash=False)
     zone = 'us-west2-a'
     # Use ':' to match as the cluster name will contain the suffix with job id
     query_cmd = (
@@ -2533,71 +2697,71 @@ def test_spot_recovery_multi_node_gcp():
     terminate_cmd = (f'gcloud compute instances delete --zone={zone}'
                      f' --quiet $({query_cmd})')
     test = Test(
-        'spot_recovery_multi_node_gcp',
+        'managed_jobs_recovery_multi_node_gcp',
         [
-            f'sky spot launch --cloud gcp --zone {zone} -n {name} --num-nodes 2 "echo SKYPILOT_TASK_ID: \$SKYPILOT_TASK_ID; sleep 1800"  -y -d',
+            f'sky jobs launch --cloud gcp --zone {zone} -n {name} --use-spot --num-nodes 2 "echo SKYPILOT_TASK_ID: \$SKYPILOT_TASK_ID; sleep 1800"  -y -d',
             'sleep 400',
-            f'{_SPOT_QUEUE_WAIT}| grep {name} | head -n1 | grep "RUNNING"',
-            f'RUN_ID=$(sky spot logs -n {name} --no-follow | grep SKYPILOT_TASK_ID | cut -d: -f2); echo "$RUN_ID" | tee /tmp/{name}-run-id',
+            f'{_JOB_QUEUE_WAIT}| grep {name} | head -n1 | grep "RUNNING"',
+            f'RUN_ID=$(sky jobs logs -n {name} --no-follow | grep SKYPILOT_TASK_ID | cut -d: -f2); echo "$RUN_ID" | tee /tmp/{name}-run-id',
             # Terminate the worker manually.
             terminate_cmd,
             'sleep 50',
-            f'{_SPOT_QUEUE_WAIT}| grep {name} | head -n1 | grep "RECOVERING"',
+            f'{_JOB_QUEUE_WAIT}| grep {name} | head -n1 | grep "RECOVERING"',
             'sleep 420',
-            f'{_SPOT_QUEUE_WAIT}| grep {name} | head -n1 | grep "RUNNING"',
-            f'RUN_ID=$(cat /tmp/{name}-run-id); echo $RUN_ID; sky spot logs -n {name} --no-follow | grep SKYPILOT_TASK_ID | cut -d: -f2 | grep "$RUN_ID"',
+            f'{_JOB_QUEUE_WAIT}| grep {name} | head -n1 | grep "RUNNING"',
+            f'RUN_ID=$(cat /tmp/{name}-run-id); echo $RUN_ID; sky jobs logs -n {name} --no-follow | grep SKYPILOT_TASK_ID | cut -d: -f2 | grep "$RUN_ID"',
         ],
-        _SPOT_CANCEL_WAIT.format(job_name=name),
+        _JOB_CANCEL_WAIT.format(job_name=name),
         timeout=25 * 60,
     )
     run_one_test(test)
 
 
 @pytest.mark.aws
-@pytest.mark.managed_spot
-def test_spot_cancellation_aws(aws_config_region):
+@pytest.mark.managed_jobs
+def test_managed_jobs_cancellation_aws(aws_config_region):
     name = _get_cluster_name()
     name_on_cloud = common_utils.make_cluster_name_on_cloud(
-        name, spot.SPOT_CLUSTER_NAME_PREFIX_LENGTH, add_user_hash=False)
+        name, jobs.JOBS_CLUSTER_NAME_PREFIX_LENGTH, add_user_hash=False)
     name_2_on_cloud = common_utils.make_cluster_name_on_cloud(
-        f'{name}-2', spot.SPOT_CLUSTER_NAME_PREFIX_LENGTH, add_user_hash=False)
+        f'{name}-2', jobs.JOBS_CLUSTER_NAME_PREFIX_LENGTH, add_user_hash=False)
     name_3_on_cloud = common_utils.make_cluster_name_on_cloud(
-        f'{name}-3', spot.SPOT_CLUSTER_NAME_PREFIX_LENGTH, add_user_hash=False)
+        f'{name}-3', jobs.JOBS_CLUSTER_NAME_PREFIX_LENGTH, add_user_hash=False)
     region = aws_config_region
     test = Test(
-        'spot_cancellation_aws',
+        'managed_jobs_cancellation_aws',
         [
             # Test cancellation during spot cluster being launched.
-            f'sky spot launch --cloud aws --region {region} -n {name} "sleep 1000"  -y -d',
+            f'sky jobs launch --cloud aws --region {region} -n {name} --use-spot "sleep 1000"  -y -d',
             'sleep 60',
-            f'{_SPOT_QUEUE_WAIT}| grep {name} | head -n1 | grep "STARTING"',
-            _SPOT_CANCEL_WAIT.format(job_name=name),
+            f'{_JOB_QUEUE_WAIT}| grep {name} | head -n1 | grep "STARTING"',
+            _JOB_CANCEL_WAIT.format(job_name=name),
             'sleep 5',
-            f'{_SPOT_QUEUE_WAIT}| grep {name} | head -n1 | grep "CANCELLING\|CANCELLED"',
+            f'{_JOB_QUEUE_WAIT}| grep {name} | head -n1 | grep "CANCELLING\|CANCELLED"',
             'sleep 120',
-            f'{_SPOT_QUEUE_WAIT}| grep {name} | head -n1 | grep "CANCELLED"',
+            f'{_JOB_QUEUE_WAIT}| grep {name} | head -n1 | grep "CANCELLED"',
             (f's=$(aws ec2 describe-instances --region {region} '
              f'--filters Name=tag:ray-cluster-name,Values={name_on_cloud}-* '
              f'--query Reservations[].Instances[].State[].Name '
              '--output text) && echo "$s" && echo; [[ -z "$s" ]] || [[ "$s" = "terminated" ]] || [[ "$s" = "shutting-down" ]]'
             ),
             # Test cancelling the spot cluster during spot job being setup.
-            f'sky spot launch --cloud aws --region {region} -n {name}-2 tests/test_yamls/test_long_setup.yaml  -y -d',
+            f'sky jobs launch --cloud aws --region {region} -n {name}-2 --use-spot tests/test_yamls/test_long_setup.yaml  -y -d',
             'sleep 300',
-            _SPOT_CANCEL_WAIT.format(job_name=f'{name}-2'),
+            _JOB_CANCEL_WAIT.format(job_name=f'{name}-2'),
             'sleep 5',
-            f'{_SPOT_QUEUE_WAIT}| grep {name}-2 | head -n1 | grep "CANCELLING\|CANCELLED"',
+            f'{_JOB_QUEUE_WAIT}| grep {name}-2 | head -n1 | grep "CANCELLING\|CANCELLED"',
             'sleep 120',
-            f'{_SPOT_QUEUE_WAIT}| grep {name}-2 | head -n1 | grep "CANCELLED"',
+            f'{_JOB_QUEUE_WAIT}| grep {name}-2 | head -n1 | grep "CANCELLED"',
             (f's=$(aws ec2 describe-instances --region {region} '
              f'--filters Name=tag:ray-cluster-name,Values={name_2_on_cloud}-* '
              f'--query Reservations[].Instances[].State[].Name '
              '--output text) && echo "$s" && echo; [[ -z "$s" ]] || [[ "$s" = "terminated" ]] || [[ "$s" = "shutting-down" ]]'
             ),
             # Test cancellation during spot job is recovering.
-            f'sky spot launch --cloud aws --region {region} -n {name}-3 "sleep 1000"  -y -d',
+            f'sky jobs launch --cloud aws --region {region} -n {name}-3 --use-spot "sleep 1000"  -y -d',
             'sleep 300',
-            f'{_SPOT_QUEUE_WAIT}| grep {name}-3 | head -n1 | grep "RUNNING"',
+            f'{_JOB_QUEUE_WAIT}| grep {name}-3 | head -n1 | grep "RUNNING"',
             # Terminate the cluster manually.
             (f'aws ec2 terminate-instances --region {region} --instance-ids $('
              f'aws ec2 describe-instances --region {region} '
@@ -2605,12 +2769,12 @@ def test_spot_cancellation_aws(aws_config_region):
              f'--query Reservations[].Instances[].InstanceId '
              '--output text)'),
             'sleep 120',
-            f'{_SPOT_QUEUE_WAIT}| grep {name}-3 | head -n1 | grep "RECOVERING"',
-            _SPOT_CANCEL_WAIT.format(job_name=f'{name}-3'),
+            f'{_JOB_QUEUE_WAIT}| grep {name}-3 | head -n1 | grep "RECOVERING"',
+            _JOB_CANCEL_WAIT.format(job_name=f'{name}-3'),
             'sleep 5',
-            f'{_SPOT_QUEUE_WAIT}| grep {name}-3 | head -n1 | grep "CANCELLING\|CANCELLED"',
+            f'{_JOB_QUEUE_WAIT}| grep {name}-3 | head -n1 | grep "CANCELLING\|CANCELLED"',
             'sleep 120',
-            f'{_SPOT_QUEUE_WAIT}| grep {name}-3 | head -n1 | grep "CANCELLED"',
+            f'{_JOB_QUEUE_WAIT}| grep {name}-3 | head -n1 | grep "CANCELLED"',
             # The cluster should be terminated (shutting-down) after cancellation. We don't use the `=` operator here because
             # there can be multiple VM with the same name due to the recovery.
             (f's=$(aws ec2 describe-instances --region {region} '
@@ -2624,12 +2788,12 @@ def test_spot_cancellation_aws(aws_config_region):
 
 
 @pytest.mark.gcp
-@pytest.mark.managed_spot
-def test_spot_cancellation_gcp():
+@pytest.mark.managed_jobs
+def test_managed_jobs_cancellation_gcp():
     name = _get_cluster_name()
     name_3 = f'{name}-3'
     name_3_on_cloud = common_utils.make_cluster_name_on_cloud(
-        name_3, spot.SPOT_CLUSTER_NAME_PREFIX_LENGTH, add_user_hash=False)
+        name_3, jobs.JOBS_CLUSTER_NAME_PREFIX_LENGTH, add_user_hash=False)
     zone = 'us-west3-b'
     query_state_cmd = (
         'gcloud compute instances list '
@@ -2641,38 +2805,38 @@ def test_spot_cancellation_gcp():
     terminate_cmd = (f'gcloud compute instances delete --zone={zone}'
                      f' --quiet $({query_cmd})')
     test = Test(
-        'spot_cancellation_gcp',
+        'managed_jobs_cancellation_gcp',
         [
             # Test cancellation during spot cluster being launched.
-            f'sky spot launch --cloud gcp --zone {zone} -n {name} "sleep 1000"  -y -d',
+            f'sky jobs launch --cloud gcp --zone {zone} -n {name} --use-spot "sleep 1000"  -y -d',
             'sleep 60',
-            f'{_SPOT_QUEUE_WAIT}| grep {name} | head -n1 | grep "STARTING"',
-            _SPOT_CANCEL_WAIT.format(job_name=name),
+            f'{_JOB_QUEUE_WAIT}| grep {name} | head -n1 | grep "STARTING"',
+            _JOB_CANCEL_WAIT.format(job_name=name),
             'sleep 5',
-            f'{_SPOT_QUEUE_WAIT}| grep {name} | head -n1 | grep "CANCELLING\|CANCELLED"',
+            f'{_JOB_QUEUE_WAIT}| grep {name} | head -n1 | grep "CANCELLING\|CANCELLED"',
             'sleep 120',
-            f'{_SPOT_QUEUE_WAIT}| grep {name} | head -n1 | grep "CANCELLED"',
+            f'{_JOB_QUEUE_WAIT}| grep {name} | head -n1 | grep "CANCELLED"',
             # Test cancelling the spot cluster during spot job being setup.
-            f'sky spot launch --cloud gcp --zone {zone} -n {name}-2 tests/test_yamls/test_long_setup.yaml  -y -d',
+            f'sky jobs launch --cloud gcp --zone {zone} -n {name}-2 --use-spot tests/test_yamls/test_long_setup.yaml  -y -d',
             'sleep 300',
-            _SPOT_CANCEL_WAIT.format(job_name=f'{name}-2'),
+            _JOB_CANCEL_WAIT.format(job_name=f'{name}-2'),
             'sleep 5',
-            f'{_SPOT_QUEUE_WAIT}| grep {name}-2 | head -n1 | grep "CANCELLING\|CANCELLED"',
+            f'{_JOB_QUEUE_WAIT}| grep {name}-2 | head -n1 | grep "CANCELLING\|CANCELLED"',
             'sleep 120',
-            f'{_SPOT_QUEUE_WAIT}| grep {name}-2 | head -n1 | grep "CANCELLED"',
+            f'{_JOB_QUEUE_WAIT}| grep {name}-2 | head -n1 | grep "CANCELLED"',
             # Test cancellation during spot job is recovering.
-            f'sky spot launch --cloud gcp --zone {zone} -n {name}-3 "sleep 1000"  -y -d',
+            f'sky jobs launch --cloud gcp --zone {zone} -n {name}-3 --use-spot "sleep 1000"  -y -d',
             'sleep 300',
-            f'{_SPOT_QUEUE_WAIT}| grep {name}-3 | head -n1 | grep "RUNNING"',
+            f'{_JOB_QUEUE_WAIT}| grep {name}-3 | head -n1 | grep "RUNNING"',
             # Terminate the cluster manually.
             terminate_cmd,
             'sleep 80',
-            f'{_SPOT_QUEUE_WAIT}| grep {name}-3 | head -n1 | grep "RECOVERING"',
-            _SPOT_CANCEL_WAIT.format(job_name=f'{name}-3'),
+            f'{_JOB_QUEUE_WAIT}| grep {name}-3 | head -n1 | grep "RECOVERING"',
+            _JOB_CANCEL_WAIT.format(job_name=f'{name}-3'),
             'sleep 5',
-            f'{_SPOT_QUEUE_WAIT}| grep {name}-3 | head -n1 | grep "CANCELLING\|CANCELLED"',
+            f'{_JOB_QUEUE_WAIT}| grep {name}-3 | head -n1 | grep "CANCELLING\|CANCELLED"',
             'sleep 120',
-            f'{_SPOT_QUEUE_WAIT}| grep {name}-3 | head -n1 | grep "CANCELLED"',
+            f'{_JOB_QUEUE_WAIT}| grep {name}-3 | head -n1 | grep "CANCELLED"',
             # The cluster should be terminated (STOPPING) after cancellation. We don't use the `=` operator here because
             # there can be multiple VM with the same name due to the recovery.
             (f's=$({query_state_cmd}) && echo "$s" && echo; [[ -z "$s" ]] || echo "$s" | grep -v -E "PROVISIONING|STAGING|RUNNING|REPAIRING|TERMINATED|SUSPENDING|SUSPENDED|SUSPENDED"'
@@ -2682,28 +2846,28 @@ def test_spot_cancellation_gcp():
     run_one_test(test)
 
 
-# ---------- Testing storage for managed spot ----------
+# ---------- Testing storage for managed job ----------
 @pytest.mark.no_fluidstack  # Fluidstack does not support spot instances
 @pytest.mark.no_azure  # Azure does not support spot instances
 @pytest.mark.no_lambda_cloud  # Lambda Cloud does not support spot instances
 @pytest.mark.no_ibm  # IBM Cloud does not support spot instances
 @pytest.mark.no_paperspace  # Paperspace does not support spot instances
 @pytest.mark.no_scp  # SCP does not support spot instances
-@pytest.mark.no_kubernetes  # Kubernetes does not have a notion of spot instances
-@pytest.mark.managed_spot
-def test_spot_storage(generic_cloud: str):
-    """Test storage with managed spot"""
+@pytest.mark.managed_jobs
+def test_managed_jobs_storage(generic_cloud: str):
+    """Test storage with managed job"""
     name = _get_cluster_name()
     yaml_str = pathlib.Path(
-        'examples/managed_spot_with_storage.yaml').read_text()
+        'examples/managed_job_with_storage.yaml').read_text()
     storage_name = f'sky-test-{int(time.time())}'
 
     # Also perform region testing for bucket creation to validate if buckets are
-    # created in the correct region and correctly mounted in spot jobs.
+    # created in the correct region and correctly mounted in managed jobs.
     # However, we inject this testing only for AWS and GCP since they are the
     # supported object storage providers in SkyPilot.
     region_flag = ''
     region_validation_cmd = 'true'
+    use_spot = ' --use-spot'
     if generic_cloud == 'aws':
         region = 'eu-central-1'
         region_flag = f' --region {region}'
@@ -2716,6 +2880,8 @@ def test_spot_storage(generic_cloud: str):
         region_cmd = TestStorageWithCredentials.cli_region_cmd(
             storage_lib.StoreType.GCS, storage_name)
         region_validation_cmd = f'{region_cmd} | grep {region}'
+    elif generic_cloud == 'kubernetes':
+        use_spot = ' --no-use-spot'
 
     yaml_str = yaml_str.replace('sky-workdir-zhwu', storage_name)
     with tempfile.NamedTemporaryFile(suffix='.yaml', mode='w') as f:
@@ -2723,17 +2889,17 @@ def test_spot_storage(generic_cloud: str):
         f.flush()
         file_path = f.name
         test = Test(
-            'spot_storage',
+            'managed_jobs_storage',
             [
                 *storage_setup_commands,
-                f'sky spot launch -n {name} --cloud {generic_cloud}{region_flag} {file_path} -y',
+                f'sky jobs launch -n {name}{use_spot} --cloud {generic_cloud}{region_flag} {file_path} -y',
                 region_validation_cmd,  # Check if the bucket is created in the correct region
                 'sleep 60',  # Wait the spot queue to be updated
-                f'{_SPOT_QUEUE_WAIT}| grep {name} | grep SUCCEEDED',
+                f'{_JOB_QUEUE_WAIT}| grep {name} | grep SUCCEEDED',
                 f'[ $(aws s3api list-buckets --query "Buckets[?contains(Name, \'{storage_name}\')].Name" --output text | wc -l) -eq 0 ]'
             ],
-            _SPOT_CANCEL_WAIT.format(job_name=name),
-            # Increase timeout since sky spot queue -r can be blocked by other spot tests.
+            _JOB_CANCEL_WAIT.format(job_name=name),
+            # Increase timeout since sky jobs queue -r can be blocked by other spot tests.
             timeout=20 * 60,
         )
         run_one_test(test)
@@ -2741,48 +2907,41 @@ def test_spot_storage(generic_cloud: str):
 
 # ---------- Testing spot TPU ----------
 @pytest.mark.gcp
-@pytest.mark.managed_spot
+@pytest.mark.managed_jobs
 @pytest.mark.tpu
-def test_spot_tpu():
-    """Test managed spot on TPU."""
+def test_managed_jobs_tpu():
+    """Test managed job on TPU."""
     name = _get_cluster_name()
     test = Test(
         'test-spot-tpu',
         [
-            f'sky spot launch -n {name} examples/tpu/tpuvm_mnist.yaml -y -d',
+            f'sky jobs launch -n {name} --use-spot examples/tpu/tpuvm_mnist.yaml -y -d',
             'sleep 5',
-            f'{_SPOT_QUEUE_WAIT}| grep {name} | head -n1 | grep STARTING',
-            'sleep 840',  # TPU takes a while to launch
-            f'{_SPOT_QUEUE_WAIT}| grep {name} | head -n1 | grep "RUNNING\|SUCCEEDED"',
+            f'{_JOB_QUEUE_WAIT}| grep {name} | head -n1 | grep STARTING',
+            'sleep 900',  # TPU takes a while to launch
+            f'{_JOB_QUEUE_WAIT}| grep {name} | head -n1 | grep "RUNNING\|SUCCEEDED"',
         ],
-        _SPOT_CANCEL_WAIT.format(job_name=name),
-        # Increase timeout since sky spot queue -r can be blocked by other spot tests.
+        _JOB_CANCEL_WAIT.format(job_name=name),
+        # Increase timeout since sky jobs queue -r can be blocked by other spot tests.
         timeout=20 * 60,
     )
     run_one_test(test)
 
 
-# ---------- Testing env for spot ----------
-@pytest.mark.no_fluidstack  # Fluidstack does not support spot instances
-@pytest.mark.no_azure  # Azure does not support spot instances
-@pytest.mark.no_lambda_cloud  # Lambda Cloud does not support spot instances
-@pytest.mark.no_ibm  # IBM Cloud does not support spot instances
-@pytest.mark.no_scp  # SCP does not support spot instances
-@pytest.mark.no_paperspace  # Paperspace does not support spot instances
-@pytest.mark.no_kubernetes  # Kubernetes does not have a notion of spot instances
-@pytest.mark.managed_spot
-def test_spot_inline_env(generic_cloud: str):
-    """Test spot env"""
+# ---------- Testing env for managed jobs ----------
+@pytest.mark.managed_jobs
+def test_managed_jobs_inline_env(generic_cloud: str):
+    """Test managed jobs env"""
     name = _get_cluster_name()
     test = Test(
-        'test-spot-inline-env',
+        'test-managed-jobs-inline-env',
         [
-            f'sky spot launch -n {name} -y --cloud {generic_cloud} --env TEST_ENV="hello world" -- "([[ ! -z \\"\$TEST_ENV\\" ]] && [[ ! -z \\"\$SKYPILOT_NODE_IPS\\" ]] && [[ ! -z \\"\$SKYPILOT_NODE_RANK\\" ]]) || exit 1"',
+            f'sky jobs launch -n {name} -y --cloud {generic_cloud} --env TEST_ENV="hello world" -- "([[ ! -z \\"\$TEST_ENV\\" ]] && [[ ! -z \\"\$SKYPILOT_NODE_IPS\\" ]] && [[ ! -z \\"\$SKYPILOT_NODE_RANK\\" ]]) || exit 1"',
             'sleep 20',
-            f'{_SPOT_QUEUE_WAIT} | grep {name} | grep SUCCEEDED',
+            f'{_JOB_QUEUE_WAIT} | grep {name} | grep SUCCEEDED',
         ],
-        _SPOT_CANCEL_WAIT.format(job_name=name),
-        # Increase timeout since sky spot queue -r can be blocked by other spot tests.
+        _JOB_CANCEL_WAIT.format(job_name=name),
+        # Increase timeout since sky jobs queue -r can be blocked by other spot tests.
         timeout=20 * 60,
     )
     run_one_test(test)
@@ -2844,12 +3003,15 @@ def test_aws_custom_image():
 
 @pytest.mark.kubernetes
 @pytest.mark.parametrize(
-    "image_id",
+    'image_id',
     [
-        "docker:nvidia/cuda:11.8.0-devel-ubuntu18.04",
-        "docker:ubuntu:18.04",
-        # Test image with python 3.11 installed by default.
-        "docker:continuumio/miniconda3",
+        'docker:nvidia/cuda:11.8.0-devel-ubuntu18.04',
+        'docker:ubuntu:18.04',
+        # Test latest image with python 3.11 installed by default.
+        'docker:continuumio/miniconda3:24.1.2-0',
+        # Test python>=3.12 where SkyPilot should automatically create a separate
+        # conda env for runtime with python 3.10.
+        'docker:continuumio/miniconda3:latest',
     ])
 def test_kubernetes_custom_image(image_id):
     """Test Kubernetes custom image"""
@@ -2880,9 +3042,12 @@ def test_azure_start_stop_two_nodes():
             f'sky exec --num-nodes=2 {name} examples/azure_start_stop.yaml',
             f'sky logs {name} 1 --status',  # Ensure the job succeeded.
             f'sky stop -y {name}',
-            f'sky start -y {name}',
+            f'sky start -y {name} -i 1',
             f'sky exec --num-nodes=2 {name} examples/azure_start_stop.yaml',
             f'sky logs {name} 2 --status',  # Ensure the job succeeded.
+            'sleep 200',
+            f's=$(sky status -r {name}) && echo "$s" && echo "$s" | grep "INIT\|STOPPED"'
+            f'|| {{ ssh {name} "cat ~/.sky/skylet.log"; exit 1; }}'
         ],
         f'sky down -y {name}',
         timeout=30 * 60,  # 30 mins  (it takes around ~23 mins)
@@ -3124,7 +3289,7 @@ def _get_skyserve_http_test(name: str, cloud: str,
             f'sky serve up -n {name} -y tests/skyserve/http/{cloud}.yaml',
             _SERVE_WAIT_UNTIL_READY.format(name=name, replica_num=2),
             f'{_SERVE_ENDPOINT_WAIT.format(name=name)}; '
-            'curl -L http://$endpoint | grep "Hi, SkyPilot here"',
+            'curl http://$endpoint | grep "Hi, SkyPilot here"',
         ],
         _TEARDOWN_SERVICE.format(name=name),
         timeout=timeout_minutes * 60,
@@ -3192,8 +3357,16 @@ def test_skyserve_azure_http():
     run_one_test(test)
 
 
+@pytest.mark.kubernetes
+@pytest.mark.serve
+def test_skyserve_kubernetes_http():
+    """Test skyserve on Kubernetes"""
+    name = _get_service_name()
+    test = _get_skyserve_http_test(name, 'kubernetes', 30)
+    run_one_test(test)
+
+
 @pytest.mark.serve
-@pytest.mark.no_kubernetes
 def test_skyserve_llm(generic_cloud: str):
     """Test skyserve with real LLM usecase"""
     name = _get_service_name()
@@ -3221,7 +3394,7 @@ def generate_llm_test_command(prompt: str, expected_output: str) -> str:
             ],
         ],
         _TEARDOWN_SERVICE.format(name=name),
-        timeout=25 * 60,
+        timeout=40 * 60,
     )
     run_one_test(test)
 
@@ -3238,11 +3411,11 @@ def test_skyserve_spot_recovery():
             f'sky serve up -n {name} -y tests/skyserve/spot/recovery.yaml',
             _SERVE_WAIT_UNTIL_READY.format(name=name, replica_num=1),
             f'{_SERVE_ENDPOINT_WAIT.format(name=name)}; '
-            'curl -L http://$endpoint | grep "Hi, SkyPilot here"',
+            'request_output=$(curl http://$endpoint); echo "$request_output"; echo "$request_output" | grep "Hi, SkyPilot here"',
             _terminate_gcp_replica(name, zone, 1),
             _SERVE_WAIT_UNTIL_READY.format(name=name, replica_num=1),
             f'{_SERVE_ENDPOINT_WAIT.format(name=name)}; '
-            'curl -L http://$endpoint | grep "Hi, SkyPilot here"',
+            'request_output=$(curl http://$endpoint); echo "$request_output"; echo "$request_output" | grep "Hi, SkyPilot here"',
         ],
         _TEARDOWN_SERVICE.format(name=name),
         timeout=20 * 60,
@@ -3315,7 +3488,6 @@ def test_skyserve_dynamic_ondemand_fallback():
 
 
 @pytest.mark.serve
-@pytest.mark.no_kubernetes
 def test_skyserve_user_bug_restart(generic_cloud: str):
     """Tests that we restart the service after user bug."""
     # TODO(zhwu): this behavior needs some rethinking.
@@ -3338,7 +3510,7 @@ def test_skyserve_user_bug_restart(generic_cloud: str):
             f'echo "$s" | grep -B 100 "NO_REPLICA" | grep "0/0"',
             f'sky serve update {name} --cloud {generic_cloud} -y tests/skyserve/auto_restart.yaml',
             f'{_SERVE_ENDPOINT_WAIT.format(name=name)}; '
-            'until curl -L http://$endpoint | grep "Hi, SkyPilot here!"; do sleep 2; done; sleep 2; '
+            'until curl http://$endpoint | grep "Hi, SkyPilot here!"; do sleep 2; done; sleep 2; '
             + _check_replica_in_status(name, [(1, False, 'READY'),
                                               (1, False, 'FAILED')]),
         ],
@@ -3349,7 +3521,7 @@ def test_skyserve_user_bug_restart(generic_cloud: str):
 
 
 @pytest.mark.serve
-@pytest.mark.no_kubernetes
+@pytest.mark.no_kubernetes  # Replicas on k8s may be running on the same node and have the same public IP
 def test_skyserve_load_balancer(generic_cloud: str):
     """Test skyserve load balancer round-robin policy"""
     name = _get_service_name()
@@ -3386,7 +3558,7 @@ def test_skyserve_auto_restart():
             f'sky serve up -n {name} -y tests/skyserve/auto_restart.yaml',
             _SERVE_WAIT_UNTIL_READY.format(name=name, replica_num=1),
             f'{_SERVE_ENDPOINT_WAIT.format(name=name)}; '
-            'curl -L http://$endpoint | grep "Hi, SkyPilot here"',
+            'request_output=$(curl http://$endpoint); echo "$request_output"; echo "$request_output" | grep "Hi, SkyPilot here"',
             # sleep for 20 seconds (initial delay) to make sure it will
             # be restarted
             f'sleep 20',
@@ -3406,7 +3578,7 @@ def test_skyserve_auto_restart():
             '     sleep 10;'
             f'done); sleep {serve.LB_CONTROLLER_SYNC_INTERVAL_SECONDS};',
             f'{_SERVE_ENDPOINT_WAIT.format(name=name)}; '
-            'curl -L http://$endpoint | grep "Hi, SkyPilot here"',
+            'request_output=$(curl http://$endpoint); echo "$request_output"; echo "$request_output" | grep "Hi, SkyPilot here"',
         ],
         _TEARDOWN_SERVICE.format(name=name),
         timeout=20 * 60,
@@ -3415,7 +3587,6 @@ def test_skyserve_auto_restart():
 
 
 @pytest.mark.serve
-@pytest.mark.no_kubernetes
 def test_skyserve_cancel(generic_cloud: str):
     """Test skyserve with cancel"""
     name = _get_service_name()
@@ -3441,7 +3612,25 @@ def test_skyserve_cancel(generic_cloud: str):
 
 
 @pytest.mark.serve
-@pytest.mark.no_kubernetes
+def test_skyserve_streaming(generic_cloud: str):
+    """Test skyserve with streaming"""
+    name = _get_service_name()
+    test = Test(
+        f'test-skyserve-streaming',
+        [
+            f'sky serve up -n {name} --cloud {generic_cloud} -y tests/skyserve/streaming/streaming.yaml',
+            _SERVE_WAIT_UNTIL_READY.format(name=name, replica_num=1),
+            f'{_SERVE_ENDPOINT_WAIT.format(name=name)}; '
+            'python3 tests/skyserve/streaming/send_streaming_request.py '
+            '--endpoint $endpoint | grep "Streaming test passed"',
+        ],
+        _TEARDOWN_SERVICE.format(name=name),
+        timeout=20 * 60,
+    )
+    run_one_test(test)
+
+
+@pytest.mark.serve
 def test_skyserve_update(generic_cloud: str):
     """Test skyserve with update"""
     name = _get_service_name()
@@ -3450,14 +3639,14 @@ def test_skyserve_update(generic_cloud: str):
         [
             f'sky serve up -n {name} --cloud {generic_cloud} -y tests/skyserve/update/old.yaml',
             _SERVE_WAIT_UNTIL_READY.format(name=name, replica_num=2),
-            f'{_SERVE_ENDPOINT_WAIT.format(name=name)}; curl -L http://$endpoint | grep "Hi, SkyPilot here"',
+            f'{_SERVE_ENDPOINT_WAIT.format(name=name)}; curl http://$endpoint | grep "Hi, SkyPilot here"',
             f'sky serve update {name} --cloud {generic_cloud} --mode blue_green -y tests/skyserve/update/new.yaml',
             # sleep before update is registered.
             'sleep 20',
             f'{_SERVE_ENDPOINT_WAIT.format(name=name)}; '
-            'until curl -L http://$endpoint | grep "Hi, new SkyPilot here!"; do sleep 2; done;'
+            'until curl http://$endpoint | grep "Hi, new SkyPilot here!"; do sleep 2; done;'
             # Make sure the traffic is not mixed
-            'curl -L http://$endpoint | grep "Hi, new SkyPilot here"',
+            'curl http://$endpoint | grep "Hi, new SkyPilot here"',
             # The latest 2 version should be READY and the older versions should be shutting down
             (_check_replica_in_status(name, [(2, False, 'READY'),
                                              (2, False, 'SHUTTING_DOWN')]) +
@@ -3470,7 +3659,6 @@ def test_skyserve_update(generic_cloud: str):
 
 
 @pytest.mark.serve
-@pytest.mark.no_kubernetes
 def test_skyserve_rolling_update(generic_cloud: str):
     """Test skyserve with rolling update"""
     name = _get_service_name()
@@ -3482,14 +3670,14 @@ def test_skyserve_rolling_update(generic_cloud: str):
         [
             f'sky serve up -n {name} --cloud {generic_cloud} -y tests/skyserve/update/old.yaml',
             _SERVE_WAIT_UNTIL_READY.format(name=name, replica_num=2),
-            f'{_SERVE_ENDPOINT_WAIT.format(name=name)}; curl -L http://$endpoint | grep "Hi, SkyPilot here"',
+            f'{_SERVE_ENDPOINT_WAIT.format(name=name)}; curl http://$endpoint | grep "Hi, SkyPilot here"',
             f'sky serve update {name} --cloud {generic_cloud} -y tests/skyserve/update/new.yaml',
             # Make sure the traffic is mixed across two versions, the replicas
             # with even id will sleep 60 seconds before being ready, so we
             # should be able to get observe the period that the traffic is mixed
             # across two versions.
             f'{_SERVE_ENDPOINT_WAIT.format(name=name)}; '
-            'until curl -L http://$endpoint | grep "Hi, new SkyPilot here!"; do sleep 2; done; sleep 2; '
+            'until curl http://$endpoint | grep "Hi, new SkyPilot here!"; do sleep 2; done; sleep 2; '
             # The latest version should have one READY and the one of the older versions should be shutting down
             f'{single_new_replica} {_check_service_version(name, "1,2")} '
             # Check the output from the old version, immediately after the
@@ -3498,7 +3686,7 @@ def test_skyserve_rolling_update(generic_cloud: str):
             # TODO(zhwu): we should have a more generalized way for checking the
             # mixed version of replicas to avoid depending on the specific
             # round robin load balancing policy.
-            'curl -L http://$endpoint | grep "Hi, SkyPilot here"',
+            'curl http://$endpoint | grep "Hi, SkyPilot here"',
         ],
         _TEARDOWN_SERVICE.format(name=name),
         timeout=20 * 60,
@@ -3507,7 +3695,6 @@ def test_skyserve_rolling_update(generic_cloud: str):
 
 
 @pytest.mark.serve
-@pytest.mark.no_kubernetes
 def test_skyserve_fast_update(generic_cloud: str):
     """Test skyserve with fast update (Increment version of old replicas)"""
     name = _get_service_name()
@@ -3517,10 +3704,10 @@ def test_skyserve_fast_update(generic_cloud: str):
         [
             f'sky serve up -n {name} -y --cloud {generic_cloud} tests/skyserve/update/bump_version_before.yaml',
             _SERVE_WAIT_UNTIL_READY.format(name=name, replica_num=2),
-            f'{_SERVE_ENDPOINT_WAIT.format(name=name)}; curl -L http://$endpoint | grep "Hi, SkyPilot here"',
+            f'{_SERVE_ENDPOINT_WAIT.format(name=name)}; curl http://$endpoint | grep "Hi, SkyPilot here"',
             f'sky serve update {name} --cloud {generic_cloud} --mode blue_green -y tests/skyserve/update/bump_version_after.yaml',
             # sleep to wait for update to be registered.
-            'sleep 120',
+            'sleep 40',
             # 2 on-deamnd (ready) + 1 on-demand (provisioning).
             (
                 _check_replica_in_status(
@@ -3530,17 +3717,17 @@ def test_skyserve_fast_update(generic_cloud: str):
                 _check_service_version(name, "2")),
             _SERVE_WAIT_UNTIL_READY.format(name=name, replica_num=3) +
             _check_service_version(name, "2"),
-            f'{_SERVE_ENDPOINT_WAIT.format(name=name)}; curl -L http://$endpoint | grep "Hi, SkyPilot here"',
+            f'{_SERVE_ENDPOINT_WAIT.format(name=name)}; curl http://$endpoint | grep "Hi, SkyPilot here"',
             # Test rolling update
             f'sky serve update {name} --cloud {generic_cloud} -y tests/skyserve/update/bump_version_before.yaml',
             # sleep to wait for update to be registered.
-            'sleep 30',
+            'sleep 25',
             # 2 on-deamnd (ready) + 1 on-demand (shutting down).
             _check_replica_in_status(name, [(2, False, 'READY'),
                                             (1, False, 'SHUTTING_DOWN')]),
             _SERVE_WAIT_UNTIL_READY.format(name=name, replica_num=2) +
             _check_service_version(name, "3"),
-            f'{_SERVE_ENDPOINT_WAIT.format(name=name)}; curl -L http://$endpoint | grep "Hi, SkyPilot here"',
+            f'{_SERVE_ENDPOINT_WAIT.format(name=name)}; curl http://$endpoint | grep "Hi, SkyPilot here"',
         ],
         _TEARDOWN_SERVICE.format(name=name),
         timeout=30 * 60,
@@ -3549,7 +3736,6 @@ def test_skyserve_fast_update(generic_cloud: str):
 
 
 @pytest.mark.serve
-@pytest.mark.no_kubernetes
 def test_skyserve_update_autoscale(generic_cloud: str):
     """Test skyserve update with autoscale"""
     name = _get_service_name()
@@ -3560,7 +3746,7 @@ def test_skyserve_update_autoscale(generic_cloud: str):
             _SERVE_WAIT_UNTIL_READY.format(name=name, replica_num=2) +
             _check_service_version(name, "1"),
             f'{_SERVE_ENDPOINT_WAIT.format(name=name)}; '
-            'curl -L http://$endpoint | grep "Hi, SkyPilot here"',
+            'curl http://$endpoint | grep "Hi, SkyPilot here"',
             f'sky serve update {name} --cloud {generic_cloud} --mode blue_green -y tests/skyserve/update/num_min_one.yaml',
             # sleep before update is registered.
             'sleep 20',
@@ -3568,7 +3754,7 @@ def test_skyserve_update_autoscale(generic_cloud: str):
             _SERVE_WAIT_UNTIL_READY.format(name=name, replica_num=1) +
             _check_service_version(name, "2"),
             f'{_SERVE_ENDPOINT_WAIT.format(name=name)}; '
-            'curl -L http://$endpoint | grep "Hi, SkyPilot here!"',
+            'curl http://$endpoint | grep "Hi, SkyPilot here!"',
             # Rolling Update
             f'sky serve update {name} --cloud {generic_cloud} -y tests/skyserve/update/num_min_two.yaml',
             # sleep before update is registered.
@@ -3577,7 +3763,7 @@ def test_skyserve_update_autoscale(generic_cloud: str):
             _SERVE_WAIT_UNTIL_READY.format(name=name, replica_num=2) +
             _check_service_version(name, "3"),
             f'{_SERVE_ENDPOINT_WAIT.format(name=name)}; '
-            'curl -L http://$endpoint | grep "Hi, SkyPilot here!"',
+            'curl http://$endpoint | grep "Hi, SkyPilot here!"',
         ],
         _TEARDOWN_SERVICE.format(name=name),
         timeout=30 * 60,
@@ -3586,8 +3772,8 @@ def test_skyserve_update_autoscale(generic_cloud: str):
 
 
 @pytest.mark.serve
+@pytest.mark.no_kubernetes  # Spot instances are not supported in Kubernetes
 @pytest.mark.parametrize('mode', ['rolling', 'blue_green'])
-@pytest.mark.no_kubernetes
 def test_skyserve_new_autoscaler_update(mode: str, generic_cloud: str):
     """Test skyserve with update that changes autoscaler"""
     name = _get_service_name() + mode
@@ -3619,7 +3805,7 @@ def test_skyserve_new_autoscaler_update(mode: str, generic_cloud: str):
             _SERVE_WAIT_UNTIL_READY.format(name=name, replica_num=2) +
             _check_service_version(name, "1"),
             f'{_SERVE_ENDPOINT_WAIT.format(name=name)}; '
-            's=$(curl -L http://$endpoint); echo "$s"; echo "$s" | grep "Hi, SkyPilot here"',
+            's=$(curl http://$endpoint); echo "$s"; echo "$s" | grep "Hi, SkyPilot here"',
             f'sky serve update {name} --cloud {generic_cloud} --mode {mode} -y tests/skyserve/update/new_autoscaler_after.yaml',
             # Wait for update to be registered
             f'sleep 120',
@@ -3630,7 +3816,7 @@ def test_skyserve_new_autoscaler_update(mode: str, generic_cloud: str):
             *update_check,
             _SERVE_WAIT_UNTIL_READY.format(name=name, replica_num=5),
             f'{_SERVE_ENDPOINT_WAIT.format(name=name)}; '
-            'curl -L http://$endpoint | grep "Hi, SkyPilot here"',
+            'curl http://$endpoint | grep "Hi, SkyPilot here"',
             _check_replica_in_status(name, [(4, True, 'READY'),
                                             (1, False, 'READY')]),
         ],
@@ -3667,7 +3853,14 @@ def test_skyserve_failures(generic_cloud: str):
             f's=$(sky serve status {name}); '
             f'until echo "$s" | grep "FAILED_PROBING"; do '
             'echo "Waiting for replica to be failed..."; sleep 5; '
-            f's=$(sky serve status {name}); echo "$s"; done;' +
+            f's=$(sky serve status {name}); echo "$s"; done',
+            # Wait for the PENDING replica to appear.
+            'sleep 10',
+            # Wait until the replica is out of PENDING.
+            f's=$(sky serve status {name}); '
+            f'until ! echo "$s" | grep "PENDING" && ! echo "$s" | grep "Please wait for the controller to be ready."; do '
+            'echo "Waiting for replica to be out of pending..."; sleep 5; '
+            f's=$(sky serve status {name}); echo "$s"; done; ' +
             _check_replica_in_status(
                 name, [(1, False, 'FAILED_PROBING'),
                        (1, False, _SERVICE_LAUNCHING_STATUS_REGEX)]),
@@ -3682,18 +3875,25 @@ def test_skyserve_failures(generic_cloud: str):
 # TODO(Ziming, Tian): Add tests for autoscaling.
 
 
-# ------- Testing user ray cluster --------
-def test_user_ray_cluster(generic_cloud: str):
+# ------- Testing user dependencies --------
+def test_user_dependencies(generic_cloud: str):
     name = _get_cluster_name()
     test = Test(
-        'user-ray-cluster',
+        'user-dependencies',
         [
-            f'sky launch -y -c {name} --cloud {generic_cloud} "ray start --head"',
-            f'sky exec {name} "echo hi"',
+            f'sky launch -y -c {name} --cloud {generic_cloud} "pip install ray>2.11; ray start --head"',
             f'sky logs {name} 1 --status',
+            f'sky exec {name} "echo hi"',
+            f'sky logs {name} 2 --status',
             f'sky status -r {name} | grep UP',
             f'sky exec {name} "echo bye"',
-            f'sky logs {name} 2 --status',
+            f'sky logs {name} 3 --status',
+            f'sky launch -c {name} tests/test_yamls/different_default_conda_env.yaml',
+            f'sky logs {name} 4 --status',
+            # Launch again to test the default env does not affect SkyPilot
+            # runtime setup
+            f'sky launch -c {name} "python --version 2>&1 | grep \'Python 3.6\' || exit 1"',
+            f'sky logs {name} 5 --status',
         ],
         f'sky down -y {name}',
     )
@@ -4593,7 +4793,7 @@ def test_externally_created_bucket_mount_without_source(
     ])
     def test_aws_regions(self, tmp_local_storage_obj, region):
         # This tests creation and upload to bucket in all AWS s3 regions
-        # To test full functionality, use test_spot_storage above.
+        # To test full functionality, use test_managed_jobs_storage above.
         store_type = storage_lib.StoreType.S3
         tmp_local_storage_obj.add_store(store_type, region=region)
         bucket_name = tmp_local_storage_obj.name
@@ -4631,7 +4831,7 @@ def test_aws_regions(self, tmp_local_storage_obj, region):
     ])
     def test_gcs_regions(self, tmp_local_storage_obj, region):
         # This tests creation and upload to bucket in all GCS regions
-        # To test full functionality, use test_spot_storage above.
+        # To test full functionality, use test_managed_jobs_storage above.
         store_type = storage_lib.StoreType.GCS
         tmp_local_storage_obj.add_store(store_type, region=region)
         bucket_name = tmp_local_storage_obj.name
@@ -4660,7 +4860,7 @@ class TestYamlSpecs:
     #  We should not use `examples/storage_demo.yaml` here, since it requires
     #  users to ensure bucket names to not exist and/or be unique.
     _TEST_YAML_PATHS = [
-        'examples/minimal.yaml', 'examples/managed_spot.yaml',
+        'examples/minimal.yaml', 'examples/managed_job.yaml',
         'examples/using_file_mounts.yaml', 'examples/resnet_app.yaml',
         'examples/multi_hostname.yaml'
     ]
diff --git a/tests/test_yaml_parser.py b/tests/test_yaml_parser.py
index c02d77b8b9e..1453cfe1620 100644
--- a/tests/test_yaml_parser.py
+++ b/tests/test_yaml_parser.py
@@ -111,3 +111,38 @@ def test_invalid_fields_storage(tmp_path):
     with pytest.raises(AssertionError) as e:
         Task.from_yaml(config_path)
     assert 'Invalid storage args' in e.value.args[0]
+
+
+def test_invalid_envs_key(tmp_path):
+    config_path = _create_config_file(
+        textwrap.dedent(f"""\
+            envs:
+                invalid_env_$_key: value
+            """), tmp_path)
+    with pytest.raises(ValueError) as e:
+        Task.from_yaml(config_path)
+    assert 'does not match any of the regexes:' in e.value.args[0]
+
+
+def test_invalid_envs_type(tmp_path):
+    config_path = _create_config_file(
+        textwrap.dedent(f"""\
+            envs:
+                - env_key1: abc
+                - env_key2: abc
+            """), tmp_path)
+    with pytest.raises(ValueError) as e:
+        Task.from_yaml(config_path)
+    assert 'is not of type \'dict\'' in e.value.args[0]
+
+
+def test_invalid_empty_envs(tmp_path):
+    config_path = _create_config_file(
+        textwrap.dedent(f"""\
+            envs:
+                env_key1: abc
+                env_key2:
+            """), tmp_path)
+    with pytest.raises(ValueError) as e:
+        Task.from_yaml(config_path)
+    assert 'Environment variable \'env_key2\' is None.' in e.value.args[0]
diff --git a/tests/test_yamls/different_default_conda_env.yaml b/tests/test_yamls/different_default_conda_env.yaml
new file mode 100644
index 00000000000..ef8923106dc
--- /dev/null
+++ b/tests/test_yamls/different_default_conda_env.yaml
@@ -0,0 +1,11 @@
+resources:
+  cpus: 2+
+
+
+setup: |
+  conda create -n testenv python=3.6 -y
+
+  echo "conda activate testenv" >> ~/.bashrc
+
+run: |
+  python --version 2>&1 | grep "Python 3.6" || exit 1
diff --git a/tests/test_yamls/failed_setup_pipeline.yaml b/tests/test_yamls/failed_setup_pipeline.yaml
index fe4ef7746d4..fb3ef62ad44 100644
--- a/tests/test_yamls/failed_setup_pipeline.yaml
+++ b/tests/test_yamls/failed_setup_pipeline.yaml
@@ -5,6 +5,13 @@ name: a
 
 num_nodes: 2
 
+resources:
+  cpus: 2
+  any_of:
+    - cloud: aws
+    - cloud: gcp
+    
+
 setup: |
   echo setup for train
 
@@ -15,6 +22,13 @@ run: |
 ---
 name: b
 
+resources:
+  cpus: 2
+  any_of:
+    - cloud: aws
+    - cloud: gcp
+    
+
 setup: |
   exit 1
 
diff --git a/tests/test_yamls/minimal.yaml b/tests/test_yamls/minimal.yaml
index d7f9d4482f2..832b55e507f 100644
--- a/tests/test_yamls/minimal.yaml
+++ b/tests/test_yamls/minimal.yaml
@@ -5,3 +5,4 @@ setup: |
 
 run: |
   conda env list
+  echo "task run finish"
diff --git a/tests/test_yamls/pipeline.yaml b/tests/test_yamls/pipeline.yaml
index e4cab670040..4339406c833 100644
--- a/tests/test_yamls/pipeline.yaml
+++ b/tests/test_yamls/pipeline.yaml
@@ -5,6 +5,10 @@ name: a
 
 resources:
   cpus: 2+
+  use_spot: true
+  any_of:
+    - cloud: aws
+    - cloud: gcp
 
 setup: |
   echo setup for train
@@ -20,6 +24,9 @@ name: b
 
 resources:
   cpus: 2+
+  any_of:
+    - cloud: aws
+    - cloud: gcp
 
 setup: |
   echo setup for train
diff --git a/tests/test_yamls/test_labels.yaml.j2 b/tests/test_yamls/test_labels.yaml.j2
new file mode 100644
index 00000000000..18bb268ca56
--- /dev/null
+++ b/tests/test_yamls/test_labels.yaml.j2
@@ -0,0 +1,8 @@
+resources:
+  cloud: {{cloud}}
+  {% if region %}
+  region: {{region}}
+  {% endif %}
+  labels:
+    inlinelabel1: inlinevalue1
+    inlinelabel2: inlinevalue2
\ No newline at end of file
diff --git a/tests/unit_tests/test_adaptor.py b/tests/unit_tests/test_adaptor.py
index 7b586d73320..ac7e17727c1 100644
--- a/tests/unit_tests/test_adaptor.py
+++ b/tests/unit_tests/test_adaptor.py
@@ -1,6 +1,6 @@
 """Tests the inner loop of repeatedly querying instance status.
 
-Used by spot controller. This test prevents #2668 from regressing.
+Used by jobs controller. This test prevents #2668 from regressing.
 """
 
 import math
diff --git a/tests/unit_tests/test_controller_utils.py b/tests/unit_tests/test_controller_utils.py
index 57490139f8b..7465f648385 100644
--- a/tests/unit_tests/test_controller_utils.py
+++ b/tests/unit_tests/test_controller_utils.py
@@ -1,19 +1,22 @@
 """Test the controller_utils module."""
+from typing import Any, Dict
+
 import pytest
 
+import sky
+from sky.jobs import constants as managed_job_constants
 from sky.serve import constants as serve_constants
-from sky.spot import constants as spot_constants
 from sky.utils import controller_utils
 
 
 @pytest.mark.parametrize(
     ('controller_type', 'custom_controller_resources_config', 'expected'), [
-        ('spot', {}, {
+        ('jobs', {}, {
             'cpus': '8+',
             'memory': '3x',
             'disk_size': 50
         }),
-        ('spot', {
+        ('jobs', {
             'cpus': '4+',
             'disk_size': 100
         }, {
@@ -33,14 +36,13 @@
             'disk_size': 200
         }),
     ])
-def test_get_controller_resources_spot(controller_type,
-                                       custom_controller_resources_config,
-                                       expected, enable_all_clouds,
-                                       monkeypatch):
-    if controller_type == 'spot':
-        controller_resources_config = spot_constants.CONTROLLER_RESOURCES
-    else:
-        controller_resources_config = serve_constants.CONTROLLER_RESOURCES
+def test_get_controller_resources(
+    controller_type: str,
+    custom_controller_resources_config: Dict[str, Any],
+    expected: Dict[str, Any],
+    enable_all_clouds,
+    monkeypatch,
+):
 
     def get_custom_controller_resources(keys, default):
         if keys == (controller_type, 'controller', 'resources'):
@@ -52,10 +54,87 @@ def get_custom_controller_resources(keys, default):
     monkeypatch.setattr('sky.skypilot_config.get_nested',
                         get_custom_controller_resources)
 
-    controller_resources = controller_utils.get_controller_resources(
-        controller_type, controller_resources_config)
+    controller_resources = list(
+        controller_utils.get_controller_resources(
+            controller=controller_utils.Controllers.from_type(controller_type),
+            task_resources=[]))[0]
     controller_resources_config = controller_resources.to_yaml_config()
     for k, v in expected.items():
         assert controller_resources_config[k] == v, (
             controller_type, custom_controller_resources_config, expected,
             controller_resources_config, k, v)
+
+
+@pytest.mark.parametrize(('controller_type', 'default_controller_resources'), [
+    ('jobs', managed_job_constants.CONTROLLER_RESOURCES),
+    ('serve', serve_constants.CONTROLLER_RESOURCES),
+])
+def test_get_controller_resources_with_task_resources(
+    controller_type: str,
+    default_controller_resources: Dict[str, Any],
+    enable_all_clouds,
+):
+
+    # 1. All resources has cloud specified. All of them
+    # could host controllers. Return a set, each item has
+    # one cloud specified plus the default resources.
+    all_clouds = {sky.AWS(), sky.GCP(), sky.Azure()}
+    all_cloud_names = {str(c) for c in all_clouds}
+    controller_resources = controller_utils.get_controller_resources(
+        controller=controller_utils.Controllers.from_type(controller_type),
+        task_resources=[sky.Resources(cloud=c) for c in all_clouds])
+    for r in controller_resources:
+        config = r.to_yaml_config()
+        cloud = config.pop('cloud')
+        assert cloud in all_cloud_names
+        all_cloud_names.remove(cloud)
+        assert config == default_controller_resources, config
+    assert not all_cloud_names
+
+    # 2. All resources has cloud specified. Some of them
+    # could NOT host controllers. Return a set, only
+    # containing those could host controllers.
+    all_clouds = {
+        sky.AWS(),
+        sky.GCP(),
+        sky.Azure(),
+        sky.Fluidstack(),
+        sky.Kubernetes(),
+        sky.Lambda(),
+        sky.RunPod()
+    }
+
+    def _could_host_controllers(cloud: sky.clouds.Cloud) -> bool:
+        try:
+            cloud.check_features_are_supported(
+                sky.Resources(),
+                {sky.clouds.CloudImplementationFeatures.HOST_CONTROLLERS})
+        except sky.exceptions.NotSupportedError:
+            return False
+        return True
+
+    all_cloud_names_expected = {
+        str(c) for c in all_clouds if _could_host_controllers(c)
+    }
+    controller_resources = controller_utils.get_controller_resources(
+        controller=controller_utils.Controllers.from_type(controller_type),
+        task_resources=[sky.Resources(cloud=c) for c in all_clouds])
+    for r in controller_resources:
+        config = r.to_yaml_config()
+        cloud = config.pop('cloud')
+        assert cloud in all_cloud_names_expected
+        all_cloud_names_expected.remove(cloud)
+        assert config == default_controller_resources, config
+    assert not all_cloud_names_expected
+
+    # 3. Some resources does not have cloud specified.
+    # Return the default resources.
+    controller_resources = controller_utils.get_controller_resources(
+        controller=controller_utils.Controllers.from_type(controller_type),
+        task_resources=[
+            sky.Resources(accelerators='L4'),
+            sky.Resources(cloud=sky.RunPod(), accelerators='A40'),
+        ])
+    assert len(controller_resources) == 1
+    config = list(controller_resources)[0].to_yaml_config()
+    assert config == default_controller_resources, config
diff --git a/tests/unit_tests/test_resources.py b/tests/unit_tests/test_resources.py
index fb2825e7688..f9e3ad51630 100644
--- a/tests/unit_tests/test_resources.py
+++ b/tests/unit_tests/test_resources.py
@@ -1,7 +1,23 @@
+from typing import Dict
 from unittest.mock import Mock
 
+import pytest
+
+from sky import clouds
 from sky.resources import Resources
 
+GLOBAL_VALID_LABELS = {
+    'plaintext': 'plainvalue',
+    'numbers123': '123',
+}
+
+GLOBAL_INVALID_LABELS = {
+    'l' * 129: 'value',  # Long key
+    'key': 'v' * 257,  # Long value
+    'spaces in label': 'spaces in value',
+    '': 'emptykey',
+}
+
 
 def test_get_reservations_available_resources():
     mock = Mock()
@@ -11,3 +27,62 @@ def test_get_reservations_available_resources():
     r.get_reservations_available_resources()
     mock.get_reservations_available_resources.assert_called_once_with(
         "instance_type", "region", "zone", set())
+
+
+def _run_label_test(allowed_labels: Dict[str, str],
+                    invalid_labels: Dict[str, str], cloud: clouds.Cloud):
+    """Run a test for labels with the given allowed and invalid labels."""
+    r_allowed = Resources(cloud=cloud, labels=allowed_labels)  # Should pass
+    assert r_allowed.labels == allowed_labels, ('Allowed labels '
+                                                'should be the same')
+
+    # Check for each invalid label
+    for invalid_label, value in invalid_labels.items():
+        l = {invalid_label: value}
+        with pytest.raises(ValueError):
+            _ = Resources(cloud=cloud, labels=l)
+            assert False, (f'Resources were initialized with '
+                           f'invalid label {invalid_label}={value}')
+
+
+def test_gcp_labels_resources():
+    allowed_labels = {
+        **GLOBAL_VALID_LABELS,
+    }
+    invalid_labels = {
+        **GLOBAL_INVALID_LABELS,
+        'domain/key': 'value',
+        '1numericstart': 'value',
+    }
+    cloud = clouds.GCP()
+    _run_label_test(allowed_labels, invalid_labels, cloud)
+
+
+def test_aws_labels_resources():
+    allowed_labels = {
+        **GLOBAL_VALID_LABELS,
+    }
+    invalid_labels = {
+        **GLOBAL_INVALID_LABELS,
+        'aws:cannotstartwithaws': 'value',
+    }
+    cloud = clouds.AWS()
+    _run_label_test(allowed_labels, invalid_labels, cloud)
+
+
+def test_kubernetes_labels_resources():
+    allowed_labels = {
+        **GLOBAL_VALID_LABELS,
+        'kueue.x-k8s.io/queue-name': 'queue',
+        'a' * 253 + '/' + 'k' * 63: 'v' *
+                                    63,  # upto 253 chars in domain, 63 in key
+        'mylabel': '',  # empty label values are allowed by k8s
+        '1numericstart': 'value',  # numeric start is allowed by k8s
+    }
+    invalid_labels = {
+        **GLOBAL_INVALID_LABELS,
+        'a' * 254 + '/' + 'k' * 64: 'v' *
+                                    63,  # exceed 253 chars in domain, 63 in key
+    }
+    cloud = clouds.Kubernetes()
+    _run_label_test(allowed_labels, invalid_labels, cloud)