From 364831959b40380330cd4787f1b9fc26f0e6f095 Mon Sep 17 00:00:00 2001 From: Zongheng Yang Date: Fri, 26 Jul 2024 16:29:48 -0700 Subject: [PATCH] Docs revamp: polishing landing docs. --- README.md | 38 ++++++----- docs/source/docs/index.rst | 66 ++++++++++++------- .../examples/interactive-development.rst | 6 +- 3 files changed, 68 insertions(+), 42 deletions(-) diff --git a/README.md b/README.md index a6f1df49c91..140b44ff7d2 100644 --- a/README.md +++ b/README.md @@ -20,9 +20,8 @@

-

- Run LLMs and AI on Any Cloud + Run AI on Any Infra — Unified, Faster, Cheaper

---- @@ -38,11 +37,11 @@ - [Nov, 2023] Using [**Axolotl**](https://github.com/OpenAccess-AI-Collective/axolotl) to finetune Mistral 7B on the cloud (on-demand and spot): [**example**](./llm/axolotl/) - [Sep, 2023] Case study: [**Covariant**](https://covariant.ai/) transformed AI development on the cloud using SkyPilot, delivering models 4x faster cost-effectively: [**read the case study**](https://blog.skypilot.co/covariant/) - [Aug, 2023] **Finetuning Cookbook**: Finetuning Llama 2 in your own cloud environment, privately: [**example**](./llm/vicuna-llama-2/), [**blog post**](https://blog.skypilot.co/finetuning-llama2-operational-guide/) -- [June, 2023] Serving LLM 24x Faster On the Cloud [**with vLLM**](https://vllm.ai/) and SkyPilot: [**example**](./llm/vllm/), [**blog post**](https://blog.skypilot.co/serving-llm-24x-faster-on-the-cloud-with-vllm-and-skypilot/)
Archived +- [June, 2023] Serving LLM 24x Faster On the Cloud [**with vLLM**](https://vllm.ai/) and SkyPilot: [**example**](./llm/vllm/), [**blog post**](https://blog.skypilot.co/serving-llm-24x-faster-on-the-cloud-with-vllm-and-skypilot/) - [Mar, 2024] Serve and deploy [**Databricks DBRX**](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm) on your infra: [**example**](./llm/dbrx/) - [Feb, 2024] Speed up your LLM deployments with [**SGLang**](https://github.com/sgl-project/sglang) for 5x throughput on SkyServe: [**example**](./llm/sglang/) - [Dec, 2023] Using [**LoRAX**](https://github.com/predibase/lorax) to serve 1000s of finetuned LLMs on a single instance in the cloud: [**example**](./llm/lorax/) @@ -54,33 +53,35 @@ ---- -SkyPilot is a framework for running LLMs, AI, and batch jobs on any cloud, offering maximum cost savings, highest GPU availability, and managed execution. +SkyPilot is a framework for running AI and batch workloads on any infra, offering unified execution, high cost savings, and high GPU availability. -SkyPilot **abstracts away cloud infra burdens**: -- Launch jobs & clusters on any cloud -- Easy scale-out: queue and run many jobs, automatically managed -- Easy access to object stores (S3, GCS, Azure, R2, IBM) +SkyPilot **abstracts away infra burdens**: +- Launch [dev clusters](https://skypilot.readthedocs.io/en/latest/examples/interactive-development.html), [jobs](https://skypilot.readthedocs.io/en/latest/examples/managed-jobs.html), and [serving](https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html) on any infra +- Easy job management: queue, run, and auto-recover many jobs -SkyPilot **maximizes GPU availability for your jobs**: -* Provision in all zones/regions/clouds you have access to ([the _Sky_](https://arxiv.org/abs/2205.07147)), with automatic failover +SkyPilot **supports multiple clusters, clouds, and hardware** ([the Sky](https://arxiv.org/abs/2205.07147)): +- Bring your reserved GPUs, Kubernetes clusters, or 12+ clouds +- [Flexible provisioning](https://skypilot.readthedocs.io/en/latest/examples/auto-failover.html) of GPUs, TPUs, CPUs, with auto-retry -SkyPilot **cuts your cloud costs**: -* [Managed Spot](https://skypilot.readthedocs.io/en/latest/examples/spot-jobs.html): 3-6x cost savings using spot VMs, with auto-recovery from preemptions -* [Optimizer](https://skypilot.readthedocs.io/en/latest/examples/auto-failover.html): 2x cost savings by auto-picking the cheapest VM/zone/region/cloud -* [Autostop](https://skypilot.readthedocs.io/en/latest/reference/auto-stop.html): hands-free cleanup of idle clusters +SkyPilot **cuts your cloud costs & maximizes GPU availability**: +* [Autostop](https://skypilot.readthedocs.io/en/latest/reference/auto-stop.html): automatic cleanup of idle resources +* [Managed Spot](https://skypilot.readthedocs.io/en/latest/examples/managed-jobs.html): 3-6x cost savings using spot instances, with preemption auto-recovery +* [Optimizer](https://skypilot.readthedocs.io/en/latest/examples/auto-failover.html): 2x cost savings by auto-picking the cheapest & most available infra SkyPilot supports your existing GPU, TPU, and CPU workloads, with no code changes. Install with pip: ```bash -pip install -U "skypilot[aws,gcp,azure,oci,lambda,runpod,fluidstack,paperspace,cudo,ibm,scp,kubernetes]" # choose your clouds +# Choose your clouds: +pip install -U "skypilot[kubernetes,aws,gcp,azure,oci,lambda,runpod,fluidstack,paperspace,cudo,ibm,scp]" ``` To get the latest features and fixes, use the nightly build or [install from source](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html): ```bash -pip install "skypilot-nightly[aws,gcp,azure,oci,lambda,runpod,fluidstack,paperspace,cudo,ibm,scp,kubernetes]" # choose your clouds +# Choose your clouds: +pip install "skypilot-nightly[kubernetes,aws,gcp,azure,oci,lambda,runpod,fluidstack,paperspace,cudo,ibm,scp]" ``` -Current supported providers (AWS, Azure, GCP, OCI, Lambda Cloud, RunPod, Fluidstack, Paperspace, Cudo, IBM, Samsung, Cloudflare, any Kubernetes cluster): +[Current supported infra](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html) (Kubernetes; AWS, GCP, Azure, OCI, Lambda Cloud, Fluidstack, RunPod, Cudo, Paperspace, Cloudflare, Samsung, IBM, VMware vSphere):

@@ -155,6 +156,7 @@ To learn more, see our [Documentation](https://skypilot.readthedocs.io/en/latest Runnable examples: - LLMs on SkyPilot + - [Llama 3.1 finetuning](./llm/llama-3_1-finetuning/) and [serving](./llm/llama-3_1/) - [GPT-2 via `llm.c`](./llm/gpt-2/) - [Llama 3](./llm/llama-3/) - [Qwen](./llm/qwen/) @@ -177,6 +179,8 @@ Runnable examples: - Add yours here & see more in [`llm/`](./llm)! - Framework examples: [PyTorch DDP](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_torch.yaml), [DeepSpeed](./examples/deepspeed-multinode/sky.yaml), [JAX/Flax on TPU](https://github.com/skypilot-org/skypilot/blob/master/examples/tpu/tpuvm_mnist.yaml), [Stable Diffusion](https://github.com/skypilot-org/skypilot/tree/master/examples/stable_diffusion), [Detectron2](https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml), [Distributed](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_tf_app.py) [TensorFlow](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_app_storage.yaml), [Ray Train](examples/distributed_ray_train/ray_train.yaml), [NeMo](https://github.com/skypilot-org/skypilot/blob/master/examples/nemo/nemo.yaml), [programmatic grid search](https://github.com/skypilot-org/skypilot/blob/master/examples/huggingface_glue_imdb_grid_search_app.py), [Docker](https://github.com/skypilot-org/skypilot/blob/master/examples/docker/echo_app.yaml), [Cog](https://github.com/skypilot-org/skypilot/blob/master/examples/cog/), [Unsloth](https://github.com/skypilot-org/skypilot/blob/master/examples/unsloth/unsloth.yaml), [Ollama](https://github.com/skypilot-org/skypilot/blob/master/llm/ollama), [llm.c](https://github.com/skypilot-org/skypilot/tree/master/llm/gpt-2) and [many more (`examples/`)](./examples). +Case Studies and Integrations: [Community Spotlights](https://blog.skypilot.co/community/) + Follow updates: - [Twitter](https://twitter.com/skypilot_org) - [Slack](http://slack.skypilot.co) diff --git a/docs/source/docs/index.rst b/docs/source/docs/index.rst index 5b8d144af70..c22bc982c1b 100644 --- a/docs/source/docs/index.rst +++ b/docs/source/docs/index.rst @@ -2,52 +2,53 @@ Welcome to SkyPilot! ==================== .. image:: /_static/SkyPilot_wide_dark.svg - :width: 65% + :width: 50% :align: center :alt: SkyPilot :class: no-scaled-link, only-dark .. image:: /_static/SkyPilot_wide_light.svg - :width: 60% + :width: 50% :align: center :alt: SkyPilot :class: no-scaled-link, only-light + .. raw:: html +

+

+ Run AI on Any Infra — Unified, Faster, Cheaper +

Star Watch Fork -

-

- Run LLMs and AI on Any Cloud -

-SkyPilot is a framework for running LLMs, AI, and batch jobs on any cloud, offering maximum cost savings, highest GPU availability, and managed execution. +SkyPilot is a framework for running AI and batch workloads on any infra, offering unified execution, high cost savings, and high GPU availability. -SkyPilot **abstracts away cloud infra burdens**: +SkyPilot **abstracts away infra burdens**: -- Launch jobs & clusters on any cloud -- Easy scale-out: queue and run many jobs, automatically managed -- Easy access to object stores (S3, GCS, Azure, R2, IBM) +- Launch :ref:`dev clusters `, :ref:`jobs `, and :ref:`serving ` on any infra +- Easy job management: queue, run, and auto-recover many jobs -SkyPilot **maximizes GPU availability for your jobs**: +SkyPilot **supports multiple clusters, clouds, and hardware** (`the Sky `_): -* Provision in all zones/regions/clouds you have access to (`the Sky `_), with automatic failover +- Bring your reserved GPUs, Kubernetes clusters, or 12+ clouds +- :ref:`Flexible provisioning ` of GPUs, TPUs, CPUs, with auto-retry -SkyPilot **cuts your cloud costs**: +SkyPilot **cuts your cloud costs & maximizes GPU availability**: -* `Managed Spot `_: 3-6x cost savings using spot VMs, with auto-recovery from preemptions -* Optimizer: 2x cost savings by auto-picking the cheapest VM/zone/region/cloud -* `Autostop `_: hands-free cleanup of idle clusters +* :ref:`Autostop `: automatic cleanup of idle resources +* :ref:`Managed Spot `: 3-6x cost savings using spot instances, with preemption auto-recovery +* :ref:`Optimizer `: 2x cost savings by auto-picking the cheapest & most available infra SkyPilot supports your existing GPU, TPU, and CPU workloads, with no code changes. -Current supported providers (AWS, GCP, Azure, OCI, Lambda Cloud, RunPod, Fluidstack, Cudo, IBM, Samsung, Cloudflare, VMware vSphere, any Kubernetes cluster): +:ref:`Current supported infra ` (Kubernetes; AWS, GCP, Azure, OCI, Lambda Cloud, Fluidstack, RunPod, Cudo, Paperspace, Cloudflare, Samsung, IBM, VMware vSphere): .. raw:: html @@ -58,10 +59,18 @@ Current supported providers (AWS, GCP, Azure, OCI, Lambda Cloud, RunPod, Fluidst

-More Information --------------------------- +Ready to get started? +---------------------- -Tutorials: `SkyPilot Tutorials `_ +:ref:`Install SkyPilot ` in ~1 minute. Then, launch your first cluster in ~5 minutes in :ref:`Quickstart `. + +Contact the SkyPilot team +--------------------------------- + +You can chat with the SkyPilot team and community on the `SkyPilot Slack `_. + +Learn more +-------------------------- Runnable examples: @@ -93,6 +102,10 @@ Runnable examples: * Framework examples: `PyTorch DDP `_, `DeepSpeed `_, `JAX/Flax on TPU `_, `Stable Diffusion `_, `Detectron2 `_, `Distributed `_ `TensorFlow `_, `NeMo `_, `programmatic grid search `_, `Docker `_, `Cog `_, `Unsloth `_, `Ollama `_, `llm.c `__ and `many more `_. +Case Studies and Integrations: `Community Spotlights `_ + +Tutorials: `SkyPilot Tutorials `_ + Follow updates: * `Twitter `_ @@ -106,10 +119,9 @@ Read the research: * `Sky Computing vision paper `_ (HotOS 2021) -Contents --------- .. toctree:: + :hidden: :maxdepth: 1 :caption: Getting Started @@ -120,6 +132,7 @@ Contents .. toctree:: + :hidden: :maxdepth: 1 :caption: Running Jobs @@ -130,6 +143,7 @@ Contents ../running-jobs/distributed-jobs .. toctree:: + :hidden: :maxdepth: 1 :caption: SkyServe: Model Serving @@ -138,6 +152,7 @@ Contents ../serving/service-yaml-spec .. toctree:: + :hidden: :maxdepth: 1 :caption: Cutting Cloud Costs @@ -146,6 +161,7 @@ Contents ../reference/benchmark/index .. toctree:: + :hidden: :maxdepth: 1 :caption: Using Data @@ -153,6 +169,7 @@ Contents ../reference/storage .. toctree:: + :hidden: :maxdepth: 1 :caption: User Guides @@ -165,6 +182,7 @@ Contents .. toctree:: + :hidden: :maxdepth: 1 :caption: Developer Guides @@ -172,6 +190,7 @@ Contents Guide: Adding a New Cloud .. toctree:: + :hidden: :maxdepth: 1 :caption: Cloud Admin and Usage @@ -180,6 +199,7 @@ Contents ../cloud-setup/quota .. toctree:: + :hidden: :maxdepth: 1 :caption: References diff --git a/docs/source/examples/interactive-development.rst b/docs/source/examples/interactive-development.rst index f9c4b5c4803..cc50f8e6ea8 100644 --- a/docs/source/examples/interactive-development.rst +++ b/docs/source/examples/interactive-development.rst @@ -1,3 +1,5 @@ +.. _dev-cluster: + Start a Development Cluster =========================== @@ -90,7 +92,7 @@ SSH SkyPilot will automatically configure the SSH setting for a cluster, so that users can connect to the cluster with the cluster name: .. code-block:: bash - + ssh dev @@ -99,7 +101,7 @@ SkyPilot will automatically configure the SSH setting for a cluster, so that use VSCode ~~~~~~ -A common use case for interactive development is to connect a local IDE to a remote cluster and directly edit code that lives on the cluster. +A common use case for interactive development is to connect a local IDE to a remote cluster and directly edit code that lives on the cluster. This is supported by simply connecting VSCode to the cluster with the cluster name: #. Click on the top bar, type: :code:`> remote-ssh`, and select :code:`Remote-SSH: Connect Current Window to Host...`