diff --git a/.buildkite/generate_pipeline.py b/.buildkite/generate_pipeline.py
index 2b0f1cec788..99f29ee258a 100644
--- a/.buildkite/generate_pipeline.py
+++ b/.buildkite/generate_pipeline.py
@@ -5,7 +5,7 @@
tests/smoke_tests
├── test_*.py -> release pipeline
-├── test_pre_merge.py -> pre-merge pipeline
+├── test_quick_tests_core.py -> run quick tests on PR before merging
run `PYTHONPATH=$(pwd)/tests:$PYTHONPATH python .buildkite/generate_pipeline.py`
to generate the pipeline for testing. The CI will run this script as a pre-step,
@@ -208,8 +208,8 @@ def _convert_release(test_files: List[str]):
extra_env={cloud: '1' for cloud in CLOUD_QUEUE_MAP})
-def _convert_pre_merge(test_files: List[str]):
- yaml_file_path = '.buildkite/pipeline_smoke_tests_pre_merge.yaml'
+def _convert_quick_tests_core(test_files: List[str]):
+ yaml_file_path = '.buildkite/pipeline_smoke_tests_quick_tests_core.yaml'
output_file_pipelines = []
for test_file in test_files:
print(f'Converting {test_file} to {yaml_file_path}')
@@ -234,18 +234,18 @@ def _convert_pre_merge(test_files: List[str]):
def main():
test_files = os.listdir('tests/smoke_tests')
release_files = []
- pre_merge_files = []
+ quick_tests_core_files = []
for test_file in test_files:
if not test_file.startswith('test_'):
continue
test_file_path = os.path.join('tests/smoke_tests', test_file)
- if "test_pre_merge" in test_file:
- pre_merge_files.append(test_file_path)
+ if "test_quick_tests_core" in test_file:
+ quick_tests_core_files.append(test_file_path)
else:
release_files.append(test_file_path)
_convert_release(release_files)
- _convert_pre_merge(pre_merge_files)
+ _convert_quick_tests_core(quick_tests_core_files)
if __name__ == '__main__':
diff --git a/README.md b/README.md
index f29b57be9ca..1ed99325df5 100644
--- a/README.md
+++ b/README.md
@@ -6,7 +6,7 @@
-
+
@@ -43,7 +43,7 @@
Archived
- [Jul 2024] [**Finetune**](./llm/llama-3_1-finetuning/) and [**serve**](./llm/llama-3_1/) **Llama 3.1** on your infra
-- [Apr 2024] Serve and finetune [**Llama 3**](https://skypilot.readthedocs.io/en/latest/gallery/llms/llama-3.html) on any cloud or Kubernetes: [**example**](./llm/llama-3/)
+- [Apr 2024] Serve and finetune [**Llama 3**](https://docs.skypilot.co/en/latest/gallery/llms/llama-3.html) on any cloud or Kubernetes: [**example**](./llm/llama-3/)
- [Mar 2024] Serve and deploy [**Databricks DBRX**](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm) on your infra: [**example**](./llm/dbrx/)
- [Feb 2024] Speed up your LLM deployments with [**SGLang**](https://github.com/sgl-project/sglang) for 5x throughput on SkyServe: [**example**](./llm/sglang/)
- [Dec 2023] Using [**LoRAX**](https://github.com/predibase/lorax) to serve 1000s of finetuned LLMs on a single instance in the cloud: [**example**](./llm/lorax/)
@@ -60,17 +60,17 @@
SkyPilot is a framework for running AI and batch workloads on any infra, offering unified execution, high cost savings, and high GPU availability.
SkyPilot **abstracts away infra burdens**:
-- Launch [dev clusters](https://skypilot.readthedocs.io/en/latest/examples/interactive-development.html), [jobs](https://skypilot.readthedocs.io/en/latest/examples/managed-jobs.html), and [serving](https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html) on any infra
+- Launch [dev clusters](https://docs.skypilot.co/en/latest/examples/interactive-development.html), [jobs](https://docs.skypilot.co/en/latest/examples/managed-jobs.html), and [serving](https://docs.skypilot.co/en/latest/serving/sky-serve.html) on any infra
- Easy job management: queue, run, and auto-recover many jobs
SkyPilot **supports multiple clusters, clouds, and hardware** ([the Sky](https://arxiv.org/abs/2205.07147)):
- Bring your reserved GPUs, Kubernetes clusters, or 12+ clouds
-- [Flexible provisioning](https://skypilot.readthedocs.io/en/latest/examples/auto-failover.html) of GPUs, TPUs, CPUs, with auto-retry
+- [Flexible provisioning](https://docs.skypilot.co/en/latest/examples/auto-failover.html) of GPUs, TPUs, CPUs, with auto-retry
SkyPilot **cuts your cloud costs & maximizes GPU availability**:
-* [Autostop](https://skypilot.readthedocs.io/en/latest/reference/auto-stop.html): automatic cleanup of idle resources
-* [Managed Spot](https://skypilot.readthedocs.io/en/latest/examples/managed-jobs.html): 3-6x cost savings using spot instances, with preemption auto-recovery
-* [Optimizer](https://skypilot.readthedocs.io/en/latest/examples/auto-failover.html): 2x cost savings by auto-picking the cheapest & most available infra
+* [Autostop](https://docs.skypilot.co/en/latest/reference/auto-stop.html): automatic cleanup of idle resources
+* [Managed Spot](https://docs.skypilot.co/en/latest/examples/managed-jobs.html): 3-6x cost savings using spot instances, with preemption auto-recovery
+* [Optimizer](https://docs.skypilot.co/en/latest/examples/auto-failover.html): 2x cost savings by auto-picking the cheapest & most available infra
SkyPilot supports your existing GPU, TPU, and CPU workloads, with no code changes.
@@ -79,13 +79,13 @@ Install with pip:
# Choose your clouds:
pip install -U "skypilot[kubernetes,aws,gcp,azure,oci,lambda,runpod,fluidstack,paperspace,cudo,ibm,scp]"
```
-To get the latest features and fixes, use the nightly build or [install from source](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html):
+To get the latest features and fixes, use the nightly build or [install from source](https://docs.skypilot.co/en/latest/getting-started/installation.html):
```bash
# Choose your clouds:
pip install "skypilot-nightly[kubernetes,aws,gcp,azure,oci,lambda,runpod,fluidstack,paperspace,cudo,ibm,scp]"
```
-[Current supported infra](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html) (Kubernetes; AWS, GCP, Azure, OCI, Lambda Cloud, Fluidstack, RunPod, Cudo, Paperspace, Cloudflare, Samsung, IBM, VMware vSphere):
+[Current supported infra](https://docs.skypilot.co/en/latest/getting-started/installation.html) (Kubernetes; AWS, GCP, Azure, OCI, Lambda Cloud, Fluidstack, RunPod, Cudo, Paperspace, Cloudflare, Samsung, IBM, VMware vSphere):
-Refer to [Quickstart](https://skypilot.readthedocs.io/en/latest/getting-started/quickstart.html) to get started with SkyPilot.
+Refer to [Quickstart](https://docs.skypilot.co/en/latest/getting-started/quickstart.html) to get started with SkyPilot.
## More Information
-To learn more, see [Concept: Sky Computing](https://docs.skypilot.co/en/latest/sky-computing.html), [SkyPilot docs](https://skypilot.readthedocs.io/en/latest/), and [SkyPilot blog](https://blog.skypilot.co/).
+To learn more, see [Concept: Sky Computing](https://docs.skypilot.co/en/latest/sky-computing.html), [SkyPilot docs](https://docs.skypilot.co/en/latest/), and [SkyPilot blog](https://blog.skypilot.co/).
Runnable examples:
diff --git a/docs/requirements-docs.txt b/docs/requirements-docs.txt
index 7627218e451..9bc7052771f 100644
--- a/docs/requirements-docs.txt
+++ b/docs/requirements-docs.txt
@@ -10,6 +10,7 @@ myst-parser==2.0.0
sphinx-autodoc-typehints==1.25.2
sphinx-book-theme==1.1.0
sphinx-togglebutton==0.3.2
+sphinx-notfound-page==1.0.4
sphinxcontrib-applehelp==1.0.7
sphinxcontrib-devhelp==1.0.5
sphinxcontrib-googleanalytics==0.4
diff --git a/docs/source/conf.py b/docs/source/conf.py
index a8ce3270e88..3c0b62c9947 100644
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -41,6 +41,7 @@
'sphinxemoji.sphinxemoji',
'sphinx_design',
'myst_parser',
+ 'notfound.extension',
]
intersphinx_mapping = {
diff --git a/docs/source/examples/managed-jobs.rst b/docs/source/examples/managed-jobs.rst
index 61c33b5c43e..99fa461249d 100644
--- a/docs/source/examples/managed-jobs.rst
+++ b/docs/source/examples/managed-jobs.rst
@@ -499,7 +499,7 @@ To achieve the above, you can specify custom configs in :code:`~/.sky/config.yam
# Specify the disk_size in GB of the jobs controller.
disk_size: 100
-The :code:`resources` field has the same spec as a normal SkyPilot job; see `here `__.
+The :code:`resources` field has the same spec as a normal SkyPilot job; see `here `__.
.. note::
These settings will not take effect if you have an existing controller (either
diff --git a/docs/source/reference/config.rst b/docs/source/reference/config.rst
index 286788625bd..d5ee4d2134a 100644
--- a/docs/source/reference/config.rst
+++ b/docs/source/reference/config.rst
@@ -22,7 +22,7 @@ Available fields and semantics:
#
# These take effects only when a managed jobs controller does not already exist.
#
- # Ref: https://skypilot.readthedocs.io/en/latest/examples/managed-jobs.html#customizing-job-controller-resources
+ # Ref: https://docs.skypilot.co/en/latest/examples/managed-jobs.html#customizing-job-controller-resources
jobs:
controller:
resources: # same spec as 'resources' in a task YAML
@@ -478,13 +478,13 @@ Available fields and semantics:
# This must be either: 'loadbalancer', 'ingress' or 'podip'.
#
# loadbalancer: Creates services of type `LoadBalancer` to expose ports.
- # See https://skypilot.readthedocs.io/en/latest/reference/kubernetes/kubernetes-setup.html#loadbalancer-service.
+ # See https://docs.skypilot.co/en/latest/reference/kubernetes/kubernetes-setup.html#loadbalancer-service.
# This mode is supported out of the box on most cloud managed Kubernetes
# environments (e.g., GKE, EKS).
#
# ingress: Creates an ingress and a ClusterIP service for each port opened.
# Requires an Nginx ingress controller to be configured on the Kubernetes cluster.
- # Refer to https://skypilot.readthedocs.io/en/latest/reference/kubernetes/kubernetes-setup.html#nginx-ingress
+ # Refer to https://docs.skypilot.co/en/latest/reference/kubernetes/kubernetes-setup.html#nginx-ingress
# for details on deploying the NGINX ingress controller.
#
# podip: Directly returns the IP address of the pod. This mode does not
@@ -513,7 +513,7 @@ Available fields and semantics:
#
# : The name of a service account to use for all Kubernetes pods.
# This service account must exist in the user's namespace and have all
- # necessary permissions. Refer to https://skypilot.readthedocs.io/en/latest/cloud-setup/cloud-permissions/kubernetes.html
+ # necessary permissions. Refer to https://docs.skypilot.co/en/latest/cloud-setup/cloud-permissions/kubernetes.html
# for details on the roles required by the service account.
#
# Using SERVICE_ACCOUNT or a custom service account only affects Kubernetes
@@ -581,7 +581,7 @@ Available fields and semantics:
# gke: uses cloud.google.com/gke-accelerator label to identify GPUs on nodes
# karpenter: uses karpenter.k8s.aws/instance-gpu-name label to identify GPUs on nodes
# generic: uses skypilot.co/accelerator labels to identify GPUs on nodes
- # Refer to https://skypilot.readthedocs.io/en/latest/reference/kubernetes/kubernetes-setup.html#setting-up-gpu-support
+ # Refer to https://docs.skypilot.co/en/latest/reference/kubernetes/kubernetes-setup.html#setting-up-gpu-support
# for more details on setting up labels for GPU support.
#
# Default: null (no autoscaler, autodetect label format for GPU nodes)
diff --git a/docs/source/reference/kubernetes/index.rst b/docs/source/reference/kubernetes/index.rst
index 89a57862c88..639b5b633ed 100644
--- a/docs/source/reference/kubernetes/index.rst
+++ b/docs/source/reference/kubernetes/index.rst
@@ -39,7 +39,7 @@ Why use SkyPilot on Kubernetes?
.. grid-item-card:: 🖼 Run popular models on Kubernetes
:text-align: center
- Train and serve `Llama-3 `_, `Mixtral `_, and more on your Kubernetes with ready-to-use recipes from the :ref:`AI gallery `.
+ Train and serve `Llama-3 `_, `Mixtral `_, and more on your Kubernetes with ready-to-use recipes from the :ref:`AI gallery `.
.. tab-item:: For Infrastructure Admins
diff --git a/docs/source/reference/yaml-spec.rst b/docs/source/reference/yaml-spec.rst
index 455ee5909c9..0be708305c8 100644
--- a/docs/source/reference/yaml-spec.rst
+++ b/docs/source/reference/yaml-spec.rst
@@ -23,7 +23,7 @@ Available fields:
# which `sky` is called.
#
# To exclude files from syncing, see
- # https://skypilot.readthedocs.io/en/latest/examples/syncing-code-artifacts.html#exclude-uploading-files
+ # https://docs.skypilot.co/en/latest/examples/syncing-code-artifacts.html#exclude-uploading-files
workdir: ~/my-task-code
# Number of nodes (optional; defaults to 1) to launch including the head node.
@@ -357,7 +357,7 @@ In additional to the above fields, SkyPilot also supports the following experime
#
# The following fields can be overridden. Please refer to docs of Advanced
# Configuration for more details of those fields:
- # https://skypilot.readthedocs.io/en/latest/reference/config.html
+ # https://docs.skypilot.co/en/latest/reference/config.html
config_overrides:
docker:
run_options: ...
diff --git a/docs/source/reservations/existing-machines.rst b/docs/source/reservations/existing-machines.rst
index 10962ecd639..717043bfd25 100644
--- a/docs/source/reservations/existing-machines.rst
+++ b/docs/source/reservations/existing-machines.rst
@@ -42,7 +42,7 @@ Prerequisites
**Local machine (typically your laptop):**
* `kubectl `_
-* `SkyPilot `_
+* `SkyPilot `_
**Remote machines (your cluster, optionally with GPUs):**
diff --git a/docs/source/serving/sky-serve.rst b/docs/source/serving/sky-serve.rst
index c00fa427bd6..5a1a913b7ea 100644
--- a/docs/source/serving/sky-serve.rst
+++ b/docs/source/serving/sky-serve.rst
@@ -515,7 +515,7 @@ To achieve the above, you can specify custom configs in :code:`~/.sky/config.yam
# Specify the disk_size in GB of the SkyServe controller.
disk_size: 1024
-The :code:`resources` field has the same spec as a normal SkyPilot job; see `here `__.
+The :code:`resources` field has the same spec as a normal SkyPilot job; see `here `__.
.. note::
These settings will not take effect if you have an existing controller (either
diff --git a/examples/airflow/shared_state/README.md b/examples/airflow/shared_state/README.md
index 5f39471351a..917a45862a7 100644
--- a/examples/airflow/shared_state/README.md
+++ b/examples/airflow/shared_state/README.md
@@ -12,7 +12,7 @@ In this guide, we demonstrate how some simple SkyPilot operations, such as launc
* Airflow installed on a [Kubernetes cluster](https://airflow.apache.org/docs/helm-chart/stable/index.html) or [locally](https://airflow.apache.org/docs/apache-airflow/stable/start.html) (`SequentialExecutor`)
* A Kubernetes cluster to run tasks on. We'll use GKE in this example.
- * You can use our guide on [setting up a Kubernetes cluster](https://skypilot.readthedocs.io/en/latest/reference/kubernetes/kubernetes-setup.html).
+ * You can use our guide on [setting up a Kubernetes cluster](https://docs.skypilot.co/en/latest/reference/kubernetes/kubernetes-setup.html).
* A persistent volume storage class should be available that supports at least `ReadWriteOnce` access mode. GKE has this supported by default.
## Preparing the Kubernetes Cluster
@@ -39,7 +39,7 @@ In this guide, we demonstrate how some simple SkyPilot operations, such as launc
name: sky-airflow-sa
namespace: default
roleRef:
- # For minimal permissions, refer to https://skypilot.readthedocs.io/en/latest/cloud-setup/cloud-permissions/kubernetes.html
+ # For minimal permissions, refer to https://docs.skypilot.co/en/latest/cloud-setup/cloud-permissions/kubernetes.html
kind: ClusterRole
name: cluster-admin
apiGroup: rbac.authorization.k8s.io
@@ -163,7 +163,7 @@ with DAG(dag_id='sky_k8s_example',
## Tips
1. **Persistent Volume**: If you have many concurrent tasks, you may want to use a storage class that supports [`ReadWriteMany`](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes) access mode.
-2. **Cloud credentials**: If you wish to run tasks on different clouds, you can configure cloud credentials in Kubernetes secrets and mount them in the Sky pod defined in the DAG. See [SkyPilot docs on setting up cloud credentials](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#cloud-account-setup) for more on how to configure credentials in the pod.
+2. **Cloud credentials**: If you wish to run tasks on different clouds, you can configure cloud credentials in Kubernetes secrets and mount them in the Sky pod defined in the DAG. See [SkyPilot docs on setting up cloud credentials](https://docs.skypilot.co/en/latest/getting-started/installation.html#cloud-account-setup) for more on how to configure credentials in the pod.
3. **Logging**: All SkyPilot logs are written to container stdout, which is captured as task logs in Airflow and displayed in the UI. You can also write logs to a file and read them in subsequent tasks.
4. **XComs for shared state**: Airflow also provides [XComs](https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html) for cross-task communication. [`sky_k8s_example_xcoms.py`](sky_k8s_example_xcoms.py) demonstrates how to use XComs to share state between tasks.
diff --git a/examples/airflow/training_workflow/README.md b/examples/airflow/training_workflow/README.md
index dad08d8d3b0..71cb10bef50 100644
--- a/examples/airflow/training_workflow/README.md
+++ b/examples/airflow/training_workflow/README.md
@@ -7,7 +7,7 @@ In this guide, we show how a training workflow involving data preprocessing, tra
-**💡 Tip:** SkyPilot also supports defining and running pipelines without Airflow. Check out [Jobs Pipelines](https://skypilot.readthedocs.io/en/latest/examples/managed-jobs.html#job-pipelines) for more information.
+**💡 Tip:** SkyPilot also supports defining and running pipelines without Airflow. Check out [Jobs Pipelines](https://docs.skypilot.co/en/latest/examples/managed-jobs.html#job-pipelines) for more information.
## Why use SkyPilot with Airflow?
In AI workflows, **the transition from development to production is hard**.
@@ -24,7 +24,7 @@ production Airflow cluster. Behind the scenes, SkyPilot handles environment setu
Here's how you can use SkyPilot to take your dev workflows to production in Airflow:
1. **Define and test your workflow as SkyPilot tasks**.
- - Use `sky launch` and [Sky VSCode integration](https://skypilot.readthedocs.io/en/latest/examples/interactive-development.html#dev-vscode) to run, debug and iterate on your code.
+ - Use `sky launch` and [Sky VSCode integration](https://docs.skypilot.co/en/latest/examples/interactive-development.html#dev-vscode) to run, debug and iterate on your code.
2. **Orchestrate SkyPilot tasks in Airflow** by invoking `sky launch` on their YAMLs as a task in the Airflow DAG.
- Airflow does the scheduling, logging, and monitoring, while SkyPilot handles the infra setup and task execution.
@@ -34,7 +34,7 @@ Here's how you can use SkyPilot to take your dev workflows to production in Airf
* Airflow installed on a [Kubernetes cluster](https://airflow.apache.org/docs/helm-chart/stable/index.html) or [locally](https://airflow.apache.org/docs/apache-airflow/stable/start.html) (`SequentialExecutor`)
* A Kubernetes cluster to run tasks on. We'll use GKE in this example.
* A Google cloud account with GCS access to store the data for task.
- * Follow [SkyPilot instructions](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#google-cloud-platform-gcp) to set up Google Cloud credentials.
+ * Follow [SkyPilot instructions](https://docs.skypilot.co/en/latest/getting-started/installation.html#google-cloud-platform-gcp) to set up Google Cloud credentials.
## Preparing the Kubernetes Cluster
@@ -60,7 +60,7 @@ Here's how you can use SkyPilot to take your dev workflows to production in Airf
name: sky-airflow-sa
namespace: default
roleRef:
- # For minimal permissions, refer to https://skypilot.readthedocs.io/en/latest/cloud-setup/cloud-permissions/kubernetes.html
+ # For minimal permissions, refer to https://docs.skypilot.co/en/latest/cloud-setup/cloud-permissions/kubernetes.html
kind: ClusterRole
name: cluster-admin
apiGroup: rbac.authorization.k8s.io
@@ -103,7 +103,7 @@ The train and eval step can be run in a similar way:
sky launch -c train --env DATA_BUCKET_URL=gs:// train.yaml
```
-Hint: You can use `ssh` and VSCode to [interactively develop](https://skypilot.readthedocs.io/en/latest/examples/interactive-development.html) and debug the tasks.
+Hint: You can use `ssh` and VSCode to [interactively develop](https://docs.skypilot.co/en/latest/examples/interactive-development.html) and debug the tasks.
Note: `eval` can be optionally run on the same cluster as `train` with `sky exec`. Refer to the `shared_state` airflow example on how to do this.
diff --git a/examples/cog/README.md b/examples/cog/README.md
index b2193e2e18f..97d886e2d2c 100644
--- a/examples/cog/README.md
+++ b/examples/cog/README.md
@@ -17,7 +17,7 @@ curl http://$IP:5000/predictions -X POST \
```
## Scale up the deployment using SkyServe
-We can use SkyServe (`sky serve`) to scale up the deployment to multiple instances, while enjoying load balancing, autoscaling, and other [SkyServe features](https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html).
+We can use SkyServe (`sky serve`) to scale up the deployment to multiple instances, while enjoying load balancing, autoscaling, and other [SkyServe features](https://docs.skypilot.co/en/latest/serving/sky-serve.html).
```console
sky serve up -n cog ./sky.yaml
```
diff --git a/examples/distributed-pytorch/README.md b/examples/distributed-pytorch/README.md
new file mode 100644
index 00000000000..6c2f7092269
--- /dev/null
+++ b/examples/distributed-pytorch/README.md
@@ -0,0 +1,81 @@
+# Distributed Training with PyTorch
+
+This example demonstrates how to run distributed training with PyTorch using SkyPilot.
+
+**The example is based on [PyTorch's official minGPT example](https://github.com/pytorch/examples/tree/main/distributed/minGPT-ddp)**
+
+
+## Overview
+
+There are two ways to run distributed training with PyTorch:
+
+1. Using normal `torchrun`
+2. Using `rdvz` backend
+
+The main difference between the two for fixed-size distributed training is that `rdvz` backend automatically handles the rank for each node, while `torchrun` requires the rank to be set manually.
+
+SkyPilot offers convinient built-in environment variables to help you start distributed training easily.
+
+### Using normal `torchrun`
+
+
+The following command will spawn 2 nodes with 2 L4 GPU each:
+```
+sky launch -c train train.yaml
+```
+
+In [train.yaml](./train.yaml), we use `torchrun` to launch the training and set the arguments for distributed training using [environment variables](https://docs.skypilot.co/en/latest/running-jobs/environment-variables.html#skypilot-environment-variables) provided by SkyPilot.
+
+```yaml
+run: |
+ cd examples/mingpt
+ MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
+ torchrun \
+ --nnodes=$SKYPILOT_NUM_NODES \
+ --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
+ --master_addr=$MASTER_ADDR \
+ --master_port=8008 \
+ --node_rank=${SKYPILOT_NODE_RANK} \
+ main.py
+```
+
+
+
+### Using `rdzv` backend
+
+`rdzv` is an alternative backend for distributed training:
+
+```
+sky launch -c train-rdzv train-rdzv.yaml
+```
+
+In [train-rdzv.yaml](./train-rdzv.yaml), we use `torchrun` to launch the training and set the arguments for distributed training using [environment variables](https://docs.skypilot.co/en/latest/running-jobs/environment-variables.html#skypilot-environment-variables) provided by SkyPilot.
+
+```yaml
+run: |
+ cd examples/mingpt
+ MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
+ echo "Starting distributed training, head node: $MASTER_ADDR"
+
+ torchrun \
+ --nnodes=$SKYPILOT_NUM_NODES \
+ --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
+ --rdzv_backend=c10d \
+ --rdzv_endpoint=$MASTER_ADDR:29500 \
+ --rdzv_id $SKYPILOT_TASK_ID \
+ main.py
+```
+
+
+## Scale up
+
+If you would like to scale up the training, you can simply change the resources requirement, and SkyPilot's built-in environment variables will be set automatically.
+
+For example, the following command will spawn 4 nodes with 4 L4 GPUs each.
+
+```
+sky launch -c train train.yaml --num-nodes 4 --gpus L4:4 --cpus 8+
+```
+
+We increase the `--cpus` to 8+ as well to avoid the performance to be bottlenecked by the CPU.
+
diff --git a/examples/distributed-pytorch/train-rdzv.yaml b/examples/distributed-pytorch/train-rdzv.yaml
new file mode 100644
index 00000000000..3bcd63dde4c
--- /dev/null
+++ b/examples/distributed-pytorch/train-rdzv.yaml
@@ -0,0 +1,29 @@
+name: minGPT-ddp-rdzv
+
+resources:
+ cpus: 4+
+ accelerators: L4
+
+num_nodes: 2
+
+setup: |
+ git clone --depth 1 https://github.com/pytorch/examples || true
+ cd examples
+ git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp
+ # SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5).
+ uv pip install -r requirements.txt "numpy<2" "torch==1.12.1+cu113" --extra-index-url https://download.pytorch.org/whl/cu113
+
+run: |
+ cd examples/mingpt
+ export LOGLEVEL=INFO
+
+ MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
+ echo "Starting distributed training, head node: $MASTER_ADDR"
+
+ torchrun \
+ --nnodes=$SKYPILOT_NUM_NODES \
+ --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
+ --rdzv_backend=c10d \
+ --rdzv_endpoint=$MASTER_ADDR:29500 \
+ --rdzv_id $SKYPILOT_TASK_ID \
+ main.py
diff --git a/examples/distributed-pytorch/train.yaml b/examples/distributed-pytorch/train.yaml
new file mode 100644
index 00000000000..b45941e1485
--- /dev/null
+++ b/examples/distributed-pytorch/train.yaml
@@ -0,0 +1,29 @@
+name: minGPT-ddp
+
+resources:
+ cpus: 4+
+ accelerators: L4
+
+num_nodes: 2
+
+setup: |
+ git clone --depth 1 https://github.com/pytorch/examples || true
+ cd examples
+ git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp
+ # SkyPilot's default image on AWS/GCP has CUDA 11.6 (Azure 11.5).
+ uv pip install -r requirements.txt "numpy<2" "torch==1.12.1+cu113" --extra-index-url https://download.pytorch.org/whl/cu113
+
+run: |
+ cd examples/mingpt
+ export LOGLEVEL=INFO
+
+ MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
+ echo "Starting distributed training, head node: $MASTER_ADDR"
+
+ torchrun \
+ --nnodes=$SKYPILOT_NUM_NODES \
+ --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
+ --master_addr=$MASTER_ADDR \
+ --master_port=8008 \
+ --node_rank=${SKYPILOT_NODE_RANK} \
+ main.py
diff --git a/examples/k8s_cloud_deploy/README.md b/examples/k8s_cloud_deploy/README.md
index 5ba42cbe836..9b0d46249d4 100644
--- a/examples/k8s_cloud_deploy/README.md
+++ b/examples/k8s_cloud_deploy/README.md
@@ -56,11 +56,11 @@ NODE_NAME GPU_NAME TOTAL_GPUS FREE_GPUS
## Run AI workloads on your Kubernetes cluster with SkyPilot
### Development clusters
-To launch a [GPU enabled development cluster](https://skypilot.readthedocs.io/en/latest/examples/interactive-development.html), run `sky launch -c mycluster --cloud kubernetes --gpus A10:1`.
+To launch a [GPU enabled development cluster](https://docs.skypilot.co/en/latest/examples/interactive-development.html), run `sky launch -c mycluster --cloud kubernetes --gpus A10:1`.
SkyPilot will setup SSH config for you.
-* [SSH access](https://skypilot.readthedocs.io/en/latest/examples/interactive-development.html#ssh): `ssh mycluster`
-* [VSCode remote development](https://skypilot.readthedocs.io/en/latest/examples/interactive-development.html#vscode): `code --remote ssh-remote+mycluster "/"`
+* [SSH access](https://docs.skypilot.co/en/latest/examples/interactive-development.html#ssh): `ssh mycluster`
+* [VSCode remote development](https://docs.skypilot.co/en/latest/examples/interactive-development.html#vscode): `code --remote ssh-remote+mycluster "/"`
### Jobs
@@ -87,7 +87,7 @@ sky-cmd-1-2ea4-head 1/1 Running 0 8m36s
sky-jobs-controller-2ea485ea-2ea4-head 1/1 Running 0 10m
```
-Refer to [SkyPilot docs](https://skypilot.readthedocs.io/) for more.
+Refer to [SkyPilot docs](https://docs.skypilot.co/) for more.
## Teardown
To teardown the Kubernetes cluster, run:
diff --git a/examples/stable_diffusion/README.md b/examples/stable_diffusion/README.md
index 2a4383f1347..56af44df91e 100644
--- a/examples/stable_diffusion/README.md
+++ b/examples/stable_diffusion/README.md
@@ -1,6 +1,6 @@
## Setup
-1. Install skypilot package by following these [instructions](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html).
+1. Install skypilot package by following these [instructions](https://docs.skypilot.co/en/latest/getting-started/installation.html).
2. Run `git clone https://github.com/skypilot-org/skypilot.git && cd examples/stable_diffusion`
diff --git a/examples/stable_diffusion/pushing_docker_image.md b/examples/stable_diffusion/pushing_docker_image.md
index 80b285fa832..0585d566543 100644
--- a/examples/stable_diffusion/pushing_docker_image.md
+++ b/examples/stable_diffusion/pushing_docker_image.md
@@ -1,6 +1,6 @@
## GCR
-1. Install skypilot package by following these [instructions](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html).
+1. Install skypilot package by following these [instructions](https://docs.skypilot.co/en/latest/getting-started/installation.html).
2. Run `git clone https://github.com/skypilot-org/skypilot.git `.
diff --git a/llm/codellama/README.md b/llm/codellama/README.md
index f145fd062ff..54019bd6d2a 100644
--- a/llm/codellama/README.md
+++ b/llm/codellama/README.md
@@ -38,7 +38,7 @@ The followings are the demos of Code Llama 70B hosted by SkyPilot Serve (aka Sky
## Running your own Code Llama with SkyPilot
-After [installing SkyPilot](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html), run your own Code Llama on vLLM with SkyPilot in 1-click:
+After [installing SkyPilot](https://docs.skypilot.co/en/latest/getting-started/installation.html), run your own Code Llama on vLLM with SkyPilot in 1-click:
1. Start serving Code Llama 70B on a single instance with any available GPU in the list specified in [endpoint.yaml](https://github.com/skypilot-org/skypilot/tree/master/llm/codellama/endpoint.yaml) with a vLLM powered OpenAI-compatible endpoint:
```console
@@ -100,7 +100,7 @@ This returns the following completion:
## Scale up the service with SkyServe
-1. With [SkyServe](https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html), a serving library built on top of SkyPilot, scaling up the Code Llama service is as simple as running:
+1. With [SkyServe](https://docs.skypilot.co/en/latest/serving/sky-serve.html), a serving library built on top of SkyPilot, scaling up the Code Llama service is as simple as running:
```bash
sky serve up -n code-llama ./endpoint.yaml
```
diff --git a/llm/dbrx/README.md b/llm/dbrx/README.md
index 3011af9d4e6..2845634b287 100644
--- a/llm/dbrx/README.md
+++ b/llm/dbrx/README.md
@@ -11,7 +11,7 @@ In this recipe, you will serve `databricks/dbrx-instruct` on your own infra --
## Prerequisites
- Go to the [HuggingFace model page](https://huggingface.co/databricks/dbrx-instruct) and request access to the model `databricks/dbrx-instruct`.
-- Check that you have installed SkyPilot ([docs](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html)).
+- Check that you have installed SkyPilot ([docs](https://docs.skypilot.co/en/latest/getting-started/installation.html)).
- Check that `sky check` shows clouds or Kubernetes are enabled.
## SkyPilot YAML
@@ -278,6 +278,6 @@ To shut down all resources:
sky serve down dbrx
```
-See more details in [SkyServe docs](https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html).
+See more details in [SkyServe docs](https://docs.skypilot.co/en/latest/serving/sky-serve.html).
diff --git a/llm/gemma/README.md b/llm/gemma/README.md
index ef5027b2807..7296f7c7e31 100644
--- a/llm/gemma/README.md
+++ b/llm/gemma/README.md
@@ -24,7 +24,7 @@ Generate a read-only access token on huggingface [here](https://huggingface.co/s
```bash
pip install "skypilot-nightly[all]"
```
-For detailed installation instructions, please refer to the [installation guide](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html).
+For detailed installation instructions, please refer to the [installation guide](https://docs.skypilot.co/en/latest/getting-started/installation.html).
### Host on a Single Instance
diff --git a/llm/gpt-2/README.md b/llm/gpt-2/README.md
index 10fa2cf6998..b8e656e2353 100644
--- a/llm/gpt-2/README.md
+++ b/llm/gpt-2/README.md
@@ -13,7 +13,7 @@ pip install "skypilot-nightly[aws,gcp,azure,kubernetes,lambda,fluidstack]" # Cho
```bash
sky check
```
-Please check the instructions for enabling clouds at [SkyPilot doc](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html).
+Please check the instructions for enabling clouds at [SkyPilot doc](https://docs.skypilot.co/en/latest/getting-started/installation.html).
3. Download the YAML for starting the training:
```bash
diff --git a/llm/llama-3/README.md b/llm/llama-3/README.md
index 8ffcb3087a9..c4cf9066f63 100644
--- a/llm/llama-3/README.md
+++ b/llm/llama-3/README.md
@@ -29,7 +29,7 @@
## Prerequisites
- Go to the [HuggingFace model page](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct) and request access to the model `meta-llama/Meta-Llama-3-70B-Instruct`.
-- Check that you have installed SkyPilot ([docs](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html)).
+- Check that you have installed SkyPilot ([docs](https://docs.skypilot.co/en/latest/getting-started/installation.html)).
- Check that `sky check` shows clouds or Kubernetes are enabled.
## SkyPilot YAML
@@ -326,7 +326,7 @@ To shut down all resources:
sky serve down llama3
```
-See more details in [SkyServe docs](https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html).
+See more details in [SkyServe docs](https://docs.skypilot.co/en/latest/serving/sky-serve.html).
### **Optional**: Connect a GUI to your Llama-3 endpoint
@@ -349,4 +349,4 @@ sky launch -c llama3-gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint
## Finetuning Llama-3
-You can finetune Llama-3 on your own data. We have an tutorial for finetunning Llama-2 for Vicuna on SkyPilot, which can be adapted for Llama-3. You can find the tutorial [here](https://skypilot.readthedocs.io/en/latest/gallery/tutorials/finetuning.html) and a detailed blog post [here](https://blog.skypilot.co/finetuning-llama2-operational-guide/).
+You can finetune Llama-3 on your own data. We have an tutorial for finetunning Llama-2 for Vicuna on SkyPilot, which can be adapted for Llama-3. You can find the tutorial [here](https://docs.skypilot.co/en/latest/gallery/tutorials/finetuning.html) and a detailed blog post [here](https://blog.skypilot.co/finetuning-llama2-operational-guide/).
diff --git a/llm/llama-3_1-finetuning/readme.md b/llm/llama-3_1-finetuning/readme.md
index 935dccde84e..ddc2b9e2463 100644
--- a/llm/llama-3_1-finetuning/readme.md
+++ b/llm/llama-3_1-finetuning/readme.md
@@ -7,10 +7,10 @@
On July 23, 2024, Meta released the [Llama 3.1 model family](https://ai.meta.com/blog/meta-llama-3-1/), including a 405B parameter model in both base model and instruction-tuned forms. Llama 3.1 405B became _the first open LLM that closely rivals top proprietary models_ like GPT-4o and Claude 3.5 Sonnet.
-This guide shows how to use [SkyPilot](https://github.com/skypilot-org/skypilot) and [torchtune](https://pytorch.org/torchtune/stable/index.html) to **finetune Llama 3.1 on your own data and infra**. Everything is packaged in a simple [SkyPilot YAML](https://skypilot.readthedocs.io/en/latest/getting-started/quickstart.html), that can be launched with one command on your infra:
+This guide shows how to use [SkyPilot](https://github.com/skypilot-org/skypilot) and [torchtune](https://pytorch.org/torchtune/stable/index.html) to **finetune Llama 3.1 on your own data and infra**. Everything is packaged in a simple [SkyPilot YAML](https://docs.skypilot.co/en/latest/getting-started/quickstart.html), that can be launched with one command on your infra:
- Local GPU workstation
- Kubernetes cluster
-- Cloud accounts ([12 clouds supported](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html))
+- Cloud accounts ([12 clouds supported](https://docs.skypilot.co/en/latest/getting-started/installation.html))