Skip to content

Commit

Permalink
Merge branch 'master' of github.com:skypilot-org/skypilot into manage…
Browse files Browse the repository at this point in the history
…d-jobs-skylet
  • Loading branch information
cg505 committed Dec 20, 2024
2 parents 3b4cf44 + a73a9cb commit ea84bb4
Show file tree
Hide file tree
Showing 70 changed files with 518 additions and 184 deletions.
14 changes: 7 additions & 7 deletions .buildkite/generate_pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
tests/smoke_tests
├── test_*.py -> release pipeline
├── test_pre_merge.py -> pre-merge pipeline
├── test_quick_tests_core.py -> run quick tests on PR before merging
run `PYTHONPATH=$(pwd)/tests:$PYTHONPATH python .buildkite/generate_pipeline.py`
to generate the pipeline for testing. The CI will run this script as a pre-step,
Expand Down Expand Up @@ -208,8 +208,8 @@ def _convert_release(test_files: List[str]):
extra_env={cloud: '1' for cloud in CLOUD_QUEUE_MAP})


def _convert_pre_merge(test_files: List[str]):
yaml_file_path = '.buildkite/pipeline_smoke_tests_pre_merge.yaml'
def _convert_quick_tests_core(test_files: List[str]):
yaml_file_path = '.buildkite/pipeline_smoke_tests_quick_tests_core.yaml'
output_file_pipelines = []
for test_file in test_files:
print(f'Converting {test_file} to {yaml_file_path}')
Expand All @@ -234,18 +234,18 @@ def _convert_pre_merge(test_files: List[str]):
def main():
test_files = os.listdir('tests/smoke_tests')
release_files = []
pre_merge_files = []
quick_tests_core_files = []
for test_file in test_files:
if not test_file.startswith('test_'):
continue
test_file_path = os.path.join('tests/smoke_tests', test_file)
if "test_pre_merge" in test_file:
pre_merge_files.append(test_file_path)
if "test_quick_tests_core" in test_file:
quick_tests_core_files.append(test_file_path)
else:
release_files.append(test_file_path)

_convert_release(release_files)
_convert_pre_merge(pre_merge_files)
_convert_quick_tests_core(quick_tests_core_files)


if __name__ == '__main__':
Expand Down
34 changes: 17 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
</p>

<p align="center">
<a href="https://skypilot.readthedocs.io/en/latest/">
<a href="https://docs.skypilot.co/">
<img alt="Documentation" src="https://readthedocs.org/projects/skypilot/badge/?version=latest">
</a>

Expand Down Expand Up @@ -43,7 +43,7 @@
<summary>Archived</summary>

- [Jul 2024] [**Finetune**](./llm/llama-3_1-finetuning/) and [**serve**](./llm/llama-3_1/) **Llama 3.1** on your infra
- [Apr 2024] Serve and finetune [**Llama 3**](https://skypilot.readthedocs.io/en/latest/gallery/llms/llama-3.html) on any cloud or Kubernetes: [**example**](./llm/llama-3/)
- [Apr 2024] Serve and finetune [**Llama 3**](https://docs.skypilot.co/en/latest/gallery/llms/llama-3.html) on any cloud or Kubernetes: [**example**](./llm/llama-3/)
- [Mar 2024] Serve and deploy [**Databricks DBRX**](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm) on your infra: [**example**](./llm/dbrx/)
- [Feb 2024] Speed up your LLM deployments with [**SGLang**](https://github.com/sgl-project/sglang) for 5x throughput on SkyServe: [**example**](./llm/sglang/)
- [Dec 2023] Using [**LoRAX**](https://github.com/predibase/lorax) to serve 1000s of finetuned LLMs on a single instance in the cloud: [**example**](./llm/lorax/)
Expand All @@ -60,17 +60,17 @@
SkyPilot is a framework for running AI and batch workloads on any infra, offering unified execution, high cost savings, and high GPU availability.

SkyPilot **abstracts away infra burdens**:
- Launch [dev clusters](https://skypilot.readthedocs.io/en/latest/examples/interactive-development.html), [jobs](https://skypilot.readthedocs.io/en/latest/examples/managed-jobs.html), and [serving](https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html) on any infra
- Launch [dev clusters](https://docs.skypilot.co/en/latest/examples/interactive-development.html), [jobs](https://docs.skypilot.co/en/latest/examples/managed-jobs.html), and [serving](https://docs.skypilot.co/en/latest/serving/sky-serve.html) on any infra
- Easy job management: queue, run, and auto-recover many jobs

SkyPilot **supports multiple clusters, clouds, and hardware** ([the Sky](https://arxiv.org/abs/2205.07147)):
- Bring your reserved GPUs, Kubernetes clusters, or 12+ clouds
- [Flexible provisioning](https://skypilot.readthedocs.io/en/latest/examples/auto-failover.html) of GPUs, TPUs, CPUs, with auto-retry
- [Flexible provisioning](https://docs.skypilot.co/en/latest/examples/auto-failover.html) of GPUs, TPUs, CPUs, with auto-retry

SkyPilot **cuts your cloud costs & maximizes GPU availability**:
* [Autostop](https://skypilot.readthedocs.io/en/latest/reference/auto-stop.html): automatic cleanup of idle resources
* [Managed Spot](https://skypilot.readthedocs.io/en/latest/examples/managed-jobs.html): 3-6x cost savings using spot instances, with preemption auto-recovery
* [Optimizer](https://skypilot.readthedocs.io/en/latest/examples/auto-failover.html): 2x cost savings by auto-picking the cheapest & most available infra
* [Autostop](https://docs.skypilot.co/en/latest/reference/auto-stop.html): automatic cleanup of idle resources
* [Managed Spot](https://docs.skypilot.co/en/latest/examples/managed-jobs.html): 3-6x cost savings using spot instances, with preemption auto-recovery
* [Optimizer](https://docs.skypilot.co/en/latest/examples/auto-failover.html): 2x cost savings by auto-picking the cheapest & most available infra

SkyPilot supports your existing GPU, TPU, and CPU workloads, with no code changes.

Expand All @@ -79,13 +79,13 @@ Install with pip:
# Choose your clouds:
pip install -U "skypilot[kubernetes,aws,gcp,azure,oci,lambda,runpod,fluidstack,paperspace,cudo,ibm,scp]"
```
To get the latest features and fixes, use the nightly build or [install from source](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html):
To get the latest features and fixes, use the nightly build or [install from source](https://docs.skypilot.co/en/latest/getting-started/installation.html):
```bash
# Choose your clouds:
pip install "skypilot-nightly[kubernetes,aws,gcp,azure,oci,lambda,runpod,fluidstack,paperspace,cudo,ibm,scp]"
```

[Current supported infra](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html) (Kubernetes; AWS, GCP, Azure, OCI, Lambda Cloud, Fluidstack, RunPod, Cudo, Paperspace, Cloudflare, Samsung, IBM, VMware vSphere):
[Current supported infra](https://docs.skypilot.co/en/latest/getting-started/installation.html) (Kubernetes; AWS, GCP, Azure, OCI, Lambda Cloud, Fluidstack, RunPod, Cudo, Paperspace, Cloudflare, Samsung, IBM, VMware vSphere):
<p align="center">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/skypilot-org/skypilot/master/docs/source/images/cloud-logos-dark.png">
Expand All @@ -95,16 +95,16 @@ pip install "skypilot-nightly[kubernetes,aws,gcp,azure,oci,lambda,runpod,fluidst


## Getting Started
You can find our documentation [here](https://skypilot.readthedocs.io/en/latest/).
- [Installation](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html)
- [Quickstart](https://skypilot.readthedocs.io/en/latest/getting-started/quickstart.html)
- [CLI reference](https://skypilot.readthedocs.io/en/latest/reference/cli.html)
You can find our documentation [here](https://docs.skypilot.co/).
- [Installation](https://docs.skypilot.co/en/latest/getting-started/installation.html)
- [Quickstart](https://docs.skypilot.co/en/latest/getting-started/quickstart.html)
- [CLI reference](https://docs.skypilot.co/en/latest/reference/cli.html)

## SkyPilot in 1 Minute

A SkyPilot task specifies: resource requirements, data to be synced, setup commands, and the task commands.

Once written in this [**unified interface**](https://skypilot.readthedocs.io/en/latest/reference/yaml-spec.html) (YAML or Python API), the task can be launched on any available cloud. This avoids vendor lock-in, and allows easily moving jobs to a different provider.
Once written in this [**unified interface**](https://docs.skypilot.co/en/latest/reference/yaml-spec.html) (YAML or Python API), the task can be launched on any available cloud. This avoids vendor lock-in, and allows easily moving jobs to a different provider.

Paste the following into a file `my_task.yaml`:

Expand Down Expand Up @@ -135,7 +135,7 @@ Prepare the workdir by cloning:
git clone https://github.com/pytorch/examples.git ~/torch_examples
```

Launch with `sky launch` (note: [access to GPU instances](https://skypilot.readthedocs.io/en/latest/cloud-setup/quota.html) is needed for this example):
Launch with `sky launch` (note: [access to GPU instances](https://docs.skypilot.co/en/latest/cloud-setup/quota.html) is needed for this example):
```bash
sky launch my_task.yaml
```
Expand All @@ -152,10 +152,10 @@ SkyPilot then performs the heavy-lifting for you, including:
</p>


Refer to [Quickstart](https://skypilot.readthedocs.io/en/latest/getting-started/quickstart.html) to get started with SkyPilot.
Refer to [Quickstart](https://docs.skypilot.co/en/latest/getting-started/quickstart.html) to get started with SkyPilot.

## More Information
To learn more, see [Concept: Sky Computing](https://docs.skypilot.co/en/latest/sky-computing.html), [SkyPilot docs](https://skypilot.readthedocs.io/en/latest/), and [SkyPilot blog](https://blog.skypilot.co/).
To learn more, see [Concept: Sky Computing](https://docs.skypilot.co/en/latest/sky-computing.html), [SkyPilot docs](https://docs.skypilot.co/en/latest/), and [SkyPilot blog](https://blog.skypilot.co/).

<!-- Keep this section in sync with index.rst in SkyPilot Docs -->
Runnable examples:
Expand Down
1 change: 1 addition & 0 deletions docs/requirements-docs.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ myst-parser==2.0.0
sphinx-autodoc-typehints==1.25.2
sphinx-book-theme==1.1.0
sphinx-togglebutton==0.3.2
sphinx-notfound-page==1.0.4
sphinxcontrib-applehelp==1.0.7
sphinxcontrib-devhelp==1.0.5
sphinxcontrib-googleanalytics==0.4
Expand Down
1 change: 1 addition & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@
'sphinxemoji.sphinxemoji',
'sphinx_design',
'myst_parser',
'notfound.extension',
]

intersphinx_mapping = {
Expand Down
2 changes: 1 addition & 1 deletion docs/source/examples/managed-jobs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -499,7 +499,7 @@ To achieve the above, you can specify custom configs in :code:`~/.sky/config.yam
# Specify the disk_size in GB of the jobs controller.
disk_size: 100
The :code:`resources` field has the same spec as a normal SkyPilot job; see `here <https://skypilot.readthedocs.io/en/latest/reference/yaml-spec.html>`__.
The :code:`resources` field has the same spec as a normal SkyPilot job; see `here <https://docs.skypilot.co/en/latest/reference/yaml-spec.html>`__.

.. note::
These settings will not take effect if you have an existing controller (either
Expand Down
10 changes: 5 additions & 5 deletions docs/source/reference/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ Available fields and semantics:
#
# These take effects only when a managed jobs controller does not already exist.
#
# Ref: https://skypilot.readthedocs.io/en/latest/examples/managed-jobs.html#customizing-job-controller-resources
# Ref: https://docs.skypilot.co/en/latest/examples/managed-jobs.html#customizing-job-controller-resources
jobs:
controller:
resources: # same spec as 'resources' in a task YAML
Expand Down Expand Up @@ -478,13 +478,13 @@ Available fields and semantics:
# This must be either: 'loadbalancer', 'ingress' or 'podip'.
#
# loadbalancer: Creates services of type `LoadBalancer` to expose ports.
# See https://skypilot.readthedocs.io/en/latest/reference/kubernetes/kubernetes-setup.html#loadbalancer-service.
# See https://docs.skypilot.co/en/latest/reference/kubernetes/kubernetes-setup.html#loadbalancer-service.
# This mode is supported out of the box on most cloud managed Kubernetes
# environments (e.g., GKE, EKS).
#
# ingress: Creates an ingress and a ClusterIP service for each port opened.
# Requires an Nginx ingress controller to be configured on the Kubernetes cluster.
# Refer to https://skypilot.readthedocs.io/en/latest/reference/kubernetes/kubernetes-setup.html#nginx-ingress
# Refer to https://docs.skypilot.co/en/latest/reference/kubernetes/kubernetes-setup.html#nginx-ingress
# for details on deploying the NGINX ingress controller.
#
# podip: Directly returns the IP address of the pod. This mode does not
Expand Down Expand Up @@ -513,7 +513,7 @@ Available fields and semantics:
#
# <string>: The name of a service account to use for all Kubernetes pods.
# This service account must exist in the user's namespace and have all
# necessary permissions. Refer to https://skypilot.readthedocs.io/en/latest/cloud-setup/cloud-permissions/kubernetes.html
# necessary permissions. Refer to https://docs.skypilot.co/en/latest/cloud-setup/cloud-permissions/kubernetes.html
# for details on the roles required by the service account.
#
# Using SERVICE_ACCOUNT or a custom service account only affects Kubernetes
Expand Down Expand Up @@ -581,7 +581,7 @@ Available fields and semantics:
# gke: uses cloud.google.com/gke-accelerator label to identify GPUs on nodes
# karpenter: uses karpenter.k8s.aws/instance-gpu-name label to identify GPUs on nodes
# generic: uses skypilot.co/accelerator labels to identify GPUs on nodes
# Refer to https://skypilot.readthedocs.io/en/latest/reference/kubernetes/kubernetes-setup.html#setting-up-gpu-support
# Refer to https://docs.skypilot.co/en/latest/reference/kubernetes/kubernetes-setup.html#setting-up-gpu-support
# for more details on setting up labels for GPU support.
#
# Default: null (no autoscaler, autodetect label format for GPU nodes)
Expand Down
2 changes: 1 addition & 1 deletion docs/source/reference/kubernetes/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ Why use SkyPilot on Kubernetes?
.. grid-item-card:: 🖼 Run popular models on Kubernetes
:text-align: center

Train and serve `Llama-3 <https://skypilot.readthedocs.io/en/latest/gallery/llms/llama-3.html>`_, `Mixtral <https://skypilot.readthedocs.io/en/latest/gallery/llms/mixtral.html>`_, and more on your Kubernetes with ready-to-use recipes from the :ref:`AI gallery <ai-gallery>`.
Train and serve `Llama-3 <https://docs.skypilot.co/en/latest/gallery/llms/llama-3.html>`_, `Mixtral <https://docs.skypilot.co/en/latest/gallery/llms/mixtral.html>`_, and more on your Kubernetes with ready-to-use recipes from the :ref:`AI gallery <ai-gallery>`.


.. tab-item:: For Infrastructure Admins
Expand Down
4 changes: 2 additions & 2 deletions docs/source/reference/yaml-spec.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ Available fields:
# which `sky` is called.
#
# To exclude files from syncing, see
# https://skypilot.readthedocs.io/en/latest/examples/syncing-code-artifacts.html#exclude-uploading-files
# https://docs.skypilot.co/en/latest/examples/syncing-code-artifacts.html#exclude-uploading-files
workdir: ~/my-task-code
# Number of nodes (optional; defaults to 1) to launch including the head node.
Expand Down Expand Up @@ -357,7 +357,7 @@ In additional to the above fields, SkyPilot also supports the following experime
#
# The following fields can be overridden. Please refer to docs of Advanced
# Configuration for more details of those fields:
# https://skypilot.readthedocs.io/en/latest/reference/config.html
# https://docs.skypilot.co/en/latest/reference/config.html
config_overrides:
docker:
run_options: ...
Expand Down
2 changes: 1 addition & 1 deletion docs/source/reservations/existing-machines.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ Prerequisites
**Local machine (typically your laptop):**

* `kubectl <https://kubernetes.io/docs/tasks/tools/install-kubectl/>`_
* `SkyPilot <https://skypilot.readthedocs.io/en/latest/getting-started/installation.html>`_
* `SkyPilot <https://docs.skypilot.co/en/latest/getting-started/installation.html>`_

**Remote machines (your cluster, optionally with GPUs):**

Expand Down
2 changes: 1 addition & 1 deletion docs/source/serving/sky-serve.rst
Original file line number Diff line number Diff line change
Expand Up @@ -515,7 +515,7 @@ To achieve the above, you can specify custom configs in :code:`~/.sky/config.yam
# Specify the disk_size in GB of the SkyServe controller.
disk_size: 1024
The :code:`resources` field has the same spec as a normal SkyPilot job; see `here <https://skypilot.readthedocs.io/en/latest/reference/yaml-spec.html>`__.
The :code:`resources` field has the same spec as a normal SkyPilot job; see `here <https://docs.skypilot.co/en/latest/reference/yaml-spec.html>`__.

.. note::
These settings will not take effect if you have an existing controller (either
Expand Down
6 changes: 3 additions & 3 deletions examples/airflow/shared_state/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ In this guide, we demonstrate how some simple SkyPilot operations, such as launc

* Airflow installed on a [Kubernetes cluster](https://airflow.apache.org/docs/helm-chart/stable/index.html) or [locally](https://airflow.apache.org/docs/apache-airflow/stable/start.html) (`SequentialExecutor`)
* A Kubernetes cluster to run tasks on. We'll use GKE in this example.
* You can use our guide on [setting up a Kubernetes cluster](https://skypilot.readthedocs.io/en/latest/reference/kubernetes/kubernetes-setup.html).
* You can use our guide on [setting up a Kubernetes cluster](https://docs.skypilot.co/en/latest/reference/kubernetes/kubernetes-setup.html).
* A persistent volume storage class should be available that supports at least `ReadWriteOnce` access mode. GKE has this supported by default.

## Preparing the Kubernetes Cluster
Expand All @@ -39,7 +39,7 @@ In this guide, we demonstrate how some simple SkyPilot operations, such as launc
name: sky-airflow-sa
namespace: default
roleRef:
# For minimal permissions, refer to https://skypilot.readthedocs.io/en/latest/cloud-setup/cloud-permissions/kubernetes.html
# For minimal permissions, refer to https://docs.skypilot.co/en/latest/cloud-setup/cloud-permissions/kubernetes.html
kind: ClusterRole
name: cluster-admin
apiGroup: rbac.authorization.k8s.io
Expand Down Expand Up @@ -163,7 +163,7 @@ with DAG(dag_id='sky_k8s_example',
## Tips

1. **Persistent Volume**: If you have many concurrent tasks, you may want to use a storage class that supports [`ReadWriteMany`](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes) access mode.
2. **Cloud credentials**: If you wish to run tasks on different clouds, you can configure cloud credentials in Kubernetes secrets and mount them in the Sky pod defined in the DAG. See [SkyPilot docs on setting up cloud credentials](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#cloud-account-setup) for more on how to configure credentials in the pod.
2. **Cloud credentials**: If you wish to run tasks on different clouds, you can configure cloud credentials in Kubernetes secrets and mount them in the Sky pod defined in the DAG. See [SkyPilot docs on setting up cloud credentials](https://docs.skypilot.co/en/latest/getting-started/installation.html#cloud-account-setup) for more on how to configure credentials in the pod.
3. **Logging**: All SkyPilot logs are written to container stdout, which is captured as task logs in Airflow and displayed in the UI. You can also write logs to a file and read them in subsequent tasks.
4. **XComs for shared state**: Airflow also provides [XComs](https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html) for cross-task communication. [`sky_k8s_example_xcoms.py`](sky_k8s_example_xcoms.py) demonstrates how to use XComs to share state between tasks.

Expand Down
Loading

0 comments on commit ea84bb4

Please sign in to comment.