Skip to content

Commit 93aa008

Browse files
committed
Merge branch 'vast.ai-support' of github.com:kristopolous/skypilot into vast.ai-support
2 parents 96cce92 + aebca9a commit 93aa008

File tree

116 files changed

+4235
-1927
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

116 files changed

+4235
-1927
lines changed

CONTRIBUTING.md

+7-7
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Contributing to SkyPilot
22

3-
Thank you for your interest in contributing to SkyPilot! We welcome and value
4-
all contributions to the project, including but not limited to:
3+
Thank you for your interest in contributing to SkyPilot! We welcome and value
4+
all contributions to the project, including but not limited to:
55

66
* [Bug reports](https://github.com/skypilot-org/skypilot/issues) and [discussions](https://github.com/skypilot-org/skypilot/discussions)
77
* [Pull requests](https://github.com/skypilot-org/skypilot/pulls) for bug fixes and new features
@@ -26,7 +26,7 @@ pip install -r requirements-dev.txt
2626
### Testing
2727
To run smoke tests (NOTE: Running all smoke tests launches ~20 clusters):
2828
```
29-
# Run all tests except for AWS and Lambda Cloud
29+
# Run all tests on AWS and Azure (default smoke test clouds)
3030
pytest tests/test_smoke.py
3131
3232
# Terminate a test's cluster even if the test failed (default is to keep it around for debugging)
@@ -41,11 +41,11 @@ pytest tests/test_smoke.py::test_minimal
4141
# Only run managed spot tests
4242
pytest tests/test_smoke.py --managed-spot
4343
44-
# Only run test for AWS + generic tests
45-
pytest tests/test_smoke.py --aws
44+
# Only run test for GCP + generic tests
45+
pytest tests/test_smoke.py --gcp
4646
47-
# Change cloud for generic tests to aws
48-
pytest tests/test_smoke.py --generic-cloud aws
47+
# Change cloud for generic tests to Azure
48+
pytest tests/test_smoke.py --generic-cloud azure
4949
```
5050

5151
For profiling code, use:

Dockerfile_k8s

+9-15
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
FROM continuumio/miniconda3:23.3.1-0
1+
FROM --platform=linux/amd64 continuumio/miniconda3:23.3.1-0
22

33
# TODO(romilb): Investigate if this image can be consolidated with the skypilot
44
# client image (`Dockerfile`)
@@ -33,21 +33,15 @@ ENV HOME /home/sky
3333
# Set current working directory
3434
WORKDIR /home/sky
3535

36-
# Install SkyPilot pip dependencies preemptively to speed up provisioning time
37-
RUN conda init && \
38-
pip install wheel Click colorama cryptography jinja2 jsonschema networkx \
39-
oauth2client pandas pendulum PrettyTable rich tabulate filelock packaging \
40-
'protobuf<4.0.0' pulp pycryptodome==3.12.0 docker kubernetes==28.1.0 \
41-
grpcio==1.51.3 python-dotenv==1.0.1 ray[default]==2.9.3 && \
36+
# Install skypilot dependencies
37+
RUN conda init && export PIP_DISABLE_PIP_VERSION_CHECK=1 && \
38+
python3 -m venv ~/skypilot-runtime && \
39+
PYTHON_EXEC=$(echo ~/skypilot-runtime)/bin/python && \
40+
$PYTHON_EXEC -m pip install 'skypilot-nightly[remote,kubernetes]' 'ray[default]==2.9.3' 'pycryptodome==3.12.0' && \
41+
$PYTHON_EXEC -m pip uninstall skypilot-nightly -y && \
4242
curl -LO "https://dl.k8s.io/release/v1.28.11/bin/linux/amd64/kubectl" && \
43-
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
44-
45-
# Add /home/sky/.local/bin/ to PATH
46-
RUN echo 'export PATH="$PATH:$HOME/.local/bin"' >> ~/.bashrc
47-
48-
# Copy SkyPilot code base. This is required for the ssh jump pod to find the
49-
# lifecycle management scripts
50-
COPY --chown=sky . /skypilot/sky/
43+
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl && \
44+
echo 'export PATH="$PATH:$HOME/.local/bin"' >> ~/.bashrc
5145

5246
# Set PYTHONUNBUFFERED=1 to have Python print to stdout/stderr immediately
5347
ENV PYTHONUNBUFFERED=1

Dockerfile_k8s_gpu

+7-12
Original file line numberDiff line numberDiff line change
@@ -41,19 +41,14 @@ RUN curl https://repo.anaconda.com/miniconda/Miniconda3-py310_23.11.0-2-Linux-x8
4141
eval "$(~/miniconda3/bin/conda shell.bash hook)" && conda init && conda config --set auto_activate_base true && conda activate base && \
4242
grep "# >>> conda initialize >>>" ~/.bashrc || { conda init && source ~/.bashrc; } && \
4343
rm Miniconda3-Linux-x86_64.sh && \
44-
pip install wheel Click colorama cryptography jinja2 jsonschema networkx \
45-
oauth2client pandas pendulum PrettyTable rich tabulate filelock packaging \
46-
'protobuf<4.0.0' pulp pycryptodome==3.12.0 docker kubernetes==28.1.0 \
47-
grpcio==1.51.3 python-dotenv==1.0.1 ray[default]==2.9.3 && \
44+
export PIP_DISABLE_PIP_VERSION_CHECK=1 && \
45+
python3 -m venv ~/skypilot-runtime && \
46+
PYTHON_EXEC=$(echo ~/skypilot-runtime)/bin/python && \
47+
$PYTHON_EXEC -m pip install 'skypilot-nightly[remote,kubernetes]' 'ray[default]==2.9.3' 'pycryptodome==3.12.0' && \
48+
$PYTHON_EXEC -m pip uninstall skypilot-nightly -y && \
4849
curl -LO "https://dl.k8s.io/release/v1.28.11/bin/linux/amd64/kubectl" && \
49-
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
50-
51-
# Add /home/sky/.local/bin/ to PATH
52-
RUN echo 'export PATH="$PATH:$HOME/.local/bin"' >> ~/.bashrc
53-
54-
# Copy SkyPilot code base. This is required for the ssh jump pod to find the
55-
# lifecycle management scripts
56-
COPY --chown=sky . /skypilot/sky/
50+
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl && \
51+
echo 'export PATH="$PATH:$HOME/.local/bin"' >> ~/.bashrc
5752

5853
# Set PYTHONUNBUFFERED=1 to have Python print to stdout/stderr immediately
5954
ENV PYTHONUNBUFFERED=1

README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -183,7 +183,7 @@ Runnable examples:
183183
- [LocalGPT](./llm/localgpt)
184184
- [Falcon](./llm/falcon)
185185
- Add yours here & see more in [`llm/`](./llm)!
186-
- Framework examples: [PyTorch DDP](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_torch.yaml), [DeepSpeed](./examples/deepspeed-multinode/sky.yaml), [JAX/Flax on TPU](https://github.com/skypilot-org/skypilot/blob/master/examples/tpu/tpuvm_mnist.yaml), [Stable Diffusion](https://github.com/skypilot-org/skypilot/tree/master/examples/stable_diffusion), [Detectron2](https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml), [Distributed](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_tf_app.py) [TensorFlow](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_app_storage.yaml), [Ray Train](examples/distributed_ray_train/ray_train.yaml), [NeMo](https://github.com/skypilot-org/skypilot/blob/master/examples/nemo/nemo.yaml), [programmatic grid search](https://github.com/skypilot-org/skypilot/blob/master/examples/huggingface_glue_imdb_grid_search_app.py), [Docker](https://github.com/skypilot-org/skypilot/blob/master/examples/docker/echo_app.yaml), [Cog](https://github.com/skypilot-org/skypilot/blob/master/examples/cog/), [Unsloth](https://github.com/skypilot-org/skypilot/blob/master/examples/unsloth/unsloth.yaml), [Ollama](https://github.com/skypilot-org/skypilot/blob/master/llm/ollama), [llm.c](https://github.com/skypilot-org/skypilot/tree/master/llm/gpt-2), [Airflow](./examples/airflow/training_workflow) and [many more (`examples/`)](./examples).
186+
- Framework examples: [PyTorch DDP](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_torch.yaml), [DeepSpeed](./examples/deepspeed-multinode/sky.yaml), [JAX/Flax on TPU](https://github.com/skypilot-org/skypilot/blob/master/examples/tpu/tpuvm_mnist.yaml), [Stable Diffusion](https://github.com/skypilot-org/skypilot/tree/master/examples/stable_diffusion), [Detectron2](https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml), [Distributed](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_tf_app.py) [TensorFlow](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_app_storage.yaml), [Ray Train](examples/distributed_ray_train/ray_train.yaml), [NeMo](https://github.com/skypilot-org/skypilot/blob/master/examples/nemo/), [programmatic grid search](https://github.com/skypilot-org/skypilot/blob/master/examples/huggingface_glue_imdb_grid_search_app.py), [Docker](https://github.com/skypilot-org/skypilot/blob/master/examples/docker/echo_app.yaml), [Cog](https://github.com/skypilot-org/skypilot/blob/master/examples/cog/), [Unsloth](https://github.com/skypilot-org/skypilot/blob/master/examples/unsloth/unsloth.yaml), [Ollama](https://github.com/skypilot-org/skypilot/blob/master/llm/ollama), [llm.c](https://github.com/skypilot-org/skypilot/tree/master/llm/gpt-2), [Airflow](./examples/airflow/training_workflow) and [many more (`examples/`)](./examples).
187187

188188
Case Studies and Integrations: [Community Spotlights](https://blog.skypilot.co/community/)
189189

docs/build.sh

+8-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,14 @@
11
#!/bin/bash
22

33
rm -rf build docs
4-
script -q /tmp/build_docs.txt bash -c "make html"
4+
5+
# MacOS and GNU `script` have different usages
6+
if [ "$(uname -s)" = "Linux" ]; then
7+
script -q /tmp/build_docs.txt -c "make html"
8+
else
9+
# Assume MacOS (uname -s = Darwin)
10+
script -q /tmp/build_docs.txt bash -c "make html"
11+
fi
512

613
# Check if the output contains "ERROR:" or "WARNING:"
714
if grep -q -E "ERROR:|WARNING:" /tmp/build_docs.txt; then

docs/source/_static/custom.js

+2-2
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ document.addEventListener('DOMContentLoaded', function () {
77
script.setAttribute('data-project-logo', 'https://avatars.githubusercontent.com/u/109387420?s=100&v=4');
88
script.setAttribute('data-modal-disclaimer', 'Results are automatically generated and may be inaccurate or contain inappropriate information. Do not include any sensitive information in your query.\n**To get further assistance, you can chat directly with the development team** by joining the [SkyPilot Slack](https://slack.skypilot.co/).');
99
script.setAttribute('data-modal-title', 'SkyPilot Docs AI - Ask a Question.');
10-
script.setAttribute('data-button-position-bottom', '85px');
10+
script.setAttribute('data-button-position-bottom', '100px');
1111
script.async = true;
1212
document.head.appendChild(script);
1313
});
@@ -25,14 +25,14 @@ document.addEventListener('DOMContentLoaded', function () {
2525
document.addEventListener('DOMContentLoaded', () => {
2626
// New items:
2727
const newItems = [
28-
{ selector: '.caption-text', text: 'SkyServe: Model Serving' },
2928
{ selector: '.toctree-l1 > a', text: 'Managed Jobs' },
3029
{ selector: '.toctree-l1 > a', text: 'Pixtral (Mistral AI)' },
3130
{ selector: '.toctree-l1 > a', text: 'Many Parallel Jobs' },
3231
{ selector: '.toctree-l1 > a', text: 'Reserved, Capacity Blocks, DWS' },
3332
{ selector: '.toctree-l1 > a', text: 'Llama 3.2 (Meta)' },
3433
{ selector: '.toctree-l1 > a', text: 'Admin Policy Enforcement' },
3534
{ selector: '.toctree-l1 > a', text: 'Using Existing Machines' },
35+
{ selector: '.toctree-l1 > a', text: 'Concept: Sky Computing' },
3636
];
3737
newItems.forEach(({ selector, text }) => {
3838
document.querySelectorAll(selector).forEach((el) => {

docs/source/cloud-setup/cloud-permissions/kubernetes.rst

+26-1
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,29 @@ SkyPilot requires permissions equivalent to the following roles to be able to ma
9696

9797
These roles must apply to both the user account configured in the kubeconfig file and the service account used by SkyPilot (if configured).
9898

99-
If your tasks use object store mounting or require access to ingress resources, you will need to grant additional permissions as described below.
99+
If you need to view real-time GPU availability with ``sky show-gpus``, your tasks use object store mounting or your tasks require access to ingress resources, you will need to grant additional permissions as described below.
100+
101+
Permissions for ``sky show-gpus``
102+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
103+
104+
``sky show-gpus`` needs to list all pods across all namespaces to calculate GPU availability. To do this, SkyPilot needs the ``get`` and ``list`` permissions for pods in a ``ClusterRole``:
105+
106+
.. code-block:: yaml
107+
108+
apiVersion: rbac.authorization.k8s.io/v1
109+
kind: ClusterRole
110+
metadata:
111+
name: sky-sa-cluster-role-pod-reader
112+
rules:
113+
- apiGroups: [""]
114+
resources: ["pods"]
115+
verbs: ["get", "list"]
116+
117+
118+
.. tip::
119+
120+
If this role is not granted to the service account, ``sky show-gpus`` will still work but it will only show the total GPUs on the nodes, not the number of free GPUs.
121+
100122

101123
Permissions for Object Store Mounting
102124
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -225,6 +247,9 @@ To create a service account that has all necessary permissions for SkyPilot (inc
225247
- apiGroups: ["networking.k8s.io"] # Required for exposing services through ingresses
226248
resources: ["ingressclasses"]
227249
verbs: ["get", "list", "watch"]
250+
- apiGroups: [""] # Required for `sky show-gpus` command
251+
resources: ["pods"]
252+
verbs: ["get", "list"]
228253
---
229254
# ClusterRoleBinding for the service account
230255
apiVersion: rbac.authorization.k8s.io/v1

docs/source/cloud-setup/policy.rst

+3
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,9 @@ The ``sky.Config`` and ``sky.RequestOptions`` classes are defined as follows:
113113
:pyobject: RequestOptions
114114
:caption: `RequestOptions Class <https://github.com/skypilot-org/skypilot/blob/master/sky/admin_policy.py>`_
115115

116+
.. note::
117+
118+
The ``sky.AdminPolicy`` should be idempotent. In other words, it should be safe to apply the policy multiple times to the same user request.
116119

117120
Example Policies
118121
----------------

docs/source/docs/index.rst

+3-2
Original file line numberDiff line numberDiff line change
@@ -62,9 +62,9 @@ SkyPilot supports your existing GPU, TPU, and CPU workloads, with no code change
6262
Ready to get started?
6363
----------------------
6464

65-
:ref:`Install SkyPilot <installation>` in ~1 minute. Then, launch your first dev cluster in ~5 minutes in :ref:`Quickstart <quickstart>`.
65+
:ref:`Install SkyPilot <installation>` in 1 minute. Then, launch your first dev cluster in 2 minutes in :ref:`Quickstart <quickstart>`.
6666

67-
Everything is launched within your cloud accounts, VPCs, and cluster(s).
67+
SkyPilot is BYOC: Everything is launched within your cloud accounts, VPCs, and clusters.
6868

6969
Contact the SkyPilot team
7070
---------------------------------
@@ -130,6 +130,7 @@ Read the research:
130130
../getting-started/quickstart
131131
../examples/interactive-development
132132
../getting-started/tutorial
133+
../sky-computing
133134

134135

135136
.. toctree::

docs/source/examples/managed-jobs.rst

+44-49
Original file line numberDiff line numberDiff line change
@@ -78,49 +78,47 @@ We can launch it with the following:
7878

7979
.. code-block:: console
8080
81+
$ git clone https://github.com/huggingface/transformers.git ~/transformers -b v4.30.1
8182
$ sky jobs launch -n bert-qa bert_qa.yaml
8283
83-
8484
.. code-block:: yaml
8585
8686
# bert_qa.yaml
8787
name: bert-qa
8888
8989
resources:
9090
accelerators: V100:1
91-
# Use spot instances to save cost.
92-
use_spot: true
93-
94-
# Assume your working directory is under `~/transformers`.
95-
# To make this example work, please run the following command:
96-
# git clone https://github.com/huggingface/transformers.git ~/transformers -b v4.30.1
97-
workdir: ~/transformers
91+
use_spot: true # Use spot instances to save cost.
9892
99-
setup: |
93+
envs:
10094
# Fill in your wandb key: copy from https://wandb.ai/authorize
10195
# Alternatively, you can use `--env WANDB_API_KEY=$WANDB_API_KEY`
10296
# to pass the key in the command line, during `sky jobs launch`.
103-
echo export WANDB_API_KEY=[YOUR-WANDB-API-KEY] >> ~/.bashrc
97+
WANDB_API_KEY:
98+
99+
# Assume your working directory is under `~/transformers`.
100+
workdir: ~/transformers
104101
102+
setup: |
105103
pip install -e .
106104
cd examples/pytorch/question-answering/
107105
pip install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
108106
pip install wandb
109107
110108
run: |
111-
cd ./examples/pytorch/question-answering/
109+
cd examples/pytorch/question-answering/
112110
python run_qa.py \
113-
--model_name_or_path bert-base-uncased \
114-
--dataset_name squad \
115-
--do_train \
116-
--do_eval \
117-
--per_device_train_batch_size 12 \
118-
--learning_rate 3e-5 \
119-
--num_train_epochs 50 \
120-
--max_seq_length 384 \
121-
--doc_stride 128 \
122-
--report_to wandb
123-
111+
--model_name_or_path bert-base-uncased \
112+
--dataset_name squad \
113+
--do_train \
114+
--do_eval \
115+
--per_device_train_batch_size 12 \
116+
--learning_rate 3e-5 \
117+
--num_train_epochs 50 \
118+
--max_seq_length 384 \
119+
--doc_stride 128 \
120+
--report_to wandb \
121+
--output_dir /tmp/bert_qa/
124122
125123
.. note::
126124

@@ -162,55 +160,52 @@ An End-to-End Example
162160
Below we show an `example <https://github.com/skypilot-org/skypilot/blob/master/examples/spot/bert_qa.yaml>`_ for fine-tuning a BERT model on a question-answering task with HuggingFace.
163161

164162
.. code-block:: yaml
165-
:emphasize-lines: 13-16,42-45
163+
:emphasize-lines: 8-11,41-44
166164
167165
# bert_qa.yaml
168166
name: bert-qa
169167
170168
resources:
171169
accelerators: V100:1
172-
use_spot: true
173-
174-
# Assume your working directory is under `~/transformers`.
175-
# To make this example work, please run the following command:
176-
# git clone https://github.com/huggingface/transformers.git ~/transformers -b v4.30.1
177-
workdir: ~/transformers
170+
use_spot: true # Use spot instances to save cost.
178171
179172
file_mounts:
180173
/checkpoint:
181174
name: # NOTE: Fill in your bucket name
182175
mode: MOUNT
183176
184-
setup: |
177+
envs:
185178
# Fill in your wandb key: copy from https://wandb.ai/authorize
186179
# Alternatively, you can use `--env WANDB_API_KEY=$WANDB_API_KEY`
187180
# to pass the key in the command line, during `sky jobs launch`.
188-
echo export WANDB_API_KEY=[YOUR-WANDB-API-KEY] >> ~/.bashrc
181+
WANDB_API_KEY:
182+
183+
# Assume your working directory is under `~/transformers`.
184+
workdir: ~/transformers
189185
186+
setup: |
190187
pip install -e .
191188
cd examples/pytorch/question-answering/
192-
pip install -r requirements.txt
189+
pip install -r requirements.txt torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
193190
pip install wandb
194191
195192
run: |
196-
cd ./examples/pytorch/question-answering/
193+
cd examples/pytorch/question-answering/
197194
python run_qa.py \
198-
--model_name_or_path bert-base-uncased \
199-
--dataset_name squad \
200-
--do_train \
201-
--do_eval \
202-
--per_device_train_batch_size 12 \
203-
--learning_rate 3e-5 \
204-
--num_train_epochs 50 \
205-
--max_seq_length 384 \
206-
--doc_stride 128 \
207-
--report_to wandb \
208-
--run_name $SKYPILOT_TASK_ID \
209-
--output_dir /checkpoint/bert_qa/ \
210-
--save_total_limit 10 \
211-
--save_steps 1000
212-
213-
195+
--model_name_or_path bert-base-uncased \
196+
--dataset_name squad \
197+
--do_train \
198+
--do_eval \
199+
--per_device_train_batch_size 12 \
200+
--learning_rate 3e-5 \
201+
--num_train_epochs 50 \
202+
--max_seq_length 384 \
203+
--doc_stride 128 \
204+
--report_to wandb \
205+
--output_dir /checkpoint/bert_qa/ \
206+
--run_name $SKYPILOT_TASK_ID \
207+
--save_total_limit 10 \
208+
--save_steps 1000
214209
215210
As HuggingFace has built-in support for periodically checkpointing, we only need to pass the highlighted arguments for setting up
216211
the output directory and frequency of checkpointing (see more

docs/source/getting-started/installation.rst

+1
Original file line numberDiff line numberDiff line change
@@ -264,6 +264,7 @@ The :code:`~/.oci/config` file should contain the following fields:
264264
fingerprint=aa:bb:cc:dd:ee:ff:gg:hh:ii:jj:kk:ll:mm:nn:oo:pp
265265
tenancy=ocid1.tenancy.oc1..aaaaaaaa
266266
region=us-sanjose-1
267+
# Note that we should avoid using full home path for the key_file configuration, e.g. use ~/.oci instead of /home/username/.oci
267268
key_file=~/.oci/oci_api_key.pem
268269
269270
216 KB
Loading

0 commit comments

Comments
 (0)