Skip to content

Commit

Permalink
Merge branch 'master' into kabir/issue-2598-jump
Browse files Browse the repository at this point in the history
  • Loading branch information
kbrgl authored Jan 13, 2024
2 parents ae7683d + 9743aa0 commit 1be67fc
Show file tree
Hide file tree
Showing 177 changed files with 9,402 additions and 3,304 deletions.
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ It has some convenience features which you might find helpful (see [Dockerfile](
- Fork the SkyPilot repository and create a new branch for your changes.
- If relevant, add tests for your changes. For changes that touch the core system, run the [smoke tests](#testing) and ensure they pass.
- Follow the [Google style guide](https://google.github.io/styleguide/pyguide.html).
- Ensure code is properly formatted by running [`format.sh`](./format.sh).
- Ensure code is properly formatted by running [`format.sh`](https://github.com/skypilot-org/skypilot/blob/master/format.sh).
- Push your changes to your fork and open a pull request in the SkyPilot repository.
- In the PR description, write a `Tested:` section to describe relevant tests performed.

Expand Down
2 changes: 1 addition & 1 deletion Dockerfile_k8s
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ USER sky
# Install SkyPilot pip dependencies
RUN pip install wheel Click colorama cryptography jinja2 jsonschema && \
pip install networkx oauth2client pandas pendulum PrettyTable && \
pip install ray==2.4.0 rich tabulate filelock && \
pip install ray[default]==2.4.0 rich tabulate filelock && \
pip install packaging 'protobuf<4.0.0' pulp && \
pip install awscli boto3 pycryptodome==3.12.0 && \
pip install docker kubernetes
Expand Down
12 changes: 7 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,10 @@

----
:fire: *News* :fire:
- [Dec, 2023] Example: Using [LoRAX](https://github.com/predibase/lorax) to serve 1000s of finetuned LLMs on a single instance in the cloud: [**example**](./llm/lorax/)
- [Dec, 2023] [**Mixtral 8x7B**](https://mistral.ai/news/mixtral-of-experts/), a high quality sparse mixture-of-experts model, was released by Mistral AI! Deploy via SkyPilot on any cloud: [**example**](./llm/mixtral/).
- [Nov, 2023] Example: Using [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) to finetune Mistral 7B on the cloud (on-demand and spot): [**example**](./llm/axolotl/)
- [Sep, 2023] [**Mistral 7B**](https://mistral.ai/news/announcing-mistral-7b/), a high-quality open LLM, was released! Deploy via SkyPilot on any cloud: [**Mistral docs**](https://docs.mistral.ai/cloud-deployment/skypilot/)
- [Sep, 2023] [**Mistral 7B**](https://mistral.ai/news/announcing-mistral-7b/), a high-quality open LLM, was released! Deploy via SkyPilot on any cloud: [**Mistral docs**](https://docs.mistral.ai/self-deployment/skypilot)
- [Sep, 2023] Case study: [**Covariant**](https://covariant.ai/) transformed AI development on the cloud using SkyPilot, delivering models 4x faster cost-effectively: [**read the case study**](https://blog.skypilot.co/covariant/)
- [Aug, 2023] Cookbook: Finetuning Llama 2 in your own cloud environment, privately: [**example**](./llm/vicuna-llama-2/), [**blog post**](https://blog.skypilot.co/finetuning-llama2-operational-guide/)
- [July, 2023] Self-Hosted **Llama-2 Chatbot** on Any Cloud: [**example**](./llm/llama-2/)
Expand All @@ -48,7 +50,7 @@ SkyPilot **maximizes GPU availability for your jobs**:

SkyPilot **cuts your cloud costs**:
* [Managed Spot](https://skypilot.readthedocs.io/en/latest/examples/spot-jobs.html): 3-6x cost savings using spot VMs, with auto-recovery from preemptions
* Optimizer: 2x cost savings by auto-picking the cheapest VM/zone/region/cloud
* [Optimizer](https://skypilot.readthedocs.io/en/latest/examples/auto-failover.html): 2x cost savings by auto-picking the cheapest VM/zone/region/cloud
* [Autostop](https://skypilot.readthedocs.io/en/latest/reference/auto-stop.html): hands-free cleanup of idle clusters

SkyPilot supports your existing GPU, TPU, and CPU workloads, with no code changes.
Expand Down Expand Up @@ -112,7 +114,7 @@ Prepare the workdir by cloning:
git clone https://github.com/pytorch/examples.git ~/torch_examples
```

Launch with `sky launch` (note: [access to GPU instances](https://skypilot.readthedocs.io/en/latest/reference/quota.html) is needed for this example):
Launch with `sky launch` (note: [access to GPU instances](https://skypilot.readthedocs.io/en/latest/cloud-setup/quota.html) is needed for this example):
```bash
sky launch my_task.yaml
```
Expand All @@ -136,7 +138,7 @@ To learn more, see our [Documentation](https://skypilot.readthedocs.io/en/latest

Runnable examples:
- LLMs on SkyPilot
- [Mistral 7B](https://docs.mistral.ai/cloud-deployment/skypilot/) (from official Mistral team)
- [Mixtral 8x7B](./llm/mixtral/); [Mistral 7B](https://docs.mistral.ai/self-deployment/skypilot/) (from official Mistral team)
- [vLLM: Serving LLM 24x Faster On the Cloud](./llm/vllm/) (from official vLLM team)
- [Vicuna chatbots: Training & Serving](./llm/vicuna/) (from official Vicuna team)
- [Train your own Vicuna on Llama-2](./llm/vicuna-llama-2/)
Expand All @@ -147,7 +149,7 @@ Runnable examples:
- [LocalGPT](./llm/localgpt)
- [Falcon](./llm/falcon)
- Add yours here & see more in [`llm/`](./llm)!
- Framework examples: [PyTorch DDP](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_torch.yaml), [DeepSpeed](./examples/deepspeed-multinode/sky.yaml), [JAX/Flax on TPU](https://github.com/skypilot-org/skypilot/blob/master/examples/tpu/tpuvm_mnist.yaml), [Stable Diffusion](https://github.com/skypilot-org/skypilot/tree/master/examples/stable_diffusion), [Detectron2](https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml), [Distributed](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_tf_app.py) [TensorFlow](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_app_storage.yaml), [NeMo](https://github.com/skypilot-org/skypilot/blob/master/examples/nemo/nemo.yaml), [programmatic grid search](https://github.com/skypilot-org/skypilot/blob/master/examples/huggingface_glue_imdb_grid_search_app.py), [Docker](https://github.com/skypilot-org/skypilot/blob/master/examples/docker/echo_app.yaml), and [many more (`examples/`)](./examples).
- Framework examples: [PyTorch DDP](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_torch.yaml), [DeepSpeed](./examples/deepspeed-multinode/sky.yaml), [JAX/Flax on TPU](https://github.com/skypilot-org/skypilot/blob/master/examples/tpu/tpuvm_mnist.yaml), [Stable Diffusion](https://github.com/skypilot-org/skypilot/tree/master/examples/stable_diffusion), [Detectron2](https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml), [Distributed](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_tf_app.py) [TensorFlow](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_app_storage.yaml), [Ray Train](examples/distributed_ray_train/ray_train.yaml), [NeMo](https://github.com/skypilot-org/skypilot/blob/master/examples/nemo/nemo.yaml), [programmatic grid search](https://github.com/skypilot-org/skypilot/blob/master/examples/huggingface_glue_imdb_grid_search_app.py), [Docker](https://github.com/skypilot-org/skypilot/blob/master/examples/docker/echo_app.yaml), and [many more (`examples/`)](./examples).

Follow updates:
- [Twitter](https://twitter.com/skypilot_org)
Expand Down
3 changes: 3 additions & 0 deletions docs/requirements-docs.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
sphinx==4.3.2
sphinx-click==5.0.1
sphinx-copybutton==0.5.0
sphinx-design==0.4.1
pydata-sphinx-theme==0.7.2
sphinx-autodoc-typehints==1.17.0
sphinx-book-theme==0.2.0
Expand All @@ -11,3 +12,5 @@ sphinxcontrib-htmlhelp==2.0.0
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.3
sphinxcontrib-serializinghtml==1.1.5
sphinxemoji==0.2.0
myst-parser==0.18.1
9 changes: 9 additions & 0 deletions docs/source/_static/custom.js
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,12 @@ document.addEventListener("DOMContentLoaded", function () {
script.async = true;
document.head.appendChild(script);
});

(function(h,o,t,j,a,r){
h.hj=h.hj||function(){(h.hj.q=h.hj.q||[]).push(arguments)};
h._hjSettings={hjid:3768396,hjsv:6};
a=o.getElementsByTagName('head')[0];
r=o.createElement('script');r.async=1;
r.src=t+h._hjSettings.hjid+j+h._hjSettings.hjsv;
a.appendChild(r);
})(window,document,'https://static.hotjar.com/c/hotjar-','.js?sv=');
29 changes: 21 additions & 8 deletions docs/source/cloud-setup/cloud-auth.rst
Original file line number Diff line number Diff line change
Expand Up @@ -55,15 +55,28 @@ GCP
GCP Service Account
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

`GCP Service Account <https://cloud.google.com/iam/docs/service-account-overview>`__ is supported.
`GCP Service Accounts
<https://cloud.google.com/iam/docs/service-account-overview>`__ are supported.

To use it to access GCP with SkyPilot, you need to setup the credentials:
.. tip::
A service account on your local machine can avoid the periodic
``google.auth.exceptions.RefreshError: Reauthentication is needed. Please
run `gcloud auth application-default login` to reauthenticate.`` error. A
service account is long-lived as it does not have an expiry time.

1. Download the key for the service account from the `GCP console <https://console.cloud.google.com/iam-admin/serviceaccounts>`__.
2. Set the environment variable `GOOGLE_APPLICATION_CREDENTIALS` to the path of the key file, and configure the gcloud CLI tool:
Set up a service account as follows:

.. code-block:: console
1. Follow the :ref:`instructions <gcp-service-account-creation>` to create a service account with the appropriate roles/permissions.
2. In the "Service Accounts" tab in the `IAM & Admin console
<https://console.cloud.google.com/iam-admin/iam>`__, click on the service
account to go to its detailed page. Click on the **KEYS** tab, then click on
**ADD KEY** to add a JSON key. The key will be downloaded automatically.
3. Set the environment variable ``GOOGLE_APPLICATION_CREDENTIALS`` to the path of the key file, and configure the gcloud CLI tool:

.. code-block:: console
$ export GOOGLE_APPLICATION_CREDENTIALS=/path/to/key.json
$ gcloud auth activate-service-account --key-file=$GOOGLE_APPLICATION_CREDENTIALS
$ gcloud config set project your-project-id
$ export GOOGLE_APPLICATION_CREDENTIALS=/path/to/key.json
$ gcloud auth activate-service-account --key-file=$GOOGLE_APPLICATION_CREDENTIALS
$ gcloud config set project your-project-id
You may want to add the export statement in your profile (e.g. ``~/.bashrc``, ``~/.zshrc``) so that it is set automatically in all new terminal sessions.
10 changes: 10 additions & 0 deletions docs/source/cloud-setup/cloud-permissions/aws.rst
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,16 @@ AWS accounts can be attached with a policy that limits the permissions of the ac
"iam:GetInstanceProfile"
],
"Resource": "arn:aws:iam::<account-ID-without-hyphens>:instance-profile/skypilot-v1"
},
{
"Effect": "Allow",
"Action": "iam:CreateServiceLinkedRole",
"Resource": "*",
"Condition": {
"StringEquals": {
"iam:AWSServiceName": "spot.amazonaws.com"
}
}
}
]
}
Expand Down
68 changes: 66 additions & 2 deletions docs/source/cloud-setup/cloud-permissions/gcp.rst
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,10 @@ User
resourcemanager.projects.get
resourcemanager.projects.getIamPolicy
.. note::

For custom VPC users (with :code:`gcp.vpc_name` specified in :code:`~/.sky/config.yaml`, check `here <#_gcp-bring-your-vpc>`_), :code:`compute.firewalls.create` and :code:`compute.firewalls.delete` are not necessary unless opening ports is needed via `resources.ports` in task yaml.

4. **Optional**: If the user needs to access GCS buckets, you can additionally add the following permissions:

.. code-block:: text
Expand All @@ -102,6 +106,7 @@ User
storage.buckets.get
storage.buckets.delete
storage.objects.create
storage.objects.update
storage.objects.delete
storage.objects.get
storage.objects.list
Expand Down Expand Up @@ -170,7 +175,7 @@ Service Account

If you already have an service account under "Service Accounts" tab with the email starting with ``skypilot-v1@``, it is likely created by SkyPilot automatically, and you can skip this section.

1. Click the "Service Accounts" tab in the "IAM & Admin" console, and click on the **CREATE SERVICE ACCOUNT**.
1. Click the "Service Accounts" tab in the `IAM & Admin console <https://console.cloud.google.com/iam-admin/iam>`__, and click on **CREATE SERVICE ACCOUNT**.

.. image:: ../../images/screenshots/gcp/create-service-account.png
:width: 80%
Expand All @@ -184,7 +189,9 @@ Service Account
:align: center
:alt: Set Service Account Name

3. Select the ``minimal-skypilot-role`` (or the name you set) created in the last section and click on **DONE**.
3. Select the ``minimal-skypilot-role`` (or the name you set) created in the
last section and click on **DONE**. You can also choose to use the Default or
Medium Permissions roles as described in the previous sections.

.. image:: ../../images/screenshots/gcp/service-account-grant-role.png
:width: 60%
Expand Down Expand Up @@ -265,3 +272,60 @@ See details in :ref:`config-yaml`. Example use cases include using a private VP
VPC with fine-grained constraints, typically created via Terraform or manually.

The custom VPC should contain the :ref:`required firewall rules <gcp-minimum-firewall-rules>`.


.. _gcp-use-internal-ips:


Using Internal IPs
-----------------------
For security reason, users may only want to use internal IPs for SkyPilot instances.
To do so, you can use SkyPilot's global config file ``~/.sky/config.yaml`` to specify the ``gcp.use_internal_ips`` and ``gcp.ssh_proxy_command`` fields (to see the detailed syntax, see :ref:`config-yaml`):

.. code-block:: yaml
gcp:
use_internal_ips: true
# VPC with NAT setup, see below
vpc_name: my-vpc-name
ssh_proxy_command: ssh -W %h:%p -o StrictHostKeyChecking=no [email protected]
The ``gcp.ssh_proxy_command`` field is optional. If SkyPilot is run on a machine that can directly access the internal IPs of the instances, it can be omitted. Otherwise, it should be set to a command that can be used to proxy SSH connections to the internal IPs of the instances.


Cloud NAT Setup
~~~~~~~~~~~~~~~~

Instances created with internal IPs only on GCP cannot access public internet by default. To make sure SkyPilot can install the dependencies correctly on the instances,
cloud NAT needs to be setup for the VPC (see `GCP's documentation <https://cloud.google.com/nat/docs/overview>`__ for details).


Cloud NAT is a regional resource, so it will need to be created in each region that SkyPilot will be used in.


.. image:: ../../images/screenshots/gcp/cloud-nat.png
:width: 80%
:align: center
:alt: GCP Cloud NAT

To limit SkyPilot to use some specific regions only, you can specify the ``gcp.ssh_proxy_command`` to be a dict mapping from region to the SSH proxy command for that region (see :ref:`config-yaml` for details):

.. code-block:: yaml
gcp:
use_internal_ips: true
vpc_name: my-vpc-name
ssh_proxy_command:
us-west1: ssh -W %h:%p -o StrictHostKeyChecking=no [email protected]
us-east1: ssh -W %h:%p -o StrictHostKeyChecking=no [email protected]
If proxy is not needed, but the regions need to be limited, you can set the ``gcp.ssh_proxy_command`` to be a dict mapping from region to ``null``:

.. code-block:: yaml
gcp:
use_internal_ips: true
vpc_name: my-vpc-name
ssh_proxy_command:
us-west1: null
us-east1: null
3 changes: 3 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
# -- General configuration

extensions = [
'sphinxemoji.sphinxemoji',
'sphinx.ext.autodoc',
'sphinx.ext.autosummary',
'sphinx.ext.duration',
Expand All @@ -34,6 +35,8 @@
'sphinx_autodoc_typehints',
'sphinx_click',
'sphinx_copybutton',
'sphinx_design',
'myst_parser',
]

intersphinx_mapping = {
Expand Down
1 change: 1 addition & 0 deletions docs/source/developers/CONTRIBUTING.md
Loading

0 comments on commit 1be67fc

Please sign in to comment.