Skip to content

Commit

Permalink
Merge branch 'master' of https://github.com/skypilot-org/skypilot int…
Browse files Browse the repository at this point in the history
…o k8s_gpulabel_validation

# Conflicts:
#	sky/utils/kubernetes_utils.py
#	tests/test_optimizer_dryruns.py
  • Loading branch information
romilbhardwaj committed Nov 6, 2023
2 parents fd171f1 + 5a35ab6 commit bc51c5d
Show file tree
Hide file tree
Showing 93 changed files with 2,384 additions and 862 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/pytest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ jobs:
run: |
python -m pip install --upgrade pip
pip install ".[all]"
pip install pytest pytest-xdist pytest-env>=0.6
pip install pytest pytest-xdist pytest-env>=0.6 memory-profiler==0.61.0
- name: Run tests with pytest
run: SKYPILOT_DISABLE_USAGE_COLLECTION=1 SKYPILOT_SKIP_CLOUD_IDENTITY_CHECK=1 pytest -n 1 --dist no ${{ matrix.test-path }}
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,7 @@ Runnable examples:
- [LocalGPT](./llm/localgpt)
- [Falcon](./llm/falcon)
- Add yours here & see more in [`llm/`](./llm)!
- Framework examples: [PyTorch DDP](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_torch.yaml), [DeepSpeed](./examples/deepspeed-multinode/sky.yaml), [JAX/Flax on TPU](https://github.com/skypilot-org/skypilot/blob/master/examples/tpu/tpuvm_mnist.yaml), [Stable Diffusion](https://github.com/skypilot-org/skypilot/tree/master/examples/stable_diffusion), [Detectron2](https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml), [Distributed](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_tf_app.py) [TensorFlow](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_app_storage.yaml), [programmatic grid search](https://github.com/skypilot-org/skypilot/blob/master/examples/huggingface_glue_imdb_grid_search_app.py), [Docker](https://github.com/skypilot-org/skypilot/blob/master/examples/docker/echo_app.yaml), and [many more (`examples/`)](./examples).
- Framework examples: [PyTorch DDP](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_torch.yaml), [DeepSpeed](./examples/deepspeed-multinode/sky.yaml), [JAX/Flax on TPU](https://github.com/skypilot-org/skypilot/blob/master/examples/tpu/tpuvm_mnist.yaml), [Stable Diffusion](https://github.com/skypilot-org/skypilot/tree/master/examples/stable_diffusion), [Detectron2](https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml), [Distributed](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_distributed_tf_app.py) [TensorFlow](https://github.com/skypilot-org/skypilot/blob/master/examples/resnet_app_storage.yaml), [NeMo](https://github.com/skypilot-org/skypilot/blob/master/examples/nemo/nemo.yaml), [programmatic grid search](https://github.com/skypilot-org/skypilot/blob/master/examples/huggingface_glue_imdb_grid_search_app.py), [Docker](https://github.com/skypilot-org/skypilot/blob/master/examples/docker/echo_app.yaml), and [many more (`examples/`)](./examples).

Follow updates:
- [Twitter](https://twitter.com/skypilot_org)
Expand Down
22 changes: 22 additions & 0 deletions docs/source/_static/custom.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#kapa-widget-container {
z-index: 10000 !important;
position: absolute !important;
}

.mantine-Modal-root {
z-index: 10000;
position: absolute;
}

#kapa-widget-container figure {
padding: 0 !important;
}

.mantine-Modal-root figure {
padding: 0 !important;
}

/* RTD Flyout */
.rst-versions {
z-index: 10001 !important; /* Set to 10001 to avoid overlap with kapa */
}
13 changes: 13 additions & 0 deletions docs/source/_static/custom.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
document.addEventListener("DOMContentLoaded", function () {
var script = document.createElement("script");
script.src = "https://widget.kapa.ai/kapa-widget.bundle.js";
script.setAttribute("data-website-id", "4223d017-a3d2-4b92-b191-ea4d425a23c3");
script.setAttribute("data-project-name", "SkyPilot");
script.setAttribute("data-project-color", "#4C4C4D");
script.setAttribute("data-project-logo", "https://avatars.githubusercontent.com/u/109387420?s=100&v=4");
script.setAttribute("data-modal-disclaimer", "Results are automatically generated and may be inaccurate or contain inappropriate information. Do not include any sensitive information in your query.");
script.setAttribute("data-modal-title", "SkyPilot Docs AI - Ask a Question.");
script.setAttribute("data-button-position-bottom", "85px");
script.async = true;
document.head.appendChild(script);
});
2 changes: 1 addition & 1 deletion docs/source/cloud-setup/cloud-permissions/gcp.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ The easiest way to grant permissions to a user access your GCP project without t
roles/iam.securityAdmin
.. note::
If the ``roles/iam.securityAdmin`` role is undesirable, you can do the following. First, include the role and have any user (e.g., the admin) run ``sky launch --cloud gcp`` successfully once. This is to create the necessary service account. Then, remove the role from the list above.
If the ``roles/iam.securityAdmin`` role is undesirable, you can do the following. First, include the role and have any user (e.g., the admin) run ``sky launch --cloud gcp`` successfully once. This is to create the necessary service account. Then, replace the role ``roles/iam.securityAdmin`` with ``roles/iam.roleViewer`` in the list above.


Optionally, to use TPUs, add the following role:
Expand Down
2 changes: 2 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,3 +94,5 @@
# relative to this directory. They are copied after the builtin static files,
# so a file named 'default.css' will overwrite the builtin 'default.css'.
html_static_path = ['_static']
html_js_files = ["custom.js"]
html_css_files = ["custom.css"]
156 changes: 108 additions & 48 deletions docs/source/examples/docker-containers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,73 +3,59 @@
Using Docker Containers
=======================

SkyPilot can run a Docker container either as the runtime environment for your task, or as the task itself.
SkyPilot can run a container either as a task, or as the runtime environment of a cluster.

Using Docker Containers as Runtime Environment
----------------------------------------------
* If the container image is invocable / has an entrypoint: run it :ref:`as a task <docker-containers-as-tasks>`.
* Otherwise, the container image is likely to be used as a runtime environment (e.g., ``ubuntu``) and you likely have extra commands to run inside the container: run it :ref:`as a runtime environment <docker-containers-as-runtime-environments>`.

When a container is used as the runtime environment, the SkyPilot task is executed inside the container.
.. _docker-containers-as-tasks:

This means all :code:`setup` and :code:`run` commands in the YAML file will be executed in the container, and any files created by the task will be stored inside the container.
Any GPUs assigned to the task will be automatically mapped to your Docker container and all future tasks on the cluster will also execute in the container.
Running Containers as Tasks
---------------------------

To use a Docker image as your runtime environment, set the :code:`image_id` field in the :code:`resources` section of your task YAML file to :code:`docker:<image_id>`.
For example, to use the :code:`ubuntu:20.04` image from Docker Hub:

.. code-block:: yaml
resources:
image_id: docker:ubuntu:20.04
setup: |
# Will run inside container
SkyPilot can run containerized applications directly as regular tasks. The default VM images provided by SkyPilot already have the Docker runtime pre-configured.

run: |
# Will run inside container
To launch a containerized application, you can directly invoke :code:`docker run` in the :code:`run` section of your task.

For Docker images hosted on private registries, you can provide the registry authentication details using :ref:`task environment variables <env-vars>`:
For example, to run a HuggingFace TGI serving container:

.. code-block:: yaml
# ecr_private_docker.yaml
resources:
image_id: docker:<your-user-id>.dkr.ecr.us-east-1.amazonaws.com/<your-private-image>:<tag>
# the following shorthand is also supported:
# image_id: docker:<your-private-image>:<tag>
envs:
SKYPILOT_DOCKER_USERNAME: AWS
# SKYPILOT_DOCKER_PASSWORD: <password>
SKYPILOT_DOCKER_SERVER: <your-user-id>.dkr.ecr.us-east-1.amazonaws.com
We suggest to setting the :code:`SKYPILOT_DOCKER_PASSWORD` environment variable through the CLI (see :ref:`passing secrets <passing-secrets>`):

.. code-block:: bash
accelerators: A100:1
$ export SKYPILOT_DOCKER_PASSWORD=$(aws ecr get-login-password --region us-east-1)
$ sky launch ecr_private_docker.yaml --env SKYPILOT_DOCKER_PASSWORD
Running Docker Containers as Tasks
----------------------------------
run: |
docker run --gpus all --shm-size 1g -v ~/data:/data \
ghcr.io/huggingface/text-generation-inference \
--model-id lmsys/vicuna-13b-v1.5
As an alternative, SkyPilot can run docker containers as tasks. Docker runtime is configured and ready for use on the default VM image used by SkyPilot.
# NOTE: Uncommon to have any commands after the above.
# `docker run` is blocking, so any commands after it
# will NOT be run inside the container.
To run a container as a task, you can directly invoke the :code:`docker run` command in the :code:`run` section of your task.
Private Registries
^^^^^^^^^^^^^^^^^^

For example, to run a GPU-accelerated container that prints the output of :code:`nvidia-smi`:
When using this mode, to access Docker images hosted on private registries,
simply add a :code:`setup` section to your task YAML file to authenticate with
the registry:

.. code-block:: yaml
resources:
accelerators: V100:1
accelerators: A100:1
setup: |
# Authenticate with private registry
docker login -u <username> -p <password> <registry>
run: |
docker run --rm --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
docker run <registry>/<image>:<tag>
Building containers remotely
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If you are running the container as a task, the container image can also be built remotely on the cluster in the :code:`setup` phase of the task.
If you are running containerized applications, the container image can also be built remotely on the cluster in the :code:`setup` phase of the task.

The :code:`echo_app` `example <https://github.com/skypilot-org/skypilot/tree/master/examples/docker>`_ provides an example on how to do this:

Expand All @@ -89,14 +75,88 @@ The :code:`echo_app` `example <https://github.com/skypilot-org/skypilot/tree/mas
run: |
docker run --rm \
--volume="/inputs:/inputs:ro" \
--volume="/outputs:/outputs:rw" \
echo:v0 \
/inputs/README.md /outputs/output.txt
--volume="/inputs:/inputs:ro" \
--volume="/outputs:/outputs:rw" \
echo:v0 \
/inputs/README.md /outputs/output.txt
In this example, the Dockerfile and build context are contained in :code:`./echo_app`.
The :code:`setup` phase of the task builds the image, and the :code:`run` phase runs the container.
The inputs to the app are copied to SkyPilot using :code:`file_mounts` and mounted into the container using docker volume mounts (:code:`--volume` flag).
The output of the app produced at :code:`/outputs` path in the container is also volume mounted to :code:`/outputs` on the VM, which gets directly written to a S3 bucket through SkyPilot Storage mounting.

Our GitHub repository has more examples, including running `Detectron2 in a Docker container <https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml>`_ via SkyPilot.
Our GitHub repository has more examples, including running `Detectron2 in a Docker container <https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml>`_ via SkyPilot.

.. _docker-containers-as-runtime-environments:

Using Containers as Runtime Environments
----------------------------------------

When a container is used as the runtime environment, everything happens inside the container:

- The SkyPilot runtime is automatically installed and launched inside the container;
- :code:`setup` and :code:`run` commands are executed in the container;
- Any files created by the task will be stored inside the container.

To use a Docker image as your runtime environment, set the :code:`image_id` field in the :code:`resources` section of your task YAML file to :code:`docker:<image_id>`.
For example, to use the :code:`ubuntu:20.04` image from Docker Hub:

.. code-block:: yaml
resources:
image_id: docker:ubuntu:20.04
setup: |
# Commands to run inside the container
run: |
# Commands to run inside the container
Any GPUs assigned to the task will be automatically mapped to your Docker container, and all subsequent tasks within the cluster will also run inside the container. In a multi-node scenario, the container will be launched on all nodes, and the corresponding node's container will be assigned for task execution.

.. tip::

**When to use this?**

If you have a preconfigured development environment set up within a Docker
image, it can be convenient to use the runtime environment mode. This is
especially useful for launching development environments that are
challenging to configure on a new virtual machine, such as dependencies on
specific versions of CUDA or cuDNN.

.. note::

Since we ``pip install skypilot`` inside the user-specified container image
as part of a launch, users should ensure dependency conflicts do not occur.

Currently, the following requirements must be met:

1. The container image should be based on Debian;

2. The container image must grant sudo permissions without requiring password authentication for the user. Having a root user is also acceptable.

Private Registries
^^^^^^^^^^^^^^^^^^

When using this mode, to access Docker images hosted on private registries,
you can provide the registry authentication details using :ref:`task environment variables <env-vars>`:

.. code-block:: yaml
# ecr_private_docker.yaml
resources:
image_id: docker:<your-user-id>.dkr.ecr.us-east-1.amazonaws.com/<your-private-image>:<tag>
# the following shorthand is also supported:
# image_id: docker:<your-private-image>:<tag>
envs:
SKYPILOT_DOCKER_USERNAME: AWS
# SKYPILOT_DOCKER_PASSWORD: <password>
SKYPILOT_DOCKER_SERVER: <your-user-id>.dkr.ecr.us-east-1.amazonaws.com
We suggest setting the :code:`SKYPILOT_DOCKER_PASSWORD` environment variable through the CLI (see :ref:`passing secrets <passing-secrets>`):

.. code-block:: console
$ export SKYPILOT_DOCKER_PASSWORD=$(aws ecr get-login-password --region us-east-1)
$ sky launch ecr_private_docker.yaml --env SKYPILOT_DOCKER_PASSWORD
85 changes: 0 additions & 85 deletions docs/source/examples/iterative-dev-project.rst

This file was deleted.

12 changes: 6 additions & 6 deletions docs/source/examples/ports.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ At times, it might be crucial to expose specific ports on your cluster to the pu
- **Creating Web Services**: Whether you're setting up a web server, database, or another service, they all communicate via specific ports that need to be accessible.
- **Collaborative Tools**: Some tools and platforms may require port openings to enable collaboration with teammates or to integrate with other services.

Opening Ports for SkyPilot cluster
Opening Ports on a Cluster
----------------------------------

To open a port on a SkyPilot cluster, specify :code:`ports` in the :code:`resources` section of your task. For example, here is a YAML configuration to expose a Jupyter Lab server:
Expand All @@ -26,22 +26,22 @@ To open a port on a SkyPilot cluster, specify :code:`ports` in the :code:`resour
In this example, the :code:`run` command will start the Jupyter Lab server on port 8888. By specifying :code:`ports: 8888`, SkyPilot will expose port 8888 on the cluster, making the jupyter server publicly accessible. To launch and access the server, run:

.. code-block:: bash
.. code-block:: console
$ sky launch -c jupyter jupyter_lab.yaml
and look in for the logs for some output like:

.. code-block:: bash
.. code-block:: console
Jupyter Server 2.7.0 is running at:
http://127.0.0.1:8888/lab?token=<token>
To get the public IP address of the head node of the cluster, run :code:`sky status --ip jupyter`:

.. code-block:: bash
.. code-block:: console
$ sky status --ip jupyter
35.223.97.21
Expand All @@ -59,7 +59,7 @@ If you want to expose multiple ports, you can specify a list of ports or port ra
SkyPilot also support opening ports through the CLI:

.. code-block:: bash
.. code-block:: console
$ sky launch -c jupyter --ports 8888 jupyter_lab.yaml
Expand Down
Loading

0 comments on commit bc51c5d

Please sign in to comment.