Merge branch 'master' of github.com:skypilot-org/skypilot into restapi

skypilot-org · Nov 13, 2023 · 4d78b41 · 4d78b41
1 parent 298fd81
commit 4d78b41
Show file tree

Hide file tree

Showing 50 changed files with 1,854 additions and 521 deletions.
diff --git a/.github/workflows/pytest.yml b/.github/workflows/pytest.yml
@@ -17,6 +17,7 @@ jobs:
         python-version: [3.8]
         test-path:
           - tests/unit_tests
+          - tests/test_api.py
           - tests/test_cli.py
           - tests/test_config.py
           - tests/test_global_user_state.py

diff --git a/docs/source/cloud-setup/cloud-permissions/gcp.rst b/docs/source/cloud-setup/cloud-permissions/gcp.rst
@@ -66,7 +66,7 @@ User
     compute.firewalls.create
     compute.firewalls.delete
     compute.firewalls.get
-    compute.instances.create 
+    compute.instances.create
     compute.instances.delete
     compute.instances.get
     compute.instances.list
@@ -148,8 +148,8 @@ User
 
 .. note::
 
-    The user created with the above minimal permissions will not be able to create service accounts to be assigned to SkyPilot instances. 
-    
+    The user created with the above minimal permissions will not be able to create service accounts to be assigned to SkyPilot instances.
+
     The admin needs to follow the :ref:`instruction below <gcp-service-account-creation>` to create a service account to be shared by all users in the project.
 
 
@@ -182,3 +182,51 @@ Service Account
     :align: center
     :alt: Set Service Account Role
 
+
+.. _gcp-minimum-firewall-rules:
+
+Firewall Rules
+~~~~~~~~~~~~~~~~~~~
+
+By default, users do not need to set up any special firewall rules to start
+using SkyPilot. If the default VPC does not satisfy the minimal required rules,
+a new VPC ``skypilot-vpc`` with sufficient rules will be automatically created
+and used.
+
+However, if you manually set up and instruct SkyPilot to use a VPC (see
+:ref:`here <config-yaml>`), ensure it has the following required firewall rules:
+
+.. code-block:: python
+
+    # Allow internal connections between SkyPilot VMs:
+    #
+    #   controller -> head node of a cluster
+    #   head node of a cluster <-> worker node(s) of a cluster
+    #
+    # NOTE: these ports are more relaxed than absolute minimum, but the
+    # sourceRanges restrict the traffic to internal IPs.
+    {
+        "direction": "INGRESS",
+        "allowed": [
+            {"IPProtocol": "tcp", "ports": ["0-65535"]},
+            {"IPProtocol": "udp", "ports": ["0-65535"]},
+        ],
+        "sourceRanges": ["10.128.0.0/9"],
+    },
+
+    # Allow SSH connections from user machine(s)
+    #
+    # NOTE: This can be satisfied using the following relaxed sourceRanges
+    # (0.0.0.0/0), but you can customize it if you want to restrict to certain
+    # known public IPs (useful when using internal VPN or proxy solutions).
+    {
+        "direction": "INGRESS",
+        "allowed": [
+            {"IPProtocol": "tcp", "ports": ["22"]},
+        ],
+        "sourceRanges": ["0.0.0.0/0"],
+    },
+
+You can inspect and manage firewall rules at
+``https://console.cloud.google.com/net-security/firewall-manager/firewall-policies/list?project=<your-project-id>``
+or using any of GCP's SDKs.
diff --git a/docs/source/examples/docker-containers.rst b/docs/source/examples/docker-containers.rst
@@ -3,73 +3,59 @@
 Using Docker Containers
 =======================
 
-SkyPilot can run a Docker container either as the runtime environment for your task, or as the task itself.
+SkyPilot can run a container either as a task, or as the runtime environment of a cluster.
 
-Using Docker Containers as Runtime Environment
-----------------------------------------------
+* If the container image is invocable / has an entrypoint: run it :ref:`as a task <docker-containers-as-tasks>`.
+* Otherwise, the container image is likely to be used as a runtime environment (e.g., ``ubuntu``) and you likely have extra commands to run inside the container: run it :ref:`as a runtime environment <docker-containers-as-runtime-environments>`.
 
-When a container is used as the runtime environment, the SkyPilot task is executed inside the container.
+.. _docker-containers-as-tasks:
 
-This means all :code:`setup` and :code:`run` commands in the YAML file will be executed in the container, and any files created by the task will be stored inside the container.
-Any GPUs assigned to the task will be automatically mapped to your Docker container and all future tasks on the cluster will also execute in the container.
+Running Containers as Tasks
+---------------------------
 
-To use a Docker image as your runtime environment, set the :code:`image_id` field in the :code:`resources` section of your task YAML file to :code:`docker:<image_id>`.
-For example, to use the :code:`ubuntu:20.04` image from Docker Hub:
-
-.. code-block:: yaml
-
-  resources:
-    image_id: docker:ubuntu:20.04
+SkyPilot can run containerized applications directly as regular tasks. The default VM images provided by SkyPilot already have the Docker runtime pre-configured.
 
-  setup: |
-    # Will run inside container
-
-  run: |
-    # Will run inside container
+To launch a containerized application, you can directly invoke :code:`docker run` in the :code:`run` section of your task.
 
-For Docker images hosted on private registries, you can provide the registry authentication details using :ref:`task environment variables <env-vars>`:
+For example, to run a HuggingFace TGI serving container:
 
 .. code-block:: yaml
 
-  # ecr_private_docker.yaml
   resources:
-    image_id: docker:<your-user-id>.dkr.ecr.us-east-1.amazonaws.com/<your-private-image>:<tag>
-    # the following shorthand is also supported:
-    # image_id: docker:<your-private-image>:<tag>
-
-  envs:
-    SKYPILOT_DOCKER_USERNAME: AWS
-    # SKYPILOT_DOCKER_PASSWORD: <password>
-    SKYPILOT_DOCKER_SERVER: <your-user-id>.dkr.ecr.us-east-1.amazonaws.com
+    accelerators: A100:1
 
-We suggest setting the :code:`SKYPILOT_DOCKER_PASSWORD` environment variable through the CLI (see :ref:`passing secrets <passing-secrets>`):
-
-.. code-block:: console
-
-  $ export SKYPILOT_DOCKER_PASSWORD=$(aws ecr get-login-password --region us-east-1)
-  $ sky launch ecr_private_docker.yaml --env SKYPILOT_DOCKER_PASSWORD
-
-Running Docker Containers as Tasks
-----------------------------------
+  run: |
+    docker run --gpus all --shm-size 1g -v ~/data:/data \
+      ghcr.io/huggingface/text-generation-inference \
+      --model-id lmsys/vicuna-13b-v1.5
 
-As an alternative, SkyPilot can run docker containers as tasks. Docker runtime is configured and ready for use on the default VM image used by SkyPilot.
+    # NOTE: Uncommon to have any commands after the above.
+    # `docker run` is blocking, so any commands after it
+    # will NOT be run inside the container.
 
-To run a container as a task, you can directly invoke the :code:`docker run` command in the :code:`run` section of your task.
+Private Registries
+^^^^^^^^^^^^^^^^^^
 
-For example, to run a GPU-accelerated container that prints the output of :code:`nvidia-smi`:
+When using this mode, to access Docker images hosted on private registries,
+simply add a :code:`setup` section to your task YAML file to authenticate with
+the registry:
 
 .. code-block:: yaml
 
   resources:
-    accelerators: V100:1
+    accelerators: A100:1
+
+  setup: |
+    # Authenticate with private registry
+    docker login -u <username> -p <password> <registry>
 
   run: |
-    docker run --rm --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
+    docker run <registry>/<image>:<tag>
 
 Building containers remotely
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-If you are running the container as a task, the container image can also be built remotely on the cluster in the :code:`setup` phase of the task.
+If you are running containerized applications, the container image can also be built remotely on the cluster in the :code:`setup` phase of the task.
 
 The :code:`echo_app` `example <https://github.com/skypilot-org/skypilot/tree/master/examples/docker>`_ provides an example on how to do this:
 
@@ -100,3 +86,77 @@ The inputs to the app are copied to SkyPilot using :code:`file_mounts` and mount
 The output of the app produced at :code:`/outputs` path in the container is also volume mounted to :code:`/outputs` on the VM, which gets directly written to a S3 bucket through SkyPilot Storage mounting.
 
 Our GitHub repository has more examples, including running `Detectron2 in a Docker container <https://github.com/skypilot-org/skypilot/blob/master/examples/detectron2_docker.yaml>`_ via SkyPilot.
+
+.. _docker-containers-as-runtime-environments:
+
+Using Containers as Runtime Environments
+----------------------------------------
+
+When a container is used as the runtime environment, everything happens inside the container:
+
+- The SkyPilot runtime is automatically installed and launched inside the container;
+- :code:`setup` and :code:`run` commands are executed in the container;
+- Any files created by the task will be stored inside the container.
+
+To use a Docker image as your runtime environment, set the :code:`image_id` field in the :code:`resources` section of your task YAML file to :code:`docker:<image_id>`.
+For example, to use the :code:`ubuntu:20.04` image from Docker Hub:
+
+.. code-block:: yaml
+
+  resources:
+    image_id: docker:ubuntu:20.04
+
+  setup: |
+    # Commands to run inside the container
+
+  run: |
+    # Commands to run inside the container
+
+Any GPUs assigned to the task will be automatically mapped to your Docker container, and all subsequent tasks within the cluster will also run inside the container. In a multi-node scenario, the container will be launched on all nodes, and the corresponding node's container will be assigned for task execution.
+
+.. tip::
+
+    **When to use this?**
+
+    If you have a preconfigured development environment set up within a Docker
+    image, it can be convenient to use the runtime environment mode.  This is
+    especially useful for launching development environments that are
+    challenging to configure on a new virtual machine, such as dependencies on
+    specific versions of CUDA or cuDNN.
+
+.. note::
+
+    Since we ``pip install skypilot`` inside the user-specified container image
+    as part of a launch, users should ensure dependency conflicts do not occur.
+
+    Currently, the following requirements must be met:
+
+    1. The container image should be based on Debian;
+
+    2. The container image must grant sudo permissions without requiring password authentication for the user. Having a root user is also acceptable.
+
+Private Registries
+^^^^^^^^^^^^^^^^^^
+
+When using this mode, to access Docker images hosted on private registries,
+you can provide the registry authentication details using :ref:`task environment variables <env-vars>`:
+
+.. code-block:: yaml
+
+  # ecr_private_docker.yaml
+  resources:
+    image_id: docker:<your-user-id>.dkr.ecr.us-east-1.amazonaws.com/<your-private-image>:<tag>
+    # the following shorthand is also supported:
+    # image_id: docker:<your-private-image>:<tag>
+
+  envs:
+    SKYPILOT_DOCKER_USERNAME: AWS
+    # SKYPILOT_DOCKER_PASSWORD: <password>
+    SKYPILOT_DOCKER_SERVER: <your-user-id>.dkr.ecr.us-east-1.amazonaws.com
+
+We suggest setting the :code:`SKYPILOT_DOCKER_PASSWORD` environment variable through the CLI (see :ref:`passing secrets <passing-secrets>`):
+
+.. code-block:: console
+
+  $ export SKYPILOT_DOCKER_PASSWORD=$(aws ecr get-login-password --region us-east-1)
+  $ sky launch ecr_private_docker.yaml --env SKYPILOT_DOCKER_PASSWORD
diff --git a/docs/source/examples/spot-jobs.rst b/docs/source/examples/spot-jobs.rst
@@ -275,26 +275,34 @@ you can still tear it down manually with
 Customizing spot controller resources
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-You may customize the resources of the spot controller for the following reasons:
+You may want to customize the resources of the spot controller for several reasons:
 
-1. Enforcing the spot controller to run on a specific location. (Default: cheapest location)
-2. Changing the maximum number of spot jobs that can be run concurrently. (Default: 16)
-3. Changing the disk_size of the spot controller to store more logs. (Default: 50GB)
+1. Use a lower-cost controller (if you have a low number of concurrent spot jobs).
+2. Enforcing the spot controller to run on a specific location. (Default: cheapest location)
+3. Changing the maximum number of spot jobs that can be run concurrently, which is 2x the vCPUs of the controller. (Default: 16)
+4. Changing the disk_size of the spot controller to store more logs. (Default: 50GB)
 
 To achieve the above, you can specify custom configs in :code:`~/.sky/config.yaml` with the following fields:
 
 .. code-block:: yaml
 
   spot:
+    # NOTE: these settings only take effect for a new spot controller, not if
+    # you have an existing one.
     controller:
       resources:
-        # All configs below are optional
-        # 1. Specify the location of the spot controller.
+        # All configs below are optional.
+        # Specify the location of the spot controller.
         cloud: gcp
         region: us-central1
-        # 2. Specify the maximum number of spot jobs that can be run concurrently.
+        # Specify the maximum number of spot jobs that can be run concurrently.
         cpus: 4+  # number of vCPUs, max concurrent spot jobs = 2 * cpus
-        # 3. Specify the disk_size of the spot controller.
+        # Specify the disk_size in GB of the spot controller.
         disk_size: 100
 
 The :code:`resources` field has the same spec as a normal SkyPilot job; see `here <https://skypilot.readthedocs.io/en/latest/reference/yaml-spec.html>`__.
+
+.. note::
+  These settings will not take effect if you have an existing controller (either
+  stopped or live).  For them to take effect, tear down the existing controller
+  first, which requires all in-progress spot jobs to finish or be canceled.
diff --git a/docs/source/reference/config.rst b/docs/source/reference/config.rst
@@ -57,7 +57,7 @@ Available fields and semantics:
     # with this name (provisioner automatically looks for such regions).
     # Regions without a VPC with this name will not be used to launch nodes.
     #
-    # Default: None (use the default VPC in each region).
+    # Default: null (use the default VPC in each region).
     vpc_name: skypilot-vpc
 
     # Should instances be assigned private IPs only? (optional)
@@ -88,7 +88,7 @@ Available fields and semantics:
     # and any SkyPilot nodes. (This option is not used between SkyPilot nodes,
     # since they are behind the proxy / may not have such a proxy set up.)
     #
-    # Optional; default: None.
+    # Optional; default: null.
     ### Format 1 ###
     # A string; the same proxy command is used for all regions.
     ssh_proxy_command: ssh -W %h:%p -i ~/.ssh/sky-key -o StrictHostKeyChecking=no ec2-user@<jump server public ip>
@@ -103,6 +103,24 @@ Available fields and semantics:
   # Advanced GCP configurations (optional).
   # Apply to all new instances but not existing ones.
   gcp:
+    # VPC to use (optional).
+    #
+    # Default: null, which implies the following behavior. First, the VPC named
+    # 'default' is checked against minimal recommended firewall rules for
+    # SkyPilot to function. If it satisfies these rules, this VPC is used.
+    # Otherwise, a new VPC named 'skypilot-vpc' is automatically created with
+    # the minimal recommended firewall rules and will be used.
+    #
+    # If this field is set, SkyPilot will use the VPC with this name. Useful for
+    # when users want to manually set up a VPC and precisely control its
+    # firewall rules. If no region restrictions are given, SkyPilot only
+    # provisions in regions for which a subnet of this VPC exists. Errors are
+    # thrown if VPC with this name is not found. The VPC does not get modified
+    # in any way, except when opening ports (e.g., via `resources.ports`) in
+    # which case new firewall rules permitting public traffic to those ports
+    # will be added.
+    vpc_name: skypilot-vpc
+
     # Reserved capacity (optional).
     #
     # The specific reservation to be considered when provisioning clusters on GCP.
@@ -117,15 +135,24 @@ Available fields and semantics:
   # Advanced Kubernetes configurations (optional).
   kubernetes:
     # The networking mode for accessing SSH jump pod (optional).
-    # This must be either: 'nodeport' or 'portforward'. If not specified, defaults to 'portforward'.
     #
-    # nodeport: Exposes the jump pod SSH service on a static port number on each Node, allowing external access to using <NodeIP>:<NodePort>. Using this mode requires opening multiple ports on nodes in the Kubernetes cluster.
-    # portforward: Uses `kubectl port-forward` to create a tunnel and directly access the jump pod SSH service in the Kubernetes cluster. Does not require opening ports the cluster nodes and is more secure. 'portforward' is used as default if 'networking' is not specified.
+    # This must be either: 'nodeport' or 'portforward'. If not specified,
+    # defaults to 'portforward'.
+    #
+    # nodeport: Exposes the jump pod SSH service on a static port number on each
+    # Node, allowing external access to using <NodeIP>:<NodePort>. Using this
+    # mode requires opening multiple ports on nodes in the Kubernetes cluster.
+    #
+    # portforward: Uses `kubectl port-forward` to create a tunnel and directly
+    # access the jump pod SSH service in the Kubernetes cluster. Does not
+    # require opening ports the cluster nodes and is more secure. 'portforward'
+    # is used as default if 'networking' is not specified.
     networking: portforward
 
   # Advanced OCI configurations (optional).
   oci:
-    # A dict mapping region names to region-specific configurations, or `default` for the default configuration.
+    # A dict mapping region names to region-specific configurations, or
+    # `default` for the default configuration.
     default:
       # The OCID of the profile to use for launching instances (optional).
       oci_config_profile: DEFAULT

diff --git a/docs/source/reference/yaml-spec.rst b/docs/source/reference/yaml-spec.rst
@@ -43,8 +43,16 @@ Available fields:
       # Accelerator name and count per node (optional).
       #
       # Use `sky show-gpus` to view available accelerator configurations.
-      #
-      # Format: <name>:<count> (or simply <name>, short for a count of 1).
+      # The following three ways are valid for specifying accelerators for a cluster:
+      #   To specify a single accelerator:
+      #     Format: <name>:<count> (or simply <name>, short for a count of 1).
+      #     accelerators: V100:4
+      #   To specify a ordered list of accelerators: Try the accelerators in the specified order.
+      #     Format: [<name>:<count>, ...]
+      #     accelerators: ['K80:1', 'V100:1', 'T4:1']
+      #   To specify an unordered set of accelerators: Optimize all specified accelerators together, and try accelerator with lowest cost first.
+      #     Format: {<name>:<count>, ...}
+      #     accelerators: {'K80:1', 'V100:1', 'T4:1'}
       accelerators: V100:4
 
       # Number of vCPUs per node (optional).