Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] Add comparison page #3756

Merged
merged 11 commits into from
Aug 2, 2024
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -161,6 +161,7 @@ Contents
../reference/tpu
../reference/logging
../reference/faq
SkyPilot vs. Other Systems <../reference/comparison>


.. toctree::
Expand Down
178 changes: 178 additions & 0 deletions docs/source/reference/comparison.rst
romilbhardwaj marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
.. _sky-compare:

Comparing SkyPilot with other systems
=====================================

We are often asked "How does SkyPilot compare with XYZ?". Providing an unbiased and up-to date answer for such questions is not easy, especially when the differences may be qualitative.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of feel this para is a bit odd. Seems messages aren't lost if left out. Wdyt? Cc @Michaelvll for a look too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it because comparisons can be sensitive for the authors of other systems, especially if we do not present their strengths. I think this should stay in

Copy link
Collaborator

@Michaelvll Michaelvll Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep this, but may be good to rephrase a bit of the sentence. Some reference for how other library starts the comparison:

Because of this flexibility, Terraform can be used to solve many different problems. This means there are a number of existing tools that overlap with the capabilities of Terraform. We compare Terraform to a number of these tools, but it should be noted that Terraform is not mutually exclusive with other systems. It can be used to manage a single application, or the entire datacenter.
Because of this broad array of supported scenarios, there are many tools that overlap with Pulumi’s capabilities. Many of these are complementary and can be used together, whereas some are “either or” decisions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded the paragraph. wdyt?


This page tries to provide a comparison of SkyPilot with other systems, focusing on the unique features of SkyPilot. We welcome feedback and contributions to this page.


SkyPilot vs Vanilla Kubernetes
------------------------------

Kubernetes is a powerful system for managing containerized applications. :ref:`Using SkyPilot to access your Kubernetes cluster <kubernetes-overview>` boosts developer productivity and allows you to scale your infra beyond a single Kubernetes cluster.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we show a architecture diagram for showing how SkyPilot relates to kubernetes clusters? For example, a user interacts with SkyPilot and SkyPilot sends the requests to underlying Kubernetes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea - added an architecture figure. wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.


Faster developer velocity
^^^^^^^^^^^^^^^^^^^^^^^^^

.. figure:: https://blog.skypilot.co/ai-on-kubernetes/images/k8s_vs_skypilot_iterative_v2.png
:align: center
romilbhardwaj marked this conversation as resolved.
Show resolved Hide resolved
:width: 95%
:alt: Iterative Development with Kubernetes vs SkyPilot

Iterative Development with Kubernetes requires tedious updates to Docker images and multiple steps to update the training run. With SkyPilot, all you need is one CLI (``sky launch``).

SkyPilot provides faster iteration for interactive development. For example, a common workflow for AI engineers is to iteratively develop and train models by tweaking code and hyperparameters while observing the training runs.

* **With Kubernetes, a single iteration is a multi-step process** involving building a Docker image, pushing it to a registry, updating the Kubernetes YAML and then deploying it.

* :strong:`With SkyPilot, a single command (`:literal:`sky launch`:strong:`) takes care of everything.` Behind the scenes, SkyPilot provisions pods, installs all required dependencies, executes the job, returns logs, and provides SSH and VSCode access to debug.


Simpler YAMLs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking out loud, do we like this better or something like "Simpler interface and faster developer speed" (the latter is more directly a benefit)?

^^^^^^^^^^^^^

Consider serving `Gemma <https://ai.google.dev/gemma>`_ with `vLLM <https://github.com/vllm-project/vllm>`_ on Kubernetes:

* **With vanilla Kubernetes**, you need over `65 lines of Kubernetes YAML <https://cloud.google.com/kubernetes-engine/docs/tutorials/serve-gemma-gpu-vllm#deploy-vllm>`_ to launch a Gemma model served with vLLM.
* **With SkyPilot**, an easy-to-understand `19-line YAML <https://gist.github.com/romilbhardwaj/b5b6b893e7a3749a2815f055f3f5351c>`_ launches a pod serving Gemma with vLLM.

Here is a side-by-side comparison of the YAMLs for serving Gemma with vLLM on SkyPilot vs Kubernetes:

.. raw:: html
romilbhardwaj marked this conversation as resolved.
Show resolved Hide resolved

<div class="row">
<div class="col-md-6 mb-3">
<h3> SkyPilot (19 lines) </h3>

.. code-block:: yaml
:linenos:

envs:
MODEL_NAME: google/gemma-2b-it
HF_TOKEN: myhftoken

resources:
image_id: docker:vllm/vllm-openai:latest
accelerators: L4:1
ports: 8000

setup: |
conda deactivate
python3 -c "import huggingface_hub; huggingface_hub.login('${HF_TOKEN}')"

run: |
conda deactivate
echo 'Starting vllm openai api server...'
python -m vllm.entrypoints.openai.api_server \
--model $MODEL_NAME --tokenizer hf-internal-testing/llama-tokenizer \
--host 0.0.0.0

.. raw:: html

</div>
<div class="col-md-6 mb-3">
<h3> Kubernetes (65 lines) </h3>

.. code-block:: yaml
:linenos:

apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-gemma-deployment
spec:
replicas: 1
selector:
matchLabels:
app: gemma-server
template:
metadata:
labels:
app: gemma-server
ai.gke.io/model: gemma-1.1-2b-it
ai.gke.io/inference-server: vllm
examples.ai.gke.io/source: user-guide
spec:
containers:
- name: inference-server
image: us-docker.pkg.dev/vertex-ai/ vertex-vision-model-garden-dockers/pytorch-vllm-serve:20240527_0916_RC00
resources:
requests:
cpu: "2"
memory: "10Gi"
ephemeral-storage: "10Gi"
nvidia.com/gpu: 1
limits:
cpu: "2"
memory: "10Gi"
ephemeral-storage: "10Gi"
nvidia.com/gpu: 1
command: ["python3", "-m", "vllm.entrypoints.api_server"]
args:
- --model=$(MODEL_ID)
- --tensor-parallel-size=1
env:
- name: MODEL_ID
value: google/gemma-1.1-2b-it
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: hf_api_token
volumeMounts:
- mountPath: /dev/shm
name: dshm
volumes:
- name: dshm
emptyDir:
medium: Memory
nodeSelector:
cloud.google.com/gke-accelerator: nvidia-l4
---
apiVersion: v1
kind: Service
metadata:
name: llm-service
spec:
selector:
app: gemma-server
type: ClusterIP
ports:
- protocol: TCP
port: 8000
targetPort: 8000

.. raw:: html

</div>
</div>

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking out loud, not sure if we can think of other value adds on this page, or if we can go a bit deeper for the points we mentioned below, like what is the quantitative benefits GPU availability and Costs can provide. The current points feel a bit weak.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm we had some numbers in the earlier commits but removed them in this comment

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see! Let's skip the numbers for now then.


Scale beyond single region/cluster
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. figure:: https://blog.skypilot.co/ai-on-kubernetes/images/failover.png
:align: center
:width: 95%
:alt: Scaling beyond a single region Kubernetes cluster with SkyPilot

If the Kubernetes cluster is full, SkyPilot can get GPUs from other regions and clouds to run your tasks at the lowest cost.

A Kubernetes cluster is typically constrained to a single region in a single cloud.
This is because etcd, the control store for Kubernetes state, can timeout and fail when it faces highers latencies across regions [1]_ [2]_ [3]_.

Being restricted to a single region/cloud with Vanilla Kubernetes has two drawbacks:

1. `GPU availability is reduced <https://blog.skypilot.co/introducing-sky-serve/#why-skyserve>`_ because you cannot utilize
available capacity elsewhere.

2. `Costs increase <https://blog.skypilot.co/introducing-sky-serve/#why-skyserve>`_ as you are unable to
take advantage of cheaper resources in other regions.

SkyPilot is designed to scale across clouds and regions: it allows you to run your tasks on your Kubernetes cluster, and burst to more regions and clouds if needed. In doing so, SkyPilot ensures that your tasks are always running in the most cost-effective region, while maintaining high availability.

.. [1] `etcd FAQ <https://etcd.io/docs/v3.3/faq/#does-etcd-work-in-cross-region-or-cross-data-center-deployments>`_
.. [2] `"Multi-region etcd cluster performance issue" on GitHub <https://github.com/etcd-io/etcd/issues/12232>`_
.. [3] `DevOps StackExchange answer <https://devops.stackexchange.com/a/13194>`_
Loading