Skip to content

Commit

Permalink
Merge branch 'h2ovl-mississippi' of https://github.com/cooleel/vllm i…
Browse files Browse the repository at this point in the history
…nto h2ovl-mississippi
  • Loading branch information
ywang96 committed Nov 3, 2024
2 parents cb3bc8b + 31ece38 commit 356180e
Show file tree
Hide file tree
Showing 165 changed files with 3,954 additions and 2,230 deletions.
9 changes: 9 additions & 0 deletions .github/dependabot.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,15 @@ updates:
reviewers: ["khluu", "simon-mo"]
allow:
- dependency-type: "all"
ignore:
- dependency-name: "torch"
- dependency-name: "torchvision"
- dependency-name: "xformers"
- dependency-name: "lm-format-enforcer"
- dependency-name: "gguf"
- dependency-name: "compressed-tensors"
- dependency-name: "ray[adag]"
- dependency-name: "lm-eval"
groups:
patch-update:
applies-to: version-updates
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -206,7 +206,7 @@ FROM vllm-base AS vllm-openai

# install additional dependencies for openai api server
RUN --mount=type=cache,target=/root/.cache/pip \
pip install accelerate hf_transfer 'modelscope!=1.15.0' bitsandbytes>=0.44.0 timm==0.9.10
pip install accelerate hf_transfer 'modelscope!=1.15.0' 'bitsandbytes>=0.44.0' timm==0.9.10

ENV VLLM_USAGE_SOURCE production-docker-image

Expand Down
2 changes: 1 addition & 1 deletion Dockerfile.neuron
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ RUN --mount=type=bind,source=.git,target=.git \
if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh ; fi

RUN python3 -m pip install -U \
cmake>=3.26 ninja packaging setuptools-scm>=8 wheel jinja2 \
'cmake>=3.26' ninja packaging 'setuptools-scm>=8' wheel jinja2 \
-r requirements-neuron.txt

ENV VLLM_TARGET_DEVICE neuron
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile.ppc64le
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ RUN --mount=type=bind,source=.git,target=.git \
# These packages will be in rocketce eventually
RUN --mount=type=cache,target=/root/.cache/pip \
pip install -v --prefer-binary --extra-index-url https://repo.fury.io/mgiessing \
cmake>=3.26 ninja packaging setuptools-scm>=8 wheel jinja2 \
'cmake>=3.26' ninja packaging 'setuptools-scm>=8' wheel jinja2 \
torch==2.3.1 \
-r requirements-cpu.txt \
xformers uvloop==0.20.0
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile.rocm
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ RUN --mount=type=cache,target=/root/.cache/pip \
python3 -m pip uninstall -y torch torchvision \
&& python3 -m pip install --pre \
torch==2.6.0.dev20240918 \
setuptools-scm>=8 \
'setuptools-scm>=8' \
torchvision==0.20.0.dev20240918 \
--extra-index-url https://download.pytorch.org/whl/nightly/rocm6.2;; \
*) ;; esac
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile.tpu
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ ENV VLLM_TARGET_DEVICE="tpu"
RUN --mount=type=cache,target=/root/.cache/pip \
--mount=type=bind,source=.git,target=.git \
python3 -m pip install \
cmake>=3.26 ninja packaging setuptools-scm>=8 wheel jinja2 \
'cmake>=3.26' ninja packaging 'setuptools-scm>=8' wheel jinja2 \
-r requirements-tpu.txt
RUN python3 setup.py develop

Expand Down
2 changes: 2 additions & 0 deletions docs/requirements-docs.txt
Original file line number Diff line number Diff line change
Expand Up @@ -13,5 +13,7 @@ torch
py-cpuinfo
transformers
mistral_common >= 1.3.4
aiohttp
starlette
openai # Required by docs/source/serving/openai_compatible_server.md's vllm.entrypoints.openai.cli_args
partial-json-parser # Required by docs/source/serving/openai_compatible_server.md's vllm.entrypoints.openai.cli_args
2 changes: 1 addition & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,6 @@ def setup(app):

# Mock out external dependencies here, otherwise the autodoc pages may be blank.
autodoc_mock_imports = [
"aiohttp",
"compressed_tensors",
"cpuinfo",
"cv2",
Expand Down Expand Up @@ -143,6 +142,7 @@ def add_line(self, line: str, source: str, *lineno: int) -> None:
"python": ("https://docs.python.org/3", None),
"typing_extensions":
("https://typing-extensions.readthedocs.io/en/latest", None),
"aiohttp": ("https://docs.aiohttp.org/en/stable", None),
"pillow": ("https://pillow.readthedocs.io/en/stable", None),
"numpy": ("https://numpy.org/doc/stable", None),
"torch": ("https://pytorch.org/docs/stable", None),
Expand Down
5 changes: 5 additions & 0 deletions docs/source/dev/pooling_params.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Pooling Parameters
==================

.. autoclass:: vllm.PoolingParams
:members:
8 changes: 4 additions & 4 deletions docs/source/getting_started/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -138,10 +138,10 @@ Since this server is compatible with OpenAI API, you can use it as a drop-in rep
A more detailed client example can be found `here <https://github.com/vllm-project/vllm/blob/main/examples/openai_completion_client.py>`__.

OpenAI Chat API with vLLM
~~~~~~~~~~~~~~~~~~~~~~~~~~
OpenAI Chat Completions API with vLLM
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

vLLM is designed to also support the OpenAI Chat API. The chat interface is a more dynamic, interactive way to communicate with the model, allowing back-and-forth exchanges that can be stored in the chat history. This is useful for tasks that require context or more detailed explanations.
vLLM is designed to also support the OpenAI Chat Completions API. The chat interface is a more dynamic, interactive way to communicate with the model, allowing back-and-forth exchanges that can be stored in the chat history. This is useful for tasks that require context or more detailed explanations.

You can use the `create chat completion <https://platform.openai.com/docs/api-reference/chat/completions/create>`_ endpoint to interact with the model:

Expand All @@ -157,7 +157,7 @@ You can use the `create chat completion <https://platform.openai.com/docs/api-re
$ ]
$ }'
Alternatively, you can use the `openai` python package:
Alternatively, you can use the ``openai`` python package:

.. code-block:: python
Expand Down
158 changes: 144 additions & 14 deletions docs/source/getting_started/tpu-installation.rst
Original file line number Diff line number Diff line change
@@ -1,35 +1,167 @@
.. _installation_tpu:

#####################
Installation with TPU
=====================
#####################

vLLM supports Google Cloud TPUs using PyTorch XLA.
Tensor Processing Units (TPUs) are Google's custom-developed application-specific
integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs
are available in different versions each with different hardware specifications.
For more information about TPUs, see `TPU System Architecture <https://cloud.google.com/tpu/docs/system-architecture-tpu-vm>`_.
For more information on the TPU versions supported with vLLM, see:

* `TPU v6e <https://cloud.google.com/tpu/docs/v6e>`_
* `TPU v5e <https://cloud.google.com/tpu/docs/v5e>`_
* `TPU v5p <https://cloud.google.com/tpu/docs/v5p>`_
* `TPU v4 <https://cloud.google.com/tpu/docs/v4>`_

These TPU versions allow you to configure the physical arrangements of the TPU
chips. This can improve throughput and networking performance. For more
information see:

* `TPU v6e topologies <https://cloud.google.com/tpu/docs/v6e#configurations>`_
* `TPU v5e topologies <https://cloud.google.com/tpu/docs/v5e#tpu-v5e-config>`_
* `TPU v5p topologies <https://cloud.google.com/tpu/docs/v5p#tpu-v5p-config>`_
* `TPU v4 topologies <https://cloud.google.com/tpu/docs/v4#tpu-v4-config>`_

In order for you to use Cloud TPUs you need to have TPU quota granted to your
Google Cloud Platform project. TPU quotas specify how many TPUs you can use in a
GPC project and are specified in terms of TPU version, the number of TPU you
want to use, and quota type. For more information, see `TPU quota <https://cloud.google.com/tpu/docs/quota#tpu_quota>`_.

For TPU pricing information, see `Cloud TPU pricing <https://cloud.google.com/tpu/pricing>`_.

You may need additional persistent storage for your TPU VMs. For more
information, see `Storage options for Cloud TPU data <https://cloud.devsite.corp.google.com/tpu/docs/storage-options>`_.

Requirements
------------

* Google Cloud TPU VM (single & multi host)
* TPU versions: v5e, v5p, v4
* Python: 3.10
* Google Cloud TPU VM
* TPU versions: v6e, v5e, v5p, v4
* Python: 3.10 or newer

Provision Cloud TPUs
====================

You can provision Cloud TPUs using the `Cloud TPU API <https://cloud.google.com/tpu/docs/reference/rest>`_`
or the `queued resources <https://cloud.google.com/tpu/docs/queued-resources>`_`
API. This section shows how to create TPUs using the queued resource API.
For more information about using the Cloud TPU API, see `Create a Cloud TPU using the Create Node API <https://cloud.google.com/tpu/docs/managing-tpus-tpu-vm#create-node-api>`_.
`Queued resources <https://cloud.devsite.corp.google.com/tpu/docs/queued-resources>`_
enable you to request Cloud TPU resources in a queued manner. When you request
queued resources, the request is added to a queue maintained by the Cloud TPU
service. When the requested resource becomes available, it's assigned to your
Google Cloud project for your immediate exclusive use.

Provision a Cloud TPU with the queued resource API
--------------------------------------------------
Create a TPU v5e with 4 TPU chips:

.. code-block:: console
gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \
--node-id TPU_NAME \
--project PROJECT_ID \
--zone ZONE \
--accelerator-type ACCELERATOR_TYPE \
--runtime-version RUNTIME_VERSION \
--service-account SERVICE_ACCOUNT
.. list-table:: Parameter descriptions
:header-rows: 1

* - Parameter name
- Description
* - QUEUED_RESOURCE_ID
- The user-assigned ID of the queued resource request.
* - TPU_NAME
- The user-assigned name of the TPU which is created when the queued
resource request is allocated.
* - PROJECT_ID
- Your Google Cloud project
* - ZONE
- The `zone <https://cloud.google.com/tpu/docs/regions-zones>`_ where you
want to create your Cloud TPU.
* - ACCELERATOR_TYPE
- The TPU version you want to use. Specify the TPU version, followed by a
'-' and the number of TPU cores. For example `v5e-4` specifies a v5e TPU
with 4 cores. For more information, see `TPU versions <https://cloud.devsite.corp.google.com/tpu/docs/system-architecture-tpu-vm#versions>`_.
* - RUNTIME_VERSION
- The TPU VM runtime version to use. For more information see `TPU VM images <https://cloud.google.com/tpu/docs/runtimes>`_.
* - SERVICE_ACCOUNT
- The email address for your service account. You can find it in the IAM
Cloud Console under *Service Accounts*. For example:
`tpu-service-account@<your_project_ID>.iam.gserviceaccount.com`

Connect to your TPU using SSH:

.. code-block:: bash
gcloud compute tpus tpu-vm ssh TPU_NAME
Create and activate a Conda environment for vLLM:

.. code-block:: bash
Installation options:
conda create -n vllm python=3.10 -y
conda activate vllm
1. :ref:`Build a docker image with Dockerfile <build_docker_tpu>`.
2. :ref:`Build from source <build_from_source_tpu>`.
Clone the vLLM repository and go to the vLLM directory:

.. code-block:: bash
git clone https://github.com/vllm-project/vllm.git && cd vllm
Uninstall the existing `torch` and `torch_xla` packages:

.. code-block:: bash
pip uninstall torch torch-xla -y
Install `torch` and `torch_xla`

.. code-block:: bash
pip install --pre torch==2.6.0.dev20241028+cpu torchvision==0.20.0.dev20241028+cpu --index-url https://download.pytorch.org/whl/nightly/cpu
pip install 'torch_xla[tpu] @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.6.0.dev-cp310-cp310-linux_x86_64.whl' -f https://storage.googleapis.com/libtpu-releases/index.html
Install JAX and Pallas:

.. code-block:: bash
pip install torch_xla[pallas] -f https://storage.googleapis.com/jax-releases/jax_nightly_releases.html -f https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html
pip install jaxlib==0.4.32.dev20240829 jax==0.4.32.dev20240829 -f https://storage.googleapis.com/jax-releases/jax_nightly_releases.html -f https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html
Install other build dependencies:

.. code-block:: bash
pip install -r requirements-tpu.txt
VLLM_TARGET_DEVICE="tpu" python setup.py develop
sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
Provision Cloud TPUs with GKE
-----------------------------

For more information about using TPUs with GKE, see
https://cloud.google.com/kubernetes-engine/docs/how-to/tpus
https://cloud.google.com/kubernetes-engine/docs/concepts/tpus
https://cloud.google.com/kubernetes-engine/docs/concepts/plan-tpus

.. _build_docker_tpu:

Build a docker image with :code:`Dockerfile.tpu`
------------------------------------------------

`Dockerfile.tpu <https://github.com/vllm-project/vllm/blob/main/Dockerfile.tpu>`_ is provided to build a docker image with TPU support.
You can use `Dockerfile.tpu <https://github.com/vllm-project/vllm/blob/main/Dockerfile.tpu>`_
to build a Docker image with TPU support.

.. code-block:: console
$ docker build -f Dockerfile.tpu -t vllm-tpu .
You can run the docker image with the following command:
Run the Docker image with the following command:

.. code-block:: console
Expand Down Expand Up @@ -75,14 +207,12 @@ Next, build vLLM from source. This will only take a few seconds:
$ VLLM_TARGET_DEVICE="tpu" python setup.py develop
.. note::

Since TPU relies on XLA which requires static shapes, vLLM bucketizes the possible input shapes and compiles an XLA graph for each different shape.
The compilation time may take 20~30 minutes in the first run.
However, the compilation time reduces to ~5 minutes afterwards because the XLA graphs are cached in the disk (in :code:`VLLM_XLA_CACHE_PATH` or :code:`~/.cache/vllm/xla_cache` by default).


.. tip::

If you encounter the following error:
Expand All @@ -93,7 +223,7 @@ Next, build vLLM from source. This will only take a few seconds:
ImportError: libopenblas.so.0: cannot open shared object file: No such file or directory
Please install OpenBLAS with the following command:
Install OpenBLAS with the following command:

.. code-block:: console
Expand Down
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,7 @@ Documentation
:caption: Developer Documentation

dev/sampling_params
dev/pooling_params
dev/offline_inference/offline_index
dev/engine/engine_index
dev/kernel/paged_attention
Expand Down
14 changes: 7 additions & 7 deletions docs/source/models/supported_models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -160,13 +160,13 @@ Text Generation
-
- ✅︎
* - :code:`GraniteForCausalLM`
- PowerLM
- :code:`ibm/PowerLM-3b` etc.
- Granite 3.0, PowerLM
- :code:`ibm-granite/granite-3.0-2b-base`, :code:`ibm-granite/granite-3.0-8b-instruct`, :code:`ibm/PowerLM-3b`, etc.
- ✅︎
- ✅︎
* - :code:`GraniteMoeForCausalLM`
- PowerMoE
- :code:`ibm/PowerMoE-3b` etc.
- Granite 3.0 MoE, PowerMoE
- :code:`ibm-granite/granite-3.0-1b-a400m-base`, :code:`ibm-granite/granite-3.0-3b-a800m-instruct`, :code:`ibm/PowerMoE-3b`, etc.
- ✅︎
- ✅︎
* - :code:`InternLMForCausalLM`
Expand Down Expand Up @@ -281,7 +281,7 @@ Text Generation
- ✅︎
* - :code:`Qwen2ForCausalLM`
- Qwen2
- :code:`Qwen/Qwen2-beta-7B`, :code:`Qwen/Qwen2-beta-7B-Chat`, etc.
- :code:`Qwen/Qwen2-7B-Instruct`, :code:`Qwen/Qwen2-7B`, etc.
- ✅︎
- ✅︎
* - :code:`Qwen2MoeForCausalLM`
Expand Down Expand Up @@ -472,7 +472,7 @@ Text Generation
- ✅︎
* - :code:`LlavaOnevisionForConditionalGeneration`
- LLaVA-Onevision
- T + I\ :sup:`+` + V
- T + I\ :sup:`+` + V\ :sup:`+`
- :code:`llava-hf/llava-onevision-qwen2-7b-ov-hf`, :code:`llava-hf/llava-onevision-qwen2-0.5b-ov-hf`, etc.
-
- ✅︎
Expand All @@ -484,7 +484,7 @@ Text Generation
- ✅︎
* - :code:`MllamaForConditionalGeneration`
- Llama 3.2
- T + I
- T + I\ :sup:`+`
- :code:`meta-llama/Llama-3.2-90B-Vision-Instruct`, :code:`meta-llama/Llama-3.2-11B-Vision`, etc.
-
-
Expand Down
Loading

0 comments on commit 356180e

Please sign in to comment.