Skip to content

Commit

Permalink
Merge branch 'main' into structured-outputs-docs
Browse files Browse the repository at this point in the history
  • Loading branch information
ismael-dm committed Nov 4, 2024
2 parents ffc8a71 + ccb5376 commit 773e672
Show file tree
Hide file tree
Showing 134 changed files with 3,280 additions and 1,778 deletions.
9 changes: 9 additions & 0 deletions .github/dependabot.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,15 @@ updates:
reviewers: ["khluu", "simon-mo"]
allow:
- dependency-type: "all"
ignore:
- dependency-name: "torch"
- dependency-name: "torchvision"
- dependency-name: "xformers"
- dependency-name: "lm-format-enforcer"
- dependency-name: "gguf"
- dependency-name: "compressed-tensors"
- dependency-name: "ray[adag]"
- dependency-name: "lm-eval"
groups:
patch-update:
applies-to: version-updates
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -206,7 +206,7 @@ FROM vllm-base AS vllm-openai

# install additional dependencies for openai api server
RUN --mount=type=cache,target=/root/.cache/pip \
pip install accelerate hf_transfer 'modelscope!=1.15.0' bitsandbytes>=0.44.0 timm==0.9.10
pip install accelerate hf_transfer 'modelscope!=1.15.0' 'bitsandbytes>=0.44.0' timm==0.9.10

ENV VLLM_USAGE_SOURCE production-docker-image

Expand Down
2 changes: 1 addition & 1 deletion Dockerfile.neuron
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ RUN --mount=type=bind,source=.git,target=.git \
if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh ; fi

RUN python3 -m pip install -U \
cmake>=3.26 ninja packaging setuptools-scm>=8 wheel jinja2 \
'cmake>=3.26' ninja packaging 'setuptools-scm>=8' wheel jinja2 \
-r requirements-neuron.txt

ENV VLLM_TARGET_DEVICE neuron
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile.ppc64le
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ RUN --mount=type=bind,source=.git,target=.git \
# These packages will be in rocketce eventually
RUN --mount=type=cache,target=/root/.cache/pip \
pip install -v --prefer-binary --extra-index-url https://repo.fury.io/mgiessing \
cmake>=3.26 ninja packaging setuptools-scm>=8 wheel jinja2 \
'cmake>=3.26' ninja packaging 'setuptools-scm>=8' wheel jinja2 \
torch==2.3.1 \
-r requirements-cpu.txt \
xformers uvloop==0.20.0
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile.rocm
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ RUN --mount=type=cache,target=/root/.cache/pip \
python3 -m pip uninstall -y torch torchvision \
&& python3 -m pip install --pre \
torch==2.6.0.dev20240918 \
setuptools-scm>=8 \
'setuptools-scm>=8' \
torchvision==0.20.0.dev20240918 \
--extra-index-url https://download.pytorch.org/whl/nightly/rocm6.2;; \
*) ;; esac
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile.tpu
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ ENV VLLM_TARGET_DEVICE="tpu"
RUN --mount=type=cache,target=/root/.cache/pip \
--mount=type=bind,source=.git,target=.git \
python3 -m pip install \
cmake>=3.26 ninja packaging setuptools-scm>=8 wheel jinja2 \
'cmake>=3.26' ninja packaging 'setuptools-scm>=8' wheel jinja2 \
-r requirements-tpu.txt
RUN python3 setup.py develop

Expand Down
4 changes: 2 additions & 2 deletions benchmarks/benchmark_serving.py
Original file line number Diff line number Diff line change
Expand Up @@ -406,9 +406,9 @@ def calculate_metrics(
median_itl_ms=np.median(itls or 0) * 1000,
percentiles_itl_ms=[(p, np.percentile(itls or 0, p) * 1000)
for p in selected_percentiles],
mean_e2el_ms=np.median(e2els or 0) * 1000,
mean_e2el_ms=np.mean(e2els or 0) * 1000,
std_e2el_ms=np.std(e2els or 0) * 1000,
median_e2el_ms=np.mean(e2els or 0) * 1000,
median_e2el_ms=np.median(e2els or 0) * 1000,
percentiles_e2el_ms=[(p, np.percentile(e2els or 0, p) * 1000)
for p in selected_percentiles],
)
Expand Down
158 changes: 144 additions & 14 deletions docs/source/getting_started/tpu-installation.rst
Original file line number Diff line number Diff line change
@@ -1,35 +1,167 @@
.. _installation_tpu:

#####################
Installation with TPU
=====================
#####################

vLLM supports Google Cloud TPUs using PyTorch XLA.
Tensor Processing Units (TPUs) are Google's custom-developed application-specific
integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs
are available in different versions each with different hardware specifications.
For more information about TPUs, see `TPU System Architecture <https://cloud.google.com/tpu/docs/system-architecture-tpu-vm>`_.
For more information on the TPU versions supported with vLLM, see:

* `TPU v6e <https://cloud.google.com/tpu/docs/v6e>`_
* `TPU v5e <https://cloud.google.com/tpu/docs/v5e>`_
* `TPU v5p <https://cloud.google.com/tpu/docs/v5p>`_
* `TPU v4 <https://cloud.google.com/tpu/docs/v4>`_

These TPU versions allow you to configure the physical arrangements of the TPU
chips. This can improve throughput and networking performance. For more
information see:

* `TPU v6e topologies <https://cloud.google.com/tpu/docs/v6e#configurations>`_
* `TPU v5e topologies <https://cloud.google.com/tpu/docs/v5e#tpu-v5e-config>`_
* `TPU v5p topologies <https://cloud.google.com/tpu/docs/v5p#tpu-v5p-config>`_
* `TPU v4 topologies <https://cloud.google.com/tpu/docs/v4#tpu-v4-config>`_

In order for you to use Cloud TPUs you need to have TPU quota granted to your
Google Cloud Platform project. TPU quotas specify how many TPUs you can use in a
GPC project and are specified in terms of TPU version, the number of TPU you
want to use, and quota type. For more information, see `TPU quota <https://cloud.google.com/tpu/docs/quota#tpu_quota>`_.

For TPU pricing information, see `Cloud TPU pricing <https://cloud.google.com/tpu/pricing>`_.

You may need additional persistent storage for your TPU VMs. For more
information, see `Storage options for Cloud TPU data <https://cloud.devsite.corp.google.com/tpu/docs/storage-options>`_.

Requirements
------------

* Google Cloud TPU VM (single & multi host)
* TPU versions: v5e, v5p, v4
* Python: 3.10
* Google Cloud TPU VM
* TPU versions: v6e, v5e, v5p, v4
* Python: 3.10 or newer

Provision Cloud TPUs
====================

You can provision Cloud TPUs using the `Cloud TPU API <https://cloud.google.com/tpu/docs/reference/rest>`_`
or the `queued resources <https://cloud.google.com/tpu/docs/queued-resources>`_`
API. This section shows how to create TPUs using the queued resource API.
For more information about using the Cloud TPU API, see `Create a Cloud TPU using the Create Node API <https://cloud.google.com/tpu/docs/managing-tpus-tpu-vm#create-node-api>`_.
`Queued resources <https://cloud.devsite.corp.google.com/tpu/docs/queued-resources>`_
enable you to request Cloud TPU resources in a queued manner. When you request
queued resources, the request is added to a queue maintained by the Cloud TPU
service. When the requested resource becomes available, it's assigned to your
Google Cloud project for your immediate exclusive use.

Provision a Cloud TPU with the queued resource API
--------------------------------------------------
Create a TPU v5e with 4 TPU chips:

.. code-block:: console
gcloud alpha compute tpus queued-resources create QUEUED_RESOURCE_ID \
--node-id TPU_NAME \
--project PROJECT_ID \
--zone ZONE \
--accelerator-type ACCELERATOR_TYPE \
--runtime-version RUNTIME_VERSION \
--service-account SERVICE_ACCOUNT
.. list-table:: Parameter descriptions
:header-rows: 1

* - Parameter name
- Description
* - QUEUED_RESOURCE_ID
- The user-assigned ID of the queued resource request.
* - TPU_NAME
- The user-assigned name of the TPU which is created when the queued
resource request is allocated.
* - PROJECT_ID
- Your Google Cloud project
* - ZONE
- The `zone <https://cloud.google.com/tpu/docs/regions-zones>`_ where you
want to create your Cloud TPU.
* - ACCELERATOR_TYPE
- The TPU version you want to use. Specify the TPU version, followed by a
'-' and the number of TPU cores. For example `v5e-4` specifies a v5e TPU
with 4 cores. For more information, see `TPU versions <https://cloud.devsite.corp.google.com/tpu/docs/system-architecture-tpu-vm#versions>`_.
* - RUNTIME_VERSION
- The TPU VM runtime version to use. For more information see `TPU VM images <https://cloud.google.com/tpu/docs/runtimes>`_.
* - SERVICE_ACCOUNT
- The email address for your service account. You can find it in the IAM
Cloud Console under *Service Accounts*. For example:
`tpu-service-account@<your_project_ID>.iam.gserviceaccount.com`

Connect to your TPU using SSH:

.. code-block:: bash
gcloud compute tpus tpu-vm ssh TPU_NAME
Create and activate a Conda environment for vLLM:

.. code-block:: bash
Installation options:
conda create -n vllm python=3.10 -y
conda activate vllm
1. :ref:`Build a docker image with Dockerfile <build_docker_tpu>`.
2. :ref:`Build from source <build_from_source_tpu>`.
Clone the vLLM repository and go to the vLLM directory:

.. code-block:: bash
git clone https://github.com/vllm-project/vllm.git && cd vllm
Uninstall the existing `torch` and `torch_xla` packages:

.. code-block:: bash
pip uninstall torch torch-xla -y
Install `torch` and `torch_xla`

.. code-block:: bash
pip install --pre torch==2.6.0.dev20241028+cpu torchvision==0.20.0.dev20241028+cpu --index-url https://download.pytorch.org/whl/nightly/cpu
pip install 'torch_xla[tpu] @ https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-2.6.0.dev-cp310-cp310-linux_x86_64.whl' -f https://storage.googleapis.com/libtpu-releases/index.html
Install JAX and Pallas:

.. code-block:: bash
pip install torch_xla[pallas] -f https://storage.googleapis.com/jax-releases/jax_nightly_releases.html -f https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html
pip install jaxlib==0.4.32.dev20240829 jax==0.4.32.dev20240829 -f https://storage.googleapis.com/jax-releases/jax_nightly_releases.html -f https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html
Install other build dependencies:

.. code-block:: bash
pip install -r requirements-tpu.txt
VLLM_TARGET_DEVICE="tpu" python setup.py develop
sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev
Provision Cloud TPUs with GKE
-----------------------------

For more information about using TPUs with GKE, see
https://cloud.google.com/kubernetes-engine/docs/how-to/tpus
https://cloud.google.com/kubernetes-engine/docs/concepts/tpus
https://cloud.google.com/kubernetes-engine/docs/concepts/plan-tpus

.. _build_docker_tpu:

Build a docker image with :code:`Dockerfile.tpu`
------------------------------------------------

`Dockerfile.tpu <https://github.com/vllm-project/vllm/blob/main/Dockerfile.tpu>`_ is provided to build a docker image with TPU support.
You can use `Dockerfile.tpu <https://github.com/vllm-project/vllm/blob/main/Dockerfile.tpu>`_
to build a Docker image with TPU support.

.. code-block:: console
$ docker build -f Dockerfile.tpu -t vllm-tpu .
You can run the docker image with the following command:
Run the Docker image with the following command:

.. code-block:: console
Expand Down Expand Up @@ -75,14 +207,12 @@ Next, build vLLM from source. This will only take a few seconds:
$ VLLM_TARGET_DEVICE="tpu" python setup.py develop
.. note::

Since TPU relies on XLA which requires static shapes, vLLM bucketizes the possible input shapes and compiles an XLA graph for each different shape.
The compilation time may take 20~30 minutes in the first run.
However, the compilation time reduces to ~5 minutes afterwards because the XLA graphs are cached in the disk (in :code:`VLLM_XLA_CACHE_PATH` or :code:`~/.cache/vllm/xla_cache` by default).


.. tip::

If you encounter the following error:
Expand All @@ -93,7 +223,7 @@ Next, build vLLM from source. This will only take a few seconds:
ImportError: libopenblas.so.0: cannot open shared object file: No such file or directory
Please install OpenBLAS with the following command:
Install OpenBLAS with the following command:

.. code-block:: console
Expand Down
14 changes: 10 additions & 4 deletions docs/source/models/supported_models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -160,13 +160,13 @@ Text Generation
-
- ✅︎
* - :code:`GraniteForCausalLM`
- PowerLM
- :code:`ibm/PowerLM-3b` etc.
- Granite 3.0, PowerLM
- :code:`ibm-granite/granite-3.0-2b-base`, :code:`ibm-granite/granite-3.0-8b-instruct`, :code:`ibm/PowerLM-3b`, etc.
- ✅︎
- ✅︎
* - :code:`GraniteMoeForCausalLM`
- PowerMoE
- :code:`ibm/PowerMoE-3b` etc.
- Granite 3.0 MoE, PowerMoE
- :code:`ibm-granite/granite-3.0-1b-a400m-base`, :code:`ibm-granite/granite-3.0-3b-a800m-instruct`, :code:`ibm/PowerMoE-3b`, etc.
- ✅︎
- ✅︎
* - :code:`InternLMForCausalLM`
Expand Down Expand Up @@ -440,6 +440,12 @@ Text Generation
- :code:`THUDM/glm-4v-9b` etc.
-
- ✅︎
* - :code:`H2OVLChatModel`
- H2OVL
- T + I\ :sup:`E+`
- :code:`h2oai/h2ovl-mississippi-800m`, :code:`h2oai/h2ovl-mississippi-2b`, etc.
-
- ✅︎
* - :code:`InternVLChatModel`
- InternVL2
- T + I\ :sup:`E+`
Expand Down
6 changes: 1 addition & 5 deletions examples/offline_inference_audio_language.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,11 +34,7 @@ def run_ultravox(question: str, audio_count: int):
tokenize=False,
add_generation_prompt=True)

llm = LLM(model=model_name,
enforce_eager=True,
enable_chunked_prefill=False,
max_model_len=8192,
limit_mm_per_prompt={"audio": audio_count})
llm = LLM(model=model_name, limit_mm_per_prompt={"audio": audio_count})
stop_token_ids = None
return llm, prompt, stop_token_ids

Expand Down
28 changes: 27 additions & 1 deletion examples/offline_inference_vision_language.py
Original file line number Diff line number Diff line change
Expand Up @@ -176,6 +176,31 @@ def run_minicpmv(question: str, modality: str):
return llm, prompt, stop_token_ids


# H2OVL-Mississippi
def run_h2ovl(question: str, modality: str):
assert modality == "image"

model_name = "h2oai/h2ovl-mississippi-2b"

llm = LLM(
model=model_name,
trust_remote_code=True,
max_model_len=8192,
)

tokenizer = AutoTokenizer.from_pretrained(model_name,
trust_remote_code=True)
messages = [{'role': 'user', 'content': f"<image>\n{question}"}]
prompt = tokenizer.apply_chat_template(messages,
tokenize=False,
add_generation_prompt=True)

# Stop tokens for H2OVL-Mississippi
# https://huggingface.co/h2oai/h2ovl-mississippi-2b
stop_token_ids = [tokenizer.eos_token_id]
return llm, prompt, stop_token_ids


# InternVL
def run_internvl(question: str, modality: str):
assert modality == "image"
Expand Down Expand Up @@ -363,6 +388,7 @@ def run_glm4v(question: str, modality: str):
"chameleon": run_chameleon,
"minicpmv": run_minicpmv,
"blip-2": run_blip2,
"h2ovl_chat": run_h2ovl,
"internvl_chat": run_internvl,
"NVLM_D": run_nvlm_d,
"qwen_vl": run_qwen_vl,
Expand Down Expand Up @@ -475,4 +501,4 @@ def main(args):
default=16,
help='Number of frames to extract from the video.')
args = parser.parse_args()
main(args)
main(args)
Loading

0 comments on commit 773e672

Please sign in to comment.