Skip to content

Commit cf838b4

Browse files
yma11sramakinteljitendra42
authored
update readme for vLLM 0.10.2 release on Intel GPU (#869)
Signed-off-by: Yan Ma <[email protected]> Co-authored-by: Srikanth Ramakrishna <[email protected]> Co-authored-by: Jitendra Patil <[email protected]>
1 parent fff3ed4 commit cf838b4

File tree

1 file changed

+248
-0
lines changed

1 file changed

+248
-0
lines changed

vllm/0.10.2-xpu.md

Lines changed: 248 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,248 @@
1+
# Optimize LLM Serving with vLLM on Intel® GPUs
2+
3+
vLLM is a fast and easy-to-use library for LLM inference and serving. It has evolved into a community-driven project with contributions from both academia and industry. Intel, as one of the community contributors, is working actively to bring satisfying performance with vLLM on Intel® platforms, including Intel® Xeon® Scalable Processors, Intel® discrete GPUs, as well as Intel® Gaudi® AI accelerators. This readme focuses on Intel® discrete GPUs at this time and brings you the necessary information to get the workloads running well on your Intel® graphics cards.
4+
5+
The vLLM used in this docker image is based on [v0.10.2](https://github.com/vllm-project/vllm/tree/v0.10.2) and validated on [Intel® Arc™ Pro B-Series Graphics](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/workstations/b-series/overview.html) Cards. It uses following BKC:
6+
7+
| Ingredients | Version |
8+
|-------------|-----------|
9+
| Host OS   | Ubuntu 25.04 |
10+
| Python   | 3.12 |
11+
| KMD Driver | 6.14.0 |
12+
| OneAPI   | 2025.1.3-0 |
13+
| PyTorch   | PyTorch 2.8 |
14+
| IPEX   | 2.8.10 |
15+
| OneCCL   | 2021.15.6.2 |
16+
17+
## 1. What's New in This Release?
18+
19+
* Gpt-oss 20B and 120B are supported in MXFP4 weight-only-quantization with optimized performance.
20+
* Attention kernel optimizations in decoding phases for all workloads achieved >10% end-to-end throughput on 10+ models with all in/out sequence length.
21+
* MoE models are optimized using persistent MoE gemm kernel and fused activation kernel to reduce the kernel bubbles. Qwen3-30B-A3B achieved 2.6x end-to-end improvement and DeepSeek-V2-lite achieved 1.5x end-to-end improvement.
22+
* More multi-modality models are supported with image/video as input, like InternVL series, MiniCPM-V-4, etc.
23+
* vLLM 0.10.2 with new features: Prefill/Decoding disaggregation, Data Parallel, tooling, reasoning output, structured output.
24+
* FP16/BF16 gemm optimizations for batch size 1-128. Obvious improvement for small batch sizes.
25+
26+
## 2. What's Supported?
27+
28+
Intel GPUs benefit from enhancements brought by [vLLM V1 engine](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html), including:
29+
30+
* Optimized Execution Loop & API Server
31+
* Simple & Flexible Scheduler
32+
* Zero-Overhead Prefix Caching
33+
* Clean Architecture for Tensor-Parallel Inference
34+
* Efficient Input Preparation
35+
36+
Besides, following up vLLM V1 design, corresponding optimized kernels and features are implemented for Intel GPUs.
37+
38+
* Chunked prefill:
39+
40+
Chunked prefill is an optimization feature in vLLM that allows large prefill requests to be divided into small chunks and batched together with decode requests. This approach prioritizes decode requests, improving inter-token latency (ITL) and GPU utilization by combining compute-bound (prefill) and memory-bound (decode) requests in the same batch. vLLM v1 engine is built on this feature and in this release, it's also supported on intel GPUs by leveraging corresponding kernel from Intel® Extension for PyTorch\* for model execution.
41+
42+
* FP8 W8A16 MatMul:
43+
44+
vLLM supports FP8 (8-bit floating point) weight using hardware acceleration on GPUs. We support weight-only online dynamic quantization with FP8, which allows for a 2x reduction in model memory requirements and up to a 1.6x improvement in throughput with minimal impact on accuracy.
45+
46+
Dynamic quantization of an original precision BF16/FP16 model to FP8 can be achieved with vLLM without any calibration data required. You can enable the feature by specifying `--quantization="fp8"` in the command line or setting `quantization="fp8"` in the LLM constructor.
47+
48+
Besides, the FP8 types typically supported in hardware have two distinct representations, each useful in different scenarios:
49+
50+
* **E4M3**: Consists of 1 sign bit, 4 exponent bits, and 3 bits of mantissa. It can store values up to +/-448 and `nan`.
51+
* **E5M2**: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- `inf`, and `nan`. The tradeoff for the increased dynamic range is lower precision of the stored values.
52+
53+
We support both representations through ENV variable `VLLM_XPU_FP8_DTYPE` with default value `E5M2`.
54+
55+
:::{warning}
56+
Currently, by default we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model. To avoid this, adding `VLLM_OFFLOAD_WEIGHTS_BEFORE_QUANT=1` can allow offloading weights to cpu before quantization and quantized weights will be kept in device.
57+
:::
58+
59+
* Multi-Modality Support
60+
61+
This release introduces support for multi-modal processing of image and video inputs, by leveraging models like the Qwen2.5-VL series, InternVL family, and MiniCPM-V-4. For example, the Qwen2.5-VL-32B-Instruct model can be launched on 4 Intel® Arc™ Pro B60 Graphics cards for the multi modality process.
62+
63+
* Pooling Models Support
64+
65+
vLLM supports pooling models such as embedding, classification and reward models. All of these models are now supported on Intel® GPUs. For detailed usage, refer to [guide](https://docs.vllm.ai/en/latest/models/pooling_models.html).
66+
67+
* Pipeline Parallelism
68+
69+
Pipeline parallelism distributes model layers across multiple GPUs. Each GPU processes different parts of the model in sequence. For Intel® GPUs, we support it on single node with `mp` as the backend.
70+
71+
* Data Parallelism
72+
73+
vLLM supports [Data Parallel](https://docs.vllm.ai/en/latest/serving/data_parallel_deployment.html) deployment, where model weights are replicated across separate instances/GPUs to process independent batches of requests. This will work with both dense and MoE models. Note export parallelism is under enabling that will be supported soon.
74+
75+
* MoE models
76+
77+
Models with MoE structure like GPT-OSS 20B/120B in MXFP4 format, Deepseek-v2-lite, Qwen/Qwen3-30B-A3B and Qwen3-30B-A3B-GPTQ-Int4 are now supported.
78+
79+
Other features like [reasoning_outputs](https://docs.vllm.ai/en/latest/features/reasoning_outputs.html), [structured_outputs](https://docs.vllm.ai/en/latest/features/structured_outputs.html) and [tool calling](https://docs.vllm.ai/en/latest/features/tool_calling.html) are supported now. We also have some experimental features supported, including:
80+
81+
* **torch.compile**: Can be enabled for fp16/bf16 path.
82+
* **speculative decoding**: Supports methods `n-gram`, `EAGLE` and `EAGLE3`.
83+
* **async scheduling**: Can be enabled by `--async-scheduling`. This may help reduce the CPU overheads, leading to better latency and throughput. However, async scheduling is currently not supported with some features such as structured outputs, speculative decoding, and pipeline parallelism.
84+
85+
## Supported Models
86+
87+
Please note that the following table contains only the models verified by Intel. Support on Intel® GPUs through vLLM extends to a wider array of models.
88+
89+
| Model Type | Model (company/model name) | FP16 | Dynamic Online FP8 | MXFP4 |
90+
|-----------------|-------------------------------------------| --- | --- | --- |
91+
| Text Generation | openai/gpt-oss-20b | | |✅︎|
92+
| Text Generation | openai/gpt-oss-120b | | |✅︎|
93+
| Text Generation | deepseek-ai/DeepSeek-R1-Distill-Llama-8B |✅︎|✅︎| |
94+
| Text Generation | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B |✅︎|✅︎| |
95+
| Text Generation | deepseek-ai/DeepSeek-R1-Distill-Qwen-32B |✅︎|✅︎| |
96+
| Text Generation | deepseek-ai/DeepSeek-R1-Distill-Llama-70B |✅︎|✅︎| |
97+
| Text Generation | Qwen/Qwen2.5-72B-Instruct |✅︎|✅︎| |
98+
| Text Generation | Qwen/Qwen3-14B |✅︎|✅︎| |
99+
| Text Generation | Qwen/Qwen3-32B |✅︎|✅︎| |
100+
| Text Generation | Qwen/Qwen3-30B-A3B |✅︎|✅︎| |
101+
| Text Generation | Qwen/Qwen3-30B-A3B-GPTQ-Int4 |✅︎|✅︎| |
102+
| Text Generation | Qwen/Qwen3-coder-30B-A3B-Instruct |✅︎|✅︎| |
103+
| Text Generation | Qwen/QwQ-32B |✅︎|✅︎| |
104+
| Multi Modality | OpenGVLab/InternVL3_5-8B |✅︎|✅︎| |
105+
| Multi Modality | OpenGVLab/InternVL3_5-14B |✅︎|✅︎| |
106+
| Multi Modality | OpenGVLab/InternVL3_5-38B |✅︎|✅︎| |
107+
| Text Generation | openbmb/MiniCPM-V-4 |✅︎|✅︎| |
108+
| Text Generation | deepseek-ai/DeepSeek-V2-Lite |✅︎|✅︎| |
109+
| Text Generation | meta-llama/Llama-3.1-8B-Instruct |✅︎|✅︎| |
110+
| Text Generation | baichuan-inc/Baichuan2-13B-Chat |✅︎|✅︎| |
111+
| Text Generation | THUDM/GLM-4-9B-chat |✅︎|✅︎| |
112+
| Text Generation | THUDM/GLM-4v-9B-chat |✅︎|✅︎| |
113+
| Text Generation | THUDM/CodeGeex4-All-9B |✅︎|✅︎| |
114+
| Text Generation | chuhac/TeleChat2-35B |✅︎|✅︎| |
115+
| Text Generation | 01-ai/Yi1.5-34B-Chat |✅︎|✅︎| |
116+
| Text Generation | THUDM/CodeGeex4-All-9B |✅︎|✅︎| |
117+
| Text Generation | deepseek-ai/DeepSeek-Coder-33B-base |✅︎|✅︎| |
118+
| Text Generation | baichuan-inc/Baichuan2-13B-Chat |✅︎|✅︎| |
119+
| Text Generation | meta-llama/Llama-2-13b-chat-hf |✅︎|✅︎| |
120+
| Text Generation | THUDM/CodeGeex4-All-9B |✅︎|✅︎| |
121+
| Text Generation | Qwen/Qwen1.5-14B-Chat |✅︎|✅︎| |
122+
| Text Generation | Qwen/Qwen1.5-32B-Chat |✅︎|✅︎| |
123+
| Multi Modality | Qwen/Qwen2-VL-7B-Instruct |✅︎|✅︎| |
124+
| Multi Modality | Qwen/Qwen2.5-VL-72B-Instruct |✅︎|✅︎| |
125+
| Multi Modality | Qwen/Qwen2.5-VL-32B-Instruct |✅︎|✅︎| |
126+
| Multi Modality | THUDM/GLM-4v-9B |✅︎|✅︎| |
127+
| Multi Modality | openbmb/MiniCPM-V-4 |✅︎|✅︎| |
128+
| Embedding Model | Qwen/Qwen3-Embedding-8B |✅︎|✅︎| |
129+
| Reranker Model | Qwen/Qwen3-Reranker-8B |✅︎|✅︎| |
130+
131+
## 3. Limitations
132+
133+
Some of vLLM V1 features may need extra support, including LoRA(Low-Rank Adaptation), pipeline parallel on Ray, EP(Expert Parallelism) and MLA(Multi-head Latent Attention).
134+
135+
The following issues are known issues:
136+
137+
* Qwen/Qwen3-30B-A3B FP16/BF16 need set `--gpu-memory-utilization=0.8` due to its high memory consumption.
138+
* W8A8 quantized models through llm_compressor are not supported yet, like RedHatAI/DeepSeek-R1-Distill-Qwen-32B-FP8-dynamic.
139+
140+
## 4. How to Get Started
141+
142+
### 4.1. Prerequisite
143+
144+
| OS | Hardware |
145+
| ---------- | ---------- |
146+
| Ubuntu 25.04 | Intel® Arc™ B-Series |
147+
148+
### 4.2. Prepare a Serving Environment
149+
150+
1. Get the released docker image with command `docker pull intel/vllm:0.10.2-xpu`
151+
2. Instantiate a docker container with command `docker run -t -d --shm-size 10g --net=host --ipc=host --privileged -v /dev/dri/by-path:/dev/dri/by-path --name=vllm-test --device /dev/dri:/dev/dri --entrypoint= intel/vllm:0.10.2-xpu /bin/bash`
152+
3. Run command `docker exec -it vllm-test bash` in 2 separate terminals to enter container environments for the server and the client respectively.
153+
154+
\* Starting from here, all commands are expected to be run inside the docker container, if not explicitly noted.
155+
156+
In both environments, you may then wish to set a `HUGGING_FACE_HUB_TOKEN` environment variable to make sure necessary files can be downloaded from the HuggingFace website.
157+
158+
```bash
159+
export HUGGING_FACE_HUB_TOKEN=xxxxxx
160+
```
161+
162+
### 4.3. Launch Workloads
163+
164+
#### 4.3.1. Launch Server in the Server Environment
165+
166+
Command:
167+
168+
```bash
169+
VLLM_WORKER_MULTIPROC_METHOD=spawn python3 -m vllm.entrypoints.openai.api_server --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --dtype=float16 --device=xpu --enforce-eager --port 8000 --block-size 64 --gpu-memory-util 0.9  --no-enable-prefix-caching --trust-remote-code --disable-sliding-window --disable-log-requests --max_num_batched_tokens=8192 --max_model_len 4096 -tp=4 --quantization fp8
170+
```
171+
172+
Note that by default fp8 online quantization will use `e5m2` and you can switch to use `e4m3` by explicitly add env `VLLM_XPU_FP8_DTYPE=e4m3`. If there is not enough memory to hold the whole model before quantization to fp8, you can use `VLLM_OFFLOAD_WEIGHTS_BEFORE_QUANT=1` to offload weights to CPU first.
173+
174+
Expected output:
175+
176+
```bash
177+
INFO 02-20 03:20:29 api_server.py:937] Starting vLLM API server on http://0.0.0.0:8000
178+
INFO 02-20 03:20:29 launcher.py:23] Available routes are:
179+
INFO 02-20 03:20:29 launcher.py:31] Route: /openapi.json, Methods: HEAD, GET
180+
INFO 02-20 03:20:29 launcher.py:31] Route: /docs, Methods: HEAD, GET
181+
INFO 02-20 03:20:29 launcher.py:31] Route: /docs/oauth2-redirect, Methods: HEAD, GET
182+
INFO 02-20 03:20:29 launcher.py:31] Route: /redoc, Methods: HEAD, GET
183+
INFO 02-20 03:20:29 launcher.py:31] Route: /health, Methods: GET
184+
INFO 02-20 03:20:29 launcher.py:31] Route: /ping, Methods: POST, GET
185+
INFO 02-20 03:20:29 launcher.py:31] Route: /tokenize, Methods: POST
186+
INFO 02-20 03:20:29 launcher.py:31] Route: /detokenize, Methods: POST
187+
INFO 02-20 03:20:29 launcher.py:31] Route: /v1/models, Methods: GET
188+
INFO 02-20 03:20:29 launcher.py:31] Route: /version, Methods: GET
189+
INFO 02-20 03:20:29 launcher.py:31] Route: /v1/chat/completions, Methods: POST
190+
INFO 02-20 03:20:29 launcher.py:31] Route: /v1/completions, Methods: POST
191+
INFO 02-20 03:20:29 launcher.py:31] Route: /v1/embeddings, Methods: POST
192+
INFO 02-20 03:20:29 launcher.py:31] Route: /pooling, Methods: POST
193+
INFO 02-20 03:20:29 launcher.py:31] Route: /score, Methods: POST
194+
INFO 02-20 03:20:29 launcher.py:31] Route: /v1/score, Methods: POST
195+
INFO 02-20 03:20:29 launcher.py:31] Route: /v1/audio/transcriptions, Methods: POST
196+
INFO 02-20 03:20:29 launcher.py:31] Route: /rerank, Methods: POST
197+
INFO 02-20 03:20:29 launcher.py:31] Route: /v1/rerank, Methods: POST
198+
INFO 02-20 03:20:29 launcher.py:31] Route: /v2/rerank, Methods: POST
199+
INFO 02-20 03:20:29 launcher.py:31] Route: /invocations, Methods: POST
200+
INFO: Started server process [1636943]
201+
INFO: Waiting for application startup.
202+
INFO: Application startup complete.
203+
```
204+
205+
It may take some time. Showing `INFO: Application startup complete.` indicates that the server is ready.
206+
207+
#### 4.3.2. Raise Requests for Benchmarking in the Client Environment
208+
209+
We leverage a [benchmarking script](https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py) which is provided in vLLM to perform performance benchmarking. You can use your own client scripts as well.
210+
211+
Use the command below to shoot serving requests:
212+
213+
```bash
214+
python3 -m vllm.entrypoints.cli.main bench serve --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --dataset-name random --random-input-len=1024 --random-output-len=1024 --ignore-eos --num-prompt 1 --max-concurrency 16 --request-rate inf --backend vllm --port=8000 --host 0.0.0.0 --ready-check-timeout-sec 1
215+
```
216+
217+
The command uses model `deepseek-ai/DeepSeek-R1-Distill-Qwen-32B`. Both input and output token sizes are set to `1024`. Maximally `16` requests are processed concurrently in the server.
218+
219+
Expected output:
220+
221+
```bash
222+
Maximum request concurrency: 16
223+
============ Serving Benchmark Result ============
224+
Successful requests: 1
225+
Benchmark duration (s): xxx
226+
Total input tokens: 1024
227+
Total generated tokens: 1024
228+
Request throughput (req/s): xxx
229+
Output token throughput (tok/s): xxx
230+
Total Token throughput (tok/s): xxx
231+
---------------Time to First Token----------------
232+
Mean TTFT (ms): xxx
233+
Median TTFT (ms): xxx
234+
P99 TTFT (ms): xxx
235+
-----Time per Output Token (excl. 1st token)------
236+
Mean TPOT (ms): xxx
237+
Median TPOT (ms): xxx
238+
P99 TPOT (ms): xxx
239+
---------------Inter-token Latency----------------
240+
Mean ITL (ms): xxx
241+
Median ITL (ms): xxx
242+
P99 ITL (ms): xxx
243+
==================================================
244+
```
245+
246+
## 5. Need Assistance?
247+
248+
Should you encounter any issues or have any questions, please submit an issue ticket at [vLLM Github Issues](https://github.com/vllm-project/vllm/issues). Include the text `[Intel GPU]` in the issue title to ensure it gets noticed.

0 commit comments

Comments
 (0)