[MISC] Use non-blocking transfer in prepare_input #7172

comaniac · 2024-08-05T21:42:47Z

This PR uses non-blocking data transfer in prepare_input. This is beneficial because we transfer several tensors to GPU in prepare_input. Here are some benchmark results using Llama-3.1-8B-Instruct on 1xH100:

Batching

Command:

python3 benchmark_throughput.py \
    --model meta-llama/Meta-Llama-3.1-8B-Instruct \
    --backend vllm \
    --input-len 292 \
    --output-len 579 \
    --num-prompts 1000

Result (I observed some variants so if you ran this multiple times the throughput is actually ranging from 8.15~8.34).

Main:    Throughput: 8.12 requests/s, 7075.74 tokens/s
This PR: Throughput: 8.34 requests/s, 7266.01 tokens/s

Serving

I used a different benchmark framework so no commands here, but the settings are as follows:

Input / Output: 550 / 150
QPS: 8
Duration: 120 seconds

Main:    P90 Latency: TTFT 47.9ms, ITL 12.9ms, E2E 2.0s
This PR: P90 Latency: TTFT 44.8ms, ITL 11.5ms, E2E 1.8s

Reference: https://pytorch.org/tutorials/intermediate/pinmem_nonblock.html

cc @youkaichao

github-actions · 2024-08-05T21:43:22Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

Yard1

lgtm

vllm/attention/backends/flashinfer.py

Signed-off-by: Alvant <[email protected]>

[MISC] Use non-blocking transfer in prepare_input

90ecb1d

comaniac marked this pull request as ready for review August 5, 2024 21:42

Yard1 approved these changes Aug 5, 2024

View reviewed changes

vllm/attention/backends/flashinfer.py Outdated Show resolved Hide resolved

from_numpy

4cdfa2e

comaniac added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 5, 2024

youkaichao approved these changes Aug 5, 2024

View reviewed changes

comaniac enabled auto-merge (squash) August 5, 2024 22:26

comaniac merged commit ef527be into vllm-project:main Aug 5, 2024
65 checks passed

sfc-gh-mkeralapura pushed a commit to sfc-gh-mkeralapura/vllm that referenced this pull request Aug 12, 2024

[MISC] Use non-blocking transfer in prepare_input (vllm-project#7172)

b978e64

kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Aug 17, 2024

[MISC] Use non-blocking transfer in prepare_input (vllm-project#7172)

bc51c47

fialhocoelho pushed a commit to opendatahub-io/vllm that referenced this pull request Aug 22, 2024

[MISC] Use non-blocking transfer in prepare_input (vllm-project#7172)

26f91d2

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[MISC] Use non-blocking transfer in prepare_input (vllm-project#7172)

a625523

Signed-off-by: Alvant <[email protected]>

KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024

[MISC] Use non-blocking transfer in prepare_input (vllm-project#7172)

f8f4d5b

comaniac deleted the non-blocking branch January 3, 2025 21:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MISC] Use non-blocking transfer in prepare_input #7172

[MISC] Use non-blocking transfer in prepare_input #7172

comaniac commented Aug 5, 2024 •

edited

Loading

github-actions bot commented Aug 5, 2024

Yard1 left a comment

[MISC] Use non-blocking transfer in prepare_input #7172

[MISC] Use non-blocking transfer in prepare_input #7172

Conversation

comaniac commented Aug 5, 2024 • edited Loading

github-actions bot commented Aug 5, 2024

Yard1 left a comment

Choose a reason for hiding this comment

comaniac commented Aug 5, 2024 •

edited

Loading