Disable spec-decode + chunked-prefill for draft models with tensor parallelism > 1 #10136

sroy745 · 2024-11-08T01:01:24Z

In this pr we are disabling the simulataneous enablement of both spec-decode & chunked-prefill if the draft model has a tp of > 1. This is needed because as reported in #9291 we see errors if both are enabled for draft model with tp> 1. After this change for this command we return the following error

python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8083 --model "meta-llama/Meta-Llama-3.1-8B-Instruct"   --speculative_model meta-llama/Llama-3.1-8B-Instruct   --tensor-parallel-size 4    --disable-log-requests --use-v2-block-manager --seed 42 --num_speculative_tokens 5  --spec-decoding-acceptance-method typical_acceptance_sampler  --speculative_draft_tensor_parallel_size 4 --max-num-seqs 64 --enable-chunked-prefill


INFO 11-08 03:44:35 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/052b29fd-c3d4-462e-a184-4bfefd9cc94a for IPC Path.
INFO 11-08 03:44:35 api_server.py:185] Started engine process with PID 1182727
INFO 11-08 03:44:40 config.py:347] This model supports multiple tasks: {'generate', 'embedding'}. Defaulting to 'generate'.
INFO 11-08 03:44:40 config.py:1014] Defaulting to use mp for distributed inference
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/jovyan/cp-sd-check/vllm/entrypoints/openai/api_server.py", line 615, in <module>
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 82, in run
    return loop.run_until_complete(wrapper())
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
  File "/home/jovyan/cp-sd-check/vllm/entrypoints/openai/api_server.py", line 581, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/home/jovyan/cp-sd-check/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/home/jovyan/cp-sd-check/vllm/entrypoints/openai/api_server.py", line 188, in build_async_engine_client_from_engine_args
    engine_config = engine_args.create_engine_config()
  File "/home/jovyan/cp-sd-check/vllm/engine/arg_utils.py", line 1071, in create_engine_config
    speculative_config = SpeculativeConfig.maybe_create_spec_config(
  File "/home/jovyan/cp-sd-check/vllm/config.py", line 1393, in maybe_create_spec_config
    raise ValueError(
ValueError: Chunked prefill and speculative decoding can be enabled simultaneously only for draft models with tensor parallel size 1.

Pull from head

github-actions · 2024-11-08T01:01:35Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: Sourashis Roy <[email protected]>

sroy745 · 2024-11-08T03:48:10Z

@njhill / @LiuXiaoxuanPKU this pr is ready for review. Could you ptal when you get a chance?

cc: @NickLucche

LiuXiaoxuanPKU

LGTM. Thanks!

…rallelism > 1 (vllm-project#10136) Signed-off-by: Sourashis Roy <[email protected]> Signed-off-by: Isotr0py <[email protected]>

…rallelism > 1 (vllm-project#10136) Signed-off-by: Sourashis Roy <[email protected]> Signed-off-by: Loc Huynh <[email protected]>

…rallelism > 1 (vllm-project#10136) Signed-off-by: Sourashis Roy <[email protected]> Signed-off-by: Jee Jee Li <[email protected]>

…rallelism > 1 (vllm-project#10136) Signed-off-by: Sourashis Roy <[email protected]>

…rallelism > 1 (vllm-project#10136) Signed-off-by: Sourashis Roy <[email protected]> Signed-off-by: Sumit Dubey <[email protected]>

…rallelism > 1 (vllm-project#10136) Signed-off-by: Sourashis Roy <[email protected]>

…rallelism > 1 (vllm-project#10136) Signed-off-by: Sourashis Roy <[email protected]> Signed-off-by: Maxime Fournioux <[email protected]>

…rallelism > 1 (vllm-project#10136) Signed-off-by: Sourashis Roy <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]>

…rallelism > 1 (vllm-project#10136) Signed-off-by: Sourashis Roy <[email protected]>

sroy745 added 30 commits May 28, 2024 20:39

Merge pull request #1 from vllm-project/main

5650b95

Pull from head

Merge branch 'vllm-project:main' into main

8f36146

Merge branch 'vllm-project:main' into main

9e75057

Merge branch 'vllm-project:main' into main

db2c679

Merge branch 'vllm-project:main' into main

8d7512c

Merge branch 'vllm-project:main' into main

1473f74

Merge branch 'vllm-project:main' into main

4013e1a

Merge branch 'vllm-project:main' into main

2dbdd78

Merge branch 'vllm-project:main' into main

b3575e9

Merge branch 'vllm-project:main' into main

94b0d43

Merge branch 'vllm-project:main' into main

fa8fedf

Merge branch 'vllm-project:main' into main

6ed96b4

Merge branch 'vllm-project:main' into main

b71c533

Merge branch 'vllm-project:main' into main

57babef

Merge branch 'vllm-project:main' into main

4b19bac

Merge branch 'vllm-project:main' into main

eb7a1c4

Merge branch 'vllm-project:main' into main

7e2c87e

Merge branch 'vllm-project:main' into main

6212d5f

Merge branch 'vllm-project:main' into main

5491438

Merge branch 'vllm-project:main' into main

68e080a

Merge branch 'vllm-project:main' into main

55e4332

Merge branch 'vllm-project:main' into main

532eb48

Merge branch 'vllm-project:main' into main

7cea056

Merge branch 'vllm-project:main' into main

185e056

Merge branch 'vllm-project:main' into main

e2be95f

Merge branch 'vllm-project:main' into main

2ed5473

Merge branch 'vllm-project:main' into main

efa4714

Merge branch 'vllm-project:main' into main

fb87d34

Merge branch 'vllm-project:main' into main

5419e49

Merge branch 'vllm-project:main' into main

9ba12f8

sroy745 and others added 2 commits November 7, 2024 16:20

Merge branch 'vllm-project:main' into main

fd9fdff

Disabling spec-decode + chunked-prefill for draft model with tp > 1

88925ee

sroy745 requested review from njhill and LiuXiaoxuanPKU as code owners November 8, 2024 01:01

sroy745 changed the title ~~Disable spec-decode + chunked-prefill for draft models with tensor parallelism > 1~~ [WIP] Disable spec-decode + chunked-prefill for draft models with tensor parallelism > 1 Nov 8, 2024

Format

96ebb0b

Signed-off-by: Sourashis Roy <[email protected]>

sroy745 marked this pull request as draft November 8, 2024 03:21

sroy745 marked this pull request as ready for review November 8, 2024 03:47

sroy745 changed the title ~~[WIP] Disable spec-decode + chunked-prefill for draft models with tensor parallelism > 1~~ Disable spec-decode + chunked-prefill for draft models with tensor parallelism > 1 Nov 8, 2024

Add TODO

4cc9677

comaniac approved these changes Nov 8, 2024

View reviewed changes

comaniac added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 8, 2024

comaniac enabled auto-merge (squash) November 8, 2024 05:19

LiuXiaoxuanPKU approved these changes Nov 8, 2024

View reviewed changes

Fix tests

185ab29

auto-merge was automatically disabled November 8, 2024 07:29
Head branch was pushed to by a user without write access

Fix a comment

14d1097

DarkLight1337 enabled auto-merge (squash) November 8, 2024 15:50

DarkLight1337 merged commit f677862 into vllm-project:main Nov 8, 2024
48 checks passed

rickyyx pushed a commit to rickyyx/vllm that referenced this pull request Nov 13, 2024

Disable spec-decode + chunked-prefill for draft models with tensor pa…

a6b43ca

…rallelism > 1 (vllm-project#10136) Signed-off-by: Sourashis Roy <[email protected]>

KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024

Disable spec-decode + chunked-prefill for draft models with tensor pa…

d4d0f84

…rallelism > 1 (vllm-project#10136) Signed-off-by: Sourashis Roy <[email protected]>

sleepwalker2017 pushed a commit to sleepwalker2017/vllm that referenced this pull request Dec 13, 2024

Disable spec-decode + chunked-prefill for draft models with tensor pa…

79aa7d0

…rallelism > 1 (vllm-project#10136) Signed-off-by: Sourashis Roy <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable spec-decode + chunked-prefill for draft models with tensor parallelism > 1 #10136

Disable spec-decode + chunked-prefill for draft models with tensor parallelism > 1 #10136

sroy745 commented Nov 8, 2024 •

edited

Loading

github-actions bot commented Nov 8, 2024

sroy745 commented Nov 8, 2024 •

edited

Loading

LiuXiaoxuanPKU left a comment

Disable spec-decode + chunked-prefill for draft models with tensor parallelism > 1 #10136

Disable spec-decode + chunked-prefill for draft models with tensor parallelism > 1 #10136

Conversation

sroy745 commented Nov 8, 2024 • edited Loading

github-actions bot commented Nov 8, 2024

sroy745 commented Nov 8, 2024 • edited Loading

LiuXiaoxuanPKU left a comment

Choose a reason for hiding this comment

sroy745 commented Nov 8, 2024 •

edited

Loading

sroy745 commented Nov 8, 2024 •

edited

Loading