[Kernel][Model] logits_soft_cap for Gemma2 with flashinfer #6051

LiuXiaoxuanPKU · 2024-07-01T23:35:44Z

Add logits_soft_cap for flashinfer, which is needed by Gemma2 model, also add a simple gemma2 test.

Yard1

we should check the flashinfer version and raise if it's too old

Yard1 · 2024-07-02T00:52:09Z

vllm/worker/model_runner.py

+            logger.warning("Please use Flashinfer backend for models with"
+                           "logits_soft_cap (i.e., Gemma-2)."
+                           " Otherwise, the output might be wrong."
+                           " Set Flashinfer backend by "
+                           "export VLLM_ATTENTION_BACKEND=FLASHINFER.")


we should just raise an exception IMO.

.buildkite/test-pipeline.yaml

Co-authored-by: Simon Mo <[email protected]>

yongxb · 2024-07-03T03:22:11Z

vllm/worker/model_runner.py

+        logits_soft_cap = getattr(self.model_config.hf_config,
+                                  'final_logit_softcapping', None)
+        if logits_soft_cap is not None and self.attn_backend.get_name(
+        ) != "flashinfer":


Could I check if logits_soft_cap is supposed to be the attn_logit_softcapping value instead? The two values are different in the Gemma2 config.

"attn_logit_softcapping": 50.0, "final_logit_softcapping": 30.0,

@yongxb Nice catch! final_logit_softcapping is used to cap the final logits before sampling. @LiuXiaoxuanPKU Could you please fix this?

zifeitong · 2024-07-03T17:42:00Z

I think this warning can be removed to avoid confusion:

vllm/vllm/model_executor/models/gemma2.py

Line 140 in 7cd2ebb

if self.config.attn_logit_softcapping is not None:

zifeitong · 2024-07-03T20:13:54Z

I am able to run and reproduce the reported MMLU scores for both 9b and 27b models 👍

However, if I don't disable CUDA graph, vLLM will crash with this error:

ERROR 07-04 03:34:04 async_llm_engine.py:53]   File "vllm/vllm/worker/model_runner.py", line 1202, in execute_model
ERROR 07-04 03:34:04 async_llm_engine.py:53]     model_input.attn_metadata.decode_wrapper = self.graph_runners[
ERROR 07-04 03:34:04 async_llm_engine.py:53] IndexError: list index out of range

LiuXiaoxuanPKU · 2024-07-03T20:44:12Z

I am able to run and reproduce the reported MMLU scores for both 9b and 27b models 👍

However, if I don't disable CUDA graph, vLLM will crash with this error:
ERROR 07-04 03:34:04 async_llm_engine.py:53]   File "vllm/vllm/worker/model_runner.py", line 1202, in execute_model
ERROR 07-04 03:34:04 async_llm_engine.py:53]     model_input.attn_metadata.decode_wrapper = self.graph_runners[
ERROR 07-04 03:34:04 async_llm_engine.py:53] IndexError: list index out of range

Thanks for reporting! Could you give me an minimal reproducible example since I can run gemma-2 with flashinfer cudagraph on my end. Thanks!

zifeitong · 2024-07-03T21:39:20Z

I am using the run_batch script:

python -m vllm.entrypoints.openai.run_batch -i requests.jsonl -o /dev/null --model google/gemma-2-9b-it --disable-log-request

requests.jsonl.zip

LiuXiaoxuanPKU · 2024-07-03T23:00:12Z

I am using the run_batch script:

python -m vllm.entrypoints.openai.run_batch -i requests.jsonl -o /dev/null --model google/gemma-2-9b-it --disable-log-request

requests.jsonl.zip

I tried the script and data on H100, it seems work. Could you report your environment? Flashinfer only supports GPU with compute capability greater than 8.0 (https://developer.nvidia.com/cuda-gpus). Not sure if that might be the problem.

zifeitong · 2024-07-03T23:34:29Z

I am using the run_batch script:
python -m vllm.entrypoints.openai.run_batch -i requests.jsonl -o /dev/null --model google/gemma-2-9b-it --disable-log-request
requests.jsonl.zip
I tried the script and data on H100, it seems work. Could you report your environment? Flashinfer only supports GPU with compute capability greater than 8.0 (https://developer.nvidia.com/cuda-gpus). Not sure if that might be the problem.

I am using H100 with CUDA 12.5. Can you try sync you branch to the latest? #4412 might be related (it refactors graph_runners).

LiuXiaoxuanPKU · 2024-07-04T04:08:34Z

I am using the run_batch script:
python -m vllm.entrypoints.openai.run_batch -i requests.jsonl -o /dev/null --model google/gemma-2-9b-it --disable-log-request
requests.jsonl.zip
I tried the script and data on H100, it seems work. Could you report your environment? Flashinfer only supports GPU with compute capability greater than 8.0 (https://developer.nvidia.com/cuda-gpus). Not sure if that might be the problem.
I am using H100 with CUDA 12.5. Can you try sync you branch to the latest? #4412 might be related (it refactors graph_runners).

yes, it's a merge conflict. Just fixed, please try again. Thanks!

zifeitong · 2024-07-04T05:53:22Z

I am using the run_batch script:
python -m vllm.entrypoints.openai.run_batch -i requests.jsonl -o /dev/null --model google/gemma-2-9b-it --disable-log-request
requests.jsonl.zip
I tried the script and data on H100, it seems work. Could you report your environment? Flashinfer only supports GPU with compute capability greater than 8.0 (https://developer.nvidia.com/cuda-gpus). Not sure if that might be the problem.
I am using H100 with CUDA 12.5. Can you try sync you branch to the latest? #4412 might be related (it refactors graph_runners).
yes, it's a merge conflict. Just fixed, please try again. Thanks!

Thanks for the fix. It works now, w/ or w/o CUDA graph.

LiuXiaoxuanPKU added 2 commits July 1, 2024 16:23

logits_soft_cap for gemma2 in flashinfer

c7ddd18

separate tests

ceb7a16

LiuXiaoxuanPKU requested a review from WoosukKwon July 1, 2024 23:35

LiuXiaoxuanPKU added 3 commits July 1, 2024 16:37

minor

afa1ef3

format

6426e9b

format

8423d03

LiuXiaoxuanPKU requested a review from Yard1 July 2, 2024 00:30

Yard1 reviewed Jul 2, 2024

View reviewed changes

comaniac self-assigned this Jul 2, 2024

WoosukKwon mentioned this pull request Jul 2, 2024

v0.5.1 Release Tracker #5806

Open

LiuXiaoxuanPKU added 2 commits July 2, 2024 13:10

add flashinfer unit test, update error message

54f8548

format

c0f298b

simon-mo reviewed Jul 2, 2024

View reviewed changes

.buildkite/test-pipeline.yaml Outdated Show resolved Hide resolved

Update .buildkite/test-pipeline.yaml

d248cef

Co-authored-by: Simon Mo <[email protected]>

Yard1 approved these changes Jul 3, 2024

View reviewed changes

yongxb reviewed Jul 3, 2024

View reviewed changes

fix

0947961

comaniac approved these changes Jul 3, 2024

View reviewed changes

LiuXiaoxuanPKU added 5 commits July 3, 2024 20:55

remove warning

0cc481c

Merge branch 'main' of github.com:LiuXiaoxuanPKU/vllm

06b1a4f

Merge branch 'main' into flashinfer-logit-soft-cap

1f266c6

fix merge bug

cf02682

format

8e4d1d3

remove model test since unit tests already convered and unblock release

5257015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel][Model] logits_soft_cap for Gemma2 with flashinfer #6051

[Kernel][Model] logits_soft_cap for Gemma2 with flashinfer #6051

LiuXiaoxuanPKU commented Jul 1, 2024

Yard1 left a comment

Yard1 Jul 2, 2024

yongxb Jul 3, 2024

WoosukKwon Jul 3, 2024

zifeitong commented Jul 3, 2024

zifeitong commented Jul 3, 2024

LiuXiaoxuanPKU commented Jul 3, 2024

zifeitong commented Jul 3, 2024

LiuXiaoxuanPKU commented Jul 3, 2024

zifeitong commented Jul 3, 2024

LiuXiaoxuanPKU commented Jul 4, 2024

zifeitong commented Jul 4, 2024

[Kernel][Model] logits_soft_cap for Gemma2 with flashinfer #6051

Are you sure you want to change the base?

[Kernel][Model] logits_soft_cap for Gemma2 with flashinfer #6051

Conversation

LiuXiaoxuanPKU commented Jul 1, 2024

Yard1 left a comment

Choose a reason for hiding this comment

Yard1 Jul 2, 2024

Choose a reason for hiding this comment

yongxb Jul 3, 2024

Choose a reason for hiding this comment

WoosukKwon Jul 3, 2024

Choose a reason for hiding this comment

zifeitong commented Jul 3, 2024

zifeitong commented Jul 3, 2024

LiuXiaoxuanPKU commented Jul 3, 2024

zifeitong commented Jul 3, 2024

LiuXiaoxuanPKU commented Jul 3, 2024

zifeitong commented Jul 3, 2024

LiuXiaoxuanPKU commented Jul 4, 2024

zifeitong commented Jul 4, 2024