[Kernel] Fix Flashinfer Correctness #7284

LiuXiaoxuanPKU · 2024-08-07T21:50:21Z

github-actions · 2024-08-07T21:50:33Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

LiuXiaoxuanPKU · 2024-08-07T21:52:14Z

/ready

simon-mo · 2024-08-07T21:52:37Z

wait how did this fix it? did the profile run corrupt something?

Yard1 · 2024-08-07T21:55:07Z

@simon-mo looks like the assumption that "prefill doesn't read KV cache" was incorrect.

LiuXiaoxuanPKU · 2024-08-07T21:56:01Z

wait how did this fix it? did the profile run corrupt something?

Originally, we set paged_kv_indptr to be 0s for both profile run and prefill run. We don't use flashinfer in the profile run, so it's good. But we use flashinfer in the prefill run, and the prefill run reads the kv cache. Setting paged_kv_indptr to be all 0s disable flashinfer from reading the kv cache, which corrupts the output.

youkaichao · 2024-08-07T23:26:47Z

test failures are not related and are fixed in the main branch.

felixzhu555 · 2024-08-08T08:52:16Z

vllm/attention/backends/flashinfer.py

@@ -127,6 +127,7 @@ def __post_init__(self):
            raise ValueError(
                f"Only {supported_head_sizes} are supported for head_dim,",
                f"received {self.head_dim}.")
+        self.is_profile_run = is_block_tables_empty(self.block_tables)


@LiuXiaoxuanPKU I think this line is buggy, self.block_tables here is always a tensor so is_block_tables_empty should always return False. From what I've checked this will happen to work for flashinfer v0.1.2, but not v0.1.3, which will raise RuntimeError: CHECK_EQ(paged_kv_indptr.size(0), batch_size + 1) failed. 1 vs 257 during profile run. (This is due to the if self.is_profile_run block never being run.)

I tried a quick fix of patching is_block_tables_empty by checking if block_tables has num_el == 0, which will pass the profile run fine. But this introduces logic issues for prefill, which obviously has a 0 length block table as well. Not sure what the best fix for this is, just wanted to raise this concern here.

Signed-off-by: Alvant <[email protected]>

ifx

00841ee

simon-mo approved these changes Aug 7, 2024

View reviewed changes

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 7, 2024

Yard1 approved these changes Aug 7, 2024

View reviewed changes

Yard1 enabled auto-merge (squash) August 7, 2024 21:55

youkaichao disabled auto-merge August 7, 2024 23:26

youkaichao merged commit e53dfd3 into vllm-project:main Aug 7, 2024
58 of 65 checks passed

felixzhu555 reviewed Aug 8, 2024

View reviewed changes

sfc-gh-mkeralapura pushed a commit to sfc-gh-mkeralapura/vllm that referenced this pull request Aug 12, 2024

[Kernel] Fix Flashinfer Correctness (vllm-project#7284)

2932be5

kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Aug 17, 2024

[Kernel] Fix Flashinfer Correctness (vllm-project#7284)

4eff157

fialhocoelho pushed a commit to opendatahub-io/vllm that referenced this pull request Aug 22, 2024

[Kernel] Fix Flashinfer Correctness (vllm-project#7284)

2382c24

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[Kernel] Fix Flashinfer Correctness (vllm-project#7284)

d7040d2

Signed-off-by: Alvant <[email protected]>

KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024

[Kernel] Fix Flashinfer Correctness (vllm-project#7284)

46cc7fe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel] Fix Flashinfer Correctness #7284

[Kernel] Fix Flashinfer Correctness #7284

LiuXiaoxuanPKU commented Aug 7, 2024 •

edited by mgoin

Loading

github-actions bot commented Aug 7, 2024

LiuXiaoxuanPKU commented Aug 7, 2024

simon-mo commented Aug 7, 2024

Yard1 commented Aug 7, 2024

LiuXiaoxuanPKU commented Aug 7, 2024

youkaichao commented Aug 7, 2024

felixzhu555 Aug 8, 2024

[Kernel] Fix Flashinfer Correctness #7284

[Kernel] Fix Flashinfer Correctness #7284

Conversation

LiuXiaoxuanPKU commented Aug 7, 2024 • edited by mgoin Loading

github-actions bot commented Aug 7, 2024

LiuXiaoxuanPKU commented Aug 7, 2024

simon-mo commented Aug 7, 2024

Yard1 commented Aug 7, 2024

LiuXiaoxuanPKU commented Aug 7, 2024

youkaichao commented Aug 7, 2024

felixzhu555 Aug 8, 2024

Choose a reason for hiding this comment

LiuXiaoxuanPKU commented Aug 7, 2024 •

edited by mgoin

Loading