Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel] Fix Flashinfer Correctness #7284

Merged
merged 1 commit into from
Aug 7, 2024

Conversation

LiuXiaoxuanPKU
Copy link
Collaborator

@LiuXiaoxuanPKU LiuXiaoxuanPKU commented Aug 7, 2024

FIX #7176

Copy link

github-actions bot commented Aug 7, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

@LiuXiaoxuanPKU
Copy link
Collaborator Author

/ready

@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 7, 2024
@simon-mo
Copy link
Collaborator

simon-mo commented Aug 7, 2024

wait how did this fix it? did the profile run corrupt something?

@Yard1
Copy link
Collaborator

Yard1 commented Aug 7, 2024

@simon-mo looks like the assumption that "prefill doesn't read KV cache" was incorrect.

@Yard1 Yard1 enabled auto-merge (squash) August 7, 2024 21:55
@LiuXiaoxuanPKU
Copy link
Collaborator Author

wait how did this fix it? did the profile run corrupt something?

Originally, we set paged_kv_indptr to be 0s for both profile run and prefill run. We don't use flashinfer in the profile run, so it's good. But we use flashinfer in the prefill run, and the prefill run reads the kv cache. Setting paged_kv_indptr to be all 0s disable flashinfer from reading the kv cache, which corrupts the output.

@youkaichao youkaichao disabled auto-merge August 7, 2024 23:26
@youkaichao
Copy link
Member

test failures are not related and are fixed in the main branch.

@youkaichao youkaichao merged commit e53dfd3 into vllm-project:main Aug 7, 2024
58 of 65 checks passed
@@ -127,6 +127,7 @@ def __post_init__(self):
raise ValueError(
f"Only {supported_head_sizes} are supported for head_dim,",
f"received {self.head_dim}.")
self.is_profile_run = is_block_tables_empty(self.block_tables)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LiuXiaoxuanPKU I think this line is buggy, self.block_tables here is always a tensor so is_block_tables_empty should always return False. From what I've checked this will happen to work for flashinfer v0.1.2, but not v0.1.3, which will raise RuntimeError: CHECK_EQ(paged_kv_indptr.size(0), batch_size + 1) failed. 1 vs 257 during profile run. (This is due to the if self.is_profile_run block never being run.)

I tried a quick fix of patching is_block_tables_empty by checking if block_tables has num_el == 0, which will pass the profile run fine. But this introduces logic issues for prefill, which obviously has a 0 length block table as well. Not sure what the best fix for this is, just wanted to raise this concern here.

sfc-gh-mkeralapura pushed a commit to sfc-gh-mkeralapura/vllm that referenced this pull request Aug 12, 2024
kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Aug 17, 2024
fialhocoelho pushed a commit to opendatahub-io/vllm that referenced this pull request Aug 22, 2024
Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: FlashInfer backend generate bad output
5 participants