-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: illegal memory access when increase max_model_length on FP8 models #6429
Comments
@comaniac FYI |
Will take a look when I get time. Meanwhile, I have 2 questions:
|
For Q1: Sure, you can substitute any other prompts input with this one(the prompt length is relative to max_model_len) by the way, i am testing the v0.5.2 version on this issue. I will update this issue if i get new result! |
I also tested Llama3-8B-FP8-KV(generated using AutoFP8 https://github.com/neuralmagic/AutoFP8)with context length = 512K and encountered same error. |
Thanks for the investigation. If that's the case then the problem is actually the paged attention kernel instead of the MoE kernel. |
For v0.5.2 version, FP8 models can't be loaded successfully. See trace back here: |
Does that work without fp8 kv-cache? |
Yes. Only the fp8 kv_cache versions can't be loaded. I have tested fp8 kv_cache versions of Llama3 and Mixtral, both failed! |
Looks like the checkpoint format issue? cc @mgoin |
We identified this issue today unfortunately. I have resolved this on main in this PR #6081 |
Sorry guys. I broke this trying to get DeepSeek working. We should get a model with kv scales into the ci |
I tested main (build from source) branch, fp8 kv_scale models could be loaded successfully! However, the issue I mentioned here wasn't solved yet. The same error traceback was encountered when I extended context length to 512K or longer (384K is ok) on Mixtral-8x7B-FP8-KV |
vLLM version: 0.5.3.post1 Result: For Mixtral-8x7B-FP8 KV_cache, the inference with 4K context length is ok. But failed when I set it to 256K or 512K。 [rank0]: The above exception was the direct cause of the following exception: [rank0]: Traceback (most recent call last): [rank0]: qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32) # [M,N] |
Hello, I had the same issue with the neuralmagic/Mistral-Nemo-Instruct-2407-FP8 model, and I found out that vLLM enables Chunked Prefill above 32k |
@florianbaud are you running on v0.5.3.post1? We resolved a difficult bug in cutlass with this commit: #6852, which will be in the next release |
@robertgshaw2-neuralmagic, I run a docker image built from last commit (c8a7e93 Date: Wed Jul 31 23:51:09 2024). |
Can you share the error message? |
Yes, the error message is:
|
FP8 kv cache is being enabled, which seems to not be compatible with chunked prefill. I think we should disable it in this case. EDIT: this doesn't seem to trigger for Llama FP8, so maybe it is an issue with MoE models |
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you! |
Your current environment
🐛 Describe the bug
When i set max_position_embeddings to 512K or higher, illegal memory access was encountered.
The text was updated successfully, but these errors were encountered: