-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Fix illegal memory access
error with chunked prefill, prefix caching, block manager v2 and xformers enabled together
#9532
[Bugfix] Fix illegal memory access
error with chunked prefill, prefix caching, block manager v2 and xformers enabled together
#9532
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
illegal memory access
error with chunked prefill, prefix caching and block manager v2 enabled togetherillegal memory access
error with chunked prefill, prefix caching and block manager v2 enabled together
(Sorry for the inconvenience, I unchecked PR as a draft to be able to run fastcheck). |
It seems that either L4 (or perhaps non-Pascal in general) is unaffected, or the request sequence is different for each GPU architecture (as this is reproduced on three P40s tested independently). Anyway, I'll try to fix that now. |
attention this, According to the official documentation: https://docs.vllm.ai/en/latest/models/engine_args.html, block manager v1 has been removed and SelfAttnBlockSpaceManager (i.e. block manager v2) is now the default. Setting this flag(--use-v2-block-manager) to True or False has no effect on vLLM behavior. |
@simon-mo @DarkLight1337 @robertgshaw2-neuralmagic |
Please don't ping the reviewers, this PR is still draft. I haven't found the cause yet. I think the block manager v2 just doesn't allocate (enough/at all) memory in some cases. Compute sanitizer output
At the moment, if you need chunked prefill + prefix caching, you can copy the block manager v1 from the old version of vLLM. It works even without modifications. You just need to replace
with
in |
Thanks for your help! Hope this issue can be solved soon. |
Hi @sasha0552, I have tried to disable |
What error do you see? Can you reproduce it consistently by sending the same sequence of requests? What GPU(s) do you have and what model are you using? If it consistently reproducing, you can anonymize the prompts (if they are confidential) like I did and send them there, it may help identify the underlying problem. If they are not confidential, you can just send them as is. You can anonymize the prompts by converting them to tokens using the repro.pyfrom vllm import LLM, SamplingParams, TokensPrompt
llm = LLM(
config_format="mistral",
dtype="float16",
enable_chunked_prefill=True,
enable_prefix_caching=True,
enforce_eager=True,
load_format="mistral",
max_model_len=4096,
model="mistralai/Ministral-8B-Instruct-2410",
swap_space=0,
tensor_parallel_size=1,
tokenizer_mode="mistral",
use_v2_block_manager=True,
)
llm.generate(TokensPrompt(prompt_token_ids=([0] * 588 ) + ([1] * 1332) + ([2] * 30 ) + ([3] * 1 )), SamplingParams(max_tokens=1, seed=42))
llm.generate(TokensPrompt(prompt_token_ids=([0] * 588 ) + ([1] * 1332) + ([4] * 3 ) + ([5] * 50 )), SamplingParams(max_tokens=1, seed=42))
llm.generate(TokensPrompt(prompt_token_ids=([0] * 588 ) + ([1] * 1332) + ([2] * 30 ) + ([6] * 95 )), SamplingParams(max_tokens=1, seed=42))
llm.generate(TokensPrompt(prompt_token_ids=([0] * 588 ) + ([1] * 1332) + ([4] * 3 ) + ([7] * 174 )), SamplingParams(max_tokens=1, seed=42))
llm.generate(TokensPrompt(prompt_token_ids=([0] * 588 ) + ([8] * 1539) ), SamplingParams(max_tokens=1, seed=42)) All five requests started with the same prefix of 588 tokens, the first four requests started with the same prefix of 1332 tokens, and the last request had a different prefix. The first and third have the same prefix, and the second and fourth have the same (but different from the first and third) prefix. Additionally, you could try running vLLM with
|
Hi @sasha0552, for the prompt and my code, I have send them to your email. I can reproduce the error consistently. Whether disable chunked_prefill or use block manage v1 met the following error:
|
It looks like there is something wrong with your email, it was rejected by Cloudflare due to lack of SPF, DMARC and DKIM. Could you try resending the email using a different address? You could also try sending the email to |
Hi @sasha0552 , I have resended them. If you have any confusion, feel free to contact me. |
This pull request has merge conflicts that must be resolved before it can be |
illegal memory access
error with chunked prefill, prefix caching and block manager v2 enabled togetherillegal memory access
error with chunked prefill, prefix caching, block manager v2 and xformers enabled together
I have reproduced the https://buildkite.com/vllm/fastcheck/builds/6857#0192df28-9c8c-4244-b850-f3ccdbf7ca2e Also, I pushed a fix, it seems that copying the MetadataBuilder code from flash-attn helps. (related: #7018) repro.py for manual testing
@StevenTang1998 I tried to reproduce the |
fef6ade
to
a2d2a8b
Compare
Is this ready for review? |
Signed-off-by: sasha0552 <[email protected]>
10a8040
to
9233866
Compare
Signed-off-by: sasha0552 <[email protected]>
9233866
to
63c87eb
Compare
vllm/attention/backends/xformers.py
Outdated
@@ -384,9 +392,166 @@ def _get_seq_len_block_table_args( | |||
raise AttributeError(f"Invalid attention type {str(attn_type)}") | |||
|
|||
|
|||
class XFormersMetadataBuilder(CommonMetadataBuilder[XFormersMetadata]): | |||
class XFormersMetadataBuilder(AttentionMetadataBuilder[XFormersMetadata]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need to re-implement the metadata builder? What's the difference between this implementation and the common one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has #7018 and
if prefix_cache_hit:
# NOTE(woosuk): For xformers, the block table should
# include the entries for the incoming prefill tokens.
block_table = block_tables[seq_id]
elif ((chunked_prefill_enabled or not is_prompt)
and block_tables is not None):
if curr_sliding_window_block == 0:
block_table = block_tables[seq_id]
else:
block_table = block_tables[seq_id][
-curr_sliding_window_block:]
from flash-attn. However, it is based on the common one, not just copied from flash-attn. Theoretically, I could modify the common one, but that might affect other attention backends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try to modify the common one? We should avoid to introduce this divergence as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Signed-off-by: sasha0552 <[email protected]>
…XFormers Signed-off-by: sasha0552 <[email protected]>
OK, thanks for your help! BTW, could you tell me how to use |
You can run vLLM as follows
However, it needs to be installed first. On ubuntu/vLLM docker container this can be done as follows
In general, in the docker I run vLLM with docker run \
--cap-add=SYS_ADMIN \
--cap-add=SYS_PTRACE \
--entrypoint sh \
--env VLLM_NO_USAGE_STATS=1 \
--gpus all \
--ipc host \
--privileged \
--restart no \
--rm \
--runtime nvidia \
--security-opt seccomp=unconfined \
--volume ./hf_cache:/root/.cache/huggingface \
--volume ./repro.py:/repro.py \
--volume ./logs:/logs \
vllm-temp2:latest \
-c " \
compute-sanitizer \
--launch-timeout=60 \
--log-file=/logs/l1.log \
--padding=32 \
--print-limit=0 \
--save=/logs/l1.xml \
--save-session-details \
--target-processes=application-only \
--tool=memcheck \
--xml \
python3 \
/repro.py \
" I created |
Hi @sasha0552 , thanks for your help! Sorry, I am still a little confused. As I know, |
You can simply replace |
Signed-off-by: sasha0552 <[email protected]>
@comaniac All tests passed, could you please merge this PR? |
…ix caching, block manager v2 and xformers enabled together (vllm-project#9532) Signed-off-by: sasha0552 <[email protected]>
…ix caching, block manager v2 and xformers enabled together (vllm-project#9532) Signed-off-by: sasha0552 <[email protected]> Signed-off-by: Linkun Chen <[email protected]>
…ix caching, block manager v2 and xformers enabled together (vllm-project#9532) Signed-off-by: sasha0552 <[email protected]>
…ix caching, block manager v2 and xformers enabled together (vllm-project#9532) Signed-off-by: sasha0552 <[email protected]>
…ix caching, block manager v2 and xformers enabled together (vllm-project#9532) Signed-off-by: sasha0552 <[email protected]> Signed-off-by: Loc Huynh <[email protected]>
…ix caching, block manager v2 and xformers enabled together (vllm-project#9532) Signed-off-by: sasha0552 <[email protected]> Signed-off-by: Sumit Dubey <[email protected]>
…ix caching, block manager v2 and xformers enabled together (vllm-project#9532) Signed-off-by: sasha0552 <[email protected]>
…ix caching, block manager v2 and xformers enabled together (vllm-project#9532) Signed-off-by: sasha0552 <[email protected]> Signed-off-by: Maxime Fournioux <[email protected]>
…ix caching, block manager v2 and xformers enabled together (vllm-project#9532) Signed-off-by: sasha0552 <[email protected]> Signed-off-by: Tyler Michael Smith <[email protected]>
…ix caching, block manager v2 and xformers enabled together (vllm-project#9532) Signed-off-by: sasha0552 <[email protected]>
…ix caching, block manager v2 and xformers enabled together (vllm-project#9532) Signed-off-by: sasha0552 <[email protected]> Signed-off-by: s.kochetkov <[email protected]>
…ix caching, block manager v2 and xformers enabled together (vllm-project#9532) Signed-off-by: sasha0552 <[email protected]>
Deleted
This PR (in its current form) is intended to simply reproduce a crash I've locally observed on a P40 with a particular request sequence, on CI (which have a supported GPU), so don't merge it.Eventually (regardless of reproductibility on other hardware) I hope to find the cause and fix the crash (although this may take a long time as I'm not familiar with these parts of vLLM, so any help is welcome).For certain sequences of requests
(you can find one in, vLLMs withtemp.py
to reproduce manually, it works using an OpenAI-compatible API)crashes with
CUDA error: an illegal memory access was encountered
somewhere. (sometimes, e.g., in the prefix caching kernel).I believe this is not a Triton for Pascal (or my hardware in general) problem, as I have found similar issues about crashes with prefix caching on other hardware.The crash looks like this...
... or this ...
... or as in the issues.Based on this (and the test I wrote), I can assume that something in block manager V2 is corrupting the CUDA memory.Test output on P40