-
-
Notifications
You must be signed in to change notification settings - Fork 7.8k
[Bugfix] Remove NVFP4 scales assertions to fix load_format=dummy #18861
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Remove NVFP4 scales assertions to fix load_format=dummy #18861
Conversation
Signed-off-by: mgoin <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: mgoin <[email protected]>
In Marlin NVFP4, I changed the scales from FP8-S1E4M3 to a special FP8-S0E5M3 format to speedup the dequantization, this assumed the origin scales to be >=0, if we can always ensure that for real weight, this assertion can be removed. |
Signed-off-by: mgoin <[email protected]>
…m-project#18861) Signed-off-by: mgoin <[email protected]> Signed-off-by: amit <[email protected]>
…m-project#18861) Signed-off-by: mgoin <[email protected]> Signed-off-by: amit <[email protected]>
Currently on 0.9.0 when running
vllm serve nvidia/DeepSeek-R1-FP4 --quantization modelopt_fp4 -tp 8 --load-format dummy
on A100, we run intoThen on B200 when running
vllm serve nvidia/DeepSeek-R1-FP4 --quantization modelopt_fp4 -tp 4 --load-format dummy
, we run intoThese asserts are hit because dummy weights are likely to have random values. It doesn't seem this check is truly necessary