Skip to content

Enable fp16/bf16 absmax #1672

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

jiqing-feng
Copy link
Contributor

@jiqing-feng jiqing-feng commented Jun 6, 2025

Hi @matthewdouglas , enable fp16/bf16 absmax on XPU could get 20% speed-up on our qlora case. Please review it. I am checking if there are any failed tests on CUDA, will let you know once it's completed. BTW, the tests are too much...

@jiqing-feng jiqing-feng marked this pull request as ready for review June 6, 2025 08:50
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
Signed-off-by: jiqing-feng <[email protected]>
@jiqing-feng jiqing-feng force-pushed the absmax branch 2 times, most recently from 69b2146 to 50ee994 Compare June 9, 2025 02:45
@jiqing-feng
Copy link
Contributor Author

Hi @matthewdouglas . I kept cuda op the same as before, only enabled cpu/xpu absmax on half-precision. This PR could pass all cuda tests on A100 and all cpu tests on Intel Xeon node. For XPU, we have around 20 tests failed because of compile error but not introduced by this PR. So please review this PR. Thanks!

Signed-off-by: jiqing-feng <[email protected]>
@matthewdouglas matthewdouglas added this to the v0.47.0 milestone Jun 9, 2025
Copy link

github-actions bot commented Jun 9, 2025

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@@ -137,11 +135,10 @@ def test_dynamic_blockwise_quantization(self, device, dtype, nested, blocksize,
abserr = sum(diffs) / len(diffs)
relerr = sum(reldiffs) / len(reldiffs)
if signed:
threshold_abserr = 0.0036 if device in ("cpu", "xpu") and (F.ipex_cpu or F.ipex_xpu) else 0.0035
assert abserr < 0.0036
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because threshold_abserr is not used.

assert abserr < 0.0036
assert relerr < 0.015
else:
assert abserr < 0.00175 if device in ("cpu", "xpu") and (F.ipex_cpu or F.ipex_xpu) else 0.0023
assert abserr < 0.0023
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have no reason to have a tighter threshold for ipex, otherwise the half-precision check cannot pass.

@jiqing-feng
Copy link
Contributor Author

Detect conflict with xpu sycl path, hold on this PR until xpu sycl path is merged.

@jiqing-feng jiqing-feng marked this pull request as draft June 17, 2025 05:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants