Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ Kernel ] Enable Dynamic Per Token
fp8
#6547[ Kernel ] Enable Dynamic Per Token
fp8
#6547Changes from 5 commits
2426e29
7665b7b
ef27613
2f96157
e748554
15cc823
d064dd7
6aa37e5
c9d819a
08cbaf7
f4cdda1
2971f4d
b601033
58992e7
82f63c9
1411006
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't
use_per_token_if_dynamic
be set by a scheme or something? I don't see why it should always be true for this caseThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what case would we not want to use dynamic per token?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgoin the hypothesis here is that dynamic per token is an overall win over dynamic per tensor when supported. Higher accuracy, but also easier to fuse RMSNorm + Quant, and fewer dependencies so better parallelization for the quantize kernels overall. Downside is more scales.
We'll need to benchmark, but IMO ok for this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume there is a decent performance hit compared to produce/using per-tensor scale, is this not the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't assume a decent performance hit -- the overheads from the CUTLASS epilogues are quite small (like 3%), and there are advantages to doing per-token when quantizing as well. We'll measure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed my mind -- as I am implementing the per-token/per-channel wrapper for
torch._scaled_mm
, I think it would be nicer to grab this from a config somewhereThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to pass this from the scheme. However, this became very hard because I needed to check if cutlass is supported in multiple places
I think that deciding this here is the right place