-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Add has_weight to RMSNorm and re-enable weights loading tracker for Mamba #10739
Conversation
Signed-off-by: Isotr0py <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
self.weight = torch.ones(hidden_size) | ||
if self.elementwise_affine: | ||
self.weight = nn.Parameter(self.weight) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious: can we also gain performance improvement from a rms_norm kernel without weight multiply for elementwise_affine=False
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My take on this is that the opportunity is pretty small, with RMSNorm typically taking maybe 2% of the time in the forward pass -- you would save on not having to load self.weight
from GPU memory, but since it's just a vector this is probably not a significant cost. And since we have custom layernorm kernels, modifying them to skip the weight application is probably not worth the effort.
However, I do think it's worth checking out if we can get speed up by skipping the weight application in the native forward method (using torch.compile
for fusion)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested the performance of weight application skipped in the native forward with torch.compile
, and there is a minor performance speedup (about 4%) on long sequence compared to no skip. So I think we can skip the weight application in native forward, although the speedup is small. :)
@tlrmchlsmth Could you please take a look to this PR? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thanks for the ping and sorry I missed it earlier!
self.weight = torch.ones(hidden_size) | ||
if self.elementwise_affine: | ||
self.weight = nn.Parameter(self.weight) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My take on this is that the opportunity is pretty small, with RMSNorm typically taking maybe 2% of the time in the forward pass -- you would save on not having to load self.weight
from GPU memory, but since it's just a vector this is probably not a significant cost. And since we have custom layernorm kernels, modifying them to skip the weight application is probably not worth the effort.
However, I do think it's worth checking out if we can get speed up by skipping the weight application in the native forward method (using torch.compile
for fusion)
Signed-off-by: Isotr0py <[email protected]>
…er for Mamba (vllm-project#10739) Signed-off-by: Isotr0py <[email protected]>
…er for Mamba (vllm-project#10739) Signed-off-by: Isotr0py <[email protected]>
This PR is a following PR for #10456
elementwise_affine
toRMSNorm
, just like pytorch implementation, which could be used to handle unlearnable RMSNorm layer without Parameter initialization.