Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat (nn/sdpa): quantization of scaled dot-product attention #1090

Merged
merged 25 commits into from
Dec 6, 2024

Conversation

nickfraser
Copy link
Collaborator

@nickfraser nickfraser commented Nov 8, 2024

Reason for this PR

Make it easier for users to quantize attention layers.

Changes Made in this PR

Achieved by providing:

  • A modular equivalent to the torch.nn.functional.scaled_dot_product_attention functional
  • A quantized version of this module
  • Adding code to convert the three options

Testing Summary

Tests:

  • Layer replacement test in LLM entry-point
  • Basic accuracy test for OPT
  • Basic graph replacement test (covered by LLM entry-point test)
  • SDPA & Quant SDPA forward tests

Risk Highlight

Adapted from pseudocode in PyTorch's documentation. Otherwise, this code barely touches any existing code, that shouldn't break any existing Brevitas features.

  • This PR includes code from another work (please detail).
  • This PR contains API-breaking changes.
  • This PR depends on work in another PR (please provide links/details).
  • This PR introduces new dependencies (please detail).
  • There are coverage gaps not covered by tests.
  • Documentation updates required in subsequent PR.

Checklist

  • Code comments added to any hard-to-understand areas, if applicable.
  • Changes generate no new warnings.
  • Updated any relevant tests, if applicable.
  • No conflicts with destination dev branch.
  • I reviewed my own code changes.
  • Initial CI/CD passing.
  • 1+ reviews given, and any review issues addressed and approved.
  • Post-review full CI/CD passing.

@nickfraser nickfraser self-assigned this Nov 8, 2024
@nickfraser nickfraser added the next release PRs which should be merged for the next release label Nov 8, 2024
@nickfraser nickfraser marked this pull request as ready for review November 20, 2024 17:56
@nickfraser nickfraser requested a review from Giuseppe5 November 20, 2024 17:56
@nickfraser
Copy link
Collaborator Author

We should merge #1088 before this.

@nickfraser nickfraser requested review from Giuseppe5 and removed request for Giuseppe5 November 28, 2024 16:55
@nickfraser
Copy link
Collaborator Author

Note, when --quant-sdpa is applied, the matrix multiplies (i.e., Q,K,V & attention softmax) are quantized in the same way as inputs to linear layers. This is slightly different to the current version of the stable diffusion example, which can have different formats for attention quantization and linear/conv layer quantization.

Copy link
Collaborator

@Giuseppe5 Giuseppe5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love it

@nickfraser nickfraser merged commit 85c1626 into Xilinx:dev Dec 6, 2024
393 of 396 checks passed
@nickfraser nickfraser deleted the feat/quant_sdpa branch December 6, 2024 09:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
next release PRs which should be merged for the next release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants