Feat (nn/sdpa): quantization of scaled dot-product attention #1090

nickfraser · 2024-11-08T11:25:20Z

Reason for this PR

Make it easier for users to quantize attention layers.

Changes Made in this PR

Achieved by providing:

A modular equivalent to the torch.nn.functional.scaled_dot_product_attention functional
A quantized version of this module
Adding code to convert the three options

Testing Summary

Tests:

Layer replacement test in LLM entry-point
Basic accuracy test for OPT
~~Basic graph replacement test~~ (covered by LLM entry-point test)
SDPA & Quant SDPA forward tests

Risk Highlight

Adapted from pseudocode in PyTorch's documentation. Otherwise, this code barely touches any existing code, that shouldn't break any existing Brevitas features.

This PR includes code from another work (please detail).
This PR contains API-breaking changes.
This PR depends on work in another PR (please provide links/details).
This PR introduces new dependencies (please detail).
There are coverage gaps not covered by tests.
Documentation updates required in subsequent PR.

Checklist

Code comments added to any hard-to-understand areas, if applicable.
Changes generate no new warnings.
Updated any relevant tests, if applicable.
No conflicts with destination dev branch.
I reviewed my own code changes.
Initial CI/CD passing.
1+ reviews given, and any review issues addressed and approved.
Post-review full CI/CD passing.

nickfraser · 2024-11-20T18:36:26Z

We should merge #1088 before this.

src/brevitas/nn/quant_sdpa.py

src/brevitas_examples/llm/main.py

…tention` classes

…ation

…name clashes

nickfraser · 2024-12-05T14:44:44Z

Note, when --quant-sdpa is applied, the matrix multiplies (i.e., Q,K,V & attention softmax) are quantized in the same way as inputs to linear layers. This is slightly different to the current version of the stable diffusion example, which can have different formats for attention quantization and linear/conv layer quantization.

Giuseppe5

I love it

nickfraser self-assigned this Nov 8, 2024

nickfraser added the next release PRs which should be merged for the next release label Nov 8, 2024

nickfraser force-pushed the feat/quant_sdpa branch from 0def533 to 221c822 Compare November 18, 2024 16:00

nickfraser marked this pull request as ready for review November 20, 2024 17:56

nickfraser requested a review from Giuseppe5 November 20, 2024 17:56

nickfraser force-pushed the feat/quant_sdpa branch from f0c4354 to 6b1e51a Compare November 28, 2024 14:39

nickfraser requested review from Giuseppe5 and removed request for Giuseppe5 November 28, 2024 16:55

Giuseppe5 reviewed Dec 3, 2024

View reviewed changes

src/brevitas/nn/quant_sdpa.py Show resolved Hide resolved

Giuseppe5 reviewed Dec 3, 2024

View reviewed changes

src/brevitas_examples/llm/main.py Show resolved Hide resolved

nickfraser force-pushed the feat/quant_sdpa branch from 091d982 to b098a09 Compare December 4, 2024 16:54

nickfraser added 14 commits December 5, 2024 14:29

feat nn: add ScaledDotProductAttention and `QuantScaledDotProductAt…

479b4d9

…tention` classes

Feat (graph/standardize): Add SDPA conversion to modular version

8f69b69

Feat (example/llm): Specify LLMs to use SDPA for their attn implement…

4a93631

…ation

Fix (nn/sdpa): formatting

0ac0db8

Fix (nn/sdpa): Updated argument to match qsdpa

d0b9f50

feat (nn): bugfixes in QuantSDPA

28f108f

Feat (example/llm): Adding functions to replace SDPA

a28bf1e

feat (example/generative): Replace SDPA with QuantSDPA

308fcba

feat (example/llm): Add argument to quantize SDPA

fd49b28

test (llm/sdpa): Added basic tests for SDPA

02d64ff

Fix (example/llm): workaround for new OPT default attention

229bc31

Fix (graph): Changed SDPA import

434ddc1

Fix (graph): Removed SDPA from default standardize script.

b1b168c

test (nn/sdpa): Added basic sanity check test

b28ee5e

nickfraser added 11 commits December 5, 2024 14:29

Fix (nn): Fix in QSPDA

57803ce

test (nn): Added quant_disabled QSPDA test

638a82b

test (fix): sdpa import

bf499e2

test (nn/sdpa): Fix when PT<2.1

14b3ea4

Fix (example/llm): Removed unnecessary intantiation kwarg

2071511

fix (nn/sdpa): Add backwards compatibility with older pytorch versions

5368351

Fix (test/nn/sdpa): Fix for older pytorch versions.

c504187

test (nn/sdpa): bugfix

ac533dd

docs (examples/llm): Update README

41e4fd5

test (nn/sdpa): test filter_kwargs works correctly

69fbafb

fix (nn/sdpa): Rename output quantizer to sdpa_output_quant to avoid …

f0701ba

…name clashes

nickfraser force-pushed the feat/quant_sdpa branch from 2f29423 to f0701ba Compare December 5, 2024 14:35

nickfraser requested a review from Giuseppe5 December 5, 2024 14:36

Giuseppe5 approved these changes Dec 6, 2024

View reviewed changes

nickfraser merged commit 85c1626 into Xilinx:dev Dec 6, 2024
393 of 396 checks passed

nickfraser deleted the feat/quant_sdpa branch December 6, 2024 09:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat (nn/sdpa): quantization of scaled dot-product attention #1090

Feat (nn/sdpa): quantization of scaled dot-product attention #1090

nickfraser commented Nov 8, 2024 •

edited

Loading

nickfraser commented Nov 20, 2024

nickfraser commented Dec 5, 2024

Giuseppe5 left a comment

Feat (nn/sdpa): quantization of scaled dot-product attention #1090

Feat (nn/sdpa): quantization of scaled dot-product attention #1090

Conversation

nickfraser commented Nov 8, 2024 • edited Loading

Reason for this PR

Changes Made in this PR

Testing Summary

Risk Highlight

Checklist

nickfraser commented Nov 20, 2024

nickfraser commented Dec 5, 2024

Giuseppe5 left a comment

Choose a reason for hiding this comment

nickfraser commented Nov 8, 2024 •

edited

Loading