Skip to content

Commit

Permalink
improvements
Browse files Browse the repository at this point in the history
  • Loading branch information
jonah-ramponi committed Mar 30, 2024
1 parent 1d3ec9d commit c8bf25c
Show file tree
Hide file tree
Showing 6 changed files with 123 additions and 192 deletions.
2 changes: 1 addition & 1 deletion content/posts/flash_attention.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ tldr: Reduce the memory usage used to compute exact attention.
draft: false
tags: [attention, inference]
---
The goal of [*Flash Attention*](https://arxiv.org/pdf/2205.14135.pdf) is to compute the attention value with fewer high bandwidth memory read / writes. The approach has since been refined in [*Flash Attention 2*](https://arxiv.org/pdf/2307.08691.pdf).
The goal of ![*Flash Attention*](https://arxiv.org/pdf/2205.14135.pdf) is to compute the attention value with fewer high bandwidth memory read / writes. The approach has since been refined in ![*Flash Attention 2*](https://arxiv.org/pdf/2307.08691.pdf).

We will split the attention inputs $Q,K,V$ into blocks. Each block will be handled separately, and attention will therefore be computed with respect to each block. With the correct scaling, adding the outputs from each block we will give us the same attention value as we would get by computing everything all together.

Expand Down
Loading

0 comments on commit c8bf25c

Please sign in to comment.