Skip to content

Commit

Permalink
Update sparse_attention.md
Browse files Browse the repository at this point in the history
  • Loading branch information
jonah-ramponi committed Mar 30, 2024
1 parent 5a43286 commit cf3605b
Showing 1 changed file with 4 additions and 4 deletions.
8 changes: 4 additions & 4 deletions content/posts/sparse_attention.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,9 @@ tags: [attention, inference]
Here, we have defined

\begin{align*}
Q_{S_i} &= (W_q x_j)_{j \in S_i}, \\\\
K_{S_i} &= (W_k x_j)_{j \in S_i}, \\\\
V_{S_i} &= (W_v x_j)_{j \in S_i}.
Q_{S_i} &= (W_q x_j)_{j \text{ in } S_i}, \\\\
K_{S_i} &= (W_k x_j)_{j \text{ in } S_i}, \\\\
V_{S_i} &= (W_v x_j)_{j \text{ in } S_i}.
\end{align*}

So how do we define the set of connectivity patterns $S$? Formally, we let $S_i = A_i^{h}$ for head $h$ where $A_i^{h} \subset \{j : j \leq i\}$. It is still no clearer how we pick which indices we should take for a given $S_i$. The original authors consider two key criteria initially:
Expand All @@ -44,7 +44,7 @@ Here, $A_i^{(1)}$ simply takes the previous $l$ locations. $A_i^{(2)}$ then take
**Fixed Attention*.* Our goal with this approach is to allow specific cells to summarize the previous locations, and to propagate this information on to future cells.

\begin{align*}
A^{(1)}_i &= \Big\{ j : \left\lfloor \frac{j}{l} \right\rfloor = \left\lfloor \frac{i}{l} \right\rfloor \Big\}, \\\\
A^{(1)}_i &= \Big\{ j : \text{floor}(\frac{j}{l}) = \text{floor}( \frac{i}{l}) \Big\}, \\\\
A^{(2)}_i &= \Big\{ j : j \mod l \in \{ t, t + 1, \ldots, l \} \Big\}, \text{ where } t = l - c \text{ and } c \text{ is a hyperparameter.}
\end{align*}

Expand Down

0 comments on commit cf3605b

Please sign in to comment.