Skip to content

Commit

Permalink
Updated site
Browse files Browse the repository at this point in the history
  • Loading branch information
jonah-ramponi committed Mar 30, 2024
1 parent acd52f5 commit e17a899
Showing 1 changed file with 2 additions and 5 deletions.
7 changes: 2 additions & 5 deletions posts/sparse_attention/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -92,11 +92,8 @@ <h1 class="title">Sparse Attention</h1>
\text{attention}(Q,K,V, S_i) = \text{softmax}\Big( \frac{(Q_{S_i}) K^T_{S_i}}{\sqrt{d_k}} \Big) V_{S_i}.
\end{equation*}</p>
<p>Here, we have defined</p>
<p>\begin{align*}
Q_{S_i} &amp;= (W_q x_j)<em>{j \text{ in } S_i}, \\
K</em>{S_i} &amp;= (W_k x_j)<em>{j \text{ in } S_i}, \\
V</em>{S_i} &amp;= (W_v x_j)_{j \text{ in } S_i}.
\end{align*}</p>
<p>$ Q_{S_i} = (W_q x_j)_{j \text{ in } S_i}$</p>
<p>$$ Q_{S_i} = (W_q x_j), K_{S_i} = (W_k x_j), V_{S_i} = (W_v x_j) \text{ for } j \in S_i $$</p>
<p>So how do we define the set of connectivity patterns $S$? Formally, we let $S_i = A_i^{h}$ for head $h$ where $A_i^{h} \subset {j : j \leq i}$. It is still no clearer how we pick which indices we should take for a given $S_i$. The original authors consider two key criteria initially:</p>
<h4 id="criteria-1">Criteria 1</h4>
<p>We should pick $|A_i^h| \propto n^{1/H}$ where $H$ is our total number of heads. This choice is efficient as it ensures the size of the connectivity set scales well with $H$.</p>
Expand Down

0 comments on commit e17a899

Please sign in to comment.