diff --git a/posts/sparse_attention/index.html b/posts/sparse_attention/index.html index 4d700b8..c765350 100644 --- a/posts/sparse_attention/index.html +++ b/posts/sparse_attention/index.html @@ -92,11 +92,8 @@
Here, we have defined
-\begin{align*} -Q_{S_i} &= (W_q x_j){j \text{ in } S_i}, \\ -K{S_i} &= (W_k x_j){j \text{ in } S_i}, \\ -V{S_i} &= (W_v x_j)_{j \text{ in } S_i}. -\end{align*}
+$ Q_{S_i} = (W_q x_j)_{j \text{ in } S_i}$
+$$ Q_{S_i} = (W_q x_j), K_{S_i} = (W_k x_j), V_{S_i} = (W_v x_j) \text{ for } j \in S_i $$
So how do we define the set of connectivity patterns $S$? Formally, we let $S_i = A_i^{h}$ for head $h$ where $A_i^{h} \subset {j : j \leq i}$. It is still no clearer how we pick which indices we should take for a given $S_i$. The original authors consider two key criteria initially:
We should pick $|A_i^h| \propto n^{1/H}$ where $H$ is our total number of heads. This choice is efficient as it ensures the size of the connectivity set scales well with $H$.