Update sparse_attention.md

jonah-ramponi · Mar 30, 2024 · cf3605b · cf3605b
1 parent 5a43286
commit cf3605b
Showing 1 changed file with 4 additions and 4 deletions.
diff --git a/content/posts/sparse_attention.md b/content/posts/sparse_attention.md
@@ -16,9 +16,9 @@ tags: [attention, inference]
 Here, we have defined
 
 \begin{align*}
-    Q_{S_i} &= (W_q x_j)_{j \in S_i}, \\\\
-    K_{S_i} &= (W_k x_j)_{j \in S_i}, \\\\
-    V_{S_i} &= (W_v x_j)_{j \in S_i}. 
+    Q_{S_i} &= (W_q x_j)_{j \text{ in } S_i}, \\\\
+    K_{S_i} &= (W_k x_j)_{j \text{ in } S_i}, \\\\
+    V_{S_i} &= (W_v x_j)_{j \text{ in } S_i}. 
 \end{align*}
 
 So how do we define the set of connectivity patterns $S$? Formally, we let $S_i = A_i^{h}$ for head $h$ where $A_i^{h} \subset \{j : j \leq i\}$. It is still no clearer how we pick which indices we should take for a given $S_i$. The original authors consider two key criteria initially:
@@ -44,7 +44,7 @@ Here, $A_i^{(1)}$ simply takes the previous $l$ locations. $A_i^{(2)}$ then take
 **Fixed Attention*.* Our goal with this approach is to allow specific cells to summarize the previous locations, and to propagate this information on to future cells.
 
 \begin{align*}
-    A^{(1)}_i &= \Big\{ j : \left\lfloor \frac{j}{l} \right\rfloor = \left\lfloor \frac{i}{l} \right\rfloor \Big\}, \\\\
+    A^{(1)}_i &= \Big\{ j : \text{floor}(\frac{j}{l}) = \text{floor}( \frac{i}{l}) \Big\}, \\\\
     A^{(2)}_i &= \Big\{ j : j \mod l \in \{ t, t + 1, \ldots, l \} \Big\},  \text{ where } t = l - c \text{ and } c \text{ is a hyperparameter.}
 \end{align*}