diff --git a/content/posts/intro_to_attention.md b/content/posts/intro_to_attention.md index a15c4ee..35eeb93 100644 --- a/content/posts/intro_to_attention.md +++ b/content/posts/intro_to_attention.md @@ -20,16 +20,16 @@ The first thing the LLM will do is split this input into tokens. A token is just In this example we have $(n = 7)$ tokens. Importantly, from our model's point of view, our input size is defined by the number of tokens instead of words. A numerical representation (vector representation) of each token is now found. Finding this vector representation is called producing an embedding of the token. The token *$\colorbox{red}{ What}$* might get tokenized as follows \begin{equation} - \text{tokenizer}(\textit{\colorbox{red}{What}}) \rightarrow \begin{pmatrix} -0.4159 \\\\ -0.5147 \\\\ 0.5690 \\\\ \vdots \\\\ -0.2577 \\\\ 0.5710 \\\\ \end{pmatrix} + \text{tokenizer}(\textit{\colorbox{red}{What}}) \rightarrow \begin{pmatrix} -0.4159 \\\\ \vdots \\\\ 0.5710 \\\\ \end{pmatrix} \end{equation} The length of each of our embeddings, these vector outputs of our tokenizer, are the same regardless of the number of characters in our token. Let us denote this length $d_{\text{model}}$. So after we embed each token in our input sequence with our tokenizer we are left with $$ -\begin{pmatrix} -0.415 \\\\ -0.514 \\\\ 0.569 \\\\ \vdots \\\\ -0.257 \\\\ 0.571 \\\\ \end{pmatrix} -\begin{pmatrix} -0.130 \\\\ -0.464 \\\\ 0.23 \\\\ \vdots \\\\ -0.154 \\\\ 0.192 \\\\ \end{pmatrix} +\begin{pmatrix} -0.415 \\\\ \vdots \\\\ 0.571 \\\\ \end{pmatrix} +\begin{pmatrix} -0.130 \\\\ \vdots \\\\ 0.192 \\\\ \end{pmatrix} , \dots , -\begin{pmatrix} 0.127 \\\\ 0.453 \\\\ 0.110 \\\\ \vdots \\\\ -0.155 \\\\ 0.484 \\\\ \end{pmatrix} +\begin{pmatrix} 0.127 \\\\ \vdots \\\\ 0.484 \\\\ \end{pmatrix} $$ @@ -38,8 +38,8 @@ This output is now passed through a *positional encoder*. Broadly, this is usefu The only thing that matters for now, is that each of our numerical representations (vectors) are slightly altered. For the numerical representation of the token ``$\colorbox{red}{ What}$" that we get from our embedding model, it might look something like: \begin{equation} - \text{positional encoder}\Bigg(\begin{pmatrix} -0.415 \\\\ -0.514 \\\\ \vdots \\\\ -0.257 \\\\ 0.571 \\\\ \end{pmatrix}\Bigg) = - \begin{pmatrix} -0.424 \\\\ -0.574 \\\\ \vdots \\\\ -0.235 \\\\ 0.534 \\\\ \end{pmatrix} + \text{positional encoder}\Bigg(\begin{pmatrix} -0.415 \\\\ \vdots \\\\ 0.571 \\\\ \end{pmatrix}\Bigg) = + \begin{pmatrix} -0.424 \\\\ \vdots \\\\ 0.534 \\\\ \end{pmatrix} \end{equation} Importantly, the positional encoder does not alter the length of our vector, $d_{\text{model}}$. It simply tweaks the values slightly. So far, we entered our prompt: @@ -89,7 +89,8 @@ The top row is the first vector output of our positional encoding. The second ro M = \Big( \text{number of tokens in input} \times \text{length of embedding} \Big) = \Big( n \times d_{\text{model}} \Big). \end{equation} -**Introduction To Self Attention.** At a high level, self-attention aims to evaluate the importance of each element in a sequence with respect to all other elements and use this to compute a representation of the sequence. All it really does is compute a weighted average of input vectors to produce output vectors. Mathematically, for an input sequence of vectors $x = (x_1, \dots ,x_{n})$ it will return some sequence of vectors, $y = (y_1,\dots,y_m)$ such that +### Introduction To Self Attention. +At a high level, self-attention aims to evaluate the importance of each element in a sequence with respect to all other elements and use this to compute a representation of the sequence. All it really does is compute a weighted average of input vectors to produce output vectors. Mathematically, for an input sequence of vectors $x = (x_1, \dots ,x_{n})$ it will return some sequence of vectors, $y = (y_1,\dots,y_m)$ such that \begin{equation} y_i = \sum_{j = 1}^{{n}} w_{ij} \cdot x_j, \text{ } \forall 1 \leq i \leq m. @@ -97,7 +98,7 @@ The top row is the first vector output of our positional encoding. The second ro for some mapping $w_{ij}$. The challenge is in figuring out how we should define our mapping $w_{ij}$. Let's look at the first way $w_{ij}$ was defined, introduced in [Attention is All You Need](https://arxiv.org/pdf/1706.03762.pdf). -**Scaled Dot Product Self Attention.** To compute scaled dot product self attention, we will use the matrix $M$ with rows corresponding to the positionally encoded vectors. $M$ has dimensions $(n \times d_{\text{model}})$. +### Scaled Dot Product Self Attention. To compute scaled dot product self attention, we will use the matrix $M$ with rows corresponding to the positionally encoded vectors. $M$ has dimensions $(n \times d_{\text{model}})$. We begin by producing query, key and value matrices, analogous to how a search engine maps a user query to relevant items in its database. We will make 3 copies of our matrix $M$. These become the matrices $Q, K$ and $V$. Each of these has dimension $(n \times d_{\text{model}})$. We let $d_k$ denote the dimensions of the keys, which in this case is $d_{\text{model}}$. We are ready to define attention as @@ -159,18 +160,16 @@ The attention matrix is a nice thing to visualize. For our toy example, it might What can we notice about our attention matrix? -#### It is symmetric. +**It is symmetric.** That is, $w = w^T$. This is to be expected, as remember it was produced by computing $QK^T$ where $Q$ and $K$ are identical. -#### The largest values are often times found on the leading diagonal. +**The largest values are often times found on the leading diagonal.** You can think of the values in the matrix as some measure of how important one token is to another. Typically, we try to ensure that each token pays attention to itself to some extent. -#### Every cell is filled. -This is because in this attention approach, every token attends to every other token. This is often referred to as \textit{full $n^2$ attention}. In Section (\ref{attentionmatrixopt}) you will see other ways of defining this attention matrix. - +**Every cell is filled.** +This is because in this attention approach, every token attends to every other token. This is often referred to as *full $n^2$ attention*. - -**Multi Head Self Attention.** +#### Multi Head Self Attention. It's important to acknowledge that there may not exist a single perfect representation of the attention matrix. Multi Head Self Attention allows us to produce many different representations of the attention matrix. Each individual attention mechanism is referred to as a ``head". Each head learns slightly different representations of the input sequence, which the original researchers found prompted the best output. Firstly, we're going to introduce some new matrices. These will be defined as @@ -201,14 +200,14 @@ The overall output of the process is then simply Concat() simply concatenates our output matrices. The output matrix of size $(n \times d_v)$ for each head is simply our matrices stacked on top of one another like so - +\end{equation*} -This output has dimension $(n \times H d_v)$. We still have $n$ rows, however now we have $h$ different representations of $d_v$. Our output, $W^O$, is another trainable weight matrix which has dimensions $W^O = (Hd_v \times d_{\text{model}})$. Therefore, the multiplication of $\text{Concat}(\text{head}_1, \cdots, \text{head}_H)$ and $W^O$ results in a matrix with dimension $(n \times d_{\text{model}})$. \ No newline at end of file +This output has dimension $(n \times H d_v)$. We still have $n$ rows, however now we have $h$ different representations of $d_v$. Our output, $W^O$, is another trainable weight matrix which has dimensions $W^O = (Hd_v \times d_{\text{model}})$. Therefore, the multiplication of Concat $(\text{head}_1, \dots, \text{head}_H)$ and $W^O$ results in a matrix with dimension $(n \times d_{\text{model}})$. \ No newline at end of file diff --git a/content/posts/mqa_gqa.md b/content/posts/mqa_gqa.md index a65c83d..26a58ec 100644 --- a/content/posts/mqa_gqa.md +++ b/content/posts/mqa_gqa.md @@ -7,6 +7,7 @@ draft: false tags: [attention, inference] --- +#### Multi Query Attention [*Multi Query Attention*](https://arxiv.org/pdf/1911.02150v1.pdf) (MQA) using the same $K$ and $V$ matrices for each head in our multi head self attention mechanism. For a given head, $h$, $1 \leq h \leq H$, the attention mechanism is calculated as \begin{equation} @@ -21,7 +22,7 @@ For each of our $H$ heads, the only difference in the weight matrices is in $W_h As before, we simply concatenate our attention outputs and multiply by $W^O$, which is defined as before. - +#### Grouped Query Attention [*Grouped Query Attention*](https://arxiv.org/pdf/2305.13245v3.pdf) (GQA) is very similar to MQA. The difference is that instead of using just one set of $K$, $V$ values for attention calculations it uses $G$ different sets of $K,V$ values. If we have $H$ heads, GQA is equivalent to MHA if $G=H$ and equivalent to MQA if $G=1$. Suppose we want to use $G$ groups. We would firstly allocate each of our $H$ heads into one of the $G$ groups. It would likely make sense to pick $G$ such that $G \mod H \equiv 0$. Though this is not a requirement. For each head in a given group, we calculate attention outputs as diff --git a/content/posts/sliding_window_attention.md b/content/posts/sliding_window_attention.md index db8e9fb..5811011 100644 --- a/content/posts/sliding_window_attention.md +++ b/content/posts/sliding_window_attention.md @@ -39,7 +39,8 @@ We will require two sets of our projection matrices. Firstly, projections to com We first calculate local attention weights using $\{Q_s,K_s,V_s\}$. This gives us an attention output, which is then combined with the output using the global attention weights. The global weights are written on top of the output attention weight matrix calculated by the local attention calculation. -**Dilated Sliding Window Attention.** is another approach to achieve a similar result. This time, instead of simply taking the $\frac{1}{2}w$ tokens either side of a given $w$ we will introduce some gaps of size $d$. This is referred to as the dilation. Using $w=2, d=1$ in our example we would have an attention matrix which looks like +#### Dilated Sliding Window Attention. +This is another approach to achieve a similar result. This time, instead of simply taking the $\frac{1}{2}w$ tokens either side of a given $w$ we will introduce some gaps of size $d$. This is referred to as the dilation. Using $w=2, d=1$ in our example we would have an attention matrix which looks like ![Dilated Sliding Window Attention Matrix](/img/dilated_sliding_window.png) diff --git a/content/posts/sparse_attention.md b/content/posts/sparse_attention.md index 66a44a7..750f2d7 100644 --- a/content/posts/sparse_attention.md +++ b/content/posts/sparse_attention.md @@ -20,15 +20,16 @@ $$ Q_{S_i} = (W_q x_j), K_{S_i} = (W_k x_j), V_{S_i} = (W_v x_j) \text{ for } j So how do we define the set of connectivity patterns $S$? Formally, we let $S_i = A_i^{h}$ for head $h$ where $A_i^{h} \subset \{j : j \leq i\}$. It is still no clearer how we pick which indices we should take for a given $S_i$. The original authors consider two key criteria initially: -#### Criteria 1 +**Criteria 1** We should pick $|A_i^h| \propto n^{1/H}$ where $H$ is our total number of heads. This choice is efficient as it ensures the size of the connectivity set scales well with $H$. -#### Criteria 2 +**Criteria 2** All input positions are connected to output positions across $p$ steps of attention. For instance, for a pair $j \leq i$ we would like $i$ to be able to attend to $j$ through a path of locations with maximum length $p+1$. This helps us propagate signals from input to output in a constant number of steps. We now investigate two different approaches that satisfy this criteria, and allow us to implement sparse attention. -**Strided Attention.** We will define a factorized attention pattern in two heads. One head will attend to the previous $l$ locations, while the other head will attend to every $l$th location. We call $l$ the stride and it is chosen to be close to $\sqrt{n}$. +#### Strided Attention. +We will define a factorized attention pattern in two heads. One head will attend to the previous $l$ locations, while the other head will attend to every $l$th location. We call $l$ the stride and it is chosen to be close to $\sqrt{n}$. \begin{align} A_i^{(1)} &= \{y,y+1,\dots,i\} \text{ for } t = \max(0,i-l), \\\\ @@ -37,7 +38,8 @@ We now investigate two different approaches that satisfy this criteria, and allo Here, $A_i^{(1)}$ simply takes the previous $l$ locations. $A_i^{(2)}$ then takes every $l$th head from the first head where $i-j$ was divisible by $l$ without remainder. This is particularly useful where you can align the structure of your input with the stride. For instance, with a piece of music. Where our input does not have a well defined structured, we use something different. In the image below, you can see $A_i^{(1)}$ responsible for the dark blue shading and $A_i^{(2)}$ responsible for the light blue. -**Fixed Attention**. Our goal with this approach is to allow specific cells to summarize the previous locations, and to propagate this information on to future cells. +#### Fixed Attention. +Our goal with this approach is to allow specific cells to summarize the previous locations, and to propagate this information on to future cells. $$ A^{(1)}_i = \{ j : \text{floor}(\frac{j}{l}) = \text{floor}( \frac{i}{l}) \}, $$ $$ A^{(2)}_i = \{ j : j \mod l \in \{ t, t + 1, \dots, l \} \}, \text{ where } t = l - c \text{ and } c \text{ is a hyperparameter.} $$