From c8bf25c8aff5d25805258a79b5d65131e8ceac64 Mon Sep 17 00:00:00 2001 From: Jonah Ramponi Date: Sat, 30 Mar 2024 15:44:04 +0000 Subject: [PATCH] improvements --- content/posts/flash_attention.md | 2 +- content/posts/intro_to_attention.md | 267 +++++++++++----------------- content/posts/mqa_gqa.md | 6 +- content/posts/resources.md | 11 ++ content/posts/sparse_attention.md | 5 +- content/posts/test copy.md | 24 --- 6 files changed, 123 insertions(+), 192 deletions(-) create mode 100644 content/posts/resources.md delete mode 100644 content/posts/test copy.md diff --git a/content/posts/flash_attention.md b/content/posts/flash_attention.md index cf80758..c1cf1ce 100644 --- a/content/posts/flash_attention.md +++ b/content/posts/flash_attention.md @@ -6,7 +6,7 @@ tldr: Reduce the memory usage used to compute exact attention. draft: false tags: [attention, inference] --- -The goal of [*Flash Attention*](https://arxiv.org/pdf/2205.14135.pdf) is to compute the attention value with fewer high bandwidth memory read / writes. The approach has since been refined in [*Flash Attention 2*](https://arxiv.org/pdf/2307.08691.pdf). +The goal of ![*Flash Attention*](https://arxiv.org/pdf/2205.14135.pdf) is to compute the attention value with fewer high bandwidth memory read / writes. The approach has since been refined in ![*Flash Attention 2*](https://arxiv.org/pdf/2307.08691.pdf). We will split the attention inputs $Q,K,V$ into blocks. Each block will be handled separately, and attention will therefore be computed with respect to each block. With the correct scaling, adding the outputs from each block we will give us the same attention value as we would get by computing everything all together. diff --git a/content/posts/intro_to_attention.md b/content/posts/intro_to_attention.md index 951888d..bf73713 100644 --- a/content/posts/intro_to_attention.md +++ b/content/posts/intro_to_attention.md @@ -8,256 +8,200 @@ tags: [attention, inference] --- Suppose you give an LLM the input -\begin{center} - \textbf{input:} \textit{``What is the capital of France?"} -\end{center} + +> *``What is the capital of France?"* + The first thing the LLM will do is split this input into tokens. A token is just some combinations of characters. You can see an example of the tokenization outputs for the question below. -\begin{center} - ``\textit{\colorbox{red}{What}\colorbox{magenta}{ is}\colorbox{green}{ the}\colorbox{orange}{ capital}\colorbox{purple}{ of}\colorbox{brown}{ France}\colorbox{cyan}?"}\footnote{This tokenization was produced using cl100k\_base, the tokenizer used in GPT-3.5-turbo and GPT-4.} -\end{center} -In this example we have $(n = 7)$ tokens. Importantly, from our model's point of view, our input size is defined by the number of tokens instead of words. A numerical representation (vector representation) of each token is now found. Finding this vector representation is called producing an embedding of the token. The token ``\colorbox{red}{ What}" might get tokenized as follows + +> ``*$\colorbox{red}{What}\colorbox{magenta}{ is}\colorbox{green}{ the}\colorbox{orange}{ capital}\colorbox{purple}{ of}\colorbox{brown}{ France}\colorbox{cyan}?$*" + +(This tokenization was produced using cl100k_base, the tokenizer used in GPT-3.5-turbo and GPT-4.) + +In this example we have $(n = 7)$ tokens. Importantly, from our model's point of view, our input size is defined by the number of tokens instead of words. A numerical representation (vector representation) of each token is now found. Finding this vector representation is called producing an embedding of the token. The token *$\colorbox{red}{ What}$* might get tokenized as follows \begin{equation} - \text{tokenizer}(\textit{\colorbox{red}{What}}) \rightarrow \begin{bmatrix} -0.4159 \\ -0.5147 \\ 0.5690 \\ \vdots \\ -0.2577 \\ 0.5710 \\ \end{bmatrix} + \text{tokenizer}(\textit{\colorbox{red}{What}}) \rightarrow \begin{bmatrix} -0.4159 \\\\ -0.5147 \\\\ 0.5690 \\\\ \vdots \\\\ -0.2577 \\\\ 0.5710 \\\\ \end{bmatrix} \end{equation} The length of each of our embeddings, these vector outputs of our tokenizer, are the same regardless of the number of characters in our token. Let us denote this length $d_{\text{model}}$. So after we embed each token in our input sequence with our tokenizer we are left with -\begin{center} - \textbf{Output after tokenization:} - $ - \begin{bmatrix} -0.415 \\ -0.514 \\ 0.569 \\ \vdots \\ -0.257 \\ 0.571 \\ \end{bmatrix} - \begin{bmatrix} --0.130 \\ --0.464 \\ -0.23 \\ -\vdots \\ --0.154 \\ -0.192 \\ - \end{bmatrix} - , \dots , - \begin{bmatrix} -0.127 \\ -0.453 \\ -0.110 \\ -\vdots \\ --0.155 \\ -0.484 \\ - \end{bmatrix} - $ -\end{center} - -This output is now passed through a \textit{positional encoder}. Broadly, this is useful to provide the model with information about the position of words or tokens within a sequence. You might wonder why we need to positionally encode each token. What does it even mean to positionally encode something? Why can't we just use the index of the item? These questions are for another document. \\ - -The only thing that matters for now, is that each of our numerical representations (vectors) are slightly altered. For the numerical representation of the token ``\colorbox{red}{ What}" that we get from our embedding model, it might look something like: + +$$ +\begin{bmatrix} -0.415 \\\\ -0.514 \\\\ 0.569 \\\\ \vdots \\\\ -0.257 \\\\ 0.571 \\\\ \end{bmatrix} +\begin{bmatrix} -0.130 \\\\ -0.464 \\\\ 0.23 \\\\ \vdots \\\\ -0.154 \\\\ 0.192 \\\\ \end{bmatrix} +, \dots , +\begin{bmatrix} 0.127 \\\\ 0.453 \\\\ 0.110 \\\\ \vdots \\\\ -0.155 \\\\ 0.484 \\\\ \end{bmatrix} +$$ + + +This output is now passed through a *positional encoder*. Broadly, this is useful to provide the model with information about the position of words or tokens within a sequence. You might wonder why we need to positionally encode each token. What does it even mean to positionally encode something? Why can't we just use the index of the item? These questions are for another post. + +The only thing that matters for now, is that each of our numerical representations (vectors) are slightly altered. For the numerical representation of the token ``$\colorbox{red}{ What}$" that we get from our embedding model, it might look something like: \begin{equation} - \text{positional encoder}\Bigg(\begin{bmatrix} -0.415 \\ -0.514 \\ \vdots \\ -0.257 \\ 0.571 \\ \end{bmatrix}\Bigg) = - \begin{bmatrix} -0.424 \\ -0.574 \\ \vdots \\ -0.235 \\ 0.534 \\ \end{bmatrix} + \text{positional encoder}\Bigg(\begin{bmatrix} -0.415 \\\\ -0.514 \\\\ \vdots \\\\ -0.257 \\\\ 0.571 \\\\ \end{bmatrix}\Bigg) = + \begin{bmatrix} -0.424 \\\\ -0.574 \\\\ \vdots \\\\ -0.235 \\\\ 0.534 \\\\ \end{bmatrix} \end{equation} -Importantly, the positional encoder does not alter the length of our vector, $d_{\text{model}}$. It simply tweaks the values slightly. So far, our transformations look like: - -\begin{center} - \textbf{input:} \hspace{14mm} \textit{``What is the capital of Paris?"} -\end{center} -\begin{center} - \begin{tikzpicture}[baseline={(current bounding box.center)}] - \draw[->] (0,0.2) -- (0,-0.2); - \end{tikzpicture} -\end{center} -\begin{center} - \textbf{(1) tokenize:} ``\textit{\colorbox{red}{What}\colorbox{magenta}{ is}\colorbox{green}{ the}\colorbox{orange}{ capital}\colorbox{purple}{ of}\colorbox{brown}{ France}\colorbox{cyan}?"} -\end{center} - -\begin{center} - \begin{tikzpicture}[baseline={(current bounding box.center)}] - \draw[->] (0,0.2) -- (0,-0.2); - \end{tikzpicture} -\end{center} - -\vspace{-6mm} -\begin{center} - \textbf{(2) embed:} \hspace{12mm} $ - \begin{bmatrix} -0.415 \\ -0.514 \\ 0.569 \\ \vdots \\ -0.257 \\ 0.571 \\ \end{bmatrix} - \begin{bmatrix} --0.130 \\ --0.464 \\ -0.23 \\ -\vdots \\ --0.154 \\ -0.192 \\ - \end{bmatrix} - , \dots , - \begin{bmatrix} -0.127 \\ -0.453 \\ -0.110 \\ -\vdots \\ --0.155 \\ -0.484 \\ - \end{bmatrix} - $ -\end{center} - -\begin{center} - \begin{tikzpicture}[baseline={(current bounding box.center)}] - \draw[->] (0,0.2) -- (0,-0.2); - \end{tikzpicture} -\end{center} - -\vspace{-6mm} -\begin{center} - \textbf{(3) encode:} \hspace{12mm} $ - \begin{bmatrix} -0.424 \\ -0.574 \\ 0.513 \\ \vdots \\ -0.235 \\ 0.534 \\ \end{bmatrix} - \begin{bmatrix} --0.133 \\ --0.461 \\ -0.228 \\ -\vdots \\ --0.151 \\ -0.193 \\ - \end{bmatrix} - , \dots , - \begin{bmatrix} -0.123 \\ -0.455 \\ -0.110 \\ -\vdots \\ --0.121 \\ -0.489 \\ - \end{bmatrix} - $ -\end{center} +Importantly, the positional encoder does not alter the length of our vector, $d_{\text{model}}$. It simply tweaks the values slightly. So far, we entered our prompt: + +> \textit{``What is the capital of Paris?"} + +This was tokenized + +> ``*$\colorbox{red}{What}\colorbox{magenta}{ is}\colorbox{green}{ the}\colorbox{orange}{ capital}\colorbox{purple}{ of}\colorbox{brown}{ France}\colorbox{cyan}?"$* + +Then embedded + +$$ +\begin{bmatrix} -0.415 \\\\ -0.514 \\\\ 0.569 \\\\ \vdots \\\\ -0.257 \\\\ 0.571 \\\\ \end{bmatrix} +\begin{bmatrix} -0.130 \\\\ -0.464 \\\\ 0.23 \\\\ \vdots \\\\ -0.154 \\\\ 0.192 \\\\ \end{bmatrix} +, \dots , +\begin{bmatrix} 0.127 \\\\ 0.453 \\\\ 0.110 \\\\ \vdots \\\\ -0.155 \\\\ 0.484 \\\\ \end{bmatrix} +$$ + +and finally positionally encoded + +\begin{equation} + \text{positional encoder}\Bigg(\begin{bmatrix} -0.415 \\\\ -0.514 \\\\ \vdots \\\\ -0.257 \\\\ 0.571 \\\\ \end{bmatrix}\Bigg) = + \begin{bmatrix} -0.424 \\\\ -0.574 \\\\ \vdots \\\\ -0.235 \\\\ 0.534 \\\\ \end{bmatrix} +\end{equation} We're now very close to being able to introduce attention. One last thing remains, at this point we will transform the output of our positional encoding to a matrix $M$ as follows \begin{equation} M = \begin{pmatrix} - -0.424 & -0.574 & 0.513 & \dots & -0.235 & 0.534 \\ - -0.133 & 0.461 & 0.228 & \dots & -0.151 & 0.193 \\ - \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ + -0.424 & -0.574 & 0.513 & \dots & -0.235 & 0.534 \\\\ + -0.133 & 0.461 & 0.228 & \dots & -0.151 & 0.193 \\\\ + \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\\\ 0.123 & 0.455 & 0.110 & \dots & -0.121 & 0.489 \end{pmatrix} = \text{positional encoding}\begin{pmatrix} - \text{\colorbox{red}{ What}} \\ - \text{\colorbox{magenta}{ is}} \\ - \vdots \\ + \text{\colorbox{red}{ What}} \\\\ + \text{\colorbox{magenta}{ is}} \\\\ + \vdots \\\\ \text{\colorbox{cyan}{?}} \end{pmatrix} \end{equation} The top row is the first vector output of our positional encoding. The second row is the second, and so on. If we had $n$ tokens in our input sequence, then matrix $M$ would have $n$ rows. The dimensions of $M$ are as follows + \begin{equation} M = \Big( \text{number of tokens in input} \times \text{length of embedding} \Big) = \Big( n \times d_{\text{model}} \Big). \end{equation} -\newpage -\section{Introduction To Self Attention} -At a high level, self-attention aims to evaluate the importance of each element in a sequence with respect to all other elements and use this to compute a representation of the sequence. All it really does is compute a weighted average of input vectors to produce output vectors. Mathematically, for an input sequence of vectors $x = (\vec{x}_1, \dots ,\vec{x}_{n})$ it will return some sequence of vectors, $y = (\vec{y}_1,\dots,\vec{y}_m)$ such that +**Introduction To Self Attention.** At a high level, self-attention aims to evaluate the importance of each element in a sequence with respect to all other elements and use this to compute a representation of the sequence. All it really does is compute a weighted average of input vectors to produce output vectors. Mathematically, for an input sequence of vectors $x = (x_1, \dots ,x_{n})$ it will return some sequence of vectors, $y = (y_1,\dots,y_m)$ such that + \begin{equation} - y_i = \sum_{j = 1}^{{n}} w_{ij} \cdot x_j, \qquad \forall 1 \leq i \leq m. \label{attention} + y_i = \sum_{j = 1}^{{n}} w_{ij} \cdot x_j, \text{ } \forall 1 \leq i \leq m. \end{equation} -for some mapping $w_{ij}$. The challenge is in figuring out how we should define our mapping $w_{ij}$. Let's look at the first way $w_{ij}$ was defined, introduced in Attention is All You Need \cite{vaswani2023attention}. -\subsection{Scaled Dot Product Self Attention} -To compute scaled dot product self attention, we will use the matrix $M$ with rows corresponding to the positionally encoded vectors. $M$ has dimensions $(n \times d_{\text{model}})$. \\ +for some mapping $w_{ij}$. The challenge is in figuring out how we should define our mapping $w_{ij}$. Let's look at the first way $w_{ij}$ was defined, introduced in ![Attention is All You Need](https://arxiv.org/pdf/1706.03762.pdf). + +**Scaled Dot Product Self Attention.** To compute scaled dot product self attention, we will use the matrix $M$ with rows corresponding to the positionally encoded vectors. $M$ has dimensions $(n \times d_{\text{model}})$. + +We begin by producing query, key and value matrices, analogous to how a search engine maps a user query to relevant items in its database. We will make 3 copies of our matrix $M$. These become the matrices $Q, K$ and $V$. Each of these has dimension $(n \times d_{\text{model}})$. We let $d_k$ denote the dimensions of the keys, which in this case is $d_{\text{model}}$. We are ready to define attention as -We begin by producing query, key and value matrices, analogous to how a search engine maps a user query to relevant items in its database\footnote{This \href{https://stats.stackexchange.com/questions/421935/what-exactly-are-keys-queries-and-values-in-attention-mechanisms}{stack exchange} post contains some great insight into the idea behind the $Q$, $K$ and $V$ matrices.}. We will make 3 copies of our matrix $M$. These become the matrices $Q, K$ and $V$. Each of these has dimension $(n \times d_{\text{model}})$. We let $d_k$ denote the dimensions of the keys, which in this case is $d_{\text{model}}$. We are ready to define attention as \begin{equation} \text{attention}(Q,K,V) = \text{softmax}\Big(\frac{Q K^T}{\sqrt{d_k}}\Big) \cdot V. \end{equation} -\begin{minted}[mathescape, linenos]{python} +```python def attention(Q, K, V): dk = K.size(-1) scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(dk) attn_weights = torch.nn.functional.softmax(scores, dim=-1) return torch.matmul(attn_weights, V) -\end{minted} +``` +Our matrix $QK^T$ of dimension $(n \times d_{\text{model}}) \times (n \times d_{\text{model}})^T = (n \times n)$. After we re-scale by $\sqrt{d_k}$, this matrix is referred to as the *attention matrix*. -Our matrix $QK^T$ of dimension $(n \times d_{\text{model}}) \times (n \times d_{\text{model}})^T = (n \times n)$. After we re-scale by $\sqrt{d_k}$, this matrix is referred to as the \textit{attention matrix}. \\ +**Why do we divide by $\sqrt{d_k}?$** This was introduced to counteract the effect of having the dot products grow large in magnitude for large dimensional inputs $d_k>>1$. In cases where the dot product grew large in size, it was suspect suspected that application of the softmax function was returning extremely small gradients which in turn lead to the vanishing gradients problem. -\textbf{Why do we divide by $\sqrt{d_k}?$} This was introduced to counteract the effect of having the dot products grow large in magnitude for large dimensional inputs $d_k>>1$. In cases where the dot product grew large in size, it was suspect suspected that application of the softmax function was returning extremely small gradients which in turn lead to the vanishing gradients problem. \\ +We multiply the softmax of the attention matrix with each row of $V$. This re-scales each row of the output matrix to sum to one. The equation for softmax applied to a matrix $X$ is as follows -We multiply the softmax of the attention matrix with each row of $V$. This re-scales each row of the output matrix to sum to one. The equation for softmax applied to a matrix $X$ is as follows\footnote{A version of softmax with better stability is discussed in Section (\ref{flashattention}).} \begin{equation} \text{softmax}(X)_{ij} = \frac{e^{X_{ij}}}{\sum_{k=1}^{n} e^{X_{ik}}}. \end{equation} -\begin{minted}[mathescape, linenos]{python} + +```python def softmax(X): exp_X = torch.exp(X) denom = exp_X.sum(dim=-1, keepdim=True) return exp_X / denom -\end{minted} +``` + +**Why use softmax?** The dot product of $Q$ and $K^T$ gives us a value anywhere between negative and positive infinity. Application of softmax ensures our outputs are more stable. Otherwise, large elements in $Q$ or $K^T$ would grow even larger, dominating the attention mechanism which may cause convergence issues. -\textbf{Why use softmax?} The dot product of $Q$ and $K^T$ gives us a value anywhere between negative and positive infinity. Application of softmax ensures our outputs are more stable. Otherwise, large elements in $Q$ or $K^T$ would grow even larger, dominating the attention mechanism which may cause convergence issues. \\ +Earlier on in we described attention as -Earlier on in Equation (\ref{attention}) we described attention as \begin{equation} y_i = \sum_{j = 1}^{{n}} w_{ij} \cdot x_j, \qquad \forall 1 \leq i \leq m. \end{equation} -Well, our \textit{attention matrix} after softmax has been applied is simply $w$ with $(i,j)th$ element $w_{ij}$. The output $y_i$ is just the weighted sum using $w$ on the value vectors, $v = (\vec{v}_1,\dots,\vec{v}_n)$. It may be clearer to visualize the output as + +Well, our *attention matrix* after softmax has been applied is simply $w$ with $(i,j)th$ element $w_{ij}$. The output $y_i$ is just the weighted sum using $w$ on the value vectors, $v = (\vec{v}_1,\dots,\vec{v}_n)$. It may be clearer to visualize the output as \[ \vec{y} = \begin{pmatrix} - w_{11} & w_{12} & \dots & w_{1n} \\ - w_{21} & w_{22} & \dots & w_{2n} \\ - \vdots & \vdots & \ddots & \vdots \\ + w_{11} & w_{12} & \dots & w_{1n} \\\\ + w_{21} & w_{22} & \dots & w_{2n} \\\\ + \vdots & \vdots & \ddots & \vdots \\\\ w_{n1} & w_{n2} & \dots & w_{nn} \end{pmatrix} \times \begin{pmatrix} - v_1 \\ v_2 \\ \vdots \\ v_n + v_1 \\\\ v_2 \\\\ \vdots \\\\ v_n \end{pmatrix} \] + The attention matrix is a nice thing to visualize. For our toy example, it might look like -\begin{equation} -w= \begin{pmatrix} - & \textit{\colorbox{red}{What}} & \textit{\colorbox{magenta}{ is}} & \textit{\colorbox{green}{ the}} & \textit{\colorbox{orange}{ capital}} & \textit{\colorbox{purple}{ of}} & \textit{\colorbox{brown}{ France}} & \textit{\colorbox{cyan}{?}} \\ - \textit{\colorbox{red}{What}} & 0.71 & 0.12 & 0.32 & 0.29 & 0.23 & 0.03 & 0.49\\ \textit{\colorbox{magenta}{ is}} & 0.12 & 0.65 & 0.04 & 0.37 & 0.27 & 0.15 & 0.13 \\ \textit{\colorbox{green}{ the}} & 0.32 & 0.04 & 0.68 & 0.21 & 0.11 & 0.36 & 0.22 \\ \textit{\colorbox{orange}{ capital}} & 0.29 & 0.37 & 0.21 & 0.59 & 0.12 & 0.39 & 0.41 \\ \textit{\colorbox{purple}{ of}} & 0.23 & 0.27 & 0.11 & 0.12 & 0.67& 0.20 & 0.15\\ \textit{\colorbox{brown}{ France}} & 0.03 & 0.15 & 0.36 & 0.39 & 0.20 & 0.81 & 0.12\\ \textit{\colorbox{cyan}{?}} & 0.49 & 0.13 & 0.22 & 0.41 & 0.15 & 0.12 & 0.70 -\end{pmatrix} -\end{equation} + + + + What can we notice about our attention matrix? -\begin{itemize} - \item It is symmetric. That is, $w = w^T$. This is to be expected, as remember it was produced by computing $QK^T$ where $Q$ and $K$ are identical. - \item The largest values are often times found on the leading diagonal. You can think of the values in the matrix as some measure of how important one token is to another. Typically, we try to ensure that each token pays attention to itself to some extent. - \item Every cell is filled. This is because in this attention approach, every token attends to every other token. This is often referred to as \textit{full $n^2$ attention}. In Section (\ref{attentionmatrixopt}) you will see other ways of defining this attention matrix. -\end{itemize} +#### It is symmetric. +That is, $w = w^T$. This is to be expected, as remember it was produced by computing $QK^T$ where $Q$ and $K$ are identical. +#### The largest values are often times found on the leading diagonal. +You can think of the values in the matrix as some measure of how important one token is to another. Typically, we try to ensure that each token pays attention to itself to some extent. +#### Every cell is filled. +This is because in this attention approach, every token attends to every other token. This is often referred to as \textit{full $n^2$ attention}. In Section (\ref{attentionmatrixopt}) you will see other ways of defining this attention matrix. -\newpage -\subsection{Multi Head Self Attention} +**Multi Head Self Attention.** -It's important to acknowledge that there may not exist a single perfect representation of the attention matrix. Multi Head Self Attention allows us to produce many different representations of the attention matrix. Each individual attention mechanism is referred to as a ``head". Each head learns slightly different representations of the input sequence, which the original researchers found prompted the best output \cite{vaswani2023attention}. \\ +It's important to acknowledge that there may not exist a single perfect representation of the attention matrix. Multi Head Self Attention allows us to produce many different representations of the attention matrix. Each individual attention mechanism is referred to as a ``head". Each head learns slightly different representations of the input sequence, which the original researchers found prompted the best output. Firstly, we're going to introduce some new matrices. These will be defined as -Firstly, we're going to introduce some new matrices. These will be defined as \begin{align*} Q = (n \times d_q), \hspace{3mm} K = (n \times d_k), \hspace{3mm} V = (n \times d_v) \end{align*} These matrices will be obtained by linearly transforming the original matrix $M$, using weight matrices $W^Q$, $W^K$ and $W^V$ respectively: \begin{align*} - Q &= M\times W^Q, \\ - K &= M \times W^K, \\ + Q &= M\times W^Q, \\\\ + K &= M \times W^K, \\\\ V &= M \times W^V. \end{align*} -Each of these matrices has $d_{\text{model}}$ rows, and remember that $M$ has $d_{\text{model}}$ columns. We have control over parameters $d_q, d_k, d_v$. In the original research they took $d_q = d_k = d_v = d_{\text{model}}/8 = 64$\cite{vaswani2023attention}. \\ +Each of these matrices has $d_{\text{model}}$ rows, and remember that $M$ has $d_{\text{model}}$ columns. We have control over parameters $d_q, d_k, d_v$. In the original research they took $d_q = d_k = d_v = d_{\text{model}}/8 = 64$. We're going to use a different set of weight matrices $W^Q$, $W^K$ and $W^V$ for each head. If we have $H$ heads, we will refer to the set of weight matrices of the $h_{th}$ head as $\{ W_h^Q, W_h^K, W_h^V \}$. For a given head, $h$, the output of the attention mechanism is + \begin{equation} h_i = \text{attention}(M \cdot W_h^Q, M\cdot W_h^K,M\cdot W_h^V) \end{equation} -In Section (\ref{bandredviakv}) you will see how we can make the computation of Multi Head Attention more efficient. The overall output of the process is then simply + +The overall output of the process is then simply + \begin{equation} - \text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \dots, \text{head}_H)W^O. + \text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \cdots, \text{head}_H)W^O. \end{equation} + Concat() simply concatenates our output matrices. The output matrix of size $(n \times d_v)$ for each head is simply our matrices stacked on top of one another like so -\begin{equation*} + + + +This output has dimension $(n \times H d_v)$. We still have $n$ rows, however now we have $h$ different representations of $d_v$. Our output, $W^O$, is another trainable weight matrix which has dimensions $W^O = (Hd_v \times d_{\text{model}})$. Therefore, the multiplication of $\text{Concat}(\text{head}_1, \cdots, \text{head}_H)$ and $W^O$ results in a matrix with dimension $(n \times d_{\text{model}})$. \ No newline at end of file diff --git a/content/posts/mqa_gqa.md b/content/posts/mqa_gqa.md index 2548805..9eab911 100644 --- a/content/posts/mqa_gqa.md +++ b/content/posts/mqa_gqa.md @@ -1,13 +1,13 @@ --- title: Multi & Grouped Query Attention -description: Using less K & V Matrices. +description: Use less K and V matrices to use less memory. date: 2024-03-22 -tldr: Using less K & V Matrices. +tldr: Use less K and V matrices to use less memory. draft: false tags: [attention, inference] --- -[*Multi Query Attention*](https://arxiv.org/pdf/1911.02150v1.pdf) (MQA) using the same $K$ and $V$ matrices for each head in our multi head self attention mechanism. For a given head, $h$, $1 \leq h \leq H$, the attention mechanism is calculated as +![*Multi Query Attention*](https://arxiv.org/pdf/1911.02150v1.pdf) (MQA) using the same $K$ and $V$ matrices for each head in our multi head self attention mechanism. For a given head, $h$, $1 \leq h \leq H$, the attention mechanism is calculated as \begin{equation} h_i = \text{attention}(M\cdot W_h^Q, M \cdot W^K,M \cdot W^V). diff --git a/content/posts/resources.md b/content/posts/resources.md new file mode 100644 index 0000000..f57106b --- /dev/null +++ b/content/posts/resources.md @@ -0,0 +1,11 @@ +--- +title: "PDFs and Resources" +date: 2024-02-28T11:49:13Z +draft: false +--- + +The contents of this website can be found as a [pdf here](/posts/file/Attention_Mechanisms.pdf). + + + + diff --git a/content/posts/sparse_attention.md b/content/posts/sparse_attention.md index f419ea9..66a44a7 100644 --- a/content/posts/sparse_attention.md +++ b/content/posts/sparse_attention.md @@ -39,9 +39,8 @@ Here, $A_i^{(1)}$ simply takes the previous $l$ locations. $A_i^{(2)}$ then take **Fixed Attention**. Our goal with this approach is to allow specific cells to summarize the previous locations, and to propagate this information on to future cells. -$$ A^{(1)}_i = \Big\{ j : \text{floor}(\frac{j}{l}) = \text{floor}( \frac{i}{l}) \Big\}, $$ - -$$ A^{(2)}_i = \Big\{ j : j \mod l \in \{ t, t + 1, \ldots, l \} \Big\}, \text{ where } t = l - c \text{ and } c \text{ is a hyperparameter.} $$ +$$ A^{(1)}_i = \{ j : \text{floor}(\frac{j}{l}) = \text{floor}( \frac{i}{l}) \}, $$ +$$ A^{(2)}_i = \{ j : j \mod l \in \{ t, t + 1, \dots, l \} \}, \text{ where } t = l - c \text{ and } c \text{ is a hyperparameter.} $$ These are best understood visually in my opinion. In the image below, $A_i^{(1)}$ is responsible for the dark blue shading and $A_i^{(2)}$ for the light blue shading. If we take stride, $l$ = 128 and $c=8$, then all positions greater than 128 can attend to positions $120-128$. The authors find choosing $c \in \{8,16,32\}$ worked well. diff --git a/content/posts/test copy.md b/content/posts/test copy.md deleted file mode 100644 index 5f939b7..0000000 --- a/content/posts/test copy.md +++ /dev/null @@ -1,24 +0,0 @@ ---- -title: "Post 2" -date: 2024-03-30T11:49:13Z -draft: false ---- - -Here's my second content [View as PDF](/posts/file/Attention_Mechanisms.pdf). - -> Here is a quote by me - -![my alt text](/img/longformer.png) - - -```python -def attention(Q, K, V): - dk = K.size(-1) - scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(dk) - attn_weights = torch.nn.functional.softmax(scores, dim=-1) - return torch.matmul(attn_weights, V) -``` - -{{< youtube id="q389nbmv4MU" >}} - -test 2