only one page to go!!

jonah-ramponi · Mar 30, 2024 · 4165a2a · 4165a2a
1 parent c8bf25c
commit 4165a2a
Show file tree

Hide file tree

Showing 2 changed files with 14 additions and 14 deletions.
diff --git a/content/posts/intro_to_attention.md b/content/posts/intro_to_attention.md
@@ -9,27 +9,27 @@ tags: [attention, inference]
 
 Suppose you give an LLM the input
 
-> *``What is the capital of France?"*
+*``What is the capital of France?"*
 
 The first thing the LLM will do is split this input into tokens. A token is just some combinations of characters. You can see an example of the tokenization outputs for the question below.
 
-> ``*$\colorbox{red}{What}\colorbox{magenta}{ is}\colorbox{green}{ the}\colorbox{orange}{ capital}\colorbox{purple}{ of}\colorbox{brown}{ France}\colorbox{cyan}?$*" 
+``*$\colorbox{red}{What}\colorbox{magenta}{ is}\colorbox{green}{ the}\colorbox{orange}{ capital}\colorbox{purple}{ of}\colorbox{brown}{ France}\colorbox{cyan}?$*" 
 
 (This tokenization was produced using cl100k_base, the tokenizer used in GPT-3.5-turbo and GPT-4.)
 
 In this example we have $(n = 7)$ tokens. Importantly, from our model's point of view, our input size is defined by the number of tokens instead of words. A numerical representation (vector representation) of each token is now found. Finding this vector representation is called producing an embedding of the token. The token *$\colorbox{red}{ What}$* might get tokenized as follows 
 
 \begin{equation}
-    \text{tokenizer}(\textit{\colorbox{red}{What}}) \rightarrow \begin{bmatrix} -0.4159 \\\\ -0.5147 \\\\  0.5690 \\\\  \vdots \\\\  -0.2577 \\\\   0.5710 \\\\   \end{bmatrix}
+    \text{tokenizer}(\textit{\colorbox{red}{What}}) \rightarrow \begin{pmatrix} -0.4159 \\\\ -0.5147 \\\\  0.5690 \\\\  \vdots \\\\  -0.2577 \\\\   0.5710 \\\\   \end{pmatrix}
 \end{equation}
 
 The length of each of our embeddings, these vector outputs of our tokenizer, are the same regardless of the number of characters in our token. Let us denote this length $d_{\text{model}}$. So after we embed each token in our input sequence with our tokenizer we are left with
 
 $$
-\begin{bmatrix} -0.415 \\\\ -0.514 \\\\  0.569 \\\\  \vdots \\\\  -0.257 \\\\   0.571 \\\\   \end{bmatrix} 
-\begin{bmatrix} -0.130 \\\\ -0.464 \\\\ 0.23 \\\\ \vdots \\\\ -0.154 \\\\ 0.192 \\\\ \end{bmatrix}
+\begin{pmatrix} -0.415 \\\\ -0.514 \\\\  0.569 \\\\  \vdots \\\\  -0.257 \\\\   0.571 \\\\   \end{pmatrix} 
+\begin{pmatrix} -0.130 \\\\ -0.464 \\\\ 0.23 \\\\ \vdots \\\\ -0.154 \\\\ 0.192 \\\\ \end{pmatrix}
 , \dots ,
-\begin{bmatrix} 0.127 \\\\ 0.453 \\\\ 0.110 \\\\ \vdots \\\\ -0.155 \\\\ 0.484 \\\\ \end{bmatrix}
+\begin{pmatrix} 0.127 \\\\ 0.453 \\\\ 0.110 \\\\ \vdots \\\\ -0.155 \\\\ 0.484 \\\\ \end{pmatrix}
 $$
 
 
@@ -38,8 +38,8 @@ This output is now passed through a *positional encoder*. Broadly, this is usefu
 The only thing that matters for now, is that each of our numerical representations (vectors) are slightly altered. For the numerical representation of the token ``$\colorbox{red}{ What}$" that we get from our embedding model, it might look something like:  
 
 \begin{equation}
-     \text{positional encoder}\Bigg(\begin{bmatrix} -0.415 \\\\ -0.514 \\\\    \vdots \\\\  -0.257 \\\\   0.571 \\\\   \end{bmatrix}\Bigg) = 
-    \begin{bmatrix} -0.424 \\\\ -0.574 \\\\  \vdots \\\\  -0.235 \\\\   0.534 \\\\   \end{bmatrix} 
+     \text{positional encoder}\Bigg(\begin{pmatrix} -0.415 \\\\ -0.514 \\\\    \vdots \\\\  -0.257 \\\\   0.571 \\\\   \end{pmatrix}\Bigg) = 
+    \begin{pmatrix} -0.424 \\\\ -0.574 \\\\  \vdots \\\\  -0.235 \\\\   0.534 \\\\   \end{pmatrix} 
 \end{equation}
 
 Importantly, the positional encoder does not alter the length of our vector, $d_{\text{model}}$. It simply tweaks the values slightly. So far, we entered our prompt: 
@@ -53,17 +53,17 @@ This was tokenized
 Then embedded 
 
 $$
-\begin{bmatrix} -0.415 \\\\ -0.514 \\\\  0.569 \\\\  \vdots \\\\  -0.257 \\\\   0.571 \\\\   \end{bmatrix} 
-\begin{bmatrix} -0.130 \\\\ -0.464 \\\\ 0.23 \\\\ \vdots \\\\ -0.154 \\\\ 0.192 \\\\ \end{bmatrix}
+\begin{pmatrix} -0.415 \\\\ -0.514 \\\\  0.569 \\\\  \vdots \\\\  -0.257 \\\\   0.571 \\\\   \end{pmatrix} 
+\begin{pmatrix} -0.130 \\\\ -0.464 \\\\ 0.23 \\\\ \vdots \\\\ -0.154 \\\\ 0.192 \\\\ \end{pmatrix}
 , \dots ,
-\begin{bmatrix} 0.127 \\\\ 0.453 \\\\ 0.110 \\\\ \vdots \\\\ -0.155 \\\\ 0.484 \\\\ \end{bmatrix}
+\begin{pmatrix} 0.127 \\\\ 0.453 \\\\ 0.110 \\\\ \vdots \\\\ -0.155 \\\\ 0.484 \\\\ \end{pmatrix}
 $$
 
 and finally positionally encoded 
 
 \begin{equation}
-     \text{positional encoder}\Bigg(\begin{bmatrix} -0.415 \\\\ -0.514 \\\\    \vdots \\\\  -0.257 \\\\   0.571 \\\\   \end{bmatrix}\Bigg) = 
-    \begin{bmatrix} -0.424 \\\\ -0.574 \\\\  \vdots \\\\  -0.235 \\\\   0.534 \\\\   \end{bmatrix} 
+     \text{positional encoder}\Bigg(\begin{pmatrix} -0.415 \\\\ -0.514 \\\\    \vdots \\\\  -0.257 \\\\   0.571 \\\\   \end{pmatrix}\Bigg) = 
+    \begin{pmatrix} -0.424 \\\\ -0.574 \\\\  \vdots \\\\  -0.235 \\\\   0.534 \\\\   \end{pmatrix} 
 \end{equation}
 
 We're now very close to being able to introduce attention. One last thing remains, at this point we will transform the output of our positional encoding to a matrix $M$ as follows 

diff --git a/content/posts/sliding_window_attention.md b/content/posts/sliding_window_attention.md
@@ -27,7 +27,7 @@ K_{1d} & K_{2d} & \cdots & K_{nd}
     \end{pmatrix}
 \end{equation}
 
-Our goal is to simplify this computation. Instead of letting each token attend to all of the other tokens, we will define a window size $w$. The token we are calculating attention values for will then only get to look at the tokens $\frac{1}{2}w$ either side of it. For our example, we could consider a sliding window of size $2$ which will look $1$ token to either side of the current token. Only the values shaded in \colorbox{olive}{olive} will be calculated.
+Our goal is to simplify this computation. Instead of letting each token attend to all of the other tokens, we will define a window size $w$. The token we are calculating attention values for will then only get to look at the tokens $\frac{1}{2}w$ either side of it. For our example, we could consider a sliding window of size $2$ which will look $1$ token to either side of the current token. Only the values shaded in $\colorbox{olive}{olive}$ will be calculated.
 
 ![Sliding Window Attention Matrix](/img/sliding_window.png)