add attention matrix, clean up lora

jonah-ramponi · Mar 30, 2024 · 3bac1cd · 3bac1cd
1 parent e484b42
commit 3bac1cd
Show file tree

Hide file tree

Showing 3 changed files with 5 additions and 5 deletions.
diff --git a/content/img/attnm.png b/content/img/attnm.png
diff --git a/content/posts/intro_to_attention.md b/content/posts/intro_to_attention.md
@@ -119,7 +119,7 @@ Our matrix $QK^T$ of dimension $(n \times d_{\text{model}}) \times (n \times d_{
 We multiply the softmax of the attention matrix with each row of $V$. This re-scales each row of the output matrix to sum to one. The equation for softmax applied to a matrix $X$ is as follows
 
 \begin{equation}
-    \text{softmax}(X)_{ij} = \frac{e^{X_{ij}}}{\sum_{k=1}^{n} e^{X_{ik}}}.
+    softmax(X)\_{ i j } = \frac{e^{X_{ij}}}{\sum_{k=1}^{n} e^{X_{ik}}}.
 \end{equation}
 
 ```python
@@ -153,7 +153,7 @@ Well, our *attention matrix* after softmax has been applied is simply $w$ with $
 The attention matrix is a nice thing to visualize. For our toy example, it might look like 
 
 
-<att-m>
+1[Attention Matrix Visualisation](/img/attnm.png)
 
 What can we notice about our attention matrix?
 

diff --git a/content/posts/lora.md b/content/posts/lora.md
@@ -7,7 +7,7 @@ draft: false
 tags: [fine-tuning, training] 
 ---
 
-Let's consider a weight matrix $W$. Typically, the weight matrices in a dense neural networks layers have full-rank. Full-rank means many different things mathematically. I think the easiest explanation of a $d$-dimensional matrix $(let's consider a square matrix, M \in \mathbb{R}^{d,d})$ being full-rank is one in which the columns could be used to span (hit every point) in $d$-dimensional space. If you consider $d=3$, a matrix like
+Let's consider a weight matrix $W$. Typically, the weight matrices in a dense neural networks layers have full-rank. Full-rank means many different things mathematically. I think the easiest explanation of a $d$-dimensional matrix (let's consider a square matrix $,M \in \mathbb{R}^{d,d}$) being full-rank is one in which the columns could be used to span (hit every point) in $d$-dimensional space. If you consider $d=3$, a matrix like
 
 \begin{equation}
     M = \begin{pmatrix}
@@ -40,7 +40,7 @@ In our square matrix case, we will have a matrix with 1000 rows and 1000 columns
 
 When we're updating the weight matrix, at each step we're figuring out how to slightly alter the values in our matrix. To visualize in a low dimensional case, we're doing something like  
 
-\begin{equation}
+\begin{equation*}
     W + \Delta W = \begin{pmatrix}
         1 & 0 & 1 \\\\ 
         1 & 0 & 1 \\\\
@@ -54,7 +54,7 @@ When we're updating the weight matrix, at each step we're figuring out how to sl
         1.00003 & 0.0002 & 1.00001 \\\\ 
         1.01 & 0.9999 & 0.003 
     \end{pmatrix} 
-\end{equation}
+\end{equation*}
 
 But if $d$ is large, the matrix $\Delta W$ will contain lots of values. We're doing lots of calculations, and this is costly. And importantly, if $W$ has a low intrinsic dimension, we can assume that we may not even need to perform this update to each and every row of $W$. Remember, a matrix having a rank $r < d$ implies that the *information* stored in the matrix could be stored in something with $r$ dimensions instead of $d$.