diff --git a/content/img/attnm.png b/content/img/attnm.png new file mode 100644 index 0000000..f409873 Binary files /dev/null and b/content/img/attnm.png differ diff --git a/content/posts/intro_to_attention.md b/content/posts/intro_to_attention.md index a0f9822..f435d50 100644 --- a/content/posts/intro_to_attention.md +++ b/content/posts/intro_to_attention.md @@ -119,7 +119,7 @@ Our matrix $QK^T$ of dimension $(n \times d_{\text{model}}) \times (n \times d_{ We multiply the softmax of the attention matrix with each row of $V$. This re-scales each row of the output matrix to sum to one. The equation for softmax applied to a matrix $X$ is as follows \begin{equation} - \text{softmax}(X)_{ij} = \frac{e^{X_{ij}}}{\sum_{k=1}^{n} e^{X_{ik}}}. + softmax(X)\_{ i j } = \frac{e^{X_{ij}}}{\sum_{k=1}^{n} e^{X_{ik}}}. \end{equation} ```python @@ -153,7 +153,7 @@ Well, our *attention matrix* after softmax has been applied is simply $w$ with $ The attention matrix is a nice thing to visualize. For our toy example, it might look like - +1[Attention Matrix Visualisation](/img/attnm.png) What can we notice about our attention matrix? diff --git a/content/posts/lora.md b/content/posts/lora.md index 872cb34..50ecfba 100644 --- a/content/posts/lora.md +++ b/content/posts/lora.md @@ -7,7 +7,7 @@ draft: false tags: [fine-tuning, training] --- -Let's consider a weight matrix $W$. Typically, the weight matrices in a dense neural networks layers have full-rank. Full-rank means many different things mathematically. I think the easiest explanation of a $d$-dimensional matrix $(let's consider a square matrix, M \in \mathbb{R}^{d,d})$ being full-rank is one in which the columns could be used to span (hit every point) in $d$-dimensional space. If you consider $d=3$, a matrix like +Let's consider a weight matrix $W$. Typically, the weight matrices in a dense neural networks layers have full-rank. Full-rank means many different things mathematically. I think the easiest explanation of a $d$-dimensional matrix (let's consider a square matrix $,M \in \mathbb{R}^{d,d}$) being full-rank is one in which the columns could be used to span (hit every point) in $d$-dimensional space. If you consider $d=3$, a matrix like \begin{equation} M = \begin{pmatrix} @@ -40,7 +40,7 @@ In our square matrix case, we will have a matrix with 1000 rows and 1000 columns When we're updating the weight matrix, at each step we're figuring out how to slightly alter the values in our matrix. To visualize in a low dimensional case, we're doing something like -\begin{equation} +\begin{equation*} W + \Delta W = \begin{pmatrix} 1 & 0 & 1 \\\\ 1 & 0 & 1 \\\\ @@ -54,7 +54,7 @@ When we're updating the weight matrix, at each step we're figuring out how to sl 1.00003 & 0.0002 & 1.00001 \\\\ 1.01 & 0.9999 & 0.003 \end{pmatrix} -\end{equation} +\end{equation*} But if $d$ is large, the matrix $\Delta W$ will contain lots of values. We're doing lots of calculations, and this is costly. And importantly, if $W$ has a low intrinsic dimension, we can assume that we may not even need to perform this update to each and every row of $W$. Remember, a matrix having a rank $r < d$ implies that the *information* stored in the matrix could be stored in something with $r$ dimensions instead of $d$.