Skip to content

Commit

Permalink
add attention matrix, clean up lora
Browse files Browse the repository at this point in the history
  • Loading branch information
jonah-ramponi committed Mar 30, 2024
1 parent e484b42 commit 3bac1cd
Show file tree
Hide file tree
Showing 3 changed files with 5 additions and 5 deletions.
Binary file added content/img/attnm.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions content/posts/intro_to_attention.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ Our matrix $QK^T$ of dimension $(n \times d_{\text{model}}) \times (n \times d_{
We multiply the softmax of the attention matrix with each row of $V$. This re-scales each row of the output matrix to sum to one. The equation for softmax applied to a matrix $X$ is as follows

\begin{equation}
\text{softmax}(X)_{ij} = \frac{e^{X_{ij}}}{\sum_{k=1}^{n} e^{X_{ik}}}.
softmax(X)\_{ i j } = \frac{e^{X_{ij}}}{\sum_{k=1}^{n} e^{X_{ik}}}.
\end{equation}

```python
Expand Down Expand Up @@ -153,7 +153,7 @@ Well, our *attention matrix* after softmax has been applied is simply $w$ with $
The attention matrix is a nice thing to visualize. For our toy example, it might look like


<att-m>
1[Attention Matrix Visualisation](/img/attnm.png)

What can we notice about our attention matrix?

Expand Down
6 changes: 3 additions & 3 deletions content/posts/lora.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ draft: false
tags: [fine-tuning, training]
---

Let's consider a weight matrix $W$. Typically, the weight matrices in a dense neural networks layers have full-rank. Full-rank means many different things mathematically. I think the easiest explanation of a $d$-dimensional matrix $(let's consider a square matrix, M \in \mathbb{R}^{d,d})$ being full-rank is one in which the columns could be used to span (hit every point) in $d$-dimensional space. If you consider $d=3$, a matrix like
Let's consider a weight matrix $W$. Typically, the weight matrices in a dense neural networks layers have full-rank. Full-rank means many different things mathematically. I think the easiest explanation of a $d$-dimensional matrix (let's consider a square matrix $,M \in \mathbb{R}^{d,d}$) being full-rank is one in which the columns could be used to span (hit every point) in $d$-dimensional space. If you consider $d=3$, a matrix like

\begin{equation}
M = \begin{pmatrix}
Expand Down Expand Up @@ -40,7 +40,7 @@ In our square matrix case, we will have a matrix with 1000 rows and 1000 columns

When we're updating the weight matrix, at each step we're figuring out how to slightly alter the values in our matrix. To visualize in a low dimensional case, we're doing something like

\begin{equation}
\begin{equation*}
W + \Delta W = \begin{pmatrix}
1 & 0 & 1 \\\\
1 & 0 & 1 \\\\
Expand All @@ -54,7 +54,7 @@ When we're updating the weight matrix, at each step we're figuring out how to sl
1.00003 & 0.0002 & 1.00001 \\\\
1.01 & 0.9999 & 0.003
\end{pmatrix}
\end{equation}
\end{equation*}

But if $d$ is large, the matrix $\Delta W$ will contain lots of values. We're doing lots of calculations, and this is costly. And importantly, if $W$ has a low intrinsic dimension, we can assume that we may not even need to perform this update to each and every row of $W$. Remember, a matrix having a rank $r < d$ implies that the *information* stored in the matrix could be stored in something with $r$ dimensions instead of $d$.

Expand Down

0 comments on commit 3bac1cd

Please sign in to comment.