Skip to content

Commit

Permalink
Update intro_to_attention.md
Browse files Browse the repository at this point in the history
  • Loading branch information
jonah-ramponi committed Apr 3, 2024
1 parent 04ad338 commit d2549d9
Showing 1 changed file with 6 additions and 1 deletion.
7 changes: 6 additions & 1 deletion content/posts/intro_to_attention.md
Original file line number Diff line number Diff line change
Expand Up @@ -198,7 +198,12 @@ The overall output of the process is then simply
Concat() simply concatenates our output matrices. The output matrix of size $(n \times d_v)$ for each head is simply our matrices stacked on top of one another like so

\begin{equation*}
\text{Concat}(\text{head}_1, \dots, \text{head}_h) =
\text{Concat}(\text{head}_1, \dots, \text{head}_h) = \begin{pmatrix}
head_{1_{11}} & \dots & head_{1_{1d_v}} & \dots & head_{H_{11}} & \dots & head_{H_{1d_v}} \\
head_{1_{21}} & \dots & head_{1_{2d_v}} & \dots & head_{H_{21}} & \dots & head_{H_{2d_v}} \\
\vdots & \ddots & \vdots & \dots & \vdots & \ddots & \vdots \\
head_{1_{n1}} & \dots & head_{1_{nd_v}} & \dots & head_{H_{n1}} & \dots & head_{H_{nd_v}} \\
\end{pmatrix}
\end{equation*}

This output has dimension $(n \times H d_v)$. We still have $n$ rows, however now we have $h$ different representations of $d_v$. Our output, $W^O$, is another trainable weight matrix which has dimensions $W^O = (Hd_v \times d_{\text{model}})$. Therefore, the multiplication of Concat $ (head_1, \dots, head_H)$ and $W^O$ results in a matrix with dimension $(n \times d_{\text{model}})$.

0 comments on commit d2549d9

Please sign in to comment.