Created on : 2025-01-09
References: Andrej Video
- Used BigramLanguageModel. Encode inputs (char level), add positional encoding
- Used multinomial so as to not always select the index with highest probability.
- Implemented single Attention head.
- Multi Headed Attention
- Add feedforward layers
- Add Layer Norm
- Use multiple Transformer Blocks to scale up the architecture.
We want tokens beforet
to communicate with token t
, the simplest approach to do this involves averaging out all the previous tokens along with current. This won't be a great implementation of the attention but simple.
-
Doing this with loops is slow (looping through all batches and timestamp for averaging), the math trick is matrix mul with lower traingular matrix.
-
We can create lower triangular matrix that will sum the previous timestamps with the help of softmax function.
# use softmax for self attention
tril = torch.tril(torch.ones(T,T))
wei = torch.zeros(T,T) # initially no communication betweek tokens as a weight they will start to communicate once training occurs
wei = wei.masked_fill(tril ==0, float('-inf')) # This mask signifies that the future will not communicate as value is negative infinity.
wei = torch.softmax(wei, dim = 1)
- Once the training occurs the weights will adjust to some value depending on which token they find interesting.
- Initially as all the weights are zeros, the softmax will calculate and distribute average but once the weights are learned the attention will be different based on wt magnitude.
- Every token will squish two vectors
query
amdkey
vector. Query acts like what it is looking for andkey
it roughly what do i contains. - To get attention get a dot product of my query and all other keys. If they align then we get a higher value.
Query
---> Here is what I am interested in.Key
---> Here is what I haveValue
---> Here is what I will communicate to you if you are interesting.
- Attention is communication mechanism.
- In the encoder block all the tokens communicate with each other , so attention with each token in the input is calculated. (eg: we would want this in sentiment analysis).
- Decoder Block: Mask with triangular matrix to not allow future information.
- Self Attention: All three key, query, value are calculated from the same vector
x
that's why it's called self attention (self nodes). Same source is used to produce key and values as queries. - Cross Attention: The Query vector is derived from the
x
(input vector) but keys and vales come from different input source.
Scaled attention, we initialize, query
, key
, value
- At initialization, all the query, key and value vectors have
mu=0
,var=1
, after doingquery@key
we get outputs with variance equal to head size, dividing by root over head size normalizes the outputs to havevar = 1
. - If the these values have large variance and one value in a vector might become large and causes the output of softmax to be very peaky and not softly distributed which would cause cause one token to be able to only interact with another single token.
- Without division, softmax ---> one hot encoding (too peaky), like aggregating information from single node.