GitHub - amitpant7/GPT-from-Scratch: GPT implementation from scratch and training on Tiny Shakespeare

Created on : 2025-01-09
References: Andrej Video

Steps

Used BigramLanguageModel. Encode inputs (char level), add positional encoding
Used multinomial so as to not always select the index with highest probability.
Implemented single Attention head.
Multi Headed Attention
Add feedforward layers
Add Layer Norm
Use multiple Transformer Blocks to scale up the architecture.

Math of self attention

We want tokens beforetto communicate with token t , the simplest approach to do this involves averaging out all the previous tokens along with current. This won't be a great implementation of the attention but simple.

Doing this with loops is slow (looping through all batches and timestamp for averaging), the math trick is matrix mul with lower traingular matrix.
We can create lower triangular matrix that will sum the previous timestamps with the help of softmax function.

# use softmax for self attention 

tril = torch.tril(torch.ones(T,T))

wei = torch.zeros(T,T) # initially no communication betweek tokens as a weight they will start to communicate once training occurs

wei = wei.masked_fill(tril ==0, float('-inf')) # This mask signifies that the future will not communicate as value is negative infinity.

wei = torch.softmax(wei, dim = 1)

Once the training occurs the weights will adjust to some value depending on which token they find interesting.
Initially as all the weights are zeros, the softmax will calculate and distribute average but once the weights are learned the attention will be different based on wt magnitude.

Working

Every token will squish two vectors query amd key vector. Query acts like what it is looking for and key it roughly what do i contains.
To get attention get a dot product of my query and all other keys. If they align then we get a higher value.
Query ---> Here is what I am interested in.
Key ---> Here is what I have
Value ---> Here is what I will communicate to you if you are interesting.

Note :

Attention is communication mechanism.
In the encoder block all the tokens communicate with each other , so attention with each token in the input is calculated. (eg: we would want this in sentiment analysis).
Decoder Block: Mask with triangular matrix to not allow future information.
Self Attention: All three key, query, value are calculated from the same vector x that's why it's called self attention (self nodes). Same source is used to produce key and values as queries.
Cross Attention: The Query vector is derived from the x (input vector) but keys and vales come from different input source.

Why divide by head_size**-0.5

Scaled attention, we initialize, query, key, value

At initialization, all the query, key and value vectors have mu=0, var=1, after doing query@key we get outputs with variance equal to head size, dividing by root over head size normalizes the outputs to havevar = 1.
If the these values have large variance and one value in a vector might become large and causes the output of softmax to be very peaky and not softly distributed which would cause cause one token to be able to only interact with another single token.
Without division, softmax ---> one hot encoding (too peaky), like aggregating information from single node.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
bigram.py		bigram.py
gpt.ipynb		gpt.ipynb
gpt1.py		gpt1.py
input.txt		input.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Steps

Math of self attention

Working

Note :

Why divide by head_size**-0.5

About

Releases

Packages

Languages

amitpant7/GPT-from-Scratch

Folders and files

Latest commit

History

Repository files navigation

Steps

Math of self attention

Working

Note :

Why divide by head_size**-0.5

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages