- This project is just a self-test for trying to recreate the GPT2 model architecture
- Will also try to add a pytorch data loader class (later) instead of the custom Data Loading followed in the initial tutorial
- Currently not intending to set up a training loop in this project, if I do so, will add the validation split too
Partial Success
- Could remember the overall architecture (made a mistake in initial tryout of missing the final layer norm and lm_head)
- Divided Attention into MultiHeads and Head in initial try
- This is okay, but having all heads operate as a singular matrix operation instead of a list is more optimal
- This also deviates from the structre of the original model
- Could not remember the code for the buffer that masks out attention output
- Missed out on adding residual connections in Block on first try
- Not a very bad miss, would have added if I had kept a diagram near me
- Need to add the normalization due to multiple residual connections on the variance
Fixed all diffs for the basic model