West Coast Machine Learning

Very Small Language Model

This is just a list of ideas of things we might want to experiment with.

Explore how this changes during training
Loss should start constant over the sequence
We are looking for ‘context learning’ where the loss decreases as the sequence grows

Root of the residual stream
- Could the residual stream start with a learned value instead of the last token? Would this allow changing the residual stream size throughout the model?
Experiment with different size layers
- Does every layer need to be the same? If the first layers are parsing the sentence and the last layers are creating the sentence, could they be smaller?
Model dimension
- Smaller dimensions in the beginning as the sentence is parsed, larger dimensions in the middle for larger context, smaller dimensions in the end for sentence generation
Context window
- Smaller context in the beginning for sentence parsing, larger context in the middle, and smaller context in the end.
Weight sharing
- Is it possible to share weights between layers? Is it possible that some recursion could occur? Could the same weights be applied to the residual stream after state has been added to perform the next step of the prediction
Quality of dataset
- Observe the effect of training on different datasets
  - Tinystories
  - Webtext
  - Books
  - Wikitext
  - Dolma
  - SQuAD