smaller implementation of karpathy's nanoGPT with a few changes.
- NEFTune
- Grouped Query Attention
- Beam search
- RMSNorm
- Knowledge distillation (student-teacher learning)
- Cosine embedding loss
- Attention mask & padding
- DDP
- mixed-precision
- DPO
- quantization: nf4, 1bit
- slerp
sample output after fine tuning a 9.3M llama-like model on 9.5M tokens of karpathy/tiny_shakespeare with knowledge distillation:
OPHELIA:
Good my lord, thou hateful dost companion ere Richmond
thoucester, who's here, and all the town with it,
That you should think me? then you shall be so holy
To the morning; as I see the duke:
If I be not consul, which is my distress
Of my tongue was made a guest,
And therefore, my noble brother.
YORK:
Madam.' God forbid your grace
Is very tongue of honour in this world;
How far best hope of it, weal is told.
- NEFTune
- Grouped Query Attention
- Beam search
- RMSNorm
- Knowledge distillation (student-teacher learning)
- Cosine embedding loss
- Attention mask & padding
- DDP
- mixed-precision