"...Our goal is to open up the space by combining every form of efficient training we have. If we throw enough tradeoffs against it, a model of this size (GPT-3) should be trainable on commodity hardware (<1k if purchased as upgrades) ... Compute-memory tradeoffs (like MOE) aren't enough ... we want more efficient training using extragradient methods and better optimizers (Shampoo)" - Lucas Nestler
python3 main.py configs/small.yaml