Adds an entry to the training log

AlibabaPAI · Apr 6, 2023 · a3b969d · a3b969d
1 parent 597d906
commit a3b969d
Showing 1 changed file with 15 additions and 0 deletions.
diff --git a/LOG.md b/LOG.md
@@ -9,6 +9,21 @@ That is, within each transformer block we compute `MLP(LN(x)) + Attention(LN(x))
 This allows to increase throughput because we can fuse the separate feed-forward and attention input projections into a single linear layer.
 We also experimented with [fusing the output projections](https://github.com/allenai/LLM/pull/79) into a single linear layer but that didn't help, possibly due to the overhead of concatenating the feed-forward and attention activations together.
 
+
+2023-04-02
+----------
+
+First training run! We trained a 300M model on about 70B tokens from C4.
+The purpose of this model is to give the other LLM teams something in our format that's not completely random,
+so they can test their evaluation and inference code.
+
+This ran on a single node only on AMD's cluster.
+On AMD hardware we're still missing Flash Attention, and we could not get `torch.compile()` to work in time for the run.
+Both are expected to provide significant speedups.
+This training run used model settings that are optimal for compiled models, despite not being able to compile,
+because we want it to be a representative model for the downstream evaluations.
+
+
 2023-03-28
 ----------