Skip to content

Commit

Permalink
Adds an entry to the training log
Browse files Browse the repository at this point in the history
  • Loading branch information
dirkgr committed Apr 6, 2023
1 parent 597d906 commit a3b969d
Showing 1 changed file with 15 additions and 0 deletions.
15 changes: 15 additions & 0 deletions LOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,21 @@ That is, within each transformer block we compute `MLP(LN(x)) + Attention(LN(x))
This allows to increase throughput because we can fuse the separate feed-forward and attention input projections into a single linear layer.
We also experimented with [fusing the output projections](https://github.com/allenai/LLM/pull/79) into a single linear layer but that didn't help, possibly due to the overhead of concatenating the feed-forward and attention activations together.


2023-04-02
----------

First training run! We trained a 300M model on about 70B tokens from C4.
The purpose of this model is to give the other LLM teams something in our format that's not completely random,
so they can test their evaluation and inference code.

This ran on a single node only on AMD's cluster.
On AMD hardware we're still missing Flash Attention, and we could not get `torch.compile()` to work in time for the run.
Both are expected to provide significant speedups.
This training run used model settings that are optimal for compiled models, despite not being able to compile,
because we want it to be a representative model for the downstream evaluations.


2023-03-28
----------

Expand Down

0 comments on commit a3b969d

Please sign in to comment.