diff --git a/_posts/papers-of-the-month/2024-09/2024-09-27-title-tbd.md b/_posts/papers-of-the-month/2024-09/2024-09-27-title-tbd.md deleted file mode 100644 index 3e14f1f..0000000 --- a/_posts/papers-of-the-month/2024-09/2024-09-27-title-tbd.md +++ /dev/null @@ -1,25 +0,0 @@ ---- -title: "September Papers: Title TBD" -header: - teaser: /assets/images/posts/2024-08/potm/twitter_card.png - image: /assets/images/posts/2024-08/potm/twitter_card.png - og_image: /assets/images/posts/2024-08/potm/twitter_card.png - -date: 2024-09-27T01:00:00-00:00 -potm_year: 2024 -potm_month: 9 - -layout: paper-summaries-layout -category: "papers-of-the-month" -toc: true -toc_sticky: true -toc_label: "Papers" -toc_icon: "book" -author.twitter: "GCResearchTeam" ---- - -TODO: blurb - ---- - -{% include paper-summaries.md %} diff --git a/_posts/papers-of-the-month/2024-09/2024-09-30-proper-conditioning.md b/_posts/papers-of-the-month/2024-09/2024-09-30-proper-conditioning.md new file mode 100644 index 0000000..7f8bb89 --- /dev/null +++ b/_posts/papers-of-the-month/2024-09/2024-09-30-proper-conditioning.md @@ -0,0 +1,34 @@ +--- +title: "September Papers: Proper Conditioning" +header: + teaser: /assets/images/posts/2024-09/potm/twitter_card.png + image: /assets/images/posts/2024-09/potm/twitter_card.png + og_image: /assets/images/posts/2024-09/potm/twitter_card.png + +date: 2024-09-30T01:00:00-00:00 +potm_year: 2024 +potm_month: 9 + +layout: paper-summaries-layout +category: "papers-of-the-month" +toc: true +toc_sticky: true +toc_label: "Papers" +toc_icon: "book" +author.twitter: "GCResearchTeam" +--- + +We're pleased to share four papers from different domains: LLM self-correction, FP8 training, generative crystals and optimisation. They are united, somewhat tenuously, by the importance of _proper conditioning_: + +1. DeepMind researchers explain how _conditioning on the wrong distribution_ during supervised fine-tuning for self-correction is harmful but can be overcome using RL. +2. A novel Smooth-SwiGLU activation _"conditions" the numerics_ by inserting a scaling factor in just the right place, preventing late-training instability in FP8. +3. The GenMS architecture generates crystal structures for materials _conditions on high-level textual and low-level structural information_ for high-quality generation. +4. SOAP is an evolution of Shampoo, with conditioners in the name and _preconditioners forming the eigenbasis_ for optimisation. + +You can be the judge of how tenuous the connection is, but I'd encourage you to check out the summaries first or despite this. + +_I hope you enjoy these as much as we did. Tell us we're wrong; tell us we're right [@GCResearchTeam](https://x.com/GCResearchTeam)._ + +--- + +{% include paper-summaries.md %} diff --git a/_posts/papers-of-the-month/2024-09/papers/2024-09-27-GenMS.md b/_posts/papers-of-the-month/2024-09/papers/2024-09-27-GenMS.md index 7d5bcf3..370d025 100644 --- a/_posts/papers-of-the-month/2024-09/papers/2024-09-27-GenMS.md +++ b/_posts/papers-of-the-month/2024-09/papers/2024-09-27-GenMS.md @@ -10,7 +10,7 @@ tags: - materials potm_year: 2024 potm_month: 9 -paper_order: 1 # Editor will decide +paper_order: 3 image_dir: "/assets/images/posts/2024-09/potm/GenMS/" review_author: name: "Daniel Justus" diff --git a/_posts/papers-of-the-month/2024-09/papers/2024-09-27-fp8_smooth_swiglu.md b/_posts/papers-of-the-month/2024-09/papers/2024-09-27-fp8_smooth_swiglu.md index 3c8a6e1..e528044 100644 --- a/_posts/papers-of-the-month/2024-09/papers/2024-09-27-fp8_smooth_swiglu.md +++ b/_posts/papers-of-the-month/2024-09/papers/2024-09-27-fp8_smooth_swiglu.md @@ -8,7 +8,7 @@ tags: - quantisation potm_year: 2024 potm_month: 9 -paper_order: 1 +paper_order: 2 image_dir: "/assets/images/posts/2024-09/potm/fp8_smooth_swiglu/" review_author: name: "Paul Balanca" diff --git a/_posts/papers-of-the-month/2024-09/papers/2024-09-27-llm-correction-via-rl.md b/_posts/papers-of-the-month/2024-09/papers/2024-09-27-llm-correction-via-rl.md index 7ac1302..a362ab1 100644 --- a/_posts/papers-of-the-month/2024-09/papers/2024-09-27-llm-correction-via-rl.md +++ b/_posts/papers-of-the-month/2024-09/papers/2024-09-27-llm-correction-via-rl.md @@ -10,8 +10,8 @@ tags: - LLMs potm_year: 2024 potm_month: 9 -paper_order: 1 # Editor will decide -image_dir: "/assets/images/posts/2024-10/potm/llm-correction-via-rl/" +paper_order: 1 +image_dir: "/assets/images/posts/2024-09/potm/llm-correction-via-rl/" review_author: name: "Charlie Blake" link: "https://x.com/thecharlieblake" diff --git a/_posts/papers-of-the-month/2024-09/papers/2024-09-27-soap.md b/_posts/papers-of-the-month/2024-09/papers/2024-09-27-soap.md index 075318e..b569375 100644 --- a/_posts/papers-of-the-month/2024-09/papers/2024-09-27-soap.md +++ b/_posts/papers-of-the-month/2024-09/papers/2024-09-27-soap.md @@ -4,10 +4,10 @@ paper_authors: "Nikhil Vyas, Depen Morwani, et al." orgs: "Harvard University" paper_link: "https://arxiv.org/abs/2409.11321" tags: - - optimization + - optimisation potm_year: 2024 potm_month: 9 -paper_order: 1 # Editor will decide +paper_order: 4 image_dir: "/assets/images/posts/2024-09/potm/soap/" review_author: name: "Douglas Orr" @@ -17,7 +17,7 @@ hidden: true ### The key idea -It turns out that the Shampoo optimizer (explained below), with some minor tweaks, is equivalent to running Adafactor in Shampoo's eigenspace. Since Adafactor is a rank=1 variant of Adam, the proposed method "SOAP" runs Adam in Shampoo's eigenspace instead. +It turns out that the Shampoo optimiser (explained below), with some minor tweaks, is equivalent to running Adafactor in Shampoo's eigenspace. Since Adafactor is a rank=1 variant of Adam, the proposed method "SOAP" runs Adam in Shampoo's eigenspace instead. SOAP performance versus Adam and Shampoo, showing good step-efficiency (due to Adam) and time-efficiency (due to periodic preconditioning). Less frequent preconditioning hurts Shampoo more than SOAP.
Figure 1. SOAP performance versus Adam and Shampoo, showing good step-efficiency (due to Adam) and time-efficiency (due to periodic preconditioning). Less frequent preconditioning hurts Shampoo more than SOAP.
@@ -32,9 +32,9 @@ R_t &= R_{t-1} + G_t^{\top} G_t \\ W_t &= W_{t-1} - \eta \cdot L_t^{-1/4} G_t R_t^{-1/4} \end{aligned}$$ -Where $W \in \Re^{m \times n}$ is a weight matrix, $L\in \Re^{m \times m}$, $R\in \Re^{n \times n}$ are "preconditioners", behaving a bit like optimizer state and $G$ is the minibatch gradient of a loss with respect to $W$. +Where $W \in \Re^{m \times n}$ is a weight matrix, $L\in \Re^{m \times m}$, $R\in \Re^{n \times n}$ are "preconditioners", behaving a bit like optimiser state and $G$ is the minibatch gradient of a loss with respect to $W$. -A slightly different variant is considered here: idealized Shampoo with power $1/2$, +A slightly different variant is considered here: idealised Shampoo with power $1/2$,
$$\begin{aligned} L &= \mathbb{E}(G G^{\top}) \\ @@ -42,7 +42,7 @@ R &= \mathbb{E}(G^{\top} G) \\ W_t &= W_{t-1} - \eta \cdot L^{-1/2} G_t R^{-1/2} \,/\, \mathrm{tr}(L) \end{aligned}$$
-Note that this _idealized_ variant takes an expectation over gradients from the dataset, rather than a running average as per practical implementations. The authors show that the last line is equivalent to idealized Adafactor in the _Shampoo eigenspace_: +Note that this _idealised_ variant takes an expectation over gradients from the dataset, rather than a running average as per practical implementations. The authors show that the last line is equivalent to idealised Adafactor in the _Shampoo eigenspace_:
$$\begin{aligned} Q_L &= \mathrm{Eigenvectors}(L) \\ @@ -61,6 +61,6 @@ The running state of this technique includes $L$, $R$, $Q_L$, $Q_R$, $M$ (in the ### Results -Results on language modelling (see figure above) show good step-efficiency of SOAP since it is based on Adam rather than Adafactor, and time-efficiency since the eigenvectors can be periodically updated without substantially harming performance. Like Shampoo, the extra optimization cost can be reduced by using a large batch size. +Results on language modelling (see figure above) show good step-efficiency of SOAP since it is based on Adam rather than Adafactor, and time-efficiency since the eigenvectors can be periodically updated without substantially harming performance. Like Shampoo, the extra optimisation cost can be reduced by using a large batch size. Stepping back for a moment, I'm excited about this progress using Shampoo variants and am eager to see experiments over long training runs of LLMs. So I hope we'll see plenty more shower-related puns on arXiv over the next year! diff --git a/assets/images/posts/2024-10/potm/llm-correction-via-rl/figure-6.png b/assets/images/posts/2024-09/potm/llm-correction-via-rl/figure-6.png similarity index 100% rename from assets/images/posts/2024-10/potm/llm-correction-via-rl/figure-6.png rename to assets/images/posts/2024-09/potm/llm-correction-via-rl/figure-6.png diff --git a/assets/images/posts/2024-10/potm/llm-correction-via-rl/table-1.png b/assets/images/posts/2024-09/potm/llm-correction-via-rl/table-1.png similarity index 100% rename from assets/images/posts/2024-10/potm/llm-correction-via-rl/table-1.png rename to assets/images/posts/2024-09/potm/llm-correction-via-rl/table-1.png diff --git a/assets/images/posts/2024-10/potm/llm-correction-via-rl/table-4.png b/assets/images/posts/2024-09/potm/llm-correction-via-rl/table-4.png similarity index 100% rename from assets/images/posts/2024-10/potm/llm-correction-via-rl/table-4.png rename to assets/images/posts/2024-09/potm/llm-correction-via-rl/table-4.png diff --git a/assets/images/posts/2024-09/potm/twitter_card.png b/assets/images/posts/2024-09/potm/twitter_card.png new file mode 100644 index 0000000..cde313b Binary files /dev/null and b/assets/images/posts/2024-09/potm/twitter_card.png differ