Add summary & tweak

graphcore-research · Sep 30, 2024 · 9202d36 · 9202d36
1 parent bd3b4d2
commit 9202d36
Show file tree

Hide file tree

Showing 10 changed files with 45 additions and 36 deletions.
diff --git a/_posts/papers-of-the-month/2024-09/2024-09-27-title-tbd.md b/_posts/papers-of-the-month/2024-09/2024-09-27-title-tbd.md
diff --git a/_posts/papers-of-the-month/2024-09/2024-09-30-proper-conditioning.md b/_posts/papers-of-the-month/2024-09/2024-09-30-proper-conditioning.md
@@ -0,0 +1,34 @@
+---
+title: "September Papers: Proper Conditioning"
+header:
+    teaser: /assets/images/posts/2024-09/potm/twitter_card.png
+    image: /assets/images/posts/2024-09/potm/twitter_card.png
+    og_image: /assets/images/posts/2024-09/potm/twitter_card.png
+
+date: 2024-09-30T01:00:00-00:00
+potm_year: 2024
+potm_month: 9
+
+layout: paper-summaries-layout
+category: "papers-of-the-month"
+toc: true
+toc_sticky: true
+toc_label: "Papers"
+toc_icon: "book"
+author.twitter: "GCResearchTeam"
+---
+
+We're pleased to share four papers from different domains: LLM self-correction, FP8 training, generative crystals and optimisation. They are united, somewhat tenuously, by the importance of _proper conditioning_:
+
+1. DeepMind researchers explain how _conditioning on the wrong distribution_ during supervised fine-tuning for self-correction is harmful but can be overcome using RL.
+2. A novel Smooth-SwiGLU activation _"conditions" the numerics_ by inserting a scaling factor in just the right place, preventing late-training instability in FP8.
+3. The GenMS architecture generates crystal structures for materials _conditions on high-level textual and low-level structural information_ for high-quality generation.
+4. SOAP is an evolution of Shampoo, with conditioners in the name and _preconditioners forming the eigenbasis_ for optimisation.
+
+You can be the judge of how tenuous the connection is, but I'd encourage you to check out the summaries first or despite this.
+
+_I hope you enjoy these as much as we did. Tell us we're wrong; tell us we're right [@GCResearchTeam](https://x.com/GCResearchTeam)._
+
+---
+
+{% include paper-summaries.md %}
diff --git a/_posts/papers-of-the-month/2024-09/papers/2024-09-27-GenMS.md b/_posts/papers-of-the-month/2024-09/papers/2024-09-27-GenMS.md
@@ -10,7 +10,7 @@ tags:
     - materials
 potm_year: 2024
 potm_month: 9
-paper_order: 1  # Editor will decide
+paper_order: 3
 image_dir: "/assets/images/posts/2024-09/potm/GenMS/"
 review_author:
     name: "Daniel Justus"

diff --git a/_posts/papers-of-the-month/2024-09/papers/2024-09-27-fp8_smooth_swiglu.md b/_posts/papers-of-the-month/2024-09/papers/2024-09-27-fp8_smooth_swiglu.md
@@ -8,7 +8,7 @@ tags:
     - quantisation
 potm_year: 2024
 potm_month: 9
-paper_order: 1
+paper_order: 2
 image_dir: "/assets/images/posts/2024-09/potm/fp8_smooth_swiglu/"
 review_author:
     name: "Paul Balanca"

diff --git a/_posts/papers-of-the-month/2024-09/papers/2024-09-27-llm-correction-via-rl.md b/_posts/papers-of-the-month/2024-09/papers/2024-09-27-llm-correction-via-rl.md
@@ -10,8 +10,8 @@ tags:
     - LLMs
 potm_year: 2024
 potm_month: 9
-paper_order: 1  # Editor will decide
-image_dir: "/assets/images/posts/2024-10/potm/llm-correction-via-rl/"
+paper_order: 1
+image_dir: "/assets/images/posts/2024-09/potm/llm-correction-via-rl/"
 review_author:
     name: "Charlie Blake"
     link: "https://x.com/thecharlieblake"

diff --git a/_posts/papers-of-the-month/2024-09/papers/2024-09-27-soap.md b/_posts/papers-of-the-month/2024-09/papers/2024-09-27-soap.md
@@ -4,10 +4,10 @@ paper_authors: "Nikhil Vyas, Depen Morwani, et al."
 orgs: "Harvard University"
 paper_link: "https://arxiv.org/abs/2409.11321"
 tags:
-    - optimization
+    - optimisation
 potm_year: 2024
 potm_month: 9
-paper_order: 1  # Editor will decide
+paper_order: 4
 image_dir: "/assets/images/posts/2024-09/potm/soap/"
 review_author:
     name: "Douglas Orr"
@@ -17,7 +17,7 @@ hidden: true
 
 ### The key idea
 
-It turns out that the Shampoo optimizer (explained below), with some minor tweaks, is equivalent to running Adafactor in Shampoo's eigenspace. Since Adafactor is a rank=1 variant of Adam, the proposed method "SOAP" runs Adam in Shampoo's eigenspace instead.
+It turns out that the Shampoo optimiser (explained below), with some minor tweaks, is equivalent to running Adafactor in Shampoo's eigenspace. Since Adafactor is a rank=1 variant of Adam, the proposed method "SOAP" runs Adam in Shampoo's eigenspace instead.
 
 <img class="constrained_img_large" src="{{ page.image_dir | append: 'headline.png' | relative_url }}" alt="SOAP performance versus Adam and Shampoo, showing good step-efficiency (due to Adam) and time-efficiency (due to periodic preconditioning). Less frequent preconditioning hurts Shampoo more than SOAP.">
 <figcaption>Figure 1. SOAP performance versus Adam and Shampoo, showing good step-efficiency (due to Adam) and time-efficiency (due to periodic preconditioning). Less frequent preconditioning hurts Shampoo more than SOAP.</figcaption>
@@ -32,17 +32,17 @@ R_t &= R_{t-1} + G_t^{\top} G_t \\
 W_t &= W_{t-1} - \eta \cdot L_t^{-1/4} G_t R_t^{-1/4}
 \end{aligned}$$</div>
 
-Where $W \in \Re^{m \times n}$ is a weight matrix, $L\in \Re^{m \times m}$, $R\in \Re^{n \times n}$ are "preconditioners", behaving a bit like optimizer state and $G$ is the minibatch gradient of a loss with respect to $W$.
+Where $W \in \Re^{m \times n}$ is a weight matrix, $L\in \Re^{m \times m}$, $R\in \Re^{n \times n}$ are "preconditioners", behaving a bit like optimiser state and $G$ is the minibatch gradient of a loss with respect to $W$.
 
-A slightly different variant is considered here: idealized Shampoo with power $1/2$,
+A slightly different variant is considered here: idealised Shampoo with power $1/2$,
 
 <div>$$\begin{aligned}
 L &= \mathbb{E}(G G^{\top}) \\
 R &= \mathbb{E}(G^{\top} G) \\
 W_t &= W_{t-1} - \eta \cdot L^{-1/2} G_t R^{-1/2} \,/\, \mathrm{tr}(L)
 \end{aligned}$$</div>
 
-Note that this _idealized_ variant takes an expectation over gradients from the dataset, rather than a running average as per practical implementations. The authors show that the last line is equivalent to idealized Adafactor in the _Shampoo eigenspace_:
+Note that this _idealised_ variant takes an expectation over gradients from the dataset, rather than a running average as per practical implementations. The authors show that the last line is equivalent to idealised Adafactor in the _Shampoo eigenspace_:
 
 <div>$$\begin{aligned}
 Q_L &= \mathrm{Eigenvectors}(L) \\
@@ -61,6 +61,6 @@ The running state of this technique includes $L$, $R$, $Q_L$, $Q_R$, $M$ (in the
 
 ### Results
 
-Results on language modelling (see figure above) show good step-efficiency of SOAP since it is based on Adam rather than Adafactor, and time-efficiency since the eigenvectors can be periodically updated without substantially harming performance. Like Shampoo, the extra optimization cost can be reduced by using a large batch size.
+Results on language modelling (see figure above) show good step-efficiency of SOAP since it is based on Adam rather than Adafactor, and time-efficiency since the eigenvectors can be periodically updated without substantially harming performance. Like Shampoo, the extra optimisation cost can be reduced by using a large batch size.
 
 Stepping back for a moment, I'm excited about this progress using Shampoo variants and am eager to see experiments over long training runs of LLMs. So I hope we'll see plenty more shower-related puns on arXiv over the next year!
diff --git a/...0/potm/llm-correction-via-rl/figure-6.png → ...9/potm/llm-correction-via-rl/figure-6.png b/...0/potm/llm-correction-via-rl/figure-6.png → ...9/potm/llm-correction-via-rl/figure-6.png
diff --git a/...10/potm/llm-correction-via-rl/table-1.png → ...09/potm/llm-correction-via-rl/table-1.png b/...10/potm/llm-correction-via-rl/table-1.png → ...09/potm/llm-correction-via-rl/table-1.png
diff --git a/...10/potm/llm-correction-via-rl/table-4.png → ...09/potm/llm-correction-via-rl/table-4.png b/...10/potm/llm-correction-via-rl/table-4.png → ...09/potm/llm-correction-via-rl/table-4.png
diff --git a/assets/images/posts/2024-09/potm/twitter_card.png b/assets/images/posts/2024-09/potm/twitter_card.png