Skip to content

Commit

Permalink
Add summary & tweak
Browse files Browse the repository at this point in the history
  • Loading branch information
DouglasOrr committed Sep 30, 2024
1 parent bd3b4d2 commit 9202d36
Show file tree
Hide file tree
Showing 10 changed files with 45 additions and 36 deletions.
25 changes: 0 additions & 25 deletions _posts/papers-of-the-month/2024-09/2024-09-27-title-tbd.md

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
---
title: "September Papers: Proper Conditioning"
header:
teaser: /assets/images/posts/2024-09/potm/twitter_card.png
image: /assets/images/posts/2024-09/potm/twitter_card.png
og_image: /assets/images/posts/2024-09/potm/twitter_card.png

date: 2024-09-30T01:00:00-00:00
potm_year: 2024
potm_month: 9

layout: paper-summaries-layout
category: "papers-of-the-month"
toc: true
toc_sticky: true
toc_label: "Papers"
toc_icon: "book"
author.twitter: "GCResearchTeam"
---

We're pleased to share four papers from different domains: LLM self-correction, FP8 training, generative crystals and optimisation. They are united, somewhat tenuously, by the importance of _proper conditioning_:

1. DeepMind researchers explain how _conditioning on the wrong distribution_ during supervised fine-tuning for self-correction is harmful but can be overcome using RL.
2. A novel Smooth-SwiGLU activation _"conditions" the numerics_ by inserting a scaling factor in just the right place, preventing late-training instability in FP8.
3. The GenMS architecture generates crystal structures for materials _conditions on high-level textual and low-level structural information_ for high-quality generation.
4. SOAP is an evolution of Shampoo, with conditioners in the name and _preconditioners forming the eigenbasis_ for optimisation.

You can be the judge of how tenuous the connection is, but I'd encourage you to check out the summaries first or despite this.

_I hope you enjoy these as much as we did. Tell us we're wrong; tell us we're right [@GCResearchTeam](https://x.com/GCResearchTeam)._

---

{% include paper-summaries.md %}
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ tags:
- materials
potm_year: 2024
potm_month: 9
paper_order: 1 # Editor will decide
paper_order: 3
image_dir: "/assets/images/posts/2024-09/potm/GenMS/"
review_author:
name: "Daniel Justus"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ tags:
- quantisation
potm_year: 2024
potm_month: 9
paper_order: 1
paper_order: 2
image_dir: "/assets/images/posts/2024-09/potm/fp8_smooth_swiglu/"
review_author:
name: "Paul Balanca"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@ tags:
- LLMs
potm_year: 2024
potm_month: 9
paper_order: 1 # Editor will decide
image_dir: "/assets/images/posts/2024-10/potm/llm-correction-via-rl/"
paper_order: 1
image_dir: "/assets/images/posts/2024-09/potm/llm-correction-via-rl/"
review_author:
name: "Charlie Blake"
link: "https://x.com/thecharlieblake"
Expand Down
14 changes: 7 additions & 7 deletions _posts/papers-of-the-month/2024-09/papers/2024-09-27-soap.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,10 @@ paper_authors: "Nikhil Vyas, Depen Morwani, et al."
orgs: "Harvard University"
paper_link: "https://arxiv.org/abs/2409.11321"
tags:
- optimization
- optimisation
potm_year: 2024
potm_month: 9
paper_order: 1 # Editor will decide
paper_order: 4
image_dir: "/assets/images/posts/2024-09/potm/soap/"
review_author:
name: "Douglas Orr"
Expand All @@ -17,7 +17,7 @@ hidden: true

### The key idea

It turns out that the Shampoo optimizer (explained below), with some minor tweaks, is equivalent to running Adafactor in Shampoo's eigenspace. Since Adafactor is a rank=1 variant of Adam, the proposed method "SOAP" runs Adam in Shampoo's eigenspace instead.
It turns out that the Shampoo optimiser (explained below), with some minor tweaks, is equivalent to running Adafactor in Shampoo's eigenspace. Since Adafactor is a rank=1 variant of Adam, the proposed method "SOAP" runs Adam in Shampoo's eigenspace instead.

<img class="constrained_img_large" src="{{ page.image_dir | append: 'headline.png' | relative_url }}" alt="SOAP performance versus Adam and Shampoo, showing good step-efficiency (due to Adam) and time-efficiency (due to periodic preconditioning). Less frequent preconditioning hurts Shampoo more than SOAP.">
<figcaption>Figure 1. SOAP performance versus Adam and Shampoo, showing good step-efficiency (due to Adam) and time-efficiency (due to periodic preconditioning). Less frequent preconditioning hurts Shampoo more than SOAP.</figcaption>
Expand All @@ -32,17 +32,17 @@ R_t &= R_{t-1} + G_t^{\top} G_t \\
W_t &= W_{t-1} - \eta \cdot L_t^{-1/4} G_t R_t^{-1/4}
\end{aligned}$$</div>

Where $W \in \Re^{m \times n}$ is a weight matrix, $L\in \Re^{m \times m}$, $R\in \Re^{n \times n}$ are "preconditioners", behaving a bit like optimizer state and $G$ is the minibatch gradient of a loss with respect to $W$.
Where $W \in \Re^{m \times n}$ is a weight matrix, $L\in \Re^{m \times m}$, $R\in \Re^{n \times n}$ are "preconditioners", behaving a bit like optimiser state and $G$ is the minibatch gradient of a loss with respect to $W$.

A slightly different variant is considered here: idealized Shampoo with power $1/2$,
A slightly different variant is considered here: idealised Shampoo with power $1/2$,

<div>$$\begin{aligned}
L &= \mathbb{E}(G G^{\top}) \\
R &= \mathbb{E}(G^{\top} G) \\
W_t &= W_{t-1} - \eta \cdot L^{-1/2} G_t R^{-1/2} \,/\, \mathrm{tr}(L)
\end{aligned}$$</div>

Note that this _idealized_ variant takes an expectation over gradients from the dataset, rather than a running average as per practical implementations. The authors show that the last line is equivalent to idealized Adafactor in the _Shampoo eigenspace_:
Note that this _idealised_ variant takes an expectation over gradients from the dataset, rather than a running average as per practical implementations. The authors show that the last line is equivalent to idealised Adafactor in the _Shampoo eigenspace_:

<div>$$\begin{aligned}
Q_L &= \mathrm{Eigenvectors}(L) \\
Expand All @@ -61,6 +61,6 @@ The running state of this technique includes $L$, $R$, $Q_L$, $Q_R$, $M$ (in the

### Results

Results on language modelling (see figure above) show good step-efficiency of SOAP since it is based on Adam rather than Adafactor, and time-efficiency since the eigenvectors can be periodically updated without substantially harming performance. Like Shampoo, the extra optimization cost can be reduced by using a large batch size.
Results on language modelling (see figure above) show good step-efficiency of SOAP since it is based on Adam rather than Adafactor, and time-efficiency since the eigenvectors can be periodically updated without substantially harming performance. Like Shampoo, the extra optimisation cost can be reduced by using a large batch size.

Stepping back for a moment, I'm excited about this progress using Shampoo variants and am eager to see experiments over long training runs of LLMs. So I hope we'll see plenty more shower-related puns on arXiv over the next year!
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 9202d36

Please sign in to comment.