theme	highlighter	class	title	background	layout
seriph	shiki	text-center	Diffusion Model for Control and Planning Tutorial	figs/diffuse_teaser.gif	cover

📚 Tutorial: Diffusion Model for Control and Planning

transition: fade class: text-left

📖 Tutorial Outline

🔄 Recap: What is a Diffusion Model?
🚀 Motivation: Why a Generative Model in Control and Planning?
🛠️ Practice: How to Use the Diffuser?
📚 Literatures: Recent Research Progress in Diffusion for RL/Control
📝 Summary & Challenges in Diffusion Models

🔄 Recap: What is a Diffusion Model?

Keynote: Generative model for distribution matching.
Applications: Image and text generation, creative tasks.

🔄 Recap: What is a Diffusion Model?

Keynote: Generative model for distribution matching.
Applications: Image and text generation, creative tasks.
Core: Score function for sample generation and distribution description. $$ \boldsymbol{x}_{i+1} \leftarrow \boldsymbol{x}_i+c \nabla \log p\left(\boldsymbol{x}_i\right)+\sqrt{2 c} \boldsymbol{\epsilon}, \quad i=0,1, \ldots, K $$

🔄 Recap: What is a Diffusion Model?

Keynote: Generative model for distribution matching.
Applications: Image and text generation, creative tasks.
Core: Score function for sample generation and distribution description.
Advantages:
- 🌟 Multimodal: Effective with multimodal distributions.
- 📈 Scalable: Suits high-dimensional problems.
- 🔒 Stable: Grounded in solid mathematics and training.
- 🔄 Non-autoregressive: Predicts entire trajectories efficiently.

🚀 Motivation: Why a Generative Model in Control and Planning?

Generative Models in Control and Planning

Generative Models: application in imitation learning to match expert data.
Examples: GANs, VAEs in imitation learning.

🚀 Motivation: Why a Generative Model in Control and Planning?

Generative Models in Control and Planning

Generative Models: application in imitation learning to match expert data.
Examples: GANs, VAEs in imitation learning.
GAN in GAIL: Discriminator learning and policy training.
- Idea: Train a discriminator to distinguish between expert and agent data.
- Limitation: Struggles with multimodal distributions, unstable training.

🚀 Motivation: Why a Generative Model in Control and Planning?

Generative Models in Control and Planning

Generative Models: Crucial in control and planning.
Examples: GANs, VAEs in imitation learning.
VAE in ACT (ALOHA): Latent space learning for planning.
- Idea: learn a latent space for planning and control. (generate action in chunks)
- Limitation: hard to train.

layout: iframe

url: https://tonyzhaozh.github.io/aloha/

🚀 Motivation: Why a Generative Model in Control and Planning?

What to Learn with the Generative Model?

Scenario: Imitation Learning $\min |p_\theta(\tau) - p_{\text{data}}(\tau)|$

Challenge: Match high-dimensional, multimodal trajectory distributions.
Solution: Diffusion models for expressive distribution matching.
Common Method: GAIL with adversarial training.
Limitation: Struggles with multimodal distributions, unstable training.

🚀 Motivation: Why a Generative Model in Control and Planning?

What to Learn with the Generative Model?

Scenario: Offline Reinforcement Learning $\max J_\theta(\tau) \ge J_\text{data}(\tau) \text{s.t.} |p_\theta(\tau) - p_{\text{data}}(\tau)| < \epsilon$

Challenge: Outperform demonstrations, ensure close action distribution.
Solution: Diffusion models to match action distribution effectively.
Common Method: CQL, penalizes out-of-distribution samples.
Limitation: over-conservative.

🚀 Motivation: Why a Generative Model in Control and Planning?

What to Learn with the Generative Model?

Scenario: Model-based Reinforcement Learning

Challenge: Match dynamic model and policy's action distribution.
Solution: Diffusion models for non-autoregressive, multimodal matching.
Common method: planning with learned dynamics.
Limitation: compounding error in long-horizon planning.

🚀 Motivation: Why a Generative Model in Control and Planning?

What to Learn with the Generative Model?

Key: using a powerful model to matching a high-dimensional, multimodal distribution.

Action/Value distribution matching: grounded in demostrations -> offline RL.
Trajectory distribution matching: dynamic feasibility and optimal trajectory distribution -> model-based RL.
Transition distribution matching: dynamics matching in a non-autoregressive manner -> model-based RL.

🛠️ Practice: How to Use the Diffuser?

What to Diffuse?

Most common: diffuse trajectory (diffuser).
Diffused variable x: state, action sequence. $\tau = {s_0, a_0, s_1, a_1, \ldots, s_T, a_T}$.

Task	Thing's to Diffuse	How to Diffuse
Image Generation
Planning

layout: iframe

url: https://diffusion-planning.github.io

🛠️ Practice: How to Use the Diffuser?

How to Impose Constraints/Objectives?

Objective: make the trained model can generalize to new constraints and tasks.
Common case: goal-conditioned, safety, new task etc.
Possible Methods:
- Guidance function (d): shift distribution with extra gradient.
- Classifier-free method: learn a model can both represent conditional and unconditional distribution.
- Inpainting (a): fill in the missing part of the trajectory by fixing certain start and end state.

🛠️ Practice: How to Use the Diffuser?

How to Impose Constraints/Objectives?

Guidance function: shift distribution with extra gradient.
- Predefined the guidance function:
  - Method: shift distribution with a manually defined function
  - Limitation: Might lead to OOD samples, which break the learned diffusion process.

$$ \tilde{p}\theta(\boldsymbol{\tau}) \propto p\theta(\boldsymbol{\tau}) h(\boldsymbol{\tau}) \ \boldsymbol{\tau}^{i-1} = \mathcal{N}\left(\mu+\alpha \Sigma \nabla \mathcal{J}(\mu), \Sigma^i\right) $$

🛠️ Practice: How to Use the Diffuser?

How to Impose Constraints/Objectives?

Guidance function: shift distribution with extra gradient. ▶️ leads to OOD samples
- Predefined the guidance function:
  - Method: shift distribution with a manually defined function
  - Limitation: Might lead to OOD samples, which break the learned diffusion process.
- Learned classifier:
  - Method: learning a classifier to distinguish between different constraints. (similar to GAN)
  - Limitation: Hard to tune parameters.

$$ \begin{aligned} \nabla \log p\left(\boldsymbol{x}_t \mid y\right) & =\nabla \log \left(\frac{p\left(\boldsymbol{x}_t\right) p\left(y \mid \boldsymbol{x}_t\right)}{p(y)}\right) \ & =\nabla \log p\left(\boldsymbol{x}_t\right)+\nabla \log p\left(y \mid \boldsymbol{x}_t\right)-\nabla \log p(y) \ & =\underbrace{\nabla \log p\left(\boldsymbol{x}t\right)}{\text {unconditional score }}+\underbrace{\nabla \log p\left(y \mid \boldsymbol{x}t\right)}{\text {adversarial gradient }} \end{aligned} $$

🛠️ Practice: How to Use the Diffuser?

How to Impose Constraints/Objectives?

Guidance function: shift distribution with extra gradient. ▶️ leads to OOD samples
Classifier-Free Method: learn a model can both represent conditional and unconditional distribution.
- Method: drop out the condition term to learn a model can represent both conditional and unconditional distribution.

$$ \begin{aligned} \nabla \log p\left(\boldsymbol{x}_t \mid y\right) & =\nabla \log p\left(\boldsymbol{x}_t\right)+\gamma\left(\nabla \log p\left(\boldsymbol{x}_t \mid y\right)-\nabla \log p\left(\boldsymbol{x}_t\right)\right) \ & =\nabla \log p\left(\boldsymbol{x}_t\right)+\gamma \nabla \log p\left(\boldsymbol{x}_t \mid y\right)-\gamma \nabla \log p\left(\boldsymbol{x}_t\right) \ & =\underbrace{\gamma \nabla \log p\left(\boldsymbol{x}t \mid y\right)}{\text {conditional score }}+\underbrace{(1-\gamma) \nabla \log p\left(\boldsymbol{x}t\right)}{\text {unconditional score }} \end{aligned} $$

🛠️ Practice: How to Use the Diffuser?

How to Impose Constraints/Objectives?

Guidance function: shift distribution with extra gradient. ▶️ leads to OOD samples
Classifier-Free Method: learn a model can both represent conditional and unconditional distribution.

Guidance Function Method	Classifier-Free Method

🛠️ Practice: How to Use the Diffuser?

How to Impose Constraints/Objectives?

Guidance function: shift distribution with extra gradient. ▶️ leads to OOD samples
Classifier-Free Method: learn a model can both represent conditional and unconditional distribution.
Inpainting: fill in the missing part of the trajectory by fixing certain start and end state.
- Method: fix the start and end state, and fill in the missing part of the trajectory.

🛠️ Practice: How to Use the Diffuser?

Common thing to diffuse: trajectory.
Common way to impose constraints/add objectives: guidance function, classifier-free method, inpainting.

📚 Literatures: Recent Research Progress in Diffusion for RL/Control

A detailed summary of each method can be found here.

The key of diffusion: how to get the score function.