theme | highlighter | class | title | background | layout |
---|---|---|---|---|---|
seriph |
shiki |
text-center |
Diffusion Model for Control and Planning Tutorial |
figs/diffuse_teaser.gif |
cover |
- 🔄 Recap: What is a Diffusion Model?
- 🚀 Motivation: Why a Generative Model in Control and Planning?
- 🛠️ Practice: How to Use the Diffuser?
- 📚 Literatures: Recent Research Progress in Diffusion for RL/Control
- 📝 Summary & Challenges in Diffusion Models
- Keynote: Generative model for distribution matching.
- Applications: Image and text generation, creative tasks.
- Keynote: Generative model for distribution matching.
- Applications: Image and text generation, creative tasks.
- Core: Score function for sample generation and distribution description. $$ \boldsymbol{x}_{i+1} \leftarrow \boldsymbol{x}_i+c \nabla \log p\left(\boldsymbol{x}_i\right)+\sqrt{2 c} \boldsymbol{\epsilon}, \quad i=0,1, \ldots, K $$
-
Keynote: Generative model for distribution matching.
-
Applications: Image and text generation, creative tasks.
-
Core: Score function for sample generation and distribution description.
-
Advantages:
- 🌟 Multimodal: Effective with multimodal distributions.
- 📈 Scalable: Suits high-dimensional problems.
- 🔒 Stable: Grounded in solid mathematics and training.
- 🔄 Non-autoregressive: Predicts entire trajectories efficiently.
- Generative Models: application in imitation learning to match expert data.
- Examples: GANs, VAEs in imitation learning.
- Generative Models: application in imitation learning to match expert data.
- Examples: GANs, VAEs in imitation learning.
- GAN in GAIL: Discriminator learning and policy training.
- Idea: Train a discriminator to distinguish between expert and agent data.
- Limitation: Struggles with multimodal distributions, unstable training.
- Generative Models: Crucial in control and planning.
- Examples: GANs, VAEs in imitation learning.
- VAE in ACT (ALOHA): Latent space learning for planning.
- Idea: learn a latent space for planning and control. (generate action in chunks)
- Limitation: hard to train.
layout: iframe
Scenario: Imitation Learning
- Challenge: Match high-dimensional, multimodal trajectory distributions.
- Solution: Diffusion models for expressive distribution matching.
- Common Method: GAIL with adversarial training.
- Limitation: Struggles with multimodal distributions, unstable training.
Scenario: Offline Reinforcement Learning
- Challenge: Outperform demonstrations, ensure close action distribution.
- Solution: Diffusion models to match action distribution effectively.
- Common Method: CQL, penalizes out-of-distribution samples.
- Limitation: over-conservative.
Scenario: Model-based Reinforcement Learning
- Challenge: Match dynamic model and policy's action distribution.
- Solution: Diffusion models for non-autoregressive, multimodal matching.
- Common method: planning with learned dynamics.
- Limitation: compounding error in long-horizon planning.
Key: using a powerful model to matching a high-dimensional, multimodal distribution.
- Action/Value distribution matching: grounded in demostrations -> offline RL.
- Trajectory distribution matching: dynamic feasibility and optimal trajectory distribution -> model-based RL.
- Transition distribution matching: dynamics matching in a non-autoregressive manner -> model-based RL.
- Most common: diffuse trajectory (
diffuser
). - Diffused variable
x
: state, action sequence.$\tau = {s_0, a_0, s_1, a_1, \ldots, s_T, a_T}$ .
Task | Thing's to Diffuse | How to Diffuse |
---|---|---|
Image Generation | ||
Planning |
layout: iframe
-
Objective: make the trained model can generalize to new constraints and tasks.
-
Common case: goal-conditioned, safety, new task etc.
-
Possible Methods:
- Guidance function (d): shift distribution with extra gradient.
- Classifier-free method: learn a model can both represent conditional and unconditional distribution.
- Inpainting (a): fill in the missing part of the trajectory by fixing certain start and end state.
- Guidance function: shift distribution with extra gradient.
- Predefined the guidance function:
- Method: shift distribution with a manually defined function
- Limitation: Might lead to OOD samples, which break the learned diffusion process.
- Predefined the guidance function:
$$ \tilde{p}\theta(\boldsymbol{\tau}) \propto p\theta(\boldsymbol{\tau}) h(\boldsymbol{\tau}) \ \boldsymbol{\tau}^{i-1} = \mathcal{N}\left(\mu+\alpha \Sigma \nabla \mathcal{J}(\mu), \Sigma^i\right) $$
- Guidance function: shift distribution with extra gradient.
▶️ leads to OOD samples- Predefined the guidance function:
- Method: shift distribution with a manually defined function
- Limitation: Might lead to OOD samples, which break the learned diffusion process.
- Learned classifier:
- Method: learning a classifier to distinguish between different constraints. (similar to GAN)
- Limitation: Hard to tune parameters.
- Predefined the guidance function:
$$ \begin{aligned} \nabla \log p\left(\boldsymbol{x}_t \mid y\right) & =\nabla \log \left(\frac{p\left(\boldsymbol{x}_t\right) p\left(y \mid \boldsymbol{x}_t\right)}{p(y)}\right) \ & =\nabla \log p\left(\boldsymbol{x}_t\right)+\nabla \log p\left(y \mid \boldsymbol{x}_t\right)-\nabla \log p(y) \ & =\underbrace{\nabla \log p\left(\boldsymbol{x}t\right)}{\text {unconditional score }}+\underbrace{\nabla \log p\left(y \mid \boldsymbol{x}t\right)}{\text {adversarial gradient }} \end{aligned} $$
- Guidance function: shift distribution with extra gradient.
▶️ leads to OOD samples - Classifier-Free Method: learn a model can both represent conditional and unconditional distribution.
- Method: drop out the condition term to learn a model can represent both conditional and unconditional distribution.
$$ \begin{aligned} \nabla \log p\left(\boldsymbol{x}_t \mid y\right) & =\nabla \log p\left(\boldsymbol{x}_t\right)+\gamma\left(\nabla \log p\left(\boldsymbol{x}_t \mid y\right)-\nabla \log p\left(\boldsymbol{x}_t\right)\right) \ & =\nabla \log p\left(\boldsymbol{x}_t\right)+\gamma \nabla \log p\left(\boldsymbol{x}_t \mid y\right)-\gamma \nabla \log p\left(\boldsymbol{x}_t\right) \ & =\underbrace{\gamma \nabla \log p\left(\boldsymbol{x}t \mid y\right)}{\text {conditional score }}+\underbrace{(1-\gamma) \nabla \log p\left(\boldsymbol{x}t\right)}{\text {unconditional score }} \end{aligned} $$
- Guidance function: shift distribution with extra gradient.
▶️ leads to OOD samples - Classifier-Free Method: learn a model can both represent conditional and unconditional distribution.
Guidance Function Method | Classifier-Free Method |
---|---|
- Guidance function: shift distribution with extra gradient.
▶️ leads to OOD samples - Classifier-Free Method: learn a model can both represent conditional and unconditional distribution.
- Inpainting: fill in the missing part of the trajectory by fixing certain start and end state.
- Method: fix the start and end state, and fill in the missing part of the trajectory.
- Common thing to diffuse: trajectory.
- Common way to impose constraints/add objectives: guidance function, classifier-free method, inpainting.
A detailed summary of each method can be found here.
The key of diffusion: how to get the score function.
$$ \color{red}\underbrace{\nabla_x \log P}{\text{how to get score function}} \color{black}( \color{blue}\underbrace{x}{\text{what to diffuse}} \color{black}| \color{green}\underbrace{y}_{\text{how to impose constraints/objectives}} \color{black}) $$
- How to get score function: data-driven v.s. analytical.
- What to diffuse: sequential v.s. non-sequential.
- How to impose constraints/objectives: hard v.s. soft.
$$ \color{red}\underbrace{\nabla_x \log P}{\text{how to get score function}} \color{black}( \color{blue}\underbrace{x}{\text{what to diffuse}} \color{black}| \color{green}\underbrace{y}_{\text{how to impose constraints/objectives}} \color{black}) $$
- How to get score function: data-driven v.s. analytical.
- Data-driven: learn the score function from data.
- Hybrid: learning from optimization intermediate results.
- Analytical: use the analytical score function.
$$ \color{red}\underbrace{\nabla_x \log P}{\text{how to get score function}} \color{black}( \color{blue}\underbrace{x}{\text{what to diffuse}} \color{black}| \color{green}\underbrace{y}_{\text{how to impose constraints/objectives}} \color{black}) $$
- How to get score function: data-driven v.s. analytical.
- What to diffuse: sequential v.s. non-sequential.
- Action/Value: learn a model to match action/value distribution, serve as regularizer and policy.
- Transition: learn a model to match transition distribution, serve as a world mode.
▶️ MPC - Trajectory: learn a model to match trajectory distribution, serve as a TO solver. (planning state v.s. state-action v.s. action)
$$ \color{red}\underbrace{\nabla_x \log P}{\text{how to get score function}} \color{black}( \color{blue}\underbrace{x}{\text{what to diffuse}} \color{black}| \color{green}\underbrace{y}_{\text{how to impose constraints/objectives}} \color{black}) $$
- How to get score function: data-driven v.s. analytical.
- What to diffuse: sequential v.s. non-sequential.
- How to impose constraints/objectives: hard v.s. soft.
- Guidance function: Predefined or learned
- Classifier-free: Use the unconditional score and conditional score (most common)
- Inpainting: Fix the state and fill in the missing parts of the distribution (complimentary to the other two)
$$ \color{red}\underbrace{\nabla_x \log P}{\text{how to get score function}} \color{black}( \color{blue}\underbrace{x}{\text{what to diffuse}} \color{black}| \color{green}\underbrace{y}_{\text{how to impose constraints/objectives}} \color{black}) $$
- How to get score function: data-driven v.s. analytical.
- What to diffuse: sequential v.s. non-sequential.
- How to impose constraints/objectives: hard v.s. soft.
- Diffusion in robotics: matches demostration distribution from data.
- Use cases: imitation learning, offline RL, model-based RL.
- Role: Learns policy, trajectory, or model as a regularizer/world model/planner.
- Diffusion in robotics: matches dataset distribution in control and planning.
- Use cases: imitation learning, offline RL, model-based RL.
- Role: Learns policy, planner, or model as a distribution matching problem.
- Advantages: high-dimensional matching, stability, scalability.
-
Diffusion in robotics: matches dataset distribution in control and planning.
-
Use cases: imitation learning, offline RL, model-based RL.
-
Role: Learns policy, planner, or model as a distribution matching problem.
-
Challenges:
- 🕒 Computational cost: longer training and inference time.
- 🔀 Shifting distribution: difficulties in adapting to dynamic datasets.
- 📊 High variance: inconsistent performance in precision tasks.
- ⛔ Constraint satisfaction: limited adaptability to new constraints.