This project explores a novel approach to Multi-Task Reinforcement Learning (MT-RL) by incorporating Dynamic Sparse Training techniques. We propose a framework that generates dynamic masks conditioned on both state and task information to enable more effective parameter sharing across tasks.
- Dynamic mask generation based on both state and task information
- Structured pruning approach targeting neurons instead of individual weights
- Adaptive parameter sharing mechanism across similar tasks and states
- Integration with SAC (Soft Actor-Critic) for continuous control tasks
The framework consists of two main components:
- Mask Generator: A neural network that generates binary masks for each weight matrix in the base network, taking state, task, and pruning ratio as inputs.
- MT-RL Training Loop: A modified SAC implementation that incorporates dynamic sparse training.
- Standard MT-RL loss for the base network
- Mask similarity optimization based on policy similarities
- Pairwise similarity computation across tasks for mask generation
- Input:
- State information
- Task encoding (one-hot)
- Pruning ratio parameter
- Output: Binary mask matrices for each layer of the base network
- Architecture:
- Compact neural network designed for structured pruning
- Generates masks at neuron-level instead of weight-level
- Uses state-task concatenated input to inform mask generation
- Architecture: Modified SAC implementation
- Components:
- Multi-layer perceptron (MLP) with dynamic masking
- Layer configurations: [400, 400] for policy networks
- State-action value network with similar architecture
- Trajectory encoder: 256-dimensional embedding
- Sample tasks and collect trajectories
- Apply dynamic masks generated based on current state-task pairs
- Update base network using masked parameters
- Store subset of trajectories for mask similarity training
- Schedule: Updates every k episodes
- Process:
- Generate masks for all tasks given common states
- Compute pairwise mask similarities
- Normalize similarities to [0,1] range
- Generate 50x50 similarity matrix for 50 tasks
- Sample trajectories for each task
- Encode trajectories using VAE into latent space
- Estimate policy distributions in latent space
- Compute pairwise KL divergence between policy distributions
- Generate target similarity matrix
-
RL Loss: Standard SAC loss function
L_RL = E[r_t + γ(min Q - α log π)]
-
Mask Similarity Loss: MSE between mask and policy similarities
L_mask = MSE(S_mask, S_policy)
where S_mask and S_policy are the similarity matrices
- Operates at neuron level rather than weight level
- Pruning ratios tested: up to 90-95% parameter reduction
- Dynamic updates based on network performance
- Mask updates: Every k=1000 episodes
- Total interactions:
- MT10: 2M interactions per task
- MT50: 1M interactions per task for fixed, 2M for random versions
- VAE encoder dimension: 256
- Trajectory buffer size: 15k steps
- Similarity computation batch size: 50 tasks
Important!: Current implementation achieves performance approximately 12% lower than Soft Modularization on average, though it outperforms the baseline SAC method. Further optimizations are being explored to improve performance.
Tested on Meta-World environments:
- MT10 random
- MT50 fix
- MT50 random
- MT10 fix
- Multi-head SAC
- Soft Modularization
- CAGrad
- Pure SAC
- CPU: 4-5 cores
- Memory: 32-50GB
- GPU: NVIDIA V100
- MT10 experiments: ~3 days
- MT50 experiments: ~5.5-7 days
- Maintains high sample efficiency
- Reduced parameter count per task's masked sub-network
- Dynamic mask generation enables state-dependent parameter sharing
- Effective feature sharing through sparse network structures
- Performance gap compared to Soft Modularization
- Hardware support required for optimal pruning ratio implementation
- Computational overhead from mask generation
- Optimization of mask generation architecture
- Exploration of different pruning ratios
- Investigation of alternative similarity metrics
- Hardware-specific optimizations
This project is licensed under the MIT License - see the LICENSE file for details.
This research was conducted using the compute resources of Compute Canada, at the University of Alberta - Intelligent Robot Learning Lab (IRLL), in collaboration with Alberta Machine Intelligence Institute (Amii).