AWAC (advantage weighted actor critic) is an algorithm that combines sample-efficient dynamic programming with maximum likelihood policy updates, providing a simple and effective framework that is able to leverage large amounts of offline data and then quickly perform online fine-tuning of RL policies, in order to reach expert-level performance after collecting a limited amount of interaction data. The full AWAC algorithm for offline RL with online fine-tuning is summarized in Algorithm 1.
In a practical implementation, we can parameterize the actor and the critic by neural networks and perform SGD updates from
and
AWAC ensures data efficiency with off-policy critic estimation via bootstrapping, and avoids offline bootstrap error with a constrained actor update. By avoiding explicit modeling of the behavior policy, AWAC avoids overly conservative updates.
python awac-train.py --dataset=HalfCheetah-v2 --seed=0 --gpu=0
[1] Nair A , Dalal M , Gupta A , et al. Accelerating Online Reinforcement Learning with Offline Datasets[J]. 2020.