Offpolicy RL [Deadline: 06.10] #39

lauritowal · 2022-09-01T17:28:48Z

Off-Policy RL
To increase sample-efficiency, we will use an off-policy RL algorithm to train the agent’s model, as described in [1]. We need to relabel the past trajectories every time we update the reward model to avoid unstable training:

“[...] we use an off-policy RL algorithm, which provides for sample-efficient learning by reusing past experiences that are stored in the replay buffer. However, the learning process of off-policy RL algorithms can be unstable because previous experiences in the replay buffer are labeled with previous learned rewards. To handle this issue, we relabel all of the agent’s past experience every time we update the reward model.”

We can use an off-policy RL algorithm from Stable Baselines [10], which is used in Pref-RL [9], or RLlib [11] and adapting the experience replay buffer to allow relabeling. The easiest might be to adapt Pref-RL for off-policy RL.

--> Adapt Pref-RL to use off-policy RL

lauritowal self-assigned this Sep 1, 2022

lauritowal removed their assignment Sep 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Offpolicy RL [Deadline: 06.10] #39

Offpolicy RL [Deadline: 06.10] #39

lauritowal commented Sep 1, 2022

Offpolicy RL [Deadline: 06.10] #39

Offpolicy RL [Deadline: 06.10] #39

Comments

lauritowal commented Sep 1, 2022