You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Off-Policy RL
To increase sample-efficiency, we will use an off-policy RL algorithm to train the agent’s model, as described in [1]. We need to relabel the past trajectories every time we update the reward model to avoid unstable training:
“[...] we use an off-policy RL algorithm, which provides for sample-efficient learning by reusing past experiences that are stored in the replay buffer. However, the learning process of off-policy RL algorithms can be unstable because previous experiences in the replay buffer are labeled with previous learned rewards. To handle this issue, we relabel all of the agent’s past experience every time we update the reward model.”
We can use an off-policy RL algorithm from Stable Baselines [10], which is used in Pref-RL [9], or RLlib [11] and adapting the experience replay buffer to allow relabeling. The easiest might be to adapt Pref-RL for off-policy RL.
--> Adapt Pref-RL to use off-policy RL
The text was updated successfully, but these errors were encountered:
Off-Policy RL
To increase sample-efficiency, we will use an off-policy RL algorithm to train the agent’s model, as described in [1]. We need to relabel the past trajectories every time we update the reward model to avoid unstable training:
“[...] we use an off-policy RL algorithm, which provides for sample-efficient learning by reusing past experiences that are stored in the replay buffer. However, the learning process of off-policy RL algorithms can be unstable because previous experiences in the replay buffer are labeled with previous learned rewards. To handle this issue, we relabel all of the agent’s past experience every time we update the reward model.”
We can use an off-policy RL algorithm from Stable Baselines [10], which is used in Pref-RL [9], or RLlib [11] and adapting the experience replay buffer to allow relabeling. The easiest might be to adapt Pref-RL for off-policy RL.
--> Adapt Pref-RL to use off-policy RL
The text was updated successfully, but these errors were encountered: