Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offpolicy RL [Deadline: 06.10] #39

Open
lauritowal opened this issue Sep 1, 2022 · 0 comments
Open

Offpolicy RL [Deadline: 06.10] #39

lauritowal opened this issue Sep 1, 2022 · 0 comments

Comments

@lauritowal
Copy link
Contributor

Off-Policy RL
To increase sample-efficiency, we will use an off-policy RL algorithm to train the agent’s model, as described in [1]. We need to relabel the past trajectories every time we update the reward model to avoid unstable training:

“[...] we use an off-policy RL algorithm, which provides for sample-efficient learning by reusing past experiences that are stored in the replay buffer. However, the learning process of off-policy RL algorithms can be unstable because previous experiences in the replay buffer are labeled with previous learned rewards. To handle this issue, we relabel all of the agent’s past experience every time we update the reward model.”

We can use an off-policy RL algorithm from Stable Baselines [10], which is used in Pref-RL [9], or RLlib [11] and adapting the experience replay buffer to allow relabeling. The easiest might be to adapt Pref-RL for off-policy RL.

--> Adapt Pref-RL to use off-policy RL

@lauritowal lauritowal self-assigned this Sep 1, 2022
@lauritowal lauritowal removed their assignment Sep 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant