by Sam Johnson
The blogpost for this project is here.
When training an RL agent, it is quite common for the agent to learn to score well against the reward function you have specified, but at the cost of some of other thing you value but for which you did not account in the goal definition. The negative side effects of a misspecified reward problem is a consequence of the more general agent behavior of reward hacking, in which an agent exploits some mistake or vulnerability in its environment to garner high reward, while failing to achieve the true objective intended by system designers. Reward hacking is a widespread problem in the field of reinforcement learning (RL) and the problem persists with RLHF in the language modeling domain. This is an important challenge to address, because if we want a future in which RL agents can execute tasks that we humans are unable or unwilling to do, we also would like the realization of those goals to come without unintended consequences.
Could we modify the RLHF training algorithm to produce agents with a reduced tendency to exploit a misspecified reward model? I've developed Greedy-Advantage-Aware RLHF (GAA) for this end.
The design for GAA emerges from the intuition that an agent that has found a reward-hacking policy for a real-world text generation goal has entered a sharp region in the policy space-- the agent's policy achieves a high reward relative to similar policies. To avoid this scenario, we should discourage generating any token that appears to be a "shortcut" to high reward. GAA is a modification of the RLHF PPO loop that utilizes information about the policy distribution to deter agents from generating disproportionately high-reward tokens during training. GAA appears to have a mitigating effect on the exploitation of misspecified reward functions in the simple experiments featured in the blogpost.
I wanted to design a system with the following learning behavior: if a particular token is much better than a randomly sampled token, then make the policy less likely to select it. If a particular token is only slightly better than a randomly sampled token, then make the policy more likely to select it. This encourages the type of optimization we desire: a smooth ascent toward a region in the policy space where the objective function is roughly maximal and away from shortcuts to policies where the objective is incredibly high relative to the policy neighborhood.
In the beginning of each GAA rollout, I observe the highest probability token from the model, or sample "greedily" from the probability distribution. I will refer to the token
The objective function and resulting gradients are computed for both the greedy token advantage
To run the experiments shown in the GAA blogpost, you can run expr1_philo.sh
and sharpness_expr.sh
from the command line. To run your own experiments using GAA, you can take advantage of the train
utility. Simply enter the command python -m src.actions.train
with any of the following options:
options:
-h, --help show this help message and exit
--gaa, --no-gaa Use the Greedy-Advantage-Aware trainer
--wandb, --no-wandb Use wandb for logging
--eval, --no-eval Evaluate the model after training
--eval_sharpness, --no-eval_sharpness
Evaluate the sharpness of the model
--name NAME Name of the experiment
--wandb_project_name WANDB_PROJECT_NAME
Name of the wandb project
--total_phases TOTAL_PHASES
Number of training phases
--batch_size BATCH_SIZE
Batch size for training
--num_minibatches NUM_MINIBATCHES
Number of minibatches for PPO
--gen_len GEN_LEN Length of the generated text
--temperature TEMPERATURE
Temperature for sampling
--prefix PREFIX Prompt for the generated text
--kl_coef KL_COEF KL coefficient for PPO objective
--vf_coef VF_COEF Value function coefficient for PPO objective
--reward_fn REWARD_FN
Reward function to use
--bonus_word BONUS_WORD
Word that gives bonus reward
--x_eta X_ETA Eta multiplier for GAA Training
--x_sig X_SIG Sigma multiplier for GAA Training
--head_learning_rate HEAD_LEARNING_RATE
Learning rate for the value head
--n_eval_samples N_EVAL_SAMPLES
Number of samples to generate for evaluation
--eval_reward_fn EVAL_REWARD_FN
Reward function to use for evaluation
My RLHF implementation is based on the materials in Callum McDougall's ARENA course. The sharpness routines I utilize are adopted from material in the PyHessian library.