-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: adds REINFORCE algorithm #357
base: main
Are you sure you want to change the base?
Conversation
dc4ea59
to
d11a4d8
Compare
|
||
The above script runs the reward model server on 1 node and the actor on 1 node. | ||
|
||
It is important to launch all jobs with ``&`` after the srun command, to ensure they do not block each other. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is important to launch all jobs with ``&`` after the srun command, to ensure they do not block each other. | |
It is important to launch all jobs with ``&`` after the srun command to ensure they do not block each other. |
trainer.reinforce.batch_iterator.use_flask=${USE_FLASK} \ | ||
trainer.reinforce.rollout_batch_seq_length=4096 | ||
|
||
The above command launches the initial and actor server on 1 node with 8 GPUs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above command launches the initial and actor server on 1 node with 8 GPUs. | |
The above command launches the initial and actor server on one node with eight GPUs. |
rm_model_file=${RM_NEMO_FILE} | ||
|
||
|
||
The above example launches the reward model server on 8 GPUs and 1 node. Please make sure to change trainer.devices, trainer.num_nodes depending on your model size and scale. Aligner will work on any scale. Also, make sure to tune the trainer.reinforce.inference_micro_batch_size argument. This argument sets the size of the batch the REINFORCE actor is allowed to send to the reward per DP rank. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above example launches the reward model server on 8 GPUs and 1 node. Please make sure to change trainer.devices, trainer.num_nodes depending on your model size and scale. Aligner will work on any scale. Also, make sure to tune the trainer.reinforce.inference_micro_batch_size argument. This argument sets the size of the batch the REINFORCE actor is allowed to send to the reward per DP rank. | |
The above example launches the reward model server on eight GPUs and one node. Make sure to change trainer.devices, trainer.num_nodes depending on your model size and scale. Aligner will work on any scale. Also, make sure to tune the trainer.reinforce.inference_micro_batch_size argument. This argument sets the size of the batch the REINFORCE actor is allowed to send to the reward per DP rank. |
Launching Both Servers for REINFORCE training | ||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% | ||
|
||
You can use slurm to launch the 2 jobs and get them to coordinate together in a full REINFORCE job via the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use slurm to launch the 2 jobs and get them to coordinate together in a full REINFORCE job via the following: | |
You can use slurm to launch the two jobs and get them to coordinate together in a full REINFORCE job through the following: |
|
||
wait | ||
|
||
The above script runs the reward model server on 1 node and the actor on 1 node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above script runs the reward model server on 1 node and the actor on 1 node. | |
The above script runs the reward model server on one node and the actor on one node. |
|
||
It is important to launch all jobs with ``&`` after the srun command, to ensure they do not block each other. | ||
|
||
.. note:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.. note:: | |
.. Note:: |
REINFORCE Results | ||
%%%%%%%%%%%%%%%%%%%%%%%%%% | ||
|
||
Once you've completed reinforce training, you can serve your model using the `megatron_gpt_eval.py <https://github.com/NVIDIA/NeMo/blob/8cd5f1c8e7d4fed9f4f946028cd02047c5d2296f/examples/nlp/language_modeling/megatron_gpt_eval.py#L4>`__ script from the NeMo codebase to run more rigorous evaluation of your trained model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once you've completed reinforce training, you can serve your model using the `megatron_gpt_eval.py <https://github.com/NVIDIA/NeMo/blob/8cd5f1c8e7d4fed9f4f946028cd02047c5d2296f/examples/nlp/language_modeling/megatron_gpt_eval.py#L4>`__ script from the NeMo codebase to run more rigorous evaluation of your trained model. | |
After you've completed reinforce training, you can serve your model using the `megatron_gpt_eval.py <https://github.com/NVIDIA/NeMo/blob/8cd5f1c8e7d4fed9f4f946028cd02047c5d2296f/examples/nlp/language_modeling/megatron_gpt_eval.py#L4>`__ script from the NeMo codebase to run more rigorous evaluation of your trained model. |
@@ -0,0 +1,198 @@ | |||
# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to update the copyright date throughout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ko3n1g can this be automated? at least the correctness && presence of this header?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just found a few minor nits.
Signed-off-by: Alexander Bukharin <[email protected]>
Signed-off-by: Alexander Bukharin <[email protected]>
Signed-off-by: Alexander Bukharin <[email protected]>
Signed-off-by: Alexander Bukharin <[email protected]>
Signed-off-by: Alexander Bukharin <[email protected]>
Signed-off-by: Alexander Bukharin <[email protected]>
Signed-off-by: Alexander Bukharin <[email protected]>
Signed-off-by: Alexander Bukharin <[email protected]>
Signed-off-by: Alexander Bukharin <[email protected]>
for more information, see https://pre-commit.ci Signed-off-by: Alexander Bukharin <[email protected]>
Signed-off-by: Alexander Bukharin <[email protected]>
01b7637
to
e851b6d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work!
Please add a functional test for reinforce. You may use the ppo one as reference.
You also need to rebase onto main
to make sure REINFORCE works with the latest state of TRTLLM (v13).
The biggest thing about this PR that I have concerns is just the huge overlap between the models since almost all the code is shared. I think we should try to create a common class for all models needing generation.
NUM_ROLLOUTS=32 | ||
ACTOR_GBS=32 | ||
REWARD_PORT=5555 | ||
host_reward="$(scontrol show hostnames=$SLURM_JOB_NODELIST_HET_GROUP_0 | head -n1)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work outside of a slurm context, probably best to hard code it with a comment, so something like:
host_reward="$(scontrol show hostnames=$SLURM_JOB_NODELIST_HET_GROUP_0 | head -n1)" | |
# Change this to the hostname of server hosting the reward model | |
reward_host="localhost" |
#SBATCH hetjob | ||
#SBATCH -N 1 --ntasks-per-node 8 -A <<ACCOUNT>> -p <<PARTITION>> --job-name <<JOBNAME>> -t 4:00:00 --exclusive | ||
|
||
NAME="2p_reinforce" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's 2p?
logger.log_hyperparams(OmegaConf.to_container(cfg)) | ||
|
||
rm = RemoteGPTRMClient(cfg.remote_rm) | ||
timer = Timer(cfg.exp_manager.get("max_time_per_run")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for consistency, this should be:
timer = Timer(cfg.exp_manager.get("max_time_per_run") if cfg.exp_manager else None) |
self.train_df = pd.DataFrame(columns=["step", "prompt", "response", "reward"]) | ||
self.val_df = pd.DataFrame(columns=["step", "prompt", "response", "reward"]) | ||
|
||
self.timer = SyncTimer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For newer algos, we should use ScopedTimer instead:
NeMo-Aligner/nemo_aligner/algorithms/ppo.py
Line 197 in e2c9695
self.timer = ScopedTimer(reduction="mean", sync_cuda=True, buffer_size=1) |
@@ -112,3 +112,28 @@ def select_topk(batch, num_select=1): | |||
|
|||
selected_batch = {k: batch[k][selected_idx] for k in batch.keys()} | |||
return selected_batch | |||
|
|||
|
|||
def calculate_rloo_baseline(prompts, reward, mask): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please write a unit test for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overlap with megatron_gpt_ppo_actor.py is very high. Given that other models are incoming that need TRTLLM support, I think it would be best to create some baseclass for all things requiring generation.
@gshennvm Thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to block this PR because of this since it works, but this definitely adds tech debt
|
||
# Speed-up training by accelerating inference stage using TRTLLM | ||
trt_llm: | ||
enable: False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we default to True
? Fewer models would be supported, but at least things would be very fast OOTB
# TRTLLM preallocates activation memory according to the number of input tokens | ||
# By default, assume the max input length is the difference between the model sequence length and the max number of tokens to generate | ||
max_input_len: ${subtract:${model.encoder_seq_length}, ${model.reinforce.length_params.max_length}} | ||
max_input_tokens: ${multiply:${.max_input_len}, ${model.reinforce.rollout_micro_batch_size}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be okay to drop, but please double check
|
||
trt_llm: ${trainer.reinforce.trt_llm} | ||
|
||
#peft |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: comment not adding much
def __init__( | ||
self, | ||
cfg: DictConfig, | ||
model, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should statically type at least the model. It helps folks who are using editors to jump to definitions
What does this PR do ?
Adds the REINFORCE algorithm.
Changelog
Usage
NAME="2p_reinforce"
PARAMETERS
RM_NEMO_FILE="/path/to/trained_rm.nemo"
ACTOR_NEMO_FILE="/path/to/sft_model.nemo"
TRAIN_DATA_PATH="/path/to/train_prompts.jsonl"
VALID_DATA_PATH="/path/to/test_prompts.jsonl"
RESULTS_DIR="/path/to/results_dir"
mkdir -p $RESULTS_DIR
GPFS="/path/to/nemo-aligner-repo"
MOUNTS="--container-mounts=MOUNTS" # mounts
CONTAINER=<<>> # use the latest NeMo Training container, Aligner will work there
PROJECT=reinforce_run
CRITIC_LOG_DIR="${RESULTS_DIR}/critic_results"
CRITIC_OUTFILE="${CRITIC_LOG_DIR}/critic_output_%j_%t.log"
CRITIC_ERRFILE="${CRITIC_LOG_DIR}/critic_error_%j_%t.err"
REWARD_PORT=5567
CRITIC_CONFIG_PATH="${GPFS}/examples/nlp/gpt/conf"
CRITIC_CONFIG_NAME="inference_rm"
CONF_DIR="${GPFS}/examples/nlp/gpt/conf"
CONFIG_NAME="gpt_reinforce_actor"
mkdir -p $CRITIC_LOG_DIR
CRITIC_NAME="${NAME}_critic"
read -r -d '' cmd_critic_inference <<EOF
cd ${GPFS}
&& export PYTHONPATH="${GPFS}:${PYTHONPATH}"
&& export HYDRA_FULL_ERROR=1
&& python -u examples/nlp/gpt/serve_reward_model.py
--config-path=${CRITIC_CONFIG_PATH}
--config-name=${CRITIC_CONFIG_NAME}
trainer.num_nodes=1
trainer.devices=8
++model.tensor_model_parallel_size=4
rm_model_file=${RM_NEMO_FILE}
inference.port=${REWARD_PORT}
EOF
srun --het-group=0 -o $CRITIC_OUTFILE -e$CRITIC_ERRFILE --container-image=$ {CONTAINER} $MOUNTS bash -c "$ {cmd_critic_inference}" &
sleep 30
ACTOR_LOG_DIR="${RESULTS_DIR}/actor_results"
CHECKPOINT_DIR="${ACTOR_LOG_DIR}/checkpoints"
TENSOBOARD_DIR="${ACTOR_LOG_DIR}/tensorboard"
NUM_ROLLOUTS=16
NORMALIZE="True"
ACTOR_LR="1e-6"
ACTOR_GBS=16
KL=0.01
USE_FLASK=False
mkdir -p $ACTOR_LOG_DIR
mkdir -p $TENSOBOARD_DIR
mkdir -p $CHECKPOINT_DIR
ACTOR_NAME="${NAME}_actor"
host_reward="$(scontrol show hostnames=$SLURM_JOB_NODELIST_HET_GROUP_0 | head -n1)"
read -r -d '' cmd_reinforce <<EOF
cd ${GPFS}
export PYTHONPATH="${GPFS}:${PYTHONPATH}"
&& export HYDRA_FULL_ERROR=1
&& python -u examples/nlp/gpt/train_gpt_reinforce_actor.py
"model.data.data_prefix={train: [${TRAIN_DATA_PATH}], validation: [${VALID_DATA_PATH}], test: [${VALID_DATA_PATH}]}"
pretrained_checkpoint.restore_from_path="${PRETRAINED_ACTOR_NEMO_FILE}"
exp_manager.checkpoint_callback_params.save_top_k=1
exp_manager.explicit_log_dir="${RESULTS_DIR}"
trainer.reinforce.max_epochs=1
trainer.reinforce.max_steps=313
trainer.reinforce.val_check_interval=4
trainer.num_nodes=1
trainer.devices=8
trainer.reinforce.trt_llm.enable=True
trainer.reinforce.trt_llm.reshard=True
trainer.reinforce.trt_llm.unload_engine_train=False
++model.tensor_model_parallel_size=4
++model.reinforce.num_rollout_samples=${NUM_ROLLOUTS}
model.global_batch_size=${ACTOR_GBS}
model.micro_batch_size=1
model.optim.lr="\${multiply:${ACTOR_LR},1.001}"
model.optim.sched.warmup_steps=0
model.optim.sched.constant_steps=312
model.optim.sched.min_lr=${ACTOR_LR}
model.optim.weight_decay=0.01
model.reinforce.rollout_micro_batch_size=16
model.reinforce.forward_micro_batch_size=16
model.reinforce.val_rollout_micro_batch_size=8
model.data.data_impl=jsonl
remote_rm.reward_model.ip=${host_reward}
remote_rm.reward_model.port=${REWARD_PORT}
++model.reinforce.length_params.max_length=2048
trainer.reinforce.initial_policy_kl_penalty="${KL}"
++model.optim.bucket_cap_mb=200
++model.dist_ckpt_format=zarr
++model.optim.overlap_grad_sync=False
++model.optim.contiguous_grad_buffer=True
++model.enable_nge=True
trainer.reinforce.batch_iterator.use_flask=${USE_FLASK}
trainer.reinforce.rollout_batch_seq_length=4096
EOF
srun --het-group=1 -o $PPO_OUTFILE -e$PPO_ERRFILE --container-image=$ {CONTAINER} $MOUNTS bash -c "$ {cmd_reinforce}" &
wait
Before your PR is "Ready for review"
Pre checks:
Checklist when contributing a new algorithm
max_steps=-1
andvalidation
?