feat: adds REINFORCE algorithm #357

abukharin3 · 2024-10-23T21:44:41Z

What does this PR do ?

Adds the REINFORCE algorithm.

Changelog

Please update the CHANGELOG.md under next version with high level changes in this PR.

Usage

NAME="2p_reinforce"

PARAMETERS

RM_NEMO_FILE="/path/to/trained_rm.nemo"

ACTOR_NEMO_FILE="/path/to/sft_model.nemo"

TRAIN_DATA_PATH="/path/to/train_prompts.jsonl"
VALID_DATA_PATH="/path/to/test_prompts.jsonl"

RESULTS_DIR="/path/to/results_dir"
mkdir -p $RESULTS_DIR

GPFS="/path/to/nemo-aligner-repo"
MOUNTS="--container-mounts=MOUNTS" # mounts

CONTAINER=<<>> # use the latest NeMo Training container, Aligner will work there

PROJECT=reinforce_run

CRITIC_LOG_DIR="${RESULTS_DIR}/critic_results"
CRITIC_OUTFILE="${CRITIC_LOG_DIR}/critic_output_%j_%t.log"
CRITIC_ERRFILE="${CRITIC_LOG_DIR}/critic_error_%j_%t.err"
REWARD_PORT=5567
CRITIC_CONFIG_PATH="${GPFS}/examples/nlp/gpt/conf"
CRITIC_CONFIG_NAME="inference_rm"

CONF_DIR="${GPFS}/examples/nlp/gpt/conf"
CONFIG_NAME="gpt_reinforce_actor"

mkdir -p $CRITIC_LOG_DIR

CRITIC_NAME="${NAME}_critic"

read -r -d '' cmd_critic_inference <<EOF
cd ${GPFS}
&& export PYTHONPATH="${GPFS}:${PYTHONPATH}"
&& export HYDRA_FULL_ERROR=1
&& python -u examples/nlp/gpt/serve_reward_model.py
--config-path=${CRITIC_CONFIG_PATH}
--config-name=${CRITIC_CONFIG_NAME}
trainer.num_nodes=1
trainer.devices=8
++model.tensor_model_parallel_size=4
rm_model_file=${RM_NEMO_FILE}
inference.port=${REWARD_PORT}
EOF

srun --het-group=0 -o $CRITIC_OUTFILE -e $CRITIC_ERRFILE --container-image=${CONTAINER} $MOUNTS bash -c "${cmd_critic_inference}" &

sleep 30

ACTOR_LOG_DIR="${RESULTS_DIR}/actor_results"
CHECKPOINT_DIR="${ACTOR_LOG_DIR}/checkpoints"
TENSOBOARD_DIR="${ACTOR_LOG_DIR}/tensorboard"

NUM_ROLLOUTS=16
NORMALIZE="True"
ACTOR_LR="1e-6"
ACTOR_GBS=16
KL=0.01
USE_FLASK=False

mkdir -p $ACTOR_LOG_DIR
mkdir -p $TENSOBOARD_DIR
mkdir -p $CHECKPOINT_DIR

ACTOR_NAME="${NAME}_actor"

host_reward="$(scontrol show hostnames=$SLURM_JOB_NODELIST_HET_GROUP_0 | head -n1)"

read -r -d '' cmd_reinforce <<EOF
cd ${GPFS}
export PYTHONPATH="${GPFS}:${PYTHONPATH}"
&& export HYDRA_FULL_ERROR=1
&& python -u examples/nlp/gpt/train_gpt_reinforce_actor.py
"model.data.data_prefix={train: [${TRAIN_DATA_PATH}], validation: [${VALID_DATA_PATH}], test: [${VALID_DATA_PATH}]}"
pretrained_checkpoint.restore_from_path="${PRETRAINED_ACTOR_NEMO_FILE}"
exp_manager.checkpoint_callback_params.save_top_k=1
exp_manager.explicit_log_dir="${RESULTS_DIR}"
trainer.reinforce.max_epochs=1
trainer.reinforce.max_steps=313
trainer.reinforce.val_check_interval=4
trainer.num_nodes=1
trainer.devices=8
trainer.reinforce.trt_llm.enable=True
trainer.reinforce.trt_llm.reshard=True
trainer.reinforce.trt_llm.unload_engine_train=False
++model.tensor_model_parallel_size=4
++model.reinforce.num_rollout_samples=${NUM_ROLLOUTS}
model.global_batch_size=${ACTOR_GBS}
model.micro_batch_size=1
model.optim.lr="\${multiply:${ACTOR_LR},1.001}"
model.optim.sched.warmup_steps=0
model.optim.sched.constant_steps=312
model.optim.sched.min_lr=${ACTOR_LR}
model.optim.weight_decay=0.01
model.reinforce.rollout_micro_batch_size=16
model.reinforce.forward_micro_batch_size=16
model.reinforce.val_rollout_micro_batch_size=8
model.data.data_impl=jsonl
remote_rm.reward_model.ip=${host_reward}
remote_rm.reward_model.port=${REWARD_PORT}
++model.reinforce.length_params.max_length=2048
trainer.reinforce.initial_policy_kl_penalty="${KL}"
++model.optim.bucket_cap_mb=200
++model.dist_ckpt_format=zarr
++model.optim.overlap_grad_sync=False
++model.optim.contiguous_grad_buffer=True
++model.enable_nge=True
trainer.reinforce.batch_iterator.use_flask=${USE_FLASK}
trainer.reinforce.rollout_batch_seq_length=4096
EOF

srun --het-group=1 -o $PPO_OUTFILE -e $PPO_ERRFILE --container-image=${CONTAINER} $MOUNTS bash -c "${cmd_reinforce}" &

wait

Before your PR is "Ready for review"

Pre checks:

[Y] Make sure you read and followed Contributor guidelines
[N] Did you write any new necessary tests?
[Y] Did you add or update any necessary documentation? Make sure to also update the NeMo Framework User Guide which contains the tutorials

Checklist when contributing a new algorithm

[Y] Does the trainer resume and restore model state all states?
[Y] Does the trainer support all parallelism techniques(PP, TP, DP)?
[Y] Does the trainer support max_steps=-1 and validation?
[Y] Does the trainer only call APIs defined in alignable_interface.py?
[&] Does the trainer have proper logging?

zredeaux07 · 2024-10-24T18:51:15Z

docs/user-guide/reinforce.rst

+
+The above script runs the reward model server on 1 node and the actor on 1 node.
+
+It is important to launch all jobs with ``&`` after the srun command, to ensure they do not block each other. 


Suggested change

It is important to launch all jobs with ``&`` after the srun command, to ensure they do not block each other.

It is important to launch all jobs with ``&`` after the srun command to ensure they do not block each other.

zredeaux07 · 2024-10-24T18:52:10Z

docs/user-guide/reinforce.rst

+      trainer.reinforce.batch_iterator.use_flask=${USE_FLASK} \
+      trainer.reinforce.rollout_batch_seq_length=4096
+
+The above command launches the initial and actor server on 1 node with 8 GPUs.


Suggested change

The above command launches the initial and actor server on 1 node with 8 GPUs.

The above command launches the initial and actor server on one node with eight GPUs.

zredeaux07 · 2024-10-24T18:53:25Z

docs/user-guide/reinforce.rst

+      rm_model_file=${RM_NEMO_FILE}
+
+
+The above example launches the reward model server on 8 GPUs and 1 node. Please make sure to change trainer.devices, trainer.num_nodes depending on your model size and scale. Aligner will work on any scale. Also, make sure to tune the trainer.reinforce.inference_micro_batch_size argument. This argument sets the size of the batch the REINFORCE actor is allowed to send to the reward per DP rank.


Suggested change

The above example launches the reward model server on 8 GPUs and 1 node. Please make sure to change trainer.devices, trainer.num_nodes depending on your model size and scale. Aligner will work on any scale. Also, make sure to tune the trainer.reinforce.inference_micro_batch_size argument. This argument sets the size of the batch the REINFORCE actor is allowed to send to the reward per DP rank.

The above example launches the reward model server on eight GPUs and one node. Make sure to change trainer.devices, trainer.num_nodes depending on your model size and scale. Aligner will work on any scale. Also, make sure to tune the trainer.reinforce.inference_micro_batch_size argument. This argument sets the size of the batch the REINFORCE actor is allowed to send to the reward per DP rank.

zredeaux07 · 2024-10-24T18:55:11Z

docs/user-guide/reinforce.rst

+Launching Both Servers for REINFORCE training
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+You can use slurm to launch the 2 jobs and get them to coordinate together in a full REINFORCE job via the following:


Suggested change

You can use slurm to launch the 2 jobs and get them to coordinate together in a full REINFORCE job via the following:

You can use slurm to launch the two jobs and get them to coordinate together in a full REINFORCE job through the following:

zredeaux07 · 2024-10-24T18:56:01Z

docs/user-guide/reinforce.rst

+
+   wait
+
+The above script runs the reward model server on 1 node and the actor on 1 node.


Suggested change

The above script runs the reward model server on 1 node and the actor on 1 node.

The above script runs the reward model server on one node and the actor on one node.

zredeaux07 · 2024-10-24T18:56:31Z

docs/user-guide/reinforce.rst

+
+It is important to launch all jobs with ``&`` after the srun command, to ensure they do not block each other. 
+
+.. note::


Suggested change

.. note::

.. Note::

zredeaux07 · 2024-10-24T18:57:17Z

docs/user-guide/reinforce.rst

+REINFORCE Results
+%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+Once you've completed reinforce training, you can serve your model using the `megatron_gpt_eval.py <https://github.com/NVIDIA/NeMo/blob/8cd5f1c8e7d4fed9f4f946028cd02047c5d2296f/examples/nlp/language_modeling/megatron_gpt_eval.py#L4>`__ script from the NeMo codebase to run more rigorous evaluation of your trained model.


Suggested change

Once you've completed reinforce training, you can serve your model using the `megatron_gpt_eval.py <https://github.com/NVIDIA/NeMo/blob/8cd5f1c8e7d4fed9f4f946028cd02047c5d2296f/examples/nlp/language_modeling/megatron_gpt_eval.py#L4>`__ script from the NeMo codebase to run more rigorous evaluation of your trained model.

After you've completed reinforce training, you can serve your model using the `megatron_gpt_eval.py <https://github.com/NVIDIA/NeMo/blob/8cd5f1c8e7d4fed9f4f946028cd02047c5d2296f/examples/nlp/language_modeling/megatron_gpt_eval.py#L4>`__ script from the NeMo codebase to run more rigorous evaluation of your trained model.

zredeaux07 · 2024-10-24T18:58:24Z

examples/nlp/gpt/train_gpt_reinforce_actor.py

@@ -0,0 +1,198 @@
+# Copyright (c) 2023, NVIDIA CORPORATION.  All rights reserved.


Do we want to update the copyright date throughout?

@ko3n1g can this be automated? at least the correctness && presence of this header?

zredeaux07

LGTM! Just found a few minor nits.

Signed-off-by: Alexander Bukharin <[email protected]>

for more information, see https://pre-commit.ci Signed-off-by: Alexander Bukharin <[email protected]>

Signed-off-by: Alexander Bukharin <[email protected]>

terrykong

Awesome work!

Please add a functional test for reinforce. You may use the ppo one as reference.

You also need to rebase onto main to make sure REINFORCE works with the latest state of TRTLLM (v13).

The biggest thing about this PR that I have concerns is just the huge overlap between the models since almost all the code is shared. I think we should try to create a common class for all models needing generation.

terrykong · 2024-11-05T17:47:36Z

docs/user-guide/reinforce.rst

+   NUM_ROLLOUTS=32
+   ACTOR_GBS=32
+   REWARD_PORT=5555
+   host_reward="$(scontrol show hostnames=$SLURM_JOB_NODELIST_HET_GROUP_0 | head -n1)"


This won't work outside of a slurm context, probably best to hard code it with a comment, so something like:

Suggested change

host_reward="$(scontrol show hostnames=$SLURM_JOB_NODELIST_HET_GROUP_0 | head -n1)"

# Change this to the hostname of server hosting the reward model

reward_host="localhost"

terrykong · 2024-11-05T17:48:02Z

docs/user-guide/reinforce.rst

+   #SBATCH hetjob
+   #SBATCH -N 1 --ntasks-per-node 8 -A <<ACCOUNT>> -p <<PARTITION>> --job-name <<JOBNAME>> -t 4:00:00 --exclusive
+
+   NAME="2p_reinforce"


terrykong · 2024-11-05T17:49:40Z

examples/nlp/gpt/train_gpt_reinforce_actor.py

+    logger.log_hyperparams(OmegaConf.to_container(cfg))
+
+    rm = RemoteGPTRMClient(cfg.remote_rm)
+    timer = Timer(cfg.exp_manager.get("max_time_per_run"))


for consistency, this should be:

NeMo-Aligner/examples/nlp/gpt/train_gpt_dpo.py

Line 141 in e2c9695

timer = Timer(cfg.exp_manager.get("max_time_per_run") if cfg.exp_manager else None)

terrykong · 2024-11-05T17:50:57Z

nemo_aligner/algorithms/reinforce.py

+        self.train_df = pd.DataFrame(columns=["step", "prompt", "response", "reward"])
+        self.val_df = pd.DataFrame(columns=["step", "prompt", "response", "reward"])
+
+        self.timer = SyncTimer(


For newer algos, we should use ScopedTimer instead:

NeMo-Aligner/nemo_aligner/algorithms/ppo.py

Line 197 in e2c9695

self.timer = ScopedTimer(reduction="mean", sync_cuda=True, buffer_size=1)

terrykong · 2024-11-05T17:52:46Z

nemo_aligner/utils/ppo_utils.py

@@ -112,3 +112,28 @@ def select_topk(batch, num_select=1):

    selected_batch = {k: batch[k][selected_idx] for k in batch.keys()}
    return selected_batch
+
+
+def calculate_rloo_baseline(prompts, reward, mask):


Could you please write a unit test for this?

terrykong · 2024-11-05T17:58:54Z

nemo_aligner/models/nlp/gpt/megatron_gpt_reinforce_actor.py

The overlap with megatron_gpt_ppo_actor.py is very high. Given that other models are incoming that need TRTLLM support, I think it would be best to create some baseclass for all things requiring generation.

@gshennvm Thoughts on this?

I don't want to block this PR because of this since it works, but this definitely adds tech debt

terrykong · 2024-11-05T18:03:35Z

examples/nlp/gpt/conf/gpt_reinforce_actor.yaml

+
+    # Speed-up training by accelerating inference stage using TRTLLM
+    trt_llm:
+      enable: False


Should we default to True? Fewer models would be supported, but at least things would be very fast OOTB

terrykong · 2024-11-05T18:04:06Z

examples/nlp/gpt/conf/gpt_reinforce_actor.yaml

+      # TRTLLM preallocates activation memory according to the number of input tokens
+      # By default, assume the max input length is the difference between the model sequence length and the max number of tokens to generate
+      max_input_len: ${subtract:${model.encoder_seq_length}, ${model.reinforce.length_params.max_length}}
+      max_input_tokens: ${multiply:${.max_input_len}, ${model.reinforce.rollout_micro_batch_size}}


should be okay to drop, but please double check

terrykong · 2024-11-05T18:04:59Z

examples/nlp/gpt/conf/gpt_reinforce_actor.yaml

+
+    trt_llm: ${trainer.reinforce.trt_llm}
+
+  #peft 


nit: comment not adding much

terrykong · 2024-11-05T18:05:50Z

nemo_aligner/algorithms/reinforce.py

+    def __init__(
+        self,
+        cfg: DictConfig,
+        model,


I think you should statically type at least the model. It helps folks who are using editors to jump to definitions

github-actions bot added documentation Improvements or additions to documentation Utils Algorithms labels Oct 23, 2024

abukharin3 force-pushed the reinforce-pr branch from dc4ea59 to d11a4d8 Compare October 23, 2024 21:45

abukharin3 changed the title ~~Reinforce pr~~ REINFORCE Oct 23, 2024

terrykong changed the title ~~REINFORCE~~ feat: adds REINFORCE algorithm Oct 24, 2024

terrykong requested a review from zredeaux07 October 24, 2024 18:01

zredeaux07 marked this pull request as ready for review October 24, 2024 18:47

zredeaux07 reviewed Oct 24, 2024

View reviewed changes

zredeaux07 previously approved these changes Oct 24, 2024

View reviewed changes

abukharin3 dismissed zredeaux07’s stale review via 01b7637 October 25, 2024 00:45

Alexander Bukharin and others added 11 commits October 24, 2024 20:46

Init reinforce

08c3a4f

Signed-off-by: Alexander Bukharin <[email protected]>

Remove ratio eps from actor

9e0f9cf

Signed-off-by: Alexander Bukharin <[email protected]>

Remove ratio eps from actor

6e5d0fc

Signed-off-by: Alexander Bukharin <[email protected]>

Fix infer rm call

0f31586

Signed-off-by: Alexander Bukharin <[email protected]>

Debug reward shape

7a37f00

Signed-off-by: Alexander Bukharin <[email protected]>

Debug reward shape

f873ce3

Signed-off-by: Alexander Bukharin <[email protected]>

Debug reward shape

063aed5

Signed-off-by: Alexander Bukharin <[email protected]>

Add REINFORCE documentation

7bb859f

Signed-off-by: Alexander Bukharin <[email protected]>

Update CHANGELOG

0f61e18

Signed-off-by: Alexander Bukharin <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

8a1cf44

for more information, see https://pre-commit.ci Signed-off-by: Alexander Bukharin <[email protected]>

Update REINFORCE example

e851b6d

Signed-off-by: Alexander Bukharin <[email protected]>

abukharin3 force-pushed the reinforce-pr branch from 01b7637 to e851b6d Compare October 25, 2024 00:46

terrykong requested changes Nov 5, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: adds REINFORCE algorithm #357

feat: adds REINFORCE algorithm #357

abukharin3 commented Oct 23, 2024

zredeaux07 Oct 24, 2024

zredeaux07 Oct 24, 2024

zredeaux07 Oct 24, 2024

zredeaux07 Oct 24, 2024

zredeaux07 Oct 24, 2024

zredeaux07 Oct 24, 2024

zredeaux07 Oct 24, 2024

zredeaux07 Oct 24, 2024 •

edited

Loading

terrykong Oct 25, 2024

zredeaux07 left a comment

terrykong left a comment

terrykong Nov 5, 2024

terrykong Nov 5, 2024

terrykong Nov 5, 2024

terrykong Nov 5, 2024

terrykong Nov 5, 2024

terrykong Nov 5, 2024

terrykong Nov 5, 2024

terrykong Nov 5, 2024

terrykong Nov 5, 2024

terrykong Nov 5, 2024

terrykong Nov 5, 2024


		The above script runs the reward model server on 1 node and the actor on 1 node.

		It is important to launch all jobs with ``&`` after the srun command, to ensure they do not block each other.

	The above command launches the initial and actor server on 1 node with 8 GPUs.
	The above command launches the initial and actor server on one node with eight GPUs.

		rm_model_file=${RM_NEMO_FILE}


		The above example launches the reward model server on 8 GPUs and 1 node. Please make sure to change trainer.devices, trainer.num_nodes depending on your model size and scale. Aligner will work on any scale. Also, make sure to tune the trainer.reinforce.inference_micro_batch_size argument. This argument sets the size of the batch the REINFORCE actor is allowed to send to the reward per DP rank.

	You can use slurm to launch the 2 jobs and get them to coordinate together in a full REINFORCE job via the following:
	You can use slurm to launch the two jobs and get them to coordinate together in a full REINFORCE job through the following:


		It is important to launch all jobs with ``&`` after the srun command, to ensure they do not block each other.

		.. note::

	Once you've completed reinforce training, you can serve your model using the `megatron_gpt_eval.py <https://github.com/NVIDIA/NeMo/blob/8cd5f1c8e7d4fed9f4f946028cd02047c5d2296f/examples/nlp/language_modeling/megatron_gpt_eval.py#L4>`__ script from the NeMo codebase to run more rigorous evaluation of your trained model.
	After you've completed reinforce training, you can serve your model using the `megatron_gpt_eval.py <https://github.com/NVIDIA/NeMo/blob/8cd5f1c8e7d4fed9f4f946028cd02047c5d2296f/examples/nlp/language_modeling/megatron_gpt_eval.py#L4>`__ script from the NeMo codebase to run more rigorous evaluation of your trained model.

		@@ -0,0 +1,198 @@
		# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.

	host_reward="$(scontrol show hostnames=$SLURM_JOB_NODELIST_HET_GROUP_0 \| head -n1)"
	# Change this to the hostname of server hosting the reward model
	reward_host="localhost"

feat: adds REINFORCE algorithm #357

Are you sure you want to change the base?

feat: adds REINFORCE algorithm #357

Conversation

abukharin3 commented Oct 23, 2024

What does this PR do ?

Changelog

Usage

PARAMETERS

Before your PR is "Ready for review"

Checklist when contributing a new algorithm

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zredeaux07 Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zredeaux07 left a comment

Choose a reason for hiding this comment

terrykong left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zredeaux07 Oct 24, 2024 •

edited

Loading