Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adds REINFORCE algorithm #357

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

abukharin3
Copy link
Contributor

What does this PR do ?

Adds the REINFORCE algorithm.

Changelog

  • Please update the CHANGELOG.md under next version with high level changes in this PR.

Usage

NAME="2p_reinforce"

PARAMETERS

RM_NEMO_FILE="/path/to/trained_rm.nemo"

ACTOR_NEMO_FILE="/path/to/sft_model.nemo"

TRAIN_DATA_PATH="/path/to/train_prompts.jsonl"
VALID_DATA_PATH="/path/to/test_prompts.jsonl"

RESULTS_DIR="/path/to/results_dir"
mkdir -p $RESULTS_DIR

GPFS="/path/to/nemo-aligner-repo"
MOUNTS="--container-mounts=MOUNTS" # mounts

CONTAINER=<<>> # use the latest NeMo Training container, Aligner will work there

PROJECT=reinforce_run

CRITIC_LOG_DIR="${RESULTS_DIR}/critic_results"
CRITIC_OUTFILE="${CRITIC_LOG_DIR}/critic_output_%j_%t.log"
CRITIC_ERRFILE="${CRITIC_LOG_DIR}/critic_error_%j_%t.err"
REWARD_PORT=5567
CRITIC_CONFIG_PATH="${GPFS}/examples/nlp/gpt/conf"
CRITIC_CONFIG_NAME="inference_rm"

CONF_DIR="${GPFS}/examples/nlp/gpt/conf"
CONFIG_NAME="gpt_reinforce_actor"

mkdir -p $CRITIC_LOG_DIR

CRITIC_NAME="${NAME}_critic"

read -r -d '' cmd_critic_inference <<EOF
cd ${GPFS}
&& export PYTHONPATH="${GPFS}:${PYTHONPATH}"
&& export HYDRA_FULL_ERROR=1
&& python -u examples/nlp/gpt/serve_reward_model.py
--config-path=${CRITIC_CONFIG_PATH}
--config-name=${CRITIC_CONFIG_NAME}
trainer.num_nodes=1
trainer.devices=8
++model.tensor_model_parallel_size=4
rm_model_file=${RM_NEMO_FILE}
inference.port=${REWARD_PORT}
EOF

srun --het-group=0 -o $CRITIC_OUTFILE -e $CRITIC_ERRFILE --container-image=${CONTAINER} $MOUNTS bash -c "${cmd_critic_inference}" &

sleep 30

ACTOR_LOG_DIR="${RESULTS_DIR}/actor_results"
CHECKPOINT_DIR="${ACTOR_LOG_DIR}/checkpoints"
TENSOBOARD_DIR="${ACTOR_LOG_DIR}/tensorboard"

NUM_ROLLOUTS=16
NORMALIZE="True"
ACTOR_LR="1e-6"
ACTOR_GBS=16
KL=0.01
USE_FLASK=False

mkdir -p $ACTOR_LOG_DIR
mkdir -p $TENSOBOARD_DIR
mkdir -p $CHECKPOINT_DIR

ACTOR_NAME="${NAME}_actor"

host_reward="$(scontrol show hostnames=$SLURM_JOB_NODELIST_HET_GROUP_0 | head -n1)"

read -r -d '' cmd_reinforce <<EOF
cd ${GPFS}
export PYTHONPATH="${GPFS}:${PYTHONPATH}"
&& export HYDRA_FULL_ERROR=1
&& python -u examples/nlp/gpt/train_gpt_reinforce_actor.py
"model.data.data_prefix={train: [${TRAIN_DATA_PATH}], validation: [${VALID_DATA_PATH}], test: [${VALID_DATA_PATH}]}"
pretrained_checkpoint.restore_from_path="${PRETRAINED_ACTOR_NEMO_FILE}"
exp_manager.checkpoint_callback_params.save_top_k=1
exp_manager.explicit_log_dir="${RESULTS_DIR}"
trainer.reinforce.max_epochs=1
trainer.reinforce.max_steps=313
trainer.reinforce.val_check_interval=4
trainer.num_nodes=1
trainer.devices=8
trainer.reinforce.trt_llm.enable=True
trainer.reinforce.trt_llm.reshard=True
trainer.reinforce.trt_llm.unload_engine_train=False
++model.tensor_model_parallel_size=4
++model.reinforce.num_rollout_samples=${NUM_ROLLOUTS}
model.global_batch_size=${ACTOR_GBS}
model.micro_batch_size=1
model.optim.lr="\${multiply:${ACTOR_LR},1.001}"
model.optim.sched.warmup_steps=0
model.optim.sched.constant_steps=312
model.optim.sched.min_lr=${ACTOR_LR}
model.optim.weight_decay=0.01
model.reinforce.rollout_micro_batch_size=16
model.reinforce.forward_micro_batch_size=16
model.reinforce.val_rollout_micro_batch_size=8
model.data.data_impl=jsonl
remote_rm.reward_model.ip=${host_reward}
remote_rm.reward_model.port=${REWARD_PORT}
++model.reinforce.length_params.max_length=2048
trainer.reinforce.initial_policy_kl_penalty="${KL}"
++model.optim.bucket_cap_mb=200
++model.dist_ckpt_format=zarr
++model.optim.overlap_grad_sync=False
++model.optim.contiguous_grad_buffer=True
++model.enable_nge=True
trainer.reinforce.batch_iterator.use_flask=${USE_FLASK}
trainer.reinforce.rollout_batch_seq_length=4096
EOF

srun --het-group=1 -o $PPO_OUTFILE -e $PPO_ERRFILE --container-image=${CONTAINER} $MOUNTS bash -c "${cmd_reinforce}" &

wait

Before your PR is "Ready for review"

Pre checks:

  • [Y] Make sure you read and followed Contributor guidelines
  • [N] Did you write any new necessary tests?
  • [Y] Did you add or update any necessary documentation? Make sure to also update the NeMo Framework User Guide which contains the tutorials

Checklist when contributing a new algorithm

  • [Y] Does the trainer resume and restore model state all states?
  • [Y] Does the trainer support all parallelism techniques(PP, TP, DP)?
  • [Y] Does the trainer support max_steps=-1 and validation?
  • [Y] Does the trainer only call APIs defined in alignable_interface.py?
  • [&] Does the trainer have proper logging?

@github-actions github-actions bot added documentation Improvements or additions to documentation Utils Algorithms labels Oct 23, 2024
@abukharin3 abukharin3 changed the title Reinforce pr REINFORCE Oct 23, 2024
@terrykong terrykong changed the title REINFORCE feat: adds REINFORCE algorithm Oct 24, 2024
@zredeaux07 zredeaux07 marked this pull request as ready for review October 24, 2024 18:47

The above script runs the reward model server on 1 node and the actor on 1 node.

It is important to launch all jobs with ``&`` after the srun command, to ensure they do not block each other.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It is important to launch all jobs with ``&`` after the srun command, to ensure they do not block each other.
It is important to launch all jobs with ``&`` after the srun command to ensure they do not block each other.

trainer.reinforce.batch_iterator.use_flask=${USE_FLASK} \
trainer.reinforce.rollout_batch_seq_length=4096

The above command launches the initial and actor server on 1 node with 8 GPUs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The above command launches the initial and actor server on 1 node with 8 GPUs.
The above command launches the initial and actor server on one node with eight GPUs.

rm_model_file=${RM_NEMO_FILE}


The above example launches the reward model server on 8 GPUs and 1 node. Please make sure to change trainer.devices, trainer.num_nodes depending on your model size and scale. Aligner will work on any scale. Also, make sure to tune the trainer.reinforce.inference_micro_batch_size argument. This argument sets the size of the batch the REINFORCE actor is allowed to send to the reward per DP rank.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The above example launches the reward model server on 8 GPUs and 1 node. Please make sure to change trainer.devices, trainer.num_nodes depending on your model size and scale. Aligner will work on any scale. Also, make sure to tune the trainer.reinforce.inference_micro_batch_size argument. This argument sets the size of the batch the REINFORCE actor is allowed to send to the reward per DP rank.
The above example launches the reward model server on eight GPUs and one node. Make sure to change trainer.devices, trainer.num_nodes depending on your model size and scale. Aligner will work on any scale. Also, make sure to tune the trainer.reinforce.inference_micro_batch_size argument. This argument sets the size of the batch the REINFORCE actor is allowed to send to the reward per DP rank.

Launching Both Servers for REINFORCE training
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

You can use slurm to launch the 2 jobs and get them to coordinate together in a full REINFORCE job via the following:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You can use slurm to launch the 2 jobs and get them to coordinate together in a full REINFORCE job via the following:
You can use slurm to launch the two jobs and get them to coordinate together in a full REINFORCE job through the following:


wait

The above script runs the reward model server on 1 node and the actor on 1 node.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The above script runs the reward model server on 1 node and the actor on 1 node.
The above script runs the reward model server on one node and the actor on one node.


It is important to launch all jobs with ``&`` after the srun command, to ensure they do not block each other.

.. note::

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. note::
.. Note::

REINFORCE Results
%%%%%%%%%%%%%%%%%%%%%%%%%%

Once you've completed reinforce training, you can serve your model using the `megatron_gpt_eval.py <https://github.com/NVIDIA/NeMo/blob/8cd5f1c8e7d4fed9f4f946028cd02047c5d2296f/examples/nlp/language_modeling/megatron_gpt_eval.py#L4>`__ script from the NeMo codebase to run more rigorous evaluation of your trained model.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Once you've completed reinforce training, you can serve your model using the `megatron_gpt_eval.py <https://github.com/NVIDIA/NeMo/blob/8cd5f1c8e7d4fed9f4f946028cd02047c5d2296f/examples/nlp/language_modeling/megatron_gpt_eval.py#L4>`__ script from the NeMo codebase to run more rigorous evaluation of your trained model.
After you've completed reinforce training, you can serve your model using the `megatron_gpt_eval.py <https://github.com/NVIDIA/NeMo/blob/8cd5f1c8e7d4fed9f4f946028cd02047c5d2296f/examples/nlp/language_modeling/megatron_gpt_eval.py#L4>`__ script from the NeMo codebase to run more rigorous evaluation of your trained model.

@@ -0,0 +1,198 @@
# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
Copy link

@zredeaux07 zredeaux07 Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to update the copyright date throughout?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ko3n1g can this be automated? at least the correctness && presence of this header?

zredeaux07
zredeaux07 previously approved these changes Oct 24, 2024
Copy link

@zredeaux07 zredeaux07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just found a few minor nits.

Alexander Bukharin and others added 11 commits October 24, 2024 20:46
Signed-off-by: Alexander Bukharin <[email protected]>
Signed-off-by: Alexander Bukharin <[email protected]>
Signed-off-by: Alexander Bukharin <[email protected]>
Signed-off-by: Alexander Bukharin <[email protected]>
Signed-off-by: Alexander Bukharin <[email protected]>
Signed-off-by: Alexander Bukharin <[email protected]>
Signed-off-by: Alexander Bukharin <[email protected]>
Signed-off-by: Alexander Bukharin <[email protected]>
Signed-off-by: Alexander Bukharin <[email protected]>
for more information, see https://pre-commit.ci

Signed-off-by: Alexander Bukharin <[email protected]>
Signed-off-by: Alexander Bukharin <[email protected]>
Copy link
Collaborator

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work!

Please add a functional test for reinforce. You may use the ppo one as reference.

You also need to rebase onto main to make sure REINFORCE works with the latest state of TRTLLM (v13).

The biggest thing about this PR that I have concerns is just the huge overlap between the models since almost all the code is shared. I think we should try to create a common class for all models needing generation.

NUM_ROLLOUTS=32
ACTOR_GBS=32
REWARD_PORT=5555
host_reward="$(scontrol show hostnames=$SLURM_JOB_NODELIST_HET_GROUP_0 | head -n1)"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work outside of a slurm context, probably best to hard code it with a comment, so something like:

Suggested change
host_reward="$(scontrol show hostnames=$SLURM_JOB_NODELIST_HET_GROUP_0 | head -n1)"
# Change this to the hostname of server hosting the reward model
reward_host="localhost"

#SBATCH hetjob
#SBATCH -N 1 --ntasks-per-node 8 -A <<ACCOUNT>> -p <<PARTITION>> --job-name <<JOBNAME>> -t 4:00:00 --exclusive

NAME="2p_reinforce"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's 2p?

logger.log_hyperparams(OmegaConf.to_container(cfg))

rm = RemoteGPTRMClient(cfg.remote_rm)
timer = Timer(cfg.exp_manager.get("max_time_per_run"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for consistency, this should be:

timer = Timer(cfg.exp_manager.get("max_time_per_run") if cfg.exp_manager else None)

self.train_df = pd.DataFrame(columns=["step", "prompt", "response", "reward"])
self.val_df = pd.DataFrame(columns=["step", "prompt", "response", "reward"])

self.timer = SyncTimer(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For newer algos, we should use ScopedTimer instead:

self.timer = ScopedTimer(reduction="mean", sync_cuda=True, buffer_size=1)

@@ -112,3 +112,28 @@ def select_topk(batch, num_select=1):

selected_batch = {k: batch[k][selected_idx] for k in batch.keys()}
return selected_batch


def calculate_rloo_baseline(prompts, reward, mask):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please write a unit test for this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overlap with megatron_gpt_ppo_actor.py is very high. Given that other models are incoming that need TRTLLM support, I think it would be best to create some baseclass for all things requiring generation.

@gshennvm Thoughts on this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to block this PR because of this since it works, but this definitely adds tech debt


# Speed-up training by accelerating inference stage using TRTLLM
trt_llm:
enable: False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we default to True? Fewer models would be supported, but at least things would be very fast OOTB

# TRTLLM preallocates activation memory according to the number of input tokens
# By default, assume the max input length is the difference between the model sequence length and the max number of tokens to generate
max_input_len: ${subtract:${model.encoder_seq_length}, ${model.reinforce.length_params.max_length}}
max_input_tokens: ${multiply:${.max_input_len}, ${model.reinforce.rollout_micro_batch_size}}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be okay to drop, but please double check


trt_llm: ${trainer.reinforce.trt_llm}

#peft
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: comment not adding much

def __init__(
self,
cfg: DictConfig,
model,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should statically type at least the model. It helps folks who are using editors to jump to definitions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algorithms documentation Improvements or additions to documentation Utils
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants