From 07a8f9890e8479e9065e929ccb922d02521fbbfc Mon Sep 17 00:00:00 2001 From: Haibin Lin Date: Tue, 17 Dec 2024 21:21:52 -0800 Subject: [PATCH 1/7] fix docs --- docs/advance/megatron_extension.rst | 4 ++-- docs/preparation/reward_function.rst | 2 +- docs/start/quickstart.rst | 8 ++++---- docs/workers/fsdp_workers.rst | 6 +++--- docs/workers/megatron_workers.rst | 2 +- 5 files changed, 11 insertions(+), 11 deletions(-) diff --git a/docs/advance/megatron_extension.rst b/docs/advance/megatron_extension.rst index de2d2564..502c1b0f 100644 --- a/docs/advance/megatron_extension.rst +++ b/docs/advance/megatron_extension.rst @@ -1,5 +1,5 @@ Add models to Megatron-LM backend -=========== +=================================== Model ----------- @@ -22,4 +22,4 @@ To support other model, users are required to implement: (vLLM) model. Note that both the actor model and rollout model are partitioned during runtime. So, it’s advisable to map the model name in actor model implementation. Otherwise, you may need an additional - name mapping and even weight transformation. \ No newline at end of file + name mapping and even weight transformation. diff --git a/docs/preparation/reward_function.rst b/docs/preparation/reward_function.rst index b58917bc..8ba7cd29 100644 --- a/docs/preparation/reward_function.rst +++ b/docs/preparation/reward_function.rst @@ -1,5 +1,5 @@ Implment Reward Function for Dataset -======================= +====================================== For each dataset, we need to implement a reward function or utilize a reward model to compute the rewards for the generated responses. We already pre-implemented some reward functions in `reward_score directory `_. diff --git a/docs/start/quickstart.rst b/docs/start/quickstart.rst index 69888f3b..07e60fbe 100644 --- a/docs/start/quickstart.rst +++ b/docs/start/quickstart.rst @@ -1,11 +1,11 @@ .. _quickstart: -========== +========================================================= Quickstart: Fintune a LLM using PPO with GSM8K dataset -========== +========================================================= Post-train a LLM using GSM8K dataset -==================== +=================================================================== Introduction ------------ @@ -107,7 +107,7 @@ Step 4: Perform PPO training with your model on GSM8K Dataset - Users could replace the ``data.train_files`` ,\ ``data.val_files``, ``actor_rollout_ref.model.path`` and ``critic.model.path`` based on their environment. -- See :doc:`config` for detailed explaination of each config field. +- See :doc:`examples/config` for detailed explaination of each config field. **Reward Model/Function** diff --git a/docs/workers/fsdp_workers.rst b/docs/workers/fsdp_workers.rst index e54860f0..fddbcef7 100644 --- a/docs/workers/fsdp_workers.rst +++ b/docs/workers/fsdp_workers.rst @@ -1,5 +1,5 @@ PyTorch FSDP Backend -============ +====================== We support PyTorch FSDP Backend by implementing various workers for actor, critic, reference, rollout and reward models. We also implement @@ -28,7 +28,7 @@ Due to the simplicity, we recommend using FSDP backend for algorithm research and prototyping. FSDP Workers ------------- +-------------- ActorRolloutRefWorker ^^^^^^^^^^^^^^^^^^^^^ @@ -139,4 +139,4 @@ HybridShard We didn’t support FSDP `HybridShard`. To support this, we may need to construct a 2D device mesh and test the corresponding -``dtensor_weight_loader`` and ``hf_weight_loader`` for each model. \ No newline at end of file +``dtensor_weight_loader`` and ``hf_weight_loader`` for each model. diff --git a/docs/workers/megatron_workers.rst b/docs/workers/megatron_workers.rst index d6f88c3e..8cdccd99 100644 --- a/docs/workers/megatron_workers.rst +++ b/docs/workers/megatron_workers.rst @@ -1,5 +1,5 @@ Megatron-LM Backend -================ +===================== We support Megatron Backend by implementing various workers for actor, critic, reference, rollout and reward models. We also implement the From b69af033f13237e1877afd9d5e5ff71d056727ea Mon Sep 17 00:00:00 2001 From: "haibin.lin" Date: Thu, 19 Dec 2024 04:31:31 +0800 Subject: [PATCH 2/7] update docs --- README.md | 26 +--- docs/advance/dpo_extension.rst | 6 +- docs/advance/fsdp_extension.rst | 2 +- docs/advance/megatron_extension.rst | 2 +- docs/examples/config.rst | 34 ++--- docs/examples/gsm8k_example.rst | 6 +- docs/examples/ppo_code_architecture.rst | 10 +- docs/experiment/ppo.rst | 32 ++-- docs/index.rst | 26 ++-- docs/preparation/prepare_data.rst | 2 +- docs/start/install.rst | 2 +- docs/start/quickstart.rst | 185 ++++++++++-------------- docs/workers/fsdp_workers.rst | 2 +- docs/workers/ray_trainer.rst | 4 +- 14 files changed, 149 insertions(+), 190 deletions(-) diff --git a/README.md b/README.md index ab07a2b3..5df1912a 100644 --- a/README.md +++ b/README.md @@ -1,16 +1,16 @@

veRL: Volcano Engine Reinforcement Learning for LLM

-veRL (HybridFlow) is a flexible, efficient and industrial-level RL(HF) training framework designed for large language models (LLMs). veRL is the open-source version of [HybridFlow](https://arxiv.org/abs/2409.19256v2) paper. +veRL is a flexible, efficient and production-ready RL training framework designed for large language models (LLMs). veRL is the open-source version of [HybridFlow](https://arxiv.org/abs/2409.19256v2) paper. veRL is flexible and easy to use with: -- **Easy to support diverse RL(HF) algorithms**: The Hybrid programming model combines the strengths of single-controller and multi-controller paradigms to enable flexible representation and efficient execution of complex Post-Training dataflows. Allowing users to build RL dataflows in a few lines of code. +- **Easy extension of diverse RL algorithms**: The Hybrid programming model combines the strengths of single-controller and multi-controller paradigms to enable flexible representation and efficient execution of complex Post-Training dataflows. Allowing users to build RL dataflows in a few lines of code. -- **Seamless integration of existing LLM infra with modular API design**: Decouples computation and data dependencies, enabling seamless integration with existing LLM frameworks, such as PyTorch FSDP, Megatron-LM and vLLM. Moreover, users can easily extend to other LLM training and inference frameworks. +- **Seamless integration of existing LLM infra with modular APIs**: Decouples computation and data dependencies, enabling seamless integration with existing LLM frameworks, such as PyTorch FSDP, Megatron-LM and vLLM. Moreover, users can easily extend to other LLM training and inference frameworks. - **Flexible device mapping**: Supports various placement of models onto different sets of GPUs for efficient resource utilization and scalability across different cluster sizes. -- Readily integration with popular Hugging Face models +- Readily integration with popular HuggingFace models veRL is fast with: @@ -172,24 +172,6 @@ Visit our [documentation](https://verl.readthedocs.io/en/latest/index.html) to l - [Add models to Megatron-LM backend](https://verl.readthedocs.io/en/latest/advance/megatron_extension.html) -## Community and Contribution - -### Communication channel - -[Join us](https://join.slack.com/t/verlgroup/shared_invite/zt-2w5p9o4c3-yy0x2Q56s_VlGLsJ93A6vA) for discussions on slack! - -### Code formatting -We use yapf (Google style) to enforce strict code formatting when reviewing MRs. To reformat you code locally, make sure you installed `yapf` -```bash -pip3 install yapf -``` -Then, make sure you are at top level of verl repo and run -```bash -yapf -ir -vv --style ./.style.yapf verl examples -``` - - - ## Citation ```tex diff --git a/docs/advance/dpo_extension.rst b/docs/advance/dpo_extension.rst index 592d9710..bc8f08ad 100644 --- a/docs/advance/dpo_extension.rst +++ b/docs/advance/dpo_extension.rst @@ -66,7 +66,7 @@ Here, ``SampleGenerator`` can be viewed as a multi-process pulled up by the control flow to call. The implementation details inside can use any inference engine including vllm, sglang and huggingface. Users can largely reuse the code in -verl/verl/trainer/ppo/rollout/vllm_rollout/vllm_rollout.py and we won’t +verl/verl/trainer/ppo/rollout/vllm_rollout/vllm_rollout.py and we won't go into details here. **ReferencePolicy inference** @@ -179,7 +179,7 @@ steps: Frequently calling these 3 steps on the controller process greatly hurts code readability. **In veRL, we have abstracted and encapsulated these 3 -steps, so that the worker’s method + dispatch + collect can be +steps, so that the worker's method + dispatch + collect can be registered into the worker_group** .. code:: python @@ -230,7 +230,7 @@ Here it requires the data interface to be ``DataProto``. Definition of Step 3: Main training loop ~~~~~~~~~~~~~~~~~~~~~~~~~~ -With the above training flows, we can implement the algorithm’s control +With the above training flows, we can implement the algorithm's control flow. It is recommended that ``main_task`` is also a ray remote process. .. code:: python diff --git a/docs/advance/fsdp_extension.rst b/docs/advance/fsdp_extension.rst index 11e9d8a1..bb3da95a 100644 --- a/docs/advance/fsdp_extension.rst +++ b/docs/advance/fsdp_extension.rst @@ -28,7 +28,7 @@ loader for the models below in `dtensor_weight_loader.py `_. 3. Weight loader that synchronize the weight from Megatron to rollout (vLLM) model. Note that both the actor model and rollout model are - partitioned during runtime. So, it’s advisable to map the model name + partitioned during runtime. So, it's advisable to map the model name in actor model implementation. Otherwise, you may need an additional name mapping and even weight transformation. diff --git a/docs/examples/config.rst b/docs/examples/config.rst index d7d8fa43..24afd880 100644 --- a/docs/examples/config.rst +++ b/docs/examples/config.rst @@ -22,14 +22,14 @@ Data return_raw_chat: False - ``data.train_files``: Training set parquet. Can be a list or a single - file. The program will read all files into memory, so it can’t be too + file. The program will read all files into memory, so it can't be too large (< 100GB). The path can be either local path or HDFS path. For HDFS path, we provide utils to download it to DRAM and convert the HDFS path to local path. - ``data.val_files``: Validation parquet. Can be a list or a single file. - ``data.prompt_key``: The field in the dataset where the prompt is - located. Default is ‘prompt’. + located. Default is 'prompt'. - ``data.max_prompt_length``: Maximum prompt length. All prompts will be left-padded to this length. An error will be reported if the length is too long @@ -41,13 +41,13 @@ Data iteration. - ``data.return_raw_input_ids``: Whether to return the original input_ids without adding chat template. This is mainly used to - accommodate situations where the reward model’s chat template differs - from the policy. It needs to be decoded first, then apply the RM’s + accommodate situations where the reward model's chat template differs + from the policy. It needs to be decoded first, then apply the RM's chat template. If using a model-based RM, and the policy and RM chat_templates are different, this flag needs to be set - ``data.return_raw_chat``: - ``data.truncation``: Truncate the input_ids or prompt length if they - exceed max_prompt_length. Default is ‘error’, not allow exceed the + exceed max_prompt_length. Default is 'error', not allow exceed the max_prompt_length. The users should increase the max_prompt_length if throwing the error. @@ -114,7 +114,7 @@ Actor/Rollout/Reference Policy **Common config for actor, rollout and reference model** -- ``actor_rollout_ref.hybrid_engine``: Whether it’s a hybrid engine, +- ``actor_rollout_ref.hybrid_engine``: Whether it's a hybrid engine, currently only supports hybrid engine - ``actor_rollout_ref.model.path``: Huggingface model path. This can be either local path or HDFS path. For HDFS path, we provide utils to @@ -123,7 +123,7 @@ Actor/Rollout/Reference Policy that need to be imported. Used to register models or tokenizers into the Huggingface system. - ``actor_rollout_ref.model.override_config``: Used to override some of - the model’s original configurations, mainly dropout + the model's original configurations, mainly dropout - ``actor_rollout_ref.model.enable_gradient_checkpointing``: Whether to enable gradient checkpointing for the actor @@ -154,12 +154,12 @@ Actor/Rollout/Reference Policy - ``actor_rollout_ref.actor.shuffle``: Whether to shuffle data when there are multiple epochs -- ``actor_rollout_ref.actor.optim``: Actor’s optimizer parameters +- ``actor_rollout_ref.actor.optim``: Actor's optimizer parameters - ``actor_rollout_ref.actor.fsdp_config``: FSDP config for actor training - - ``wrap_policy``: FSDP wrap policy. By default, it uses Huggingface’s + - ``wrap_policy``: FSDP wrap policy. By default, it uses Huggingface's wrap policy, i.e., wrapping by DecoderLayer - No need to set transformer_layer_cls_to_wrap, so we comment it. @@ -172,7 +172,7 @@ Actor/Rollout/Reference Policy **Reference Model** - ``actor_rollout_ref.ref``: FSDP config same as actor. **For models - larger than 7B, it’s recommended to turn on offload for ref by + larger than 7B, it's recommended to turn on offload for ref by default** - ``actor_rollout_ref.ref.log_prob_micro_batch_size``: The batch size for one forward pass in the computation of ``ref_log_prob``. @@ -180,11 +180,11 @@ Actor/Rollout/Reference Policy **Rollout Model** - ``actor_rollout_ref.rollout.name``: hf/vllm. We use vLLM by default - because it’s much efficient and our hybrid engine is implemented with + because it's much efficient and our hybrid engine is implemented with vLLM. - Rollout (Auto-regressive) parameters. The key should be equal to the - property name in vLLM’s ``SamplingParams``. + property name in vLLM's ``SamplingParams``. - ``temperature``, ``top_k``, ``top_p`` and others: Sampling parameters in ``SamplingParams``. @@ -224,7 +224,7 @@ Actor/Rollout/Reference Policy - ``megatron``: Use Megatron weight loader. Deployed with Megatron backend. The input model ``state_dict()`` is already partitioned along TP dimension and already gathered along PP dimension. This - weight loader requires that the Rollout model and Actor model’s + weight loader requires that the Rollout model and Actor model's parameters shape and name should be identical. - ``dtensor``: Default solution when using Huggingface weight loader. Deployed with FSDP backend and the state_dict_type is @@ -232,7 +232,7 @@ Actor/Rollout/Reference Policy loader - ``hf``: Use Huggingface weight loader. Deployed with FSDP backend and the state_dict_type is ``StateDictType.FULL_STATE_DICT``. This - solution doesn’t need to rewrite the weight loader for each model + solution doesn't need to rewrite the weight loader for each model implemented in vLLM but it results in larger peak memory usage. - ``dummy_hf``, ``dummy_megatron``, ``dummy_dtensor``: Random initialization. @@ -268,11 +268,11 @@ Reward Model responses. If False, the following parameters are not effective. - ``reward_model.model`` - - ``input_tokenizer``: Input tokenizer. If the reward model’s chat + - ``input_tokenizer``: Input tokenizer. If the reward model's chat template is inconsistent with the policy, we need to first decode to - plaintext, then apply the rm’s chat_template. Then score with RM. If + plaintext, then apply the rm's chat_template. Then score with RM. If chat_templates are consistent, it can be set to null. - - ``path``: RM’s HDFS path or local path. Note that RM only supports + - ``path``: RM's HDFS path or local path. Note that RM only supports AutoModelForSequenceClassification. Other model types need to define their own RewardModelWorker and pass it from the code. diff --git a/docs/examples/gsm8k_example.rst b/docs/examples/gsm8k_example.rst index 0d3c1f8f..de694cfd 100644 --- a/docs/examples/gsm8k_example.rst +++ b/docs/examples/gsm8k_example.rst @@ -49,7 +49,7 @@ Step 1: Prepare dataset Step 2: Download Model ---------------------- -There’re three ways to prepare the model checkpoints for post-training: +There're three ways to prepare the model checkpoints for post-training: - Download the required models from hugging face @@ -96,7 +96,7 @@ We also provide various training scripts for SFT on GSM8K dataset in `gsm8k sft Step 4: Perform PPO training with your model on GSM8K Dataset ------------------------------------------------------------- -- Prepare your own run.sh script. Here’s an example for GSM8k dataset +- Prepare your own run.sh script. Here's an example for GSM8k dataset and deepseek-llm-7b-chat model. - Users could replace the ``data.train_files`` ,\ ``data.val_files``, ``actor_rollout_ref.model.path`` and ``critic.model.path`` based on @@ -107,7 +107,7 @@ Step 4: Perform PPO training with your model on GSM8K Dataset We use a rule-based reward model. We force the model to produce a final answer following 4 “#” as shown in the solution. We extract the final -answer from both the solution and model’s output using regular +answer from both the solution and model's output using regular expression matching. We compare them and assign a reward of 1 to correct answer, 0.1 to incorrect answer and 0 to no answer. diff --git a/docs/examples/ppo_code_architecture.rst b/docs/examples/ppo_code_architecture.rst index ab1f66a6..1cca8301 100644 --- a/docs/examples/ppo_code_architecture.rst +++ b/docs/examples/ppo_code_architecture.rst @@ -1,7 +1,7 @@ PPO Example Architecture ======================== -Let’s start with the Proximal Policy Optimization algorithm, which is +Let's start with the Proximal Policy Optimization algorithm, which is most widely used algorithm in LLM post-training. The main entry point of the PPO algorithm example is: @@ -151,18 +151,18 @@ Defining reward model/function resource_pool_manager = ResourcePoolManager(resource_pool_spec=resource_pool_spec, mapping=mapping) Since not all tasks use model-based RM, users need to define here -whether it’s a model-based RM or a function-based RM +whether it's a model-based RM or a function-based RM -- If it’s a model-based RM, directly add the ``RewardModel`` role in the +- If it's a model-based RM, directly add the ``RewardModel`` role in the resource mapping and add it to the resource pool mapping. - Note that the pre-defined ``RewardModelWorker`` only supports models with the structure of huggingface - ``AutoModelForSequenceClassification``. If it’s not this model, you + ``AutoModelForSequenceClassification``. If it's not this model, you need to define your own RewardModelWorker in `FSDP Workers `_ and `Megatron-LM Workers `_. -- If it’s a function-based RM, the users are required to classified the +- If it's a function-based RM, the users are required to classified the reward function for each datasets. .. code:: python diff --git a/docs/experiment/ppo.rst b/docs/experiment/ppo.rst index 8dcce1c9..332a3ee2 100644 --- a/docs/experiment/ppo.rst +++ b/docs/experiment/ppo.rst @@ -8,17 +8,23 @@ Assuming GSM8k dataset is preprocess via ``python3 examples/data_preprocess/gsm8 Refer to the table below to reproduce PPO training from different pre-trained models. -+----------------------------+------------------------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ -| Model | Method | Test score | Details | -+============================+========================+============+=====================+=========================================================================================================================================================================================================+ -| google/gemma-2-2b-it | pretrained checkpoint | 23.9 | `Huggingface `_ | -+----------------------------+------------------------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ -| google/gemma-2-2b-it | SFT | 52.06 | `Command and logs `_ | -+----------------------------+------------------------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ -| google/gemma-2-2b-it | SFT + PPO | 64.02 | `Command and logs `_, `wandb `_ | -+----------------------------+------------------------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ -| Qwen/Qwen2.5-0.5B-Instruct | pretrained checkpoint | 36.4 | `Qwen blog `_ | -+----------------------------+------------------------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ -| Qwen/Qwen2.5-0.5B-Instruct | PPO | 56.7 | `Command and logs `_ | -+----------------------------+------------------------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +.. _hf_gemma: https://huggingface.co/google/gemma-2-2b-it#benchmark-results +.. _verl_data_gemma_sft: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/gemma-2-2b-it-sft-0.411.log +.. _verl_data_gemma_sft_ppo: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/gemma-2-2b-it-ppo-bsz512_4-prompt1024-resp-512-0.640.log +.. _verl_data_gemma_sft_ppo_wandb: https://api.wandb.ai/links/verl-team/h7ux8602 +.. _qwen_blog: https://qwenlm.github.io/blog/qwen2.5-llm/ +.. _verl_data_qwen25_05_ppo: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/Qwen2.5-0.5B-bsz256_2-prompt1024-resp512-0.567.log ++----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+ +| Model | Method | Test score | Details | ++============================+========================+============+=====================+=========================================================================+ +| google/gemma-2-2b-it | pretrained checkpoint | 23.9 | `Huggingface `_ | ++----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+ +| google/gemma-2-2b-it | SFT | 52.06 | `Command and logs `_ | ++----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+ +| google/gemma-2-2b-it | SFT + PPO | 64.02 | `Command and logs `_, `wandb `_ | ++----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+ +| Qwen/Qwen2.5-0.5B-Instruct | pretrained checkpoint | 36.4 | `Qwen blog `_ | ++----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+ +| Qwen/Qwen2.5-0.5B-Instruct | PPO | 56.7 | `Command and logs `_ | ++----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+ \ No newline at end of file diff --git a/docs/index.rst b/docs/index.rst index 80e59783..ce72cd69 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,17 +1,19 @@ -Welcome to veRL/HybridFlow's documentation! +Welcome to veRL's documentation! ================================================ -veRL (HybridFlow) is a flexible, efficient and industrial-level RL(HF) training framework designed for large language models (LLMs) Post-Training. +.. _hf_arxiv: https://arxiv.org/pdf/2409.19256 + +veRL is a flexible, efficient and production-ready RL training framework designed for large language models (LLMs) post-training. It is an open source implementation of the `HybridFlow `_ paper. veRL is flexible and easy to use with: -- **Easy to support diverse RL(HF) algorithms**: The Hybrid programming model combines the strengths of single-controller and multi-controller paradigms to enable flexible representation and efficient execution of complex Post-Training dataflows. Allowing users to build RL dataflows in a few lines of code. +- **Easy extension of diverse RL algorithms**: The Hybrid programming model combines the strengths of single-controller and multi-controller paradigms to enable flexible representation and efficient execution of complex Post-Training dataflows. Allowing users to build RL dataflows in a few lines of code. -- **Seamless integration of existing LLM infra with modular API design**: Decouples computation and data dependencies, enabling seamless integration with existing LLM frameworks, such as PyTorch FSDP, Megatron-LM and vLLM. Moreover, users can easily extend to other LLM training and inference frameworks. +- **Seamless integration of existing LLM infra with modular APIs**: Decouples computation and data dependencies, enabling seamless integration with existing LLM frameworks, such as PyTorch FSDP, Megatron-LM and vLLM. Moreover, users can easily extend to other LLM training and inference frameworks. -- **Flexible device mapping**: Supports various placement of models onto different sets of GPUs for efficient resource utilization and scalability across different cluster sizes. +- **Flexible device mapping and parallelism**: Supports various placement of models onto different sets of GPUs for efficient resource utilization and scalability across different cluster sizes. -- Readily integration with popular Hugging Face models +- Readily integration with popular HuggingFace models veRL is fast with: @@ -81,7 +83,13 @@ Contribution veRL is free software; you can redistribute it and/or modify it under the terms of the Apache License 2.0. We welcome contributions. -Join us on `GitHub `_ . +Join us on `GitHub `_ and `Slack `_ for discussions. + +Code formatting +^^^^^^^^^^^^^^^^^^^^^^^^ +We use yapf (Google style) to enforce strict code formatting when reviewing MRs. Run yapf at the top level of verl repo: + +.. bash:: -.. and check out our -.. :doc:`contribution guidelines `. + pip3 install yapf + yapf -ir -vv --style ./.style.yapf verl examples tests diff --git a/docs/preparation/prepare_data.rst b/docs/preparation/prepare_data.rst index 58de561c..40c4f644 100644 --- a/docs/preparation/prepare_data.rst +++ b/docs/preparation/prepare_data.rst @@ -10,7 +10,7 @@ to follow the following steps: The data preprocess script can be divided into two parts: 1. The first part is the common part, which loads the dataset from - huggingface’s ``datasets`` package. Then preprocess the datasets with + huggingface's ``datasets`` package. Then preprocess the datasets with the ``make_map_fn`` and then store in the parquet format. .. code:: python diff --git a/docs/start/install.rst b/docs/start/install.rst index 9a932e95..e18c396b 100644 --- a/docs/start/install.rst +++ b/docs/start/install.rst @@ -55,7 +55,7 @@ For users who pursue better scalability, we recommend using Megatron-LM backend. Please install the above dependencies first. Currently, we support Megatron-LM\@core_v0.4.0 and we fix some internal -issues of Megatron-LM. Here’s the additional installation guide. +issues of Megatron-LM. Here's the additional installation guide. The pros, cons and extension guide for using Megatron-LM backend can be found in :doc:`Megatron-LM Workers<../workers/megatron_workers>`. diff --git a/docs/start/quickstart.rst b/docs/start/quickstart.rst index 07e60fbe..5c67cffc 100644 --- a/docs/start/quickstart.rst +++ b/docs/start/quickstart.rst @@ -1,7 +1,7 @@ .. _quickstart: ========================================================= -Quickstart: Fintune a LLM using PPO with GSM8K dataset +Quickstart: Post-train a LLM using PPO with GSM8K dataset ========================================================= Post-train a LLM using GSM8K dataset @@ -10,26 +10,22 @@ Post-train a LLM using GSM8K dataset Introduction ------------ -In this example, we train an LLM to tackle the GSM8k task. +.. _hf_dataset_gsm8k: https://huggingface.co/datasets/gsm8k -Paper: https://arxiv.org/pdf/2110.14168 +In this example, we train an LLM to tackle the `GSM8k `_ task with function-based rewards[1]_. -Dataset: https://huggingface.co/datasets/gsm8k +Prerequisite: + +- the latest version of ``verl`` and its dependencies installed following the installation guide. Using the docker image is recommended. + +- an GPU with at least 32 GB memory -Note that the original paper mainly focuses on training a verifier (a -reward model) to solve math problems via Best-of-N sampling. In this -example, we train an RLHF agent using a rule-based reward model. Dataset Introduction -------------------- GSM8k is a math problem dataset. The prompt is an elementary school -problem. The LLM model is required to answer the math problem. - -The training set contains 7473 samples and the test set contains 1319 -samples. - -**An example** +problem. The LLM model is asked to solve the math problem. Below is an example: Prompt @@ -44,129 +40,96 @@ Solution number of teaspoons she used is 7/20, she used 7/20\ *120 = <<7/20*\ 120=42>>42 #### 42 -Step 1: Prepare dataset ------------------------ +Step 1: Prepare the dataset +---------------------------- + +We preprocess the dataset in parquet format so that (1) it contains necessary fields for computing RL rewards and (2) is faster to read. .. code:: bash - cd examples/data_preprocess - python3 gsm8k.py --local_dir ~/data/gsm8k + python3 examples/data_preprocess/gsm8k.py --local_dir ~/data/gsm8k -Step 2: Download Model ----------------------- +Step 2: Download a model for post-training +------------------------------------------- -There’re three ways to prepare the model checkpoints for post-training: +Usually we recommend starting with an "instruct" model variant so that the model follows instructions. In this example, we start with the ``Qwen2.5-0.5B-Instruct`` model. -- Download the required models from huggingface +If you start from a "base" model variant, doing SFT before RL is recommended. Refer to the `sft directory `_ and `SFT Trainer `_ for further details. .. code:: bash - huggingface-cli download deepseek-ai/deepseek-math-7b-instruct --local-dir ~/models/deepseek-math-7b-instruct --local-dir-use-symlinks False + python3 -c "import transformers; transformers.pipeline('text-generation', model='Qwen/Qwen2.5-0.5B-Instruct')" -- Already store your store model in the local directory or HDFS path. -- Also, you can directly use the model name in huggingface (e.g., - deepseek-ai/deepseek-math-7b-instruct) in - ``actor_rollout_ref.model.path`` and ``critic.model.path`` field in - the run script. +Step 3: Perform PPO training with the instruct model +---------------------------------------------------------------------- -Noted that users should prepare checkpoints for actor, critic and reward -model. +**Reward Model/Function** -[Optional] Step 3: SFT your Model ---------------------------------- +We use a pre-defined rule-based reward model. We force the model to produce a final +answer following 4 “#” as shown in the solution. We extract the final +answer from both the solution and model's output using regular +expression matching. We assign a reward of 1 to correct +answer, 0.1 to incorrect answer and 0 to no answer. -We provide a SFT Trainer using PyTorch FSDP in -`fsdp_sft_trainer.py `_. -Users can customize their own SFT -script using our FSDP SFT Trainer. +For mode details, please refer to `verl/utils/reward_score/gsm8k.py `_. -We also provide various training scripts for SFT on GSM8K dataset in `gsm8k sft directory `_. +**Training Script** -.. code:: shell +Now let's run PPO training with the dataset and model above[2]_. - set -x +.. code:: bash - torchrun -m verl.trainer.fsdp_sft_trainer \ - data.train_files=$HOME/data/gsm8k/train.parquet \ - data.val_files=$HOME/data/gsm8k/test.parquet \ - data.prompt_key=question \ - data.response_key=answer \ - data.micro_batch_size=8 \ - model.partial_pretrain=deepseek-ai/deepseek-coder-6.7b-instruct \ - trainer.default_hdfs_dir=hdfs://user/verl/experiments/gsm8k/deepseek-coder-6.7b-instruct/ \ - trainer.project_name=gsm8k-sft \ - trainer.experiment_name=gsm8k-sft-deepseek-coder-6.7b-instruct \ - trainer.total_epochs=4 \ - trainer.logger=['console','wandb'] + bash examples/ppo_trainer/run_deepseek7b_llm.sh -Step 4: Perform PPO training with your model on GSM8K Dataset -------------------------------------------------------------- +The script of `run_deepseek7b_llm.sh` -- Prepare your own run.sh script. Here’s an example for GSM8k dataset +- Prepare your own run.sh script. Here's an example for GSM8k dataset and deepseek-llm-7b-chat model. - Users could replace the ``data.train_files`` ,\ ``data.val_files``, ``actor_rollout_ref.model.path`` and ``critic.model.path`` based on their environment. - See :doc:`examples/config` for detailed explaination of each config field. -**Reward Model/Function** - -We use a rule-based reward model. We force the model to produce a final -answer following 4 “#” as shown in the solution. We extract the final -answer from both the solution and model’s output using regular -expression matching. We compare them and assign a reward of 1 to correct -answer, 0.1 to incorrect answer and 0 to no answer. - -**Training Script** - -The training script example for FSDP and Megatron-LM backend are stored in -`examples/ppo_trainer `_ directory. - .. code:: bash - cd ../ppo_trainer - bash run_deepseek7b_llm.sh - -The script of `run_deepseek7b_llm.sh` - -.. code:: bash - - set -x - python3 -m verl.trainer.main_ppo \ - data.train_files=~/data/rlhf/gsm8k/train.parquet \ - data.val_files=~/data/rlhf/gsm8k/test.parquet \ - data.train_batch_size=1024 \ - data.val_batch_size=1312 \ - data.max_prompt_length=512 \ - data.max_response_length=512 \ - actor_rollout_ref.model.path=~/models/deepseek-llm-7b-chat \ - actor_rollout_ref.actor.optim.lr=1e-6 \ - actor_rollout_ref.actor.ppo_mini_batch_size=256 \ - actor_rollout_ref.actor.ppo_micro_batch_size=64 \ - actor_rollout_ref.actor.fsdp_config.param_offload=False \ - actor_rollout_ref.actor.fsdp_config.grad_offload=False \ - actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \ - actor_rollout_ref.rollout.micro_batch_size=256 \ - actor_rollout_ref.rollout.log_prob_micro_batch_size=128 \ - actor_rollout_ref.rollout.tensor_model_parallel_size=2 \ - actor_rollout_ref.rollout.name=vllm \ - actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \ - actor_rollout_ref.ref.log_prob_micro_batch_size=128 \ - actor_rollout_ref.ref.fsdp_config.param_offload=True \ - critic.optim.lr=1e-5 \ - critic.model.path=~/models/deepseek-llm-7b-chat \ - critic.model.enable_gradient_checkpointing=False \ - critic.ppo_micro_batch_size=64 \ - critic.model.fsdp_config.param_offload=False \ - critic.model.fsdp_config.grad_offload=False \ - critic.model.fsdp_config.optimizer_offload=False \ - algorithm.kl_ctrl.kl_coef=0.001 \ - trainer.critic_warmup=0 \ - trainer.logger=['console','wandb'] \ - trainer.project_name='verl_example_gsm8k' \ - trainer.experiment_name='deepseek_llm_7b_function_rm' \ - trainer.n_gpus_per_node=8 \ - trainer.nnodes=1 \ - trainer.save_freq=-1 \ - trainer.total_epochs=15 + data.train_files=$HOME/data/gsm8k/train.parquet \ + data.val_files=$HOME/data/gsm8k/test.parquet \ + data.train_batch_size=512 \ + data.val_batch_size=1312 \ + data.max_prompt_length=512 \ + data.max_response_length=256 \ + actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \ + actor_rollout_ref.actor.optim.lr=1e-6 \ + actor_rollout_ref.actor.ppo_mini_batch_size=128 \ + actor_rollout_ref.actor.ppo_micro_batch_size=1 \ + actor_rollout_ref.rollout.log_prob_micro_batch_size=1 \ + actor_rollout_ref.rollout.tensor_model_parallel_size=1 \ + actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \ + actor_rollout_ref.ref.log_prob_micro_batch_size=4 \ + actor_rollout_ref.ref.fsdp_config.param_offload=False \ + critic.optim.lr=1e-5 \ + critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \ + critic.ppo_micro_batch_size=1 \ + algorithm.kl_ctrl.kl_coef=0.001 \ + trainer.logger=['console'] \ + trainer.n_gpus_per_node=1 \ + trainer.nnodes=1 \ + trainer.save_freq=10 \ + trainer.test_freq=10 \ + trainer.total_epochs=15 $@ 2>&1 | tee verl_demo.log + + +# checkpoints/${trainer.project_name}/${trainer.experiment_name} + +# trainer.logger=['console','wandb'] +# trainer.project_name='verl_post_training' \ +# trainer.experiment_name='gsm8k_function_rm' \ + +# actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \ +# critic.model.fsdp_config.optimizer_offload=False \ + + + +.. [1] The original paper (https://arxiv.org/pdf/2110.14168) mainly focuses on training a verifier (a reward model) to solve math problems via Best-of-N sampling. In this example, we train an RL agent using a rule-based reward model. +.. [2] More training script examples for FSDP and Megatron-LM backend are stored in `examples/ppo_trainer `_ directory. \ No newline at end of file diff --git a/docs/workers/fsdp_workers.rst b/docs/workers/fsdp_workers.rst index fddbcef7..3ea0e748 100644 --- a/docs/workers/fsdp_workers.rst +++ b/docs/workers/fsdp_workers.rst @@ -137,6 +137,6 @@ additional initialization for the Optimizer. HybridShard ------------ -We didn’t support FSDP `HybridShard`. To support this, we may need to +We didn't support FSDP `HybridShard`. To support this, we may need to construct a 2D device mesh and test the corresponding ``dtensor_weight_loader`` and ``hf_weight_loader`` for each model. diff --git a/docs/workers/ray_trainer.rst b/docs/workers/ray_trainer.rst index 9470bb8f..3b1a9e6f 100644 --- a/docs/workers/ray_trainer.rst +++ b/docs/workers/ray_trainer.rst @@ -94,7 +94,7 @@ CUDA/distributed context in different processes. self.actor_rollout_wg = all_wg['actor_rollout'] self.actor_rollout_wg.init_model() -.. note:: For megatron backend, if we merge the ``worker_groups`` into the same processes, all the roles will utilize the same 3D parallel size. To optimize this, we may need to maintain several 3D process groups for each role in the same distributed context. If you want to use different 3D parallel size for different roles, please follow the similar architecture of the first code block to initialize each role’s ``worker_group`` +.. note:: For megatron backend, if we merge the ``worker_groups`` into the same processes, all the roles will utilize the same 3D parallel size. To optimize this, we may need to maintain several 3D process groups for each role in the same distributed context. If you want to use different 3D parallel size for different roles, please follow the similar architecture of the first code block to initialize each role's ``worker_group`` PPO Training Loop @@ -104,7 +104,7 @@ We implement the PPO training loop by calling the functions in worker_group of each role. The input and output data of each function is a ``DataProto`` object implemented in `protocol.py `_. In the training loop, trainer will dispatch/collect the data to/from different GPUs -following the transfer protocols wrapped in the workers’ functions. The +following the transfer protocols wrapped in the workers' functions. The computation of PPO micro batches is processed in ``update_actor`` and ``update_critic`` functions. From 21e7354d9b84830fd0af947f8d917c94fa66d785 Mon Sep 17 00:00:00 2001 From: "haibin.lin" Date: Thu, 19 Dec 2024 05:49:00 +0800 Subject: [PATCH 3/7] update quickstart --- docs/examples/config.rst | 2 ++ docs/experiment/ppo.rst | 2 ++ docs/start/quickstart.rst | 59 +++++++++++++++++++-------------- verl/trainer/ppo/ray_trainer.py | 2 +- 4 files changed, 40 insertions(+), 25 deletions(-) diff --git a/docs/examples/config.rst b/docs/examples/config.rst index 24afd880..3fc1906b 100644 --- a/docs/examples/config.rst +++ b/docs/examples/config.rst @@ -1,3 +1,5 @@ +.. _config-explain-page: + Config Explaination =================== diff --git a/docs/experiment/ppo.rst b/docs/experiment/ppo.rst index 332a3ee2..8141db1b 100644 --- a/docs/experiment/ppo.rst +++ b/docs/experiment/ppo.rst @@ -1,3 +1,5 @@ +.. _algo-baseline-page: + Algorithm Baselines =================== diff --git a/docs/start/quickstart.rst b/docs/start/quickstart.rst index 5c67cffc..eb7cb935 100644 --- a/docs/start/quickstart.rst +++ b/docs/start/quickstart.rst @@ -77,58 +77,69 @@ For mode details, please refer to `verl/utils/reward_score/gsm8k.py &1 | tee verl_demo.log +You are expected to see the following logs, indicating training in progress: + +.. code:: + step:0 - timing/gen:21.470 - timing/ref:4.360 - timing/values:5.800 - critic/kl:0.000 - critic/kl_coeff:0.001 - timing/adv:0.109 - timing/update_critic:15.664 - critic/vf_loss:14.947 - critic/vf_clipfrac:0.000 - critic/vpred_mean:-2.056 - critic/grad_norm:1023.278 - critic/lr(1e-4):0.100 - timing/update_actor:20.314 - actor/entropy_loss:0.433 - actor/pg_loss:-0.005 - actor/pg_clipfrac:0.000 - actor/ppo_kl:0.000 - actor/grad_norm:1.992 - actor/lr(1e-4):0.010 - critic/score/mean:0.004 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.004 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.000 - critic/advantages/max:2.360 - critic/advantages/min:-2.280 - critic/returns/mean:0.003 - critic/returns/max:0.000 - critic/returns/min:0.000 - critic/values/mean:-2.045 - critic/values/max:9.500 - critic/values/min:-14.000 - response_length/mean:239.133 - response_length/max:256.000 - response_length/min:77.000 - prompt_length/mean:104.883 - prompt_length/max:175.000 - prompt_length/min:68.000 + step:1 - timing/gen:23.020 - timing/ref:4.322 - timing/values:5.953 - critic/kl:0.000 - critic/kl_coeff:0.001 - timing/adv:0.118 - timing/update_critic:15.646 - critic/vf_loss:18.472 - critic/vf_clipfrac:0.384 - critic/vpred_mean:1.038 - critic/grad_norm:942.924 - critic/lr(1e-4):0.100 - timing/update_actor:20.526 - actor/entropy_loss:0.440 - actor/pg_loss:0.000 - actor/pg_clipfrac:0.002 - actor/ppo_kl:0.000 - actor/grad_norm:2.060 - actor/lr(1e-4):0.010 - critic/score/mean:0.000 - critic/score/max:0.000 - critic/score/min:0.000 - critic/rewards/mean:0.000 - critic/rewards/max:0.000 - critic/rewards/min:0.000 - critic/advantages/mean:0.000 - critic/advantages/max:2.702 - critic/advantages/min:-2.616 - critic/returns/mean:0.000 - critic/returns/max:0.000 - critic/returns/min:0.000 - critic/values/mean:-2.280 - critic/values/max:11.000 - critic/values/min:-16.000 - response_length/mean:232.242 - response_length/max:256.000 - response_length/min:91.000 - prompt_length/mean:102.398 - prompt_length/max:185.000 - prompt_length/min:70.000 + +Checkout :ref:`algo-baseline-page` for full training and validation logs for reference. + +The checkpoint is saved at the following dir by default: + +- checkpoints/${trainer.project_name}/${trainer.experiment_name} + +To enable ``wandb`` for experiment tracking, set the following configs: + +.. code:: bash + + trainer.logger=['console','wandb'] \ + trainer.project_name=$YOUR_PROJECT_NAME \ + trainer.experiment_name=$YOUR_RUN_NAME \ + +If you encounter out of memory issues, enable the following configs would help: + +- actor_rollout_ref.actor.ppo_micro_batch_size=1 \ -# checkpoints/${trainer.project_name}/${trainer.experiment_name} +- critic.ppo_micro_batch_size=1 \ -# trainer.logger=['console','wandb'] -# trainer.project_name='verl_post_training' \ -# trainer.experiment_name='gsm8k_function_rm' \ +- actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \ -# actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \ -# critic.model.fsdp_config.optimizer_offload=False \ +- critic.model.fsdp_config.optimizer_offload=False \ +For the full set of configs, please refer to :ref:`config-explain-page` for detailed explaination and performance tuning. .. [1] The original paper (https://arxiv.org/pdf/2110.14168) mainly focuses on training a verifier (a reward model) to solve math problems via Best-of-N sampling. In this example, we train an RL agent using a rule-based reward model. diff --git a/verl/trainer/ppo/ray_trainer.py b/verl/trainer/ppo/ray_trainer.py index a5f8879c..c24b45ee 100644 --- a/verl/trainer/ppo/ray_trainer.py +++ b/verl/trainer/ppo/ray_trainer.py @@ -417,7 +417,7 @@ def fit(self): # perform validation before training # currently, we only support validation using the reward_function. - if self.val_reward_fn is not None: + if self.val_reward_fn is not None and self.config.trainer.get('val_before_train', True): val_metrics = self._validate() pprint(f'Initial validation metrics: {val_metrics}') logger.log(data=val_metrics, step=global_steps) From c70cb2451d6696cbaf836c253a4139f019bc3aeb Mon Sep 17 00:00:00 2001 From: Haibin Lin Date: Wed, 18 Dec 2024 13:57:53 -0800 Subject: [PATCH 4/7] fix quickstart syntax --- docs/index.rst | 2 +- docs/start/quickstart.rst | 21 ++++++++++----------- 2 files changed, 11 insertions(+), 12 deletions(-) diff --git a/docs/index.rst b/docs/index.rst index ce72cd69..756e4aee 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -89,7 +89,7 @@ Code formatting ^^^^^^^^^^^^^^^^^^^^^^^^ We use yapf (Google style) to enforce strict code formatting when reviewing MRs. Run yapf at the top level of verl repo: -.. bash:: +.. code-block:: bash pip3 install yapf yapf -ir -vv --style ./.style.yapf verl examples tests diff --git a/docs/start/quickstart.rst b/docs/start/quickstart.rst index eb7cb935..8422c470 100644 --- a/docs/start/quickstart.rst +++ b/docs/start/quickstart.rst @@ -12,7 +12,7 @@ Introduction .. _hf_dataset_gsm8k: https://huggingface.co/datasets/gsm8k -In this example, we train an LLM to tackle the `GSM8k `_ task with function-based rewards[1]_. +In this example, we train an LLM to tackle the `GSM8k `_ task with function-based rewards. [1]_ Prerequisite: @@ -45,7 +45,7 @@ Step 1: Prepare the dataset We preprocess the dataset in parquet format so that (1) it contains necessary fields for computing RL rewards and (2) is faster to read. -.. code:: bash +.. code-block:: bash python3 examples/data_preprocess/gsm8k.py --local_dir ~/data/gsm8k @@ -56,7 +56,7 @@ Usually we recommend starting with an "instruct" model variant so that the model If you start from a "base" model variant, doing SFT before RL is recommended. Refer to the `sft directory `_ and `SFT Trainer `_ for further details. -.. code:: bash +.. code-block:: bash python3 -c "import transformers; transformers.pipeline('text-generation', model='Qwen/Qwen2.5-0.5B-Instruct')" @@ -75,12 +75,12 @@ For mode details, please refer to `verl/utils/reward_score/gsm8k.py `_ directory. \ No newline at end of file +.. [2] More training script examples for FSDP and Megatron-LM backend are stored in `examples/ppo_trainer `_ directory. From 518f63224b73dc861c6468ac1e1df13572f18638 Mon Sep 17 00:00:00 2001 From: Haibin Lin Date: Wed, 18 Dec 2024 14:06:30 -0800 Subject: [PATCH 5/7] set hdfs to null --- docs/start/quickstart.rst | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/start/quickstart.rst b/docs/start/quickstart.rst index 8422c470..393f0ca2 100644 --- a/docs/start/quickstart.rst +++ b/docs/start/quickstart.rst @@ -93,7 +93,7 @@ Set the ``data.train_files`` ,\ ``data.val_files``, ``actor_rollout_ref.model.pa actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.actor.ppo_mini_batch_size=64 \ actor_rollout_ref.actor.ppo_micro_batch_size=4 \ - actor_rollout_ref.rollout.log_prob_micro_batch_size=4 \ + actor_rollout_ref.rollout.log_prob_micro_batch_size=8 \ actor_rollout_ref.rollout.tensor_model_parallel_size=1 \ actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \ actor_rollout_ref.ref.log_prob_micro_batch_size=4 \ @@ -103,6 +103,7 @@ Set the ``data.train_files`` ,\ ``data.val_files``, ``actor_rollout_ref.model.pa algorithm.kl_ctrl.kl_coef=0.001 \ trainer.logger=['console'] \ +trainer.val_before_train=False \ + trainer.default_hdfs_dir=null \ trainer.n_gpus_per_node=1 \ trainer.nnodes=1 \ trainer.save_freq=10 \ From d1690d89df382a1d0deaefbe2efc6aeb87ded4bb Mon Sep 17 00:00:00 2001 From: Haibin Lin Date: Wed, 18 Dec 2024 14:31:28 -0800 Subject: [PATCH 6/7] fix hdfs path null --- verl/trainer/ppo/ray_trainer.py | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/verl/trainer/ppo/ray_trainer.py b/verl/trainer/ppo/ray_trainer.py index c24b45ee..d68b88a3 100644 --- a/verl/trainer/ppo/ray_trainer.py +++ b/verl/trainer/ppo/ray_trainer.py @@ -515,13 +515,15 @@ def fit(self): if self.config.trainer.save_freq > 0 and (global_steps + 1) % self.config.trainer.save_freq == 0: actor_local_path = os.path.join(self.config.trainer.default_local_dir, 'actor', f'global_step_{global_steps}') - actor_remote_path = os.path.join(self.config.trainer.default_hdfs_dir, 'actor') + actor_remote_path = None if self.config.trainer.default_hdfs_dir is None else os.path.join( + self.config.trainer.default_hdfs_dir, 'actor') self.actor_rollout_wg.save_checkpoint(actor_local_path, actor_remote_path) if self.use_critic: critic_local_path = os.path.join(self.config.trainer.default_local_dir, 'critic', f'global_step_{global_steps}') - critic_remote_path = os.path.join(self.config.trainer.default_hdfs_dir, 'critic') + critic_remote_path = None if self.config.trainer.default_hdfs_dir is None else os.path.join( + self.config.trainer.default_hdfs_dir, 'critic') self.critic_wg.save_checkpoint(critic_local_path, critic_remote_path) global_steps += 1 From 3008c44c9bc86751dd30bae34ff3a678ecfcdc96 Mon Sep 17 00:00:00 2001 From: Haibin Lin Date: Wed, 18 Dec 2024 14:51:57 -0800 Subject: [PATCH 7/7] add key metric --- docs/start/quickstart.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/start/quickstart.rst b/docs/start/quickstart.rst index 393f0ca2..87f25686 100644 --- a/docs/start/quickstart.rst +++ b/docs/start/quickstart.rst @@ -110,7 +110,7 @@ Set the ``data.train_files`` ,\ ``data.val_files``, ``actor_rollout_ref.model.pa trainer.test_freq=10 \ trainer.total_epochs=15 $@ 2>&1 | tee verl_demo.log -You are expected to see the following logs, indicating training in progress: +You are expected to see the following logs, indicating training in progress. The key metric ``val/test_score/openai/gsm8k`` is computed every ``trainer.test_freq`` steps: .. code-block:: bash