From 07a8f9890e8479e9065e929ccb922d02521fbbfc Mon Sep 17 00:00:00 2001
From: Haibin Lin <haibin.lin@bytedance.com>
Date: Tue, 17 Dec 2024 21:21:52 -0800
Subject: [PATCH 1/7] fix docs

---
 docs/advance/megatron_extension.rst  | 4 ++--
 docs/preparation/reward_function.rst | 2 +-
 docs/start/quickstart.rst            | 8 ++++----
 docs/workers/fsdp_workers.rst        | 6 +++---
 docs/workers/megatron_workers.rst    | 2 +-
 5 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/docs/advance/megatron_extension.rst b/docs/advance/megatron_extension.rst
index de2d2564..502c1b0f 100644
--- a/docs/advance/megatron_extension.rst
+++ b/docs/advance/megatron_extension.rst
@@ -1,5 +1,5 @@
 Add models to Megatron-LM backend
-===========
+===================================
 
 Model
 -----------
@@ -22,4 +22,4 @@ To support other model, users are required to implement:
    (vLLM) model. Note that both the actor model and rollout model are
    partitioned during runtime. So, it’s advisable to map the model name
    in actor model implementation. Otherwise, you may need an additional
-   name mapping and even weight transformation.
\ No newline at end of file
+   name mapping and even weight transformation.
diff --git a/docs/preparation/reward_function.rst b/docs/preparation/reward_function.rst
index b58917bc..8ba7cd29 100644
--- a/docs/preparation/reward_function.rst
+++ b/docs/preparation/reward_function.rst
@@ -1,5 +1,5 @@
 Implment Reward Function for Dataset
-=======================
+======================================
 
 For each dataset, we need to implement a reward function or utilize a reward model to compute the rewards for the generated responses.
 We already pre-implemented some reward functions in `reward_score directory <https://github.com/volcengine/verl/blob/main/verl/utils/reward_score>`_.
diff --git a/docs/start/quickstart.rst b/docs/start/quickstart.rst
index 69888f3b..07e60fbe 100644
--- a/docs/start/quickstart.rst
+++ b/docs/start/quickstart.rst
@@ -1,11 +1,11 @@
 .. _quickstart:
 
-==========
+=========================================================
 Quickstart: Fintune a LLM using PPO with GSM8K dataset
-==========
+=========================================================
 
 Post-train a LLM using GSM8K dataset
-====================
+===================================================================
 
 Introduction
 ------------
@@ -107,7 +107,7 @@ Step 4: Perform PPO training with your model on GSM8K Dataset
 - Users could replace the ``data.train_files`` ,\ ``data.val_files``,
   ``actor_rollout_ref.model.path`` and ``critic.model.path`` based on
   their environment.
-- See :doc:`config` for detailed explaination of each config field.
+- See :doc:`examples/config` for detailed explaination of each config field.
 
 **Reward Model/Function**
 
diff --git a/docs/workers/fsdp_workers.rst b/docs/workers/fsdp_workers.rst
index e54860f0..fddbcef7 100644
--- a/docs/workers/fsdp_workers.rst
+++ b/docs/workers/fsdp_workers.rst
@@ -1,5 +1,5 @@
 PyTorch FSDP Backend
-============
+======================
 
 We support PyTorch FSDP Backend by implementing various workers for
 actor, critic, reference, rollout and reward models. We also implement
@@ -28,7 +28,7 @@ Due to the simplicity, we recommend using FSDP backend for algorithm
 research and prototyping.
 
 FSDP Workers
-------------
+--------------
 
 ActorRolloutRefWorker
 ^^^^^^^^^^^^^^^^^^^^^
@@ -139,4 +139,4 @@ HybridShard
 
 We didn’t support FSDP `HybridShard`. To support this, we may need to
 construct a 2D device mesh and test the corresponding
-``dtensor_weight_loader`` and ``hf_weight_loader`` for each model.
\ No newline at end of file
+``dtensor_weight_loader`` and ``hf_weight_loader`` for each model.
diff --git a/docs/workers/megatron_workers.rst b/docs/workers/megatron_workers.rst
index d6f88c3e..8cdccd99 100644
--- a/docs/workers/megatron_workers.rst
+++ b/docs/workers/megatron_workers.rst
@@ -1,5 +1,5 @@
 Megatron-LM Backend
-================
+=====================
 
 We support Megatron Backend by implementing various workers for actor,
 critic, reference, rollout and reward models. We also implement the

From b69af033f13237e1877afd9d5e5ff71d056727ea Mon Sep 17 00:00:00 2001
From: "haibin.lin" <haibin.lin@bytedance.com>
Date: Thu, 19 Dec 2024 04:31:31 +0800
Subject: [PATCH 2/7] update docs

---
 README.md                               |  26 +---
 docs/advance/dpo_extension.rst          |   6 +-
 docs/advance/fsdp_extension.rst         |   2 +-
 docs/advance/megatron_extension.rst     |   2 +-
 docs/examples/config.rst                |  34 ++---
 docs/examples/gsm8k_example.rst         |   6 +-
 docs/examples/ppo_code_architecture.rst |  10 +-
 docs/experiment/ppo.rst                 |  32 ++--
 docs/index.rst                          |  26 ++--
 docs/preparation/prepare_data.rst       |   2 +-
 docs/start/install.rst                  |   2 +-
 docs/start/quickstart.rst               | 185 ++++++++++--------------
 docs/workers/fsdp_workers.rst           |   2 +-
 docs/workers/ray_trainer.rst            |   4 +-
 14 files changed, 149 insertions(+), 190 deletions(-)

diff --git a/README.md b/README.md
index ab07a2b3..5df1912a 100644
--- a/README.md
+++ b/README.md
@@ -1,16 +1,16 @@
 <h1 style="text-align: center;">veRL: Volcano Engine Reinforcement Learning for LLM</h1>
 
-veRL (HybridFlow) is a flexible, efficient and industrial-level RL(HF) training framework designed for large language models (LLMs). veRL is the open-source version of [HybridFlow](https://arxiv.org/abs/2409.19256v2) paper.
+veRL is a flexible, efficient and production-ready RL training framework designed for large language models (LLMs). veRL is the open-source version of [HybridFlow](https://arxiv.org/abs/2409.19256v2) paper.
 
 veRL is flexible and easy to use with:
 
-- **Easy to support diverse RL(HF) algorithms**: The Hybrid programming model combines the strengths of single-controller and multi-controller paradigms to enable flexible representation and efficient execution of complex Post-Training dataflows. Allowing users to build RL dataflows in a few lines of code.
+- **Easy extension of diverse RL algorithms**: The Hybrid programming model combines the strengths of single-controller and multi-controller paradigms to enable flexible representation and efficient execution of complex Post-Training dataflows. Allowing users to build RL dataflows in a few lines of code.
 
-- **Seamless integration of existing LLM infra with modular API design**: Decouples computation and data dependencies, enabling seamless integration with existing LLM frameworks, such as PyTorch FSDP, Megatron-LM and vLLM. Moreover, users can easily extend to other LLM training and inference frameworks.
+- **Seamless integration of existing LLM infra with modular APIs**: Decouples computation and data dependencies, enabling seamless integration with existing LLM frameworks, such as PyTorch FSDP, Megatron-LM and vLLM. Moreover, users can easily extend to other LLM training and inference frameworks.
 
 - **Flexible device mapping**: Supports various placement of models onto different sets of GPUs for efficient resource utilization and scalability across different cluster sizes.
 
-- Readily integration with popular Hugging Face models
+- Readily integration with popular HuggingFace models
 
 
 veRL is fast with:
@@ -172,24 +172,6 @@ Visit our [documentation](https://verl.readthedocs.io/en/latest/index.html) to l
   - [Add models to Megatron-LM backend](https://verl.readthedocs.io/en/latest/advance/megatron_extension.html)
 
 
-## Community and Contribution
-
-### Communication channel
-
-[Join us](https://join.slack.com/t/verlgroup/shared_invite/zt-2w5p9o4c3-yy0x2Q56s_VlGLsJ93A6vA) for discussions on slack!
-
-### Code formatting
-We use yapf (Google style) to enforce strict code formatting when reviewing MRs. To reformat you code locally, make sure you installed `yapf`
-```bash
-pip3 install yapf
-```
-Then, make sure you are at top level of verl repo and run
-```bash
-yapf -ir -vv --style ./.style.yapf verl examples
-```
-
-
-
 ## Citation
 
 ```tex
diff --git a/docs/advance/dpo_extension.rst b/docs/advance/dpo_extension.rst
index 592d9710..bc8f08ad 100644
--- a/docs/advance/dpo_extension.rst
+++ b/docs/advance/dpo_extension.rst
@@ -66,7 +66,7 @@ Here, ``SampleGenerator`` can be viewed as a multi-process pulled up by
 the control flow to call. The implementation details inside can use any
 inference engine including vllm, sglang and huggingface. Users can
 largely reuse the code in
-verl/verl/trainer/ppo/rollout/vllm_rollout/vllm_rollout.py and we won’t
+verl/verl/trainer/ppo/rollout/vllm_rollout/vllm_rollout.py and we won't
 go into details here.
 
 **ReferencePolicy inference**
@@ -179,7 +179,7 @@ steps:
 
 Frequently calling these 3 steps on the controller process greatly hurts
 code readability. **In veRL, we have abstracted and encapsulated these 3
-steps, so that the worker’s method + dispatch + collect can be
+steps, so that the worker's method + dispatch + collect can be
 registered into the worker_group**
 
 .. code:: python
@@ -230,7 +230,7 @@ Here it requires the data interface to be ``DataProto``. Definition of
 Step 3: Main training loop
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-With the above training flows, we can implement the algorithm’s control
+With the above training flows, we can implement the algorithm's control
 flow. It is recommended that ``main_task`` is also a ray remote process.
 
 .. code:: python
diff --git a/docs/advance/fsdp_extension.rst b/docs/advance/fsdp_extension.rst
index 11e9d8a1..bb3da95a 100644
--- a/docs/advance/fsdp_extension.rst
+++ b/docs/advance/fsdp_extension.rst
@@ -28,7 +28,7 @@ loader for the models below in `dtensor_weight_loader.py <https://github.com/vol
 - ``Qwen2ForCausalLM``
 - ``DeepseekV2ForCausalLM``
 
-To implement ``dtensor_weight_loader`` of a model that’s supported in
+To implement ``dtensor_weight_loader`` of a model that's supported in
 vLLM, follow the guide of gemma model below:
 
 1. Copy the
diff --git a/docs/advance/megatron_extension.rst b/docs/advance/megatron_extension.rst
index 502c1b0f..b2b02e02 100644
--- a/docs/advance/megatron_extension.rst
+++ b/docs/advance/megatron_extension.rst
@@ -20,6 +20,6 @@ To support other model, users are required to implement:
    your loader to ``weight_loader_registry`` in `weight_loader_registry.py <https://github.com/volcengine/verl/blob/main/verl/models/weight_loader_registry.py>`_.
 3. Weight loader that synchronize the weight from Megatron to rollout
    (vLLM) model. Note that both the actor model and rollout model are
-   partitioned during runtime. So, it’s advisable to map the model name
+   partitioned during runtime. So, it's advisable to map the model name
    in actor model implementation. Otherwise, you may need an additional
    name mapping and even weight transformation.
diff --git a/docs/examples/config.rst b/docs/examples/config.rst
index d7d8fa43..24afd880 100644
--- a/docs/examples/config.rst
+++ b/docs/examples/config.rst
@@ -22,14 +22,14 @@ Data
      return_raw_chat: False
 
 - ``data.train_files``: Training set parquet. Can be a list or a single
-  file. The program will read all files into memory, so it can’t be too
+  file. The program will read all files into memory, so it can't be too
   large (< 100GB). The path can be either local path or HDFS path. For
   HDFS path, we provide utils to download it to DRAM and convert the
   HDFS path to local path.
 - ``data.val_files``: Validation parquet. Can be a list or a single
   file.
 - ``data.prompt_key``: The field in the dataset where the prompt is
-  located. Default is ‘prompt’.
+  located. Default is 'prompt'.
 - ``data.max_prompt_length``: Maximum prompt length. All prompts will be
   left-padded to this length. An error will be reported if the length is
   too long
@@ -41,13 +41,13 @@ Data
   iteration.
 - ``data.return_raw_input_ids``: Whether to return the original
   input_ids without adding chat template. This is mainly used to
-  accommodate situations where the reward model’s chat template differs
-  from the policy. It needs to be decoded first, then apply the RM’s
+  accommodate situations where the reward model's chat template differs
+  from the policy. It needs to be decoded first, then apply the RM's
   chat template. If using a model-based RM, and the policy and RM
   chat_templates are different, this flag needs to be set
 - ``data.return_raw_chat``:
 - ``data.truncation``: Truncate the input_ids or prompt length if they
-  exceed max_prompt_length. Default is ‘error’, not allow exceed the
+  exceed max_prompt_length. Default is 'error', not allow exceed the
   max_prompt_length. The users should increase the max_prompt_length if
   throwing the error.
 
@@ -114,7 +114,7 @@ Actor/Rollout/Reference Policy
 
 **Common config for actor, rollout and reference model**
 
-- ``actor_rollout_ref.hybrid_engine``: Whether it’s a hybrid engine,
+- ``actor_rollout_ref.hybrid_engine``: Whether it's a hybrid engine,
   currently only supports hybrid engine
 - ``actor_rollout_ref.model.path``: Huggingface model path. This can be
   either local path or HDFS path. For HDFS path, we provide utils to
@@ -123,7 +123,7 @@ Actor/Rollout/Reference Policy
   that need to be imported. Used to register models or tokenizers into
   the Huggingface system.
 - ``actor_rollout_ref.model.override_config``: Used to override some of
-  the model’s original configurations, mainly dropout
+  the model's original configurations, mainly dropout
 - ``actor_rollout_ref.model.enable_gradient_checkpointing``: Whether to
   enable gradient checkpointing for the actor
 
@@ -154,12 +154,12 @@ Actor/Rollout/Reference Policy
 - ``actor_rollout_ref.actor.shuffle``: Whether to shuffle data when
   there are multiple epochs
 
-- ``actor_rollout_ref.actor.optim``: Actor’s optimizer parameters
+- ``actor_rollout_ref.actor.optim``: Actor's optimizer parameters
 
 - ``actor_rollout_ref.actor.fsdp_config``: FSDP config for actor
   training
 
-  - ``wrap_policy``: FSDP wrap policy. By default, it uses Huggingface’s
+  - ``wrap_policy``: FSDP wrap policy. By default, it uses Huggingface's
     wrap policy, i.e., wrapping by DecoderLayer
 
     - No need to set transformer_layer_cls_to_wrap, so we comment it.
@@ -172,7 +172,7 @@ Actor/Rollout/Reference Policy
 **Reference Model**
 
 - ``actor_rollout_ref.ref``: FSDP config same as actor. **For models
-  larger than 7B, it’s recommended to turn on offload for ref by
+  larger than 7B, it's recommended to turn on offload for ref by
   default**
 - ``actor_rollout_ref.ref.log_prob_micro_batch_size``: The batch size
   for one forward pass in the computation of ``ref_log_prob``.
@@ -180,11 +180,11 @@ Actor/Rollout/Reference Policy
 **Rollout Model**
 
 - ``actor_rollout_ref.rollout.name``: hf/vllm. We use vLLM by default
-  because it’s much efficient and our hybrid engine is implemented with
+  because it's much efficient and our hybrid engine is implemented with
   vLLM.
 
 - Rollout (Auto-regressive) parameters. The key should be equal to the
-  property name in vLLM’s ``SamplingParams``.
+  property name in vLLM's ``SamplingParams``.
 
   - ``temperature``, ``top_k``, ``top_p`` and others: Sampling
     parameters in ``SamplingParams``.
@@ -224,7 +224,7 @@ Actor/Rollout/Reference Policy
   - ``megatron``: Use Megatron weight loader. Deployed with Megatron
     backend. The input model ``state_dict()`` is already partitioned
     along TP dimension and already gathered along PP dimension. This
-    weight loader requires that the Rollout model and Actor model’s
+    weight loader requires that the Rollout model and Actor model's
     parameters shape and name should be identical.
   - ``dtensor``: Default solution when using Huggingface weight loader.
     Deployed with FSDP backend and the state_dict_type is
@@ -232,7 +232,7 @@ Actor/Rollout/Reference Policy
     loader
   - ``hf``: Use Huggingface weight loader. Deployed with FSDP backend
     and the state_dict_type is ``StateDictType.FULL_STATE_DICT``. This
-    solution doesn’t need to rewrite the weight loader for each model
+    solution doesn't need to rewrite the weight loader for each model
     implemented in vLLM but it results in larger peak memory usage.
   - ``dummy_hf``, ``dummy_megatron``, ``dummy_dtensor``: Random
     initialization.
@@ -268,11 +268,11 @@ Reward Model
   responses. If False, the following parameters are not effective.
 - ``reward_model.model``
 
-  - ``input_tokenizer``: Input tokenizer. If the reward model’s chat
+  - ``input_tokenizer``: Input tokenizer. If the reward model's chat
     template is inconsistent with the policy, we need to first decode to
-    plaintext, then apply the rm’s chat_template. Then score with RM. If
+    plaintext, then apply the rm's chat_template. Then score with RM. If
     chat_templates are consistent, it can be set to null.
-  - ``path``: RM’s HDFS path or local path. Note that RM only supports
+  - ``path``: RM's HDFS path or local path. Note that RM only supports
     AutoModelForSequenceClassification. Other model types need to define
     their own RewardModelWorker and pass it from the code.
 
diff --git a/docs/examples/gsm8k_example.rst b/docs/examples/gsm8k_example.rst
index 0d3c1f8f..de694cfd 100644
--- a/docs/examples/gsm8k_example.rst
+++ b/docs/examples/gsm8k_example.rst
@@ -49,7 +49,7 @@ Step 1: Prepare dataset
 Step 2: Download Model
 ----------------------
 
-There’re three ways to prepare the model checkpoints for post-training:
+There're three ways to prepare the model checkpoints for post-training:
 
 - Download the required models from hugging face
 
@@ -96,7 +96,7 @@ We also provide various training scripts for SFT on GSM8K dataset in `gsm8k sft
 Step 4: Perform PPO training with your model on GSM8K Dataset
 -------------------------------------------------------------
 
-- Prepare your own run.sh script. Here’s an example for GSM8k dataset
+- Prepare your own run.sh script. Here's an example for GSM8k dataset
   and deepseek-llm-7b-chat model.
 - Users could replace the ``data.train_files`` ,\ ``data.val_files``,
   ``actor_rollout_ref.model.path`` and ``critic.model.path`` based on
@@ -107,7 +107,7 @@ Step 4: Perform PPO training with your model on GSM8K Dataset
 
 We use a rule-based reward model. We force the model to produce a final
 answer following 4 “#” as shown in the solution. We extract the final
-answer from both the solution and model’s output using regular
+answer from both the solution and model's output using regular
 expression matching. We compare them and assign a reward of 1 to correct
 answer, 0.1 to incorrect answer and 0 to no answer.
 
diff --git a/docs/examples/ppo_code_architecture.rst b/docs/examples/ppo_code_architecture.rst
index ab1f66a6..1cca8301 100644
--- a/docs/examples/ppo_code_architecture.rst
+++ b/docs/examples/ppo_code_architecture.rst
@@ -1,7 +1,7 @@
 PPO Example Architecture
 ========================
 
-Let’s start with the Proximal Policy Optimization algorithm, which is
+Let's start with the Proximal Policy Optimization algorithm, which is
 most widely used algorithm in LLM post-training.
 
 The main entry point of the PPO algorithm example is:
@@ -151,18 +151,18 @@ Defining reward model/function
    resource_pool_manager = ResourcePoolManager(resource_pool_spec=resource_pool_spec, mapping=mapping)
 
 Since not all tasks use model-based RM, users need to define here
-whether it’s a model-based RM or a function-based RM
+whether it's a model-based RM or a function-based RM
 
-- If it’s a model-based RM, directly add the ``RewardModel`` role in the
+- If it's a model-based RM, directly add the ``RewardModel`` role in the
   resource mapping and add it to the resource pool mapping.
 
   - Note that the pre-defined ``RewardModelWorker`` only supports models
     with the structure of huggingface
-    ``AutoModelForSequenceClassification``. If it’s not this model, you
+    ``AutoModelForSequenceClassification``. If it's not this model, you
     need to define your own RewardModelWorker in `FSDP Workers <https://github.com/volcengine/verl/blob/main/verl/trainer/ppo/workers/fsdp_workers.py>`_ 
     and `Megatron-LM Workers <https://github.com/volcengine/verl/blob/main/verl/trainer/ppo/workers/megatron_workers.py>`_.
 
-- If it’s a function-based RM, the users are required to classified the
+- If it's a function-based RM, the users are required to classified the
   reward function for each datasets.
 
 .. code:: python
diff --git a/docs/experiment/ppo.rst b/docs/experiment/ppo.rst
index 8dcce1c9..332a3ee2 100644
--- a/docs/experiment/ppo.rst
+++ b/docs/experiment/ppo.rst
@@ -8,17 +8,23 @@ Assuming GSM8k dataset is preprocess via ``python3 examples/data_preprocess/gsm8
 
 Refer to the table below to reproduce PPO training from different pre-trained models.
 
-+----------------------------+------------------------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-| Model                      | Method                 | Test score |  Details                                                                                                                                                                                                                      |
-+============================+========================+============+=====================+=========================================================================================================================================================================================================+
-| google/gemma-2-2b-it       | pretrained checkpoint  | 23.9       |   `Huggingface <https://huggingface.co/google/gemma-2-2b-it#benchmark-results>`_                                                                                                                                              |
-+----------------------------+------------------------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-| google/gemma-2-2b-it       | SFT                    | 52.06      |   `Command and logs <https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/gemma-2-2b-it-sft-0.411.log>`_                                                                                                       |
-+----------------------------+------------------------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-| google/gemma-2-2b-it       | SFT + PPO              | 64.02      |   `Command and logs <https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/gemma-2-2b-it-ppo-bsz512_4-prompt1024-resp-512-0.640.log>`_, `wandb <https://api.wandb.ai/links/verl-team/h7ux8602>`_                |
-+----------------------------+------------------------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-| Qwen/Qwen2.5-0.5B-Instruct | pretrained checkpoint  | 36.4       |   `Qwen blog <https://qwenlm.github.io/blog/qwen2.5-llm/>`_                                                                                                                                                                   |
-+----------------------------+------------------------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-| Qwen/Qwen2.5-0.5B-Instruct | PPO                    | 56.7       |   `Command and logs <https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/Qwen2.5-0.5B-bsz256_2-prompt1024-resp512-0.567.log>`_                                                                                |
-+----------------------------+------------------------+------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+.. _hf_gemma: https://huggingface.co/google/gemma-2-2b-it#benchmark-results
+.. _verl_data_gemma_sft: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/gemma-2-2b-it-sft-0.411.log
+.. _verl_data_gemma_sft_ppo: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/gemma-2-2b-it-ppo-bsz512_4-prompt1024-resp-512-0.640.log
+.. _verl_data_gemma_sft_ppo_wandb: https://api.wandb.ai/links/verl-team/h7ux8602
+.. _qwen_blog: https://qwenlm.github.io/blog/qwen2.5-llm/
+.. _verl_data_qwen25_05_ppo: https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/Qwen2.5-0.5B-bsz256_2-prompt1024-resp512-0.567.log
 
++----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+| Model                      | Method                 | Test score |  Details                                                                                      |
++============================+========================+============+=====================+=========================================================================+
+| google/gemma-2-2b-it       | pretrained checkpoint  | 23.9       |   `Huggingface <gemma_hf>`_                                                                   |
++----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+| google/gemma-2-2b-it       | SFT                    | 52.06      |   `Command and logs <verl_data_gemma_sft>`_                                                   |
++----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+| google/gemma-2-2b-it       | SFT + PPO              | 64.02      |   `Command and logs <verl_data_gemma_sft_ppo>`_, `wandb <verl_data_gemma_sft_ppo_wandb>`_     |
++----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+| Qwen/Qwen2.5-0.5B-Instruct | pretrained checkpoint  | 36.4       |   `Qwen blog <qwen_blog>`_                                                                    |
++----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
+| Qwen/Qwen2.5-0.5B-Instruct | PPO                    | 56.7       |   `Command and logs <verl_data_qwen25_05_ppo>`_                                               |
++----------------------------+------------------------+------------+-----------------------------------------------------------------------------------------------+
\ No newline at end of file
diff --git a/docs/index.rst b/docs/index.rst
index 80e59783..ce72cd69 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -1,17 +1,19 @@
-Welcome to veRL/HybridFlow's documentation!
+Welcome to veRL's documentation!
 ================================================
 
-veRL (HybridFlow) is a flexible, efficient and industrial-level RL(HF) training framework designed for large language models (LLMs) Post-Training. 
+.. _hf_arxiv: https://arxiv.org/pdf/2409.19256
+
+veRL is a flexible, efficient and production-ready RL training framework designed for large language models (LLMs) post-training. It is an open source implementation of the `HybridFlow <hf_arxiv>`_ paper.
 
 veRL is flexible and easy to use with:
 
-- **Easy to support diverse RL(HF) algorithms**: The Hybrid programming model combines the strengths of single-controller and multi-controller paradigms to enable flexible representation and efficient execution of complex Post-Training dataflows. Allowing users to build RL dataflows in a few lines of code.
+- **Easy extension of diverse RL algorithms**: The Hybrid programming model combines the strengths of single-controller and multi-controller paradigms to enable flexible representation and efficient execution of complex Post-Training dataflows. Allowing users to build RL dataflows in a few lines of code.
 
-- **Seamless integration of existing LLM infra with modular API design**: Decouples computation and data dependencies, enabling seamless integration with existing LLM frameworks, such as PyTorch FSDP, Megatron-LM and vLLM. Moreover, users can easily extend to other LLM training and inference frameworks.
+- **Seamless integration of existing LLM infra with modular APIs**: Decouples computation and data dependencies, enabling seamless integration with existing LLM frameworks, such as PyTorch FSDP, Megatron-LM and vLLM. Moreover, users can easily extend to other LLM training and inference frameworks.
 
-- **Flexible device mapping**: Supports various placement of models onto different sets of GPUs for efficient resource utilization and scalability across different cluster sizes.
+- **Flexible device mapping and parallelism**: Supports various placement of models onto different sets of GPUs for efficient resource utilization and scalability across different cluster sizes.
 
-- Readily integration with popular Hugging Face models
+- Readily integration with popular HuggingFace models
 
 
 veRL is fast with:
@@ -81,7 +83,13 @@ Contribution
 
 veRL is free software; you can redistribute it and/or modify it under the terms
 of the Apache License 2.0. We welcome contributions.
-Join us on `GitHub <https://github.com/volcengine/verl>`_ .
+Join us on `GitHub <https://github.com/volcengine/verl>`_ and `Slack <https://join.slack.com/t/verlgroup/shared_invite/zt-2w5p9o4c3-yy0x2Q56s_VlGLsJ93A6vA>`_ for discussions.
+
+Code formatting
+^^^^^^^^^^^^^^^^^^^^^^^^
+We use yapf (Google style) to enforce strict code formatting when reviewing MRs. Run yapf at the top level of verl repo:
+
+.. bash::
 
-.. and check out our
-.. :doc:`contribution guidelines <contribute>`.
+   pip3 install yapf
+   yapf -ir -vv --style ./.style.yapf verl examples tests
diff --git a/docs/preparation/prepare_data.rst b/docs/preparation/prepare_data.rst
index 58de561c..40c4f644 100644
--- a/docs/preparation/prepare_data.rst
+++ b/docs/preparation/prepare_data.rst
@@ -10,7 +10,7 @@ to follow the following steps: The data preprocess script can be divided
 into two parts:
 
 1. The first part is the common part, which loads the dataset from
-   huggingface’s ``datasets`` package. Then preprocess the datasets with
+   huggingface's ``datasets`` package. Then preprocess the datasets with
    the ``make_map_fn`` and then store in the parquet format.
 
 .. code:: python
diff --git a/docs/start/install.rst b/docs/start/install.rst
index 9a932e95..e18c396b 100644
--- a/docs/start/install.rst
+++ b/docs/start/install.rst
@@ -55,7 +55,7 @@ For users who pursue better scalability, we recommend using Megatron-LM
 backend. Please install the above dependencies first.
 
 Currently, we support Megatron-LM\@core_v0.4.0 and we fix some internal
-issues of Megatron-LM. Here’s the additional installation guide.
+issues of Megatron-LM. Here's the additional installation guide.
 
 The pros, cons and extension guide for using Megatron-LM backend can be
 found in :doc:`Megatron-LM Workers<../workers/megatron_workers>`.
diff --git a/docs/start/quickstart.rst b/docs/start/quickstart.rst
index 07e60fbe..5c67cffc 100644
--- a/docs/start/quickstart.rst
+++ b/docs/start/quickstart.rst
@@ -1,7 +1,7 @@
 .. _quickstart:
 
 =========================================================
-Quickstart: Fintune a LLM using PPO with GSM8K dataset
+Quickstart: Post-train a LLM using PPO with GSM8K dataset
 =========================================================
 
 Post-train a LLM using GSM8K dataset
@@ -10,26 +10,22 @@ Post-train a LLM using GSM8K dataset
 Introduction
 ------------
 
-In this example, we train an LLM to tackle the GSM8k task.
+.. _hf_dataset_gsm8k: https://huggingface.co/datasets/gsm8k
 
-Paper: https://arxiv.org/pdf/2110.14168
+In this example, we train an LLM to tackle the `GSM8k <hf_dataset_gsm8k>`_ task with function-based rewards[1]_.
 
-Dataset: https://huggingface.co/datasets/gsm8k
+Prerequisite:
+
+- the latest version of ``verl`` and its dependencies installed following the installation guide. Using the docker image is recommended.
+
+- an GPU with at least 32 GB memory
 
-Note that the original paper mainly focuses on training a verifier (a
-reward model) to solve math problems via Best-of-N sampling. In this
-example, we train an RLHF agent using a rule-based reward model.
 
 Dataset Introduction
 --------------------
 
 GSM8k is a math problem dataset. The prompt is an elementary school
-problem. The LLM model is required to answer the math problem.
-
-The training set contains 7473 samples and the test set contains 1319
-samples.
-
-**An example**
+problem. The LLM model is asked to solve the math problem. Below is an example:
 
 Prompt
 
@@ -44,129 +40,96 @@ Solution
    number of teaspoons she used is 7/20, she used 7/20\ *120 =
    <<7/20*\ 120=42>>42 #### 42
 
-Step 1: Prepare dataset
------------------------
+Step 1: Prepare the dataset
+----------------------------
+
+We preprocess the dataset in parquet format so that (1) it contains necessary fields for computing RL rewards and (2) is faster to read.
 
 .. code:: bash
 
-   cd examples/data_preprocess
-   python3 gsm8k.py --local_dir ~/data/gsm8k
+   python3 examples/data_preprocess/gsm8k.py --local_dir ~/data/gsm8k
 
-Step 2: Download Model
-----------------------
+Step 2: Download a model for post-training
+-------------------------------------------
 
-There’re three ways to prepare the model checkpoints for post-training:
+Usually we recommend starting with an "instruct" model variant so that the model follows instructions. In this example, we start with the ``Qwen2.5-0.5B-Instruct`` model.
 
-- Download the required models from huggingface
+If you start from a "base" model variant, doing SFT before RL is recommended. Refer to the `sft directory <https://github.com/volcengine/verl/blob/main/examples/gsm8k/sft/>`_ and `SFT Trainer <https://github.com/volcengine/verl/blob/main/verl/trainer/fsdp_sft_trainer.py>`_ for further details.
 
 .. code:: bash
 
-   huggingface-cli download deepseek-ai/deepseek-math-7b-instruct --local-dir ~/models/deepseek-math-7b-instruct --local-dir-use-symlinks False
+   python3 -c "import transformers; transformers.pipeline('text-generation', model='Qwen/Qwen2.5-0.5B-Instruct')"
 
-- Already store your store model in the local directory or HDFS path.
-- Also, you can directly use the model name in huggingface (e.g.,
-  deepseek-ai/deepseek-math-7b-instruct) in
-  ``actor_rollout_ref.model.path`` and ``critic.model.path`` field in
-  the run script.
+Step 3: Perform PPO training with the instruct model
+----------------------------------------------------------------------
 
-Noted that users should prepare checkpoints for actor, critic and reward
-model.
+**Reward Model/Function**
 
-[Optional] Step 3: SFT your Model
----------------------------------
+We use a pre-defined rule-based reward model. We force the model to produce a final
+answer following 4 “#” as shown in the solution. We extract the final
+answer from both the solution and model's output using regular
+expression matching. We assign a reward of 1 to correct
+answer, 0.1 to incorrect answer and 0 to no answer. 
 
-We provide a SFT Trainer using PyTorch FSDP in
-`fsdp_sft_trainer.py <https://github.com/volcengine/verl/blob/main/verl/trainer/fsdp_sft_trainer.py>`_. 
-Users can customize their own SFT
-script using our FSDP SFT Trainer.
+For mode details, please refer to `verl/utils/reward_score/gsm8k.py <https://github.com/volcengine/verl/blob/v0.1/verl/utils/reward_score/gsm8k.py>`_.
 
-We also provide various training scripts for SFT on GSM8K dataset in `gsm8k sft directory <https://github.com/volcengine/verl/blob/main/examples/gsm8k/sft/>`_.
+**Training Script**
 
-.. code:: shell
+Now let's run PPO training with the dataset and model above[2]_.
 
-   set -x
+.. code:: bash
 
-   torchrun -m verl.trainer.fsdp_sft_trainer \
-       data.train_files=$HOME/data/gsm8k/train.parquet \
-       data.val_files=$HOME/data/gsm8k/test.parquet \
-       data.prompt_key=question \
-       data.response_key=answer \
-       data.micro_batch_size=8 \
-       model.partial_pretrain=deepseek-ai/deepseek-coder-6.7b-instruct \
-       trainer.default_hdfs_dir=hdfs://user/verl/experiments/gsm8k/deepseek-coder-6.7b-instruct/ \
-       trainer.project_name=gsm8k-sft \
-       trainer.experiment_name=gsm8k-sft-deepseek-coder-6.7b-instruct \
-       trainer.total_epochs=4 \
-       trainer.logger=['console','wandb']
+   bash examples/ppo_trainer/run_deepseek7b_llm.sh
 
-Step 4: Perform PPO training with your model on GSM8K Dataset
--------------------------------------------------------------
+The script of `run_deepseek7b_llm.sh`
 
-- Prepare your own run.sh script. Here’s an example for GSM8k dataset
+- Prepare your own run.sh script. Here's an example for GSM8k dataset
   and deepseek-llm-7b-chat model.
 - Users could replace the ``data.train_files`` ,\ ``data.val_files``,
   ``actor_rollout_ref.model.path`` and ``critic.model.path`` based on
   their environment.
 - See :doc:`examples/config` for detailed explaination of each config field.
 
-**Reward Model/Function**
-
-We use a rule-based reward model. We force the model to produce a final
-answer following 4 “#” as shown in the solution. We extract the final
-answer from both the solution and model’s output using regular
-expression matching. We compare them and assign a reward of 1 to correct
-answer, 0.1 to incorrect answer and 0 to no answer.
-
-**Training Script**
-
-The training script example for FSDP and Megatron-LM backend are stored in 
-`examples/ppo_trainer <https://github.com/volcengine/verl/tree/main/examples/ppo_trainer>`_ directory.
-
 .. code:: bash
 
-   cd ../ppo_trainer
-   bash run_deepseek7b_llm.sh
-
-The script of `run_deepseek7b_llm.sh`
-
-.. code:: bash
-
-   set -x
-
    python3 -m verl.trainer.main_ppo \
-       data.train_files=~/data/rlhf/gsm8k/train.parquet \
-       data.val_files=~/data/rlhf/gsm8k/test.parquet \
-       data.train_batch_size=1024 \
-       data.val_batch_size=1312 \
-       data.max_prompt_length=512 \
-       data.max_response_length=512 \
-       actor_rollout_ref.model.path=~/models/deepseek-llm-7b-chat \
-       actor_rollout_ref.actor.optim.lr=1e-6 \
-       actor_rollout_ref.actor.ppo_mini_batch_size=256 \
-       actor_rollout_ref.actor.ppo_micro_batch_size=64 \
-       actor_rollout_ref.actor.fsdp_config.param_offload=False \
-       actor_rollout_ref.actor.fsdp_config.grad_offload=False \
-       actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
-       actor_rollout_ref.rollout.micro_batch_size=256 \
-       actor_rollout_ref.rollout.log_prob_micro_batch_size=128 \
-       actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
-       actor_rollout_ref.rollout.name=vllm \
-       actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
-       actor_rollout_ref.ref.log_prob_micro_batch_size=128 \
-       actor_rollout_ref.ref.fsdp_config.param_offload=True \
-       critic.optim.lr=1e-5 \
-       critic.model.path=~/models/deepseek-llm-7b-chat \
-       critic.model.enable_gradient_checkpointing=False \
-       critic.ppo_micro_batch_size=64 \
-       critic.model.fsdp_config.param_offload=False \
-       critic.model.fsdp_config.grad_offload=False \
-       critic.model.fsdp_config.optimizer_offload=False \
-       algorithm.kl_ctrl.kl_coef=0.001 \
-       trainer.critic_warmup=0 \
-       trainer.logger=['console','wandb'] \
-       trainer.project_name='verl_example_gsm8k' \
-       trainer.experiment_name='deepseek_llm_7b_function_rm' \
-       trainer.n_gpus_per_node=8 \
-       trainer.nnodes=1 \
-       trainer.save_freq=-1 \
-       trainer.total_epochs=15
+    data.train_files=$HOME/data/gsm8k/train.parquet \
+    data.val_files=$HOME/data/gsm8k/test.parquet \
+    data.train_batch_size=512 \
+    data.val_batch_size=1312 \
+    data.max_prompt_length=512 \
+    data.max_response_length=256 \
+    actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
+    actor_rollout_ref.actor.optim.lr=1e-6 \
+    actor_rollout_ref.actor.ppo_mini_batch_size=128 \
+    actor_rollout_ref.actor.ppo_micro_batch_size=1 \
+    actor_rollout_ref.rollout.log_prob_micro_batch_size=1 \
+    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
+    actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
+    actor_rollout_ref.ref.log_prob_micro_batch_size=4 \
+    actor_rollout_ref.ref.fsdp_config.param_offload=False \
+    critic.optim.lr=1e-5 \
+    critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \
+    critic.ppo_micro_batch_size=1 \
+    algorithm.kl_ctrl.kl_coef=0.001 \
+    trainer.logger=['console'] \
+    trainer.n_gpus_per_node=1 \
+    trainer.nnodes=1 \
+    trainer.save_freq=10 \
+    trainer.test_freq=10 \
+    trainer.total_epochs=15 $@ 2>&1 | tee verl_demo.log
+
+
+# checkpoints/${trainer.project_name}/${trainer.experiment_name}
+
+# trainer.logger=['console','wandb']
+# trainer.project_name='verl_post_training' \
+# trainer.experiment_name='gsm8k_function_rm' \
+
+# actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
+# critic.model.fsdp_config.optimizer_offload=False \
+
+
+
+.. [1] The original paper (https://arxiv.org/pdf/2110.14168) mainly focuses on training a verifier (a reward model) to solve math problems via Best-of-N sampling. In this example, we train an RL agent using a rule-based reward model.
+.. [2] More training script examples for FSDP and Megatron-LM backend are stored in `examples/ppo_trainer <https://github.com/volcengine/verl/tree/main/examples/ppo_trainer>`_ directory.
\ No newline at end of file
diff --git a/docs/workers/fsdp_workers.rst b/docs/workers/fsdp_workers.rst
index fddbcef7..3ea0e748 100644
--- a/docs/workers/fsdp_workers.rst
+++ b/docs/workers/fsdp_workers.rst
@@ -137,6 +137,6 @@ additional initialization for the Optimizer.
 HybridShard
 ------------
 
-We didn’t support FSDP `HybridShard`. To support this, we may need to
+We didn't support FSDP `HybridShard`. To support this, we may need to
 construct a 2D device mesh and test the corresponding
 ``dtensor_weight_loader`` and ``hf_weight_loader`` for each model.
diff --git a/docs/workers/ray_trainer.rst b/docs/workers/ray_trainer.rst
index 9470bb8f..3b1a9e6f 100644
--- a/docs/workers/ray_trainer.rst
+++ b/docs/workers/ray_trainer.rst
@@ -94,7 +94,7 @@ CUDA/distributed context in different processes.
    self.actor_rollout_wg = all_wg['actor_rollout']
    self.actor_rollout_wg.init_model()
 
-.. note:: For megatron backend, if we merge the ``worker_groups`` into the same processes, all the roles will utilize the same 3D parallel size. To optimize this, we may need to maintain several 3D process groups for each role in the same distributed context. If you want to use different 3D parallel size for different roles, please follow the similar architecture of the first code block to initialize each role’s ``worker_group``
+.. note:: For megatron backend, if we merge the ``worker_groups`` into the same processes, all the roles will utilize the same 3D parallel size. To optimize this, we may need to maintain several 3D process groups for each role in the same distributed context. If you want to use different 3D parallel size for different roles, please follow the similar architecture of the first code block to initialize each role's ``worker_group``
 
 
 PPO Training Loop
@@ -104,7 +104,7 @@ We implement the PPO training loop by calling the functions in
 worker_group of each role. The input and output data of each function is
 a ``DataProto`` object implemented in `protocol.py <https://github.com/volcengine/verl/blob/main/verl/protocol.py>`_. In the training
 loop, trainer will dispatch/collect the data to/from different GPUs
-following the transfer protocols wrapped in the workers’ functions. The
+following the transfer protocols wrapped in the workers' functions. The
 computation of PPO micro batches is processed in ``update_actor`` and
 ``update_critic`` functions.
 

From 21e7354d9b84830fd0af947f8d917c94fa66d785 Mon Sep 17 00:00:00 2001
From: "haibin.lin" <haibin.lin@bytedance.com>
Date: Thu, 19 Dec 2024 05:49:00 +0800
Subject: [PATCH 3/7] update quickstart

---
 docs/examples/config.rst        |  2 ++
 docs/experiment/ppo.rst         |  2 ++
 docs/start/quickstart.rst       | 59 +++++++++++++++++++--------------
 verl/trainer/ppo/ray_trainer.py |  2 +-
 4 files changed, 40 insertions(+), 25 deletions(-)

diff --git a/docs/examples/config.rst b/docs/examples/config.rst
index 24afd880..3fc1906b 100644
--- a/docs/examples/config.rst
+++ b/docs/examples/config.rst
@@ -1,3 +1,5 @@
+.. _config-explain-page:
+
 Config Explaination
 ===================
 
diff --git a/docs/experiment/ppo.rst b/docs/experiment/ppo.rst
index 332a3ee2..8141db1b 100644
--- a/docs/experiment/ppo.rst
+++ b/docs/experiment/ppo.rst
@@ -1,3 +1,5 @@
+.. _algo-baseline-page:
+
 Algorithm Baselines
 ===================
 
diff --git a/docs/start/quickstart.rst b/docs/start/quickstart.rst
index 5c67cffc..eb7cb935 100644
--- a/docs/start/quickstart.rst
+++ b/docs/start/quickstart.rst
@@ -77,58 +77,69 @@ For mode details, please refer to `verl/utils/reward_score/gsm8k.py <https://git
 
 Now let's run PPO training with the dataset and model above[2]_.
 
-.. code:: bash
-
-   bash examples/ppo_trainer/run_deepseek7b_llm.sh
 
-The script of `run_deepseek7b_llm.sh`
-
-- Prepare your own run.sh script. Here's an example for GSM8k dataset
-  and deepseek-llm-7b-chat model.
-- Users could replace the ``data.train_files`` ,\ ``data.val_files``,
-  ``actor_rollout_ref.model.path`` and ``critic.model.path`` based on
-  their environment.
-- See :doc:`examples/config` for detailed explaination of each config field.
+Set the ``data.train_files`` ,\ ``data.val_files``, ``actor_rollout_ref.model.path`` and ``critic.model.path`` based on your dataset and model names or paths.
 
 .. code:: bash
 
-   python3 -m verl.trainer.main_ppo \
+   PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
     data.train_files=$HOME/data/gsm8k/train.parquet \
     data.val_files=$HOME/data/gsm8k/test.parquet \
-    data.train_batch_size=512 \
+    data.train_batch_size=256 \
     data.val_batch_size=1312 \
     data.max_prompt_length=512 \
     data.max_response_length=256 \
     actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
     actor_rollout_ref.actor.optim.lr=1e-6 \
-    actor_rollout_ref.actor.ppo_mini_batch_size=128 \
-    actor_rollout_ref.actor.ppo_micro_batch_size=1 \
-    actor_rollout_ref.rollout.log_prob_micro_batch_size=1 \
+    actor_rollout_ref.actor.ppo_mini_batch_size=64 \
+    actor_rollout_ref.actor.ppo_micro_batch_size=4 \
+    actor_rollout_ref.rollout.log_prob_micro_batch_size=4 \
     actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
     actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
     actor_rollout_ref.ref.log_prob_micro_batch_size=4 \
-    actor_rollout_ref.ref.fsdp_config.param_offload=False \
     critic.optim.lr=1e-5 \
     critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \
-    critic.ppo_micro_batch_size=1 \
+    critic.ppo_micro_batch_size=4 \
     algorithm.kl_ctrl.kl_coef=0.001 \
     trainer.logger=['console'] \
+    +trainer.val_before_train=False \
     trainer.n_gpus_per_node=1 \
     trainer.nnodes=1 \
     trainer.save_freq=10 \
     trainer.test_freq=10 \
     trainer.total_epochs=15 $@ 2>&1 | tee verl_demo.log
 
+You are expected to see the following logs, indicating training in progress:
+
+.. code::
+    step:0 - timing/gen:21.470 - timing/ref:4.360 - timing/values:5.800 - critic/kl:0.000 - critic/kl_coeff:0.001 - timing/adv:0.109 - timing/update_critic:15.664 - critic/vf_loss:14.947 - critic/vf_clipfrac:0.000 - critic/vpred_mean:-2.056 - critic/grad_norm:1023.278 - critic/lr(1e-4):0.100 - timing/update_actor:20.314 - actor/entropy_loss:0.433 - actor/pg_loss:-0.005 - actor/pg_clipfrac:0.000 - actor/ppo_kl:0.000 - actor/grad_norm:1.992 - actor/lr(1e-4):0.010 - critic/score/mean:0.004 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.004 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.000 - critic/advantages/max:2.360 - critic/advantages/min:-2.280 - critic/returns/mean:0.003 - critic/returns/max:0.000 - critic/returns/min:0.000 - critic/values/mean:-2.045 - critic/values/max:9.500 - critic/values/min:-14.000 - response_length/mean:239.133 - response_length/max:256.000 - response_length/min:77.000 - prompt_length/mean:104.883 - prompt_length/max:175.000 - prompt_length/min:68.000
+    step:1 - timing/gen:23.020 - timing/ref:4.322 - timing/values:5.953 - critic/kl:0.000 - critic/kl_coeff:0.001 - timing/adv:0.118 - timing/update_critic:15.646 - critic/vf_loss:18.472 - critic/vf_clipfrac:0.384 - critic/vpred_mean:1.038 - critic/grad_norm:942.924 - critic/lr(1e-4):0.100 - timing/update_actor:20.526 - actor/entropy_loss:0.440 - actor/pg_loss:0.000 - actor/pg_clipfrac:0.002 - actor/ppo_kl:0.000 - actor/grad_norm:2.060 - actor/lr(1e-4):0.010 - critic/score/mean:0.000 - critic/score/max:0.000 - critic/score/min:0.000 - critic/rewards/mean:0.000 - critic/rewards/max:0.000 - critic/rewards/min:0.000 - critic/advantages/mean:0.000 - critic/advantages/max:2.702 - critic/advantages/min:-2.616 - critic/returns/mean:0.000 - critic/returns/max:0.000 - critic/returns/min:0.000 - critic/values/mean:-2.280 - critic/values/max:11.000 - critic/values/min:-16.000 - response_length/mean:232.242 - response_length/max:256.000 - response_length/min:91.000 - prompt_length/mean:102.398 - prompt_length/max:185.000 - prompt_length/min:70.000
+
+Checkout :ref:`algo-baseline-page` for full training and validation logs for reference.
+
+The checkpoint is saved at the following dir by default:
+
+- checkpoints/${trainer.project_name}/${trainer.experiment_name}
+
+To enable ``wandb`` for experiment tracking, set the following configs:
+
+.. code:: bash
+
+    trainer.logger=['console','wandb'] \
+    trainer.project_name=$YOUR_PROJECT_NAME \
+    trainer.experiment_name=$YOUR_RUN_NAME \
+
+If you encounter out of memory issues, enable the following configs would help:
+
+- actor_rollout_ref.actor.ppo_micro_batch_size=1 \
 
-# checkpoints/${trainer.project_name}/${trainer.experiment_name}
+- critic.ppo_micro_batch_size=1 \
 
-# trainer.logger=['console','wandb']
-# trainer.project_name='verl_post_training' \
-# trainer.experiment_name='gsm8k_function_rm' \
+- actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
 
-# actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
-# critic.model.fsdp_config.optimizer_offload=False \
+- critic.model.fsdp_config.optimizer_offload=False \
 
+For the full set of configs, please refer to :ref:`config-explain-page` for detailed explaination and performance tuning.
 
 
 .. [1] The original paper (https://arxiv.org/pdf/2110.14168) mainly focuses on training a verifier (a reward model) to solve math problems via Best-of-N sampling. In this example, we train an RL agent using a rule-based reward model.
diff --git a/verl/trainer/ppo/ray_trainer.py b/verl/trainer/ppo/ray_trainer.py
index a5f8879c..c24b45ee 100644
--- a/verl/trainer/ppo/ray_trainer.py
+++ b/verl/trainer/ppo/ray_trainer.py
@@ -417,7 +417,7 @@ def fit(self):
 
         # perform validation before training
         # currently, we only support validation using the reward_function.
-        if self.val_reward_fn is not None:
+        if self.val_reward_fn is not None and self.config.trainer.get('val_before_train', True):
             val_metrics = self._validate()
             pprint(f'Initial validation metrics: {val_metrics}')
             logger.log(data=val_metrics, step=global_steps)

From c70cb2451d6696cbaf836c253a4139f019bc3aeb Mon Sep 17 00:00:00 2001
From: Haibin Lin <haibin.lin@bytedance.com>
Date: Wed, 18 Dec 2024 13:57:53 -0800
Subject: [PATCH 4/7] fix quickstart syntax

---
 docs/index.rst            |  2 +-
 docs/start/quickstart.rst | 21 ++++++++++-----------
 2 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/docs/index.rst b/docs/index.rst
index ce72cd69..756e4aee 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -89,7 +89,7 @@ Code formatting
 ^^^^^^^^^^^^^^^^^^^^^^^^
 We use yapf (Google style) to enforce strict code formatting when reviewing MRs. Run yapf at the top level of verl repo:
 
-.. bash::
+.. code-block:: bash
 
    pip3 install yapf
    yapf -ir -vv --style ./.style.yapf verl examples tests
diff --git a/docs/start/quickstart.rst b/docs/start/quickstart.rst
index eb7cb935..8422c470 100644
--- a/docs/start/quickstart.rst
+++ b/docs/start/quickstart.rst
@@ -12,7 +12,7 @@ Introduction
 
 .. _hf_dataset_gsm8k: https://huggingface.co/datasets/gsm8k
 
-In this example, we train an LLM to tackle the `GSM8k <hf_dataset_gsm8k>`_ task with function-based rewards[1]_.
+In this example, we train an LLM to tackle the `GSM8k <hf_dataset_gsm8k>`_ task with function-based rewards. [1]_
 
 Prerequisite:
 
@@ -45,7 +45,7 @@ Step 1: Prepare the dataset
 
 We preprocess the dataset in parquet format so that (1) it contains necessary fields for computing RL rewards and (2) is faster to read.
 
-.. code:: bash
+.. code-block:: bash
 
    python3 examples/data_preprocess/gsm8k.py --local_dir ~/data/gsm8k
 
@@ -56,7 +56,7 @@ Usually we recommend starting with an "instruct" model variant so that the model
 
 If you start from a "base" model variant, doing SFT before RL is recommended. Refer to the `sft directory <https://github.com/volcengine/verl/blob/main/examples/gsm8k/sft/>`_ and `SFT Trainer <https://github.com/volcengine/verl/blob/main/verl/trainer/fsdp_sft_trainer.py>`_ for further details.
 
-.. code:: bash
+.. code-block:: bash
 
    python3 -c "import transformers; transformers.pipeline('text-generation', model='Qwen/Qwen2.5-0.5B-Instruct')"
 
@@ -75,12 +75,12 @@ For mode details, please refer to `verl/utils/reward_score/gsm8k.py <https://git
 
 **Training Script**
 
-Now let's run PPO training with the dataset and model above[2]_.
+Now let's run PPO training with the dataset and model above. [2]_
 
 
 Set the ``data.train_files`` ,\ ``data.val_files``, ``actor_rollout_ref.model.path`` and ``critic.model.path`` based on your dataset and model names or paths.
 
-.. code:: bash
+.. code-block:: bash
 
    PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
     data.train_files=$HOME/data/gsm8k/train.parquet \
@@ -111,19 +111,18 @@ Set the ``data.train_files`` ,\ ``data.val_files``, ``actor_rollout_ref.model.pa
 
 You are expected to see the following logs, indicating training in progress:
 
-.. code::
+.. code-block:: bash
+
     step:0 - timing/gen:21.470 - timing/ref:4.360 - timing/values:5.800 - critic/kl:0.000 - critic/kl_coeff:0.001 - timing/adv:0.109 - timing/update_critic:15.664 - critic/vf_loss:14.947 - critic/vf_clipfrac:0.000 - critic/vpred_mean:-2.056 - critic/grad_norm:1023.278 - critic/lr(1e-4):0.100 - timing/update_actor:20.314 - actor/entropy_loss:0.433 - actor/pg_loss:-0.005 - actor/pg_clipfrac:0.000 - actor/ppo_kl:0.000 - actor/grad_norm:1.992 - actor/lr(1e-4):0.010 - critic/score/mean:0.004 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.004 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.000 - critic/advantages/max:2.360 - critic/advantages/min:-2.280 - critic/returns/mean:0.003 - critic/returns/max:0.000 - critic/returns/min:0.000 - critic/values/mean:-2.045 - critic/values/max:9.500 - critic/values/min:-14.000 - response_length/mean:239.133 - response_length/max:256.000 - response_length/min:77.000 - prompt_length/mean:104.883 - prompt_length/max:175.000 - prompt_length/min:68.000
     step:1 - timing/gen:23.020 - timing/ref:4.322 - timing/values:5.953 - critic/kl:0.000 - critic/kl_coeff:0.001 - timing/adv:0.118 - timing/update_critic:15.646 - critic/vf_loss:18.472 - critic/vf_clipfrac:0.384 - critic/vpred_mean:1.038 - critic/grad_norm:942.924 - critic/lr(1e-4):0.100 - timing/update_actor:20.526 - actor/entropy_loss:0.440 - actor/pg_loss:0.000 - actor/pg_clipfrac:0.002 - actor/ppo_kl:0.000 - actor/grad_norm:2.060 - actor/lr(1e-4):0.010 - critic/score/mean:0.000 - critic/score/max:0.000 - critic/score/min:0.000 - critic/rewards/mean:0.000 - critic/rewards/max:0.000 - critic/rewards/min:0.000 - critic/advantages/mean:0.000 - critic/advantages/max:2.702 - critic/advantages/min:-2.616 - critic/returns/mean:0.000 - critic/returns/max:0.000 - critic/returns/min:0.000 - critic/values/mean:-2.280 - critic/values/max:11.000 - critic/values/min:-16.000 - response_length/mean:232.242 - response_length/max:256.000 - response_length/min:91.000 - prompt_length/mean:102.398 - prompt_length/max:185.000 - prompt_length/min:70.000
 
 Checkout :ref:`algo-baseline-page` for full training and validation logs for reference.
 
-The checkpoint is saved at the following dir by default:
-
-- checkpoints/${trainer.project_name}/${trainer.experiment_name}
+The checkpoint is saved at the following dir by default: ``checkpoints/${trainer.project_name}/${trainer.experiment_name}``
 
 To enable ``wandb`` for experiment tracking, set the following configs:
 
-.. code:: bash
+.. code-block:: bash
 
     trainer.logger=['console','wandb'] \
     trainer.project_name=$YOUR_PROJECT_NAME \
@@ -143,4 +142,4 @@ For the full set of configs, please refer to :ref:`config-explain-page` for deta
 
 
 .. [1] The original paper (https://arxiv.org/pdf/2110.14168) mainly focuses on training a verifier (a reward model) to solve math problems via Best-of-N sampling. In this example, we train an RL agent using a rule-based reward model.
-.. [2] More training script examples for FSDP and Megatron-LM backend are stored in `examples/ppo_trainer <https://github.com/volcengine/verl/tree/main/examples/ppo_trainer>`_ directory.
\ No newline at end of file
+.. [2] More training script examples for FSDP and Megatron-LM backend are stored in `examples/ppo_trainer <https://github.com/volcengine/verl/tree/main/examples/ppo_trainer>`_ directory.

From 518f63224b73dc861c6468ac1e1df13572f18638 Mon Sep 17 00:00:00 2001
From: Haibin Lin <haibin.lin@bytedance.com>
Date: Wed, 18 Dec 2024 14:06:30 -0800
Subject: [PATCH 5/7] set hdfs to null

---
 docs/start/quickstart.rst | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/docs/start/quickstart.rst b/docs/start/quickstart.rst
index 8422c470..393f0ca2 100644
--- a/docs/start/quickstart.rst
+++ b/docs/start/quickstart.rst
@@ -93,7 +93,7 @@ Set the ``data.train_files`` ,\ ``data.val_files``, ``actor_rollout_ref.model.pa
     actor_rollout_ref.actor.optim.lr=1e-6 \
     actor_rollout_ref.actor.ppo_mini_batch_size=64 \
     actor_rollout_ref.actor.ppo_micro_batch_size=4 \
-    actor_rollout_ref.rollout.log_prob_micro_batch_size=4 \
+    actor_rollout_ref.rollout.log_prob_micro_batch_size=8 \
     actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
     actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
     actor_rollout_ref.ref.log_prob_micro_batch_size=4 \
@@ -103,6 +103,7 @@ Set the ``data.train_files`` ,\ ``data.val_files``, ``actor_rollout_ref.model.pa
     algorithm.kl_ctrl.kl_coef=0.001 \
     trainer.logger=['console'] \
     +trainer.val_before_train=False \
+    trainer.default_hdfs_dir=null \
     trainer.n_gpus_per_node=1 \
     trainer.nnodes=1 \
     trainer.save_freq=10 \

From d1690d89df382a1d0deaefbe2efc6aeb87ded4bb Mon Sep 17 00:00:00 2001
From: Haibin Lin <haibin.lin@bytedance.com>
Date: Wed, 18 Dec 2024 14:31:28 -0800
Subject: [PATCH 6/7] fix hdfs path null

---
 verl/trainer/ppo/ray_trainer.py | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/verl/trainer/ppo/ray_trainer.py b/verl/trainer/ppo/ray_trainer.py
index c24b45ee..d68b88a3 100644
--- a/verl/trainer/ppo/ray_trainer.py
+++ b/verl/trainer/ppo/ray_trainer.py
@@ -515,13 +515,15 @@ def fit(self):
                 if self.config.trainer.save_freq > 0 and (global_steps + 1) % self.config.trainer.save_freq == 0:
                     actor_local_path = os.path.join(self.config.trainer.default_local_dir, 'actor',
                                                     f'global_step_{global_steps}')
-                    actor_remote_path = os.path.join(self.config.trainer.default_hdfs_dir, 'actor')
+                    actor_remote_path = None if self.config.trainer.default_hdfs_dir is None else os.path.join(
+                        self.config.trainer.default_hdfs_dir, 'actor')
                     self.actor_rollout_wg.save_checkpoint(actor_local_path, actor_remote_path)
 
                     if self.use_critic:
                         critic_local_path = os.path.join(self.config.trainer.default_local_dir, 'critic',
                                                          f'global_step_{global_steps}')
-                        critic_remote_path = os.path.join(self.config.trainer.default_hdfs_dir, 'critic')
+                        critic_remote_path = None if self.config.trainer.default_hdfs_dir is None else os.path.join(
+                            self.config.trainer.default_hdfs_dir, 'critic')
                         self.critic_wg.save_checkpoint(critic_local_path, critic_remote_path)
 
                 global_steps += 1

From 3008c44c9bc86751dd30bae34ff3a678ecfcdc96 Mon Sep 17 00:00:00 2001
From: Haibin Lin <haibin.lin@bytedance.com>
Date: Wed, 18 Dec 2024 14:51:57 -0800
Subject: [PATCH 7/7] add key metric

---
 docs/start/quickstart.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/start/quickstart.rst b/docs/start/quickstart.rst
index 393f0ca2..87f25686 100644
--- a/docs/start/quickstart.rst
+++ b/docs/start/quickstart.rst
@@ -110,7 +110,7 @@ Set the ``data.train_files`` ,\ ``data.val_files``, ``actor_rollout_ref.model.pa
     trainer.test_freq=10 \
     trainer.total_epochs=15 $@ 2>&1 | tee verl_demo.log
 
-You are expected to see the following logs, indicating training in progress:
+You are expected to see the following logs, indicating training in progress. The key metric ``val/test_score/openai/gsm8k`` is computed every ``trainer.test_freq`` steps:
 
 .. code-block:: bash