update docs

volcengine · Dec 18, 2024 · b69af03 · b69af03
1 parent 07a8f98
commit b69af03
Show file tree

Hide file tree

Showing 14 changed files with 149 additions and 190 deletions.
diff --git a/README.md b/README.md
@@ -1,16 +1,16 @@
 <h1 style="text-align: center;">veRL: Volcano Engine Reinforcement Learning for LLM</h1>
 
-veRL (HybridFlow) is a flexible, efficient and industrial-level RL(HF) training framework designed for large language models (LLMs). veRL is the open-source version of [HybridFlow](https://arxiv.org/abs/2409.19256v2) paper.
+veRL is a flexible, efficient and production-ready RL training framework designed for large language models (LLMs). veRL is the open-source version of [HybridFlow](https://arxiv.org/abs/2409.19256v2) paper.
 
 veRL is flexible and easy to use with:
 
-- **Easy to support diverse RL(HF) algorithms**: The Hybrid programming model combines the strengths of single-controller and multi-controller paradigms to enable flexible representation and efficient execution of complex Post-Training dataflows. Allowing users to build RL dataflows in a few lines of code.
+- **Easy extension of diverse RL algorithms**: The Hybrid programming model combines the strengths of single-controller and multi-controller paradigms to enable flexible representation and efficient execution of complex Post-Training dataflows. Allowing users to build RL dataflows in a few lines of code.
 
-- **Seamless integration of existing LLM infra with modular API design**: Decouples computation and data dependencies, enabling seamless integration with existing LLM frameworks, such as PyTorch FSDP, Megatron-LM and vLLM. Moreover, users can easily extend to other LLM training and inference frameworks.
+- **Seamless integration of existing LLM infra with modular APIs**: Decouples computation and data dependencies, enabling seamless integration with existing LLM frameworks, such as PyTorch FSDP, Megatron-LM and vLLM. Moreover, users can easily extend to other LLM training and inference frameworks.
 
 - **Flexible device mapping**: Supports various placement of models onto different sets of GPUs for efficient resource utilization and scalability across different cluster sizes.
 
-- Readily integration with popular Hugging Face models
+- Readily integration with popular HuggingFace models
 
 
 veRL is fast with:
@@ -172,24 +172,6 @@ Visit our [documentation](https://verl.readthedocs.io/en/latest/index.html) to l
   - [Add models to Megatron-LM backend](https://verl.readthedocs.io/en/latest/advance/megatron_extension.html)
 
 
-## Community and Contribution
-
-### Communication channel
-
-[Join us](https://join.slack.com/t/verlgroup/shared_invite/zt-2w5p9o4c3-yy0x2Q56s_VlGLsJ93A6vA) for discussions on slack!
-
-### Code formatting
-We use yapf (Google style) to enforce strict code formatting when reviewing MRs. To reformat you code locally, make sure you installed `yapf`
-```bash
-pip3 install yapf
-```
-Then, make sure you are at top level of verl repo and run
-```bash
-yapf -ir -vv --style ./.style.yapf verl examples
-```
-
-
-
 ## Citation
 
 ```tex

diff --git a/docs/advance/dpo_extension.rst b/docs/advance/dpo_extension.rst
@@ -66,7 +66,7 @@ Here, ``SampleGenerator`` can be viewed as a multi-process pulled up by
 the control flow to call. The implementation details inside can use any
 inference engine including vllm, sglang and huggingface. Users can
 largely reuse the code in
-verl/verl/trainer/ppo/rollout/vllm_rollout/vllm_rollout.py and we won’t
+verl/verl/trainer/ppo/rollout/vllm_rollout/vllm_rollout.py and we won't
 go into details here.
 
 **ReferencePolicy inference**
@@ -179,7 +179,7 @@ steps:
 
 Frequently calling these 3 steps on the controller process greatly hurts
 code readability. **In veRL, we have abstracted and encapsulated these 3
-steps, so that the worker’s method + dispatch + collect can be
+steps, so that the worker's method + dispatch + collect can be
 registered into the worker_group**
 
 .. code:: python
@@ -230,7 +230,7 @@ Here it requires the data interface to be ``DataProto``. Definition of
 Step 3: Main training loop
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-With the above training flows, we can implement the algorithm’s control
+With the above training flows, we can implement the algorithm's control
 flow. It is recommended that ``main_task`` is also a ray remote process.
 
 .. code:: python

diff --git a/docs/advance/fsdp_extension.rst b/docs/advance/fsdp_extension.rst
@@ -28,7 +28,7 @@ loader for the models below in `dtensor_weight_loader.py <https://github.com/vol
 - ``Qwen2ForCausalLM``
 - ``DeepseekV2ForCausalLM``
 
-To implement ``dtensor_weight_loader`` of a model that’s supported in
+To implement ``dtensor_weight_loader`` of a model that's supported in
 vLLM, follow the guide of gemma model below:
 
 1. Copy the

diff --git a/docs/advance/megatron_extension.rst b/docs/advance/megatron_extension.rst
@@ -20,6 +20,6 @@ To support other model, users are required to implement:
    your loader to ``weight_loader_registry`` in `weight_loader_registry.py <https://github.com/volcengine/verl/blob/main/verl/models/weight_loader_registry.py>`_.
 3. Weight loader that synchronize the weight from Megatron to rollout
    (vLLM) model. Note that both the actor model and rollout model are
-   partitioned during runtime. So, it’s advisable to map the model name
+   partitioned during runtime. So, it's advisable to map the model name
    in actor model implementation. Otherwise, you may need an additional
    name mapping and even weight transformation.
diff --git a/docs/examples/config.rst b/docs/examples/config.rst
@@ -22,14 +22,14 @@ Data
      return_raw_chat: False
 
 - ``data.train_files``: Training set parquet. Can be a list or a single
-  file. The program will read all files into memory, so it can’t be too
+  file. The program will read all files into memory, so it can't be too
   large (< 100GB). The path can be either local path or HDFS path. For
   HDFS path, we provide utils to download it to DRAM and convert the
   HDFS path to local path.
 - ``data.val_files``: Validation parquet. Can be a list or a single
   file.
 - ``data.prompt_key``: The field in the dataset where the prompt is
-  located. Default is ‘prompt’.
+  located. Default is 'prompt'.
 - ``data.max_prompt_length``: Maximum prompt length. All prompts will be
   left-padded to this length. An error will be reported if the length is
   too long
@@ -41,13 +41,13 @@ Data
   iteration.
 - ``data.return_raw_input_ids``: Whether to return the original
   input_ids without adding chat template. This is mainly used to
-  accommodate situations where the reward model’s chat template differs
-  from the policy. It needs to be decoded first, then apply the RM’s
+  accommodate situations where the reward model's chat template differs
+  from the policy. It needs to be decoded first, then apply the RM's
   chat template. If using a model-based RM, and the policy and RM
   chat_templates are different, this flag needs to be set
 - ``data.return_raw_chat``:
 - ``data.truncation``: Truncate the input_ids or prompt length if they
-  exceed max_prompt_length. Default is ‘error’, not allow exceed the
+  exceed max_prompt_length. Default is 'error', not allow exceed the
   max_prompt_length. The users should increase the max_prompt_length if
   throwing the error.
 
@@ -114,7 +114,7 @@ Actor/Rollout/Reference Policy
 
 **Common config for actor, rollout and reference model**
 
-- ``actor_rollout_ref.hybrid_engine``: Whether it’s a hybrid engine,
+- ``actor_rollout_ref.hybrid_engine``: Whether it's a hybrid engine,
   currently only supports hybrid engine
 - ``actor_rollout_ref.model.path``: Huggingface model path. This can be
   either local path or HDFS path. For HDFS path, we provide utils to
@@ -123,7 +123,7 @@ Actor/Rollout/Reference Policy
   that need to be imported. Used to register models or tokenizers into
   the Huggingface system.
 - ``actor_rollout_ref.model.override_config``: Used to override some of
-  the model’s original configurations, mainly dropout
+  the model's original configurations, mainly dropout
 - ``actor_rollout_ref.model.enable_gradient_checkpointing``: Whether to
   enable gradient checkpointing for the actor
 
@@ -154,12 +154,12 @@ Actor/Rollout/Reference Policy
 - ``actor_rollout_ref.actor.shuffle``: Whether to shuffle data when
   there are multiple epochs
 
-- ``actor_rollout_ref.actor.optim``: Actor’s optimizer parameters
+- ``actor_rollout_ref.actor.optim``: Actor's optimizer parameters
 
 - ``actor_rollout_ref.actor.fsdp_config``: FSDP config for actor
   training
 
-  - ``wrap_policy``: FSDP wrap policy. By default, it uses Huggingface’s
+  - ``wrap_policy``: FSDP wrap policy. By default, it uses Huggingface's
     wrap policy, i.e., wrapping by DecoderLayer
 
     - No need to set transformer_layer_cls_to_wrap, so we comment it.
@@ -172,19 +172,19 @@ Actor/Rollout/Reference Policy
 **Reference Model**
 
 - ``actor_rollout_ref.ref``: FSDP config same as actor. **For models
-  larger than 7B, it’s recommended to turn on offload for ref by
+  larger than 7B, it's recommended to turn on offload for ref by
   default**
 - ``actor_rollout_ref.ref.log_prob_micro_batch_size``: The batch size
   for one forward pass in the computation of ``ref_log_prob``.
 
 **Rollout Model**
 
 - ``actor_rollout_ref.rollout.name``: hf/vllm. We use vLLM by default
-  because it’s much efficient and our hybrid engine is implemented with
+  because it's much efficient and our hybrid engine is implemented with
   vLLM.
 
 - Rollout (Auto-regressive) parameters. The key should be equal to the
-  property name in vLLM’s ``SamplingParams``.
+  property name in vLLM's ``SamplingParams``.
 
   - ``temperature``, ``top_k``, ``top_p`` and others: Sampling
     parameters in ``SamplingParams``.
@@ -224,15 +224,15 @@ Actor/Rollout/Reference Policy
   - ``megatron``: Use Megatron weight loader. Deployed with Megatron
     backend. The input model ``state_dict()`` is already partitioned
     along TP dimension and already gathered along PP dimension. This
-    weight loader requires that the Rollout model and Actor model’s
+    weight loader requires that the Rollout model and Actor model's
     parameters shape and name should be identical.
   - ``dtensor``: Default solution when using Huggingface weight loader.
     Deployed with FSDP backend and the state_dict_type is
     ``StateDictType.SHARDED_STATE_DICT``. Recommend to use this weight
     loader
   - ``hf``: Use Huggingface weight loader. Deployed with FSDP backend
     and the state_dict_type is ``StateDictType.FULL_STATE_DICT``. This
-    solution doesn’t need to rewrite the weight loader for each model
+    solution doesn't need to rewrite the weight loader for each model
     implemented in vLLM but it results in larger peak memory usage.
   - ``dummy_hf``, ``dummy_megatron``, ``dummy_dtensor``: Random
     initialization.
@@ -268,11 +268,11 @@ Reward Model
   responses. If False, the following parameters are not effective.
 - ``reward_model.model``
 
-  - ``input_tokenizer``: Input tokenizer. If the reward model’s chat
+  - ``input_tokenizer``: Input tokenizer. If the reward model's chat
     template is inconsistent with the policy, we need to first decode to
-    plaintext, then apply the rm’s chat_template. Then score with RM. If
+    plaintext, then apply the rm's chat_template. Then score with RM. If
     chat_templates are consistent, it can be set to null.
-  - ``path``: RM’s HDFS path or local path. Note that RM only supports
+  - ``path``: RM's HDFS path or local path. Note that RM only supports
     AutoModelForSequenceClassification. Other model types need to define
     their own RewardModelWorker and pass it from the code.
 

diff --git a/docs/examples/gsm8k_example.rst b/docs/examples/gsm8k_example.rst
@@ -49,7 +49,7 @@ Step 1: Prepare dataset
 Step 2: Download Model
 ----------------------
 
-There’re three ways to prepare the model checkpoints for post-training:
+There're three ways to prepare the model checkpoints for post-training:
 
 - Download the required models from hugging face
 
@@ -96,7 +96,7 @@ We also provide various training scripts for SFT on GSM8K dataset in `gsm8k sft
 Step 4: Perform PPO training with your model on GSM8K Dataset
 -------------------------------------------------------------
 
-- Prepare your own run.sh script. Here’s an example for GSM8k dataset
+- Prepare your own run.sh script. Here's an example for GSM8k dataset
   and deepseek-llm-7b-chat model.
 - Users could replace the ``data.train_files`` ,\ ``data.val_files``,
   ``actor_rollout_ref.model.path`` and ``critic.model.path`` based on
@@ -107,7 +107,7 @@ Step 4: Perform PPO training with your model on GSM8K Dataset
 
 We use a rule-based reward model. We force the model to produce a final
 answer following 4 “#” as shown in the solution. We extract the final
-answer from both the solution and model’s output using regular
+answer from both the solution and model's output using regular
 expression matching. We compare them and assign a reward of 1 to correct
 answer, 0.1 to incorrect answer and 0 to no answer.
 

diff --git a/docs/examples/ppo_code_architecture.rst b/docs/examples/ppo_code_architecture.rst
@@ -1,7 +1,7 @@
 PPO Example Architecture
 ========================
 
-Let’s start with the Proximal Policy Optimization algorithm, which is
+Let's start with the Proximal Policy Optimization algorithm, which is
 most widely used algorithm in LLM post-training.
 
 The main entry point of the PPO algorithm example is:
@@ -151,18 +151,18 @@ Defining reward model/function
    resource_pool_manager = ResourcePoolManager(resource_pool_spec=resource_pool_spec, mapping=mapping)
 
 Since not all tasks use model-based RM, users need to define here
-whether it’s a model-based RM or a function-based RM
+whether it's a model-based RM or a function-based RM
 
-- If it’s a model-based RM, directly add the ``RewardModel`` role in the
+- If it's a model-based RM, directly add the ``RewardModel`` role in the
   resource mapping and add it to the resource pool mapping.
 
   - Note that the pre-defined ``RewardModelWorker`` only supports models
     with the structure of huggingface
-    ``AutoModelForSequenceClassification``. If it’s not this model, you
+    ``AutoModelForSequenceClassification``. If it's not this model, you
     need to define your own RewardModelWorker in `FSDP Workers <https://github.com/volcengine/verl/blob/main/verl/trainer/ppo/workers/fsdp_workers.py>`_ 
     and `Megatron-LM Workers <https://github.com/volcengine/verl/blob/main/verl/trainer/ppo/workers/megatron_workers.py>`_.
 
-- If it’s a function-based RM, the users are required to classified the
+- If it's a function-based RM, the users are required to classified the
   reward function for each datasets.
 
 .. code:: python