Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[example] docs: improve the quickstart documentation #56

Merged
merged 8 commits into from
Dec 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 11 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
<h1 style="text-align: center;">veRL: Volcano Engine Reinforcement Learning for LLM</h1>

veRL (HybridFlow) is a flexible, efficient and industrial-level RL(HF) training framework designed for large language models (LLMs). veRL is the open-source version of [HybridFlow](https://arxiv.org/abs/2409.19256v2) paper.
veRL is a flexible, efficient and production-ready RL training framework designed for large language models (LLMs). veRL is the open-source version of [HybridFlow](https://arxiv.org/abs/2409.19256v2) paper.

veRL is flexible and easy to use with:

- **Easy to support diverse RL(HF) algorithms**: The Hybrid programming model combines the strengths of single-controller and multi-controller paradigms to enable flexible representation and efficient execution of complex Post-Training dataflows. Allowing users to build RL dataflows in a few lines of code.
- **Easy extension of diverse RL algorithms**: The Hybrid programming model combines the strengths of single-controller and multi-controller paradigms to enable flexible representation and efficient execution of complex Post-Training dataflows. Allowing users to build RL dataflows in a few lines of code.

- **Seamless integration of existing LLM infra with modular API design**: Decouples computation and data dependencies, enabling seamless integration with existing LLM frameworks, such as PyTorch FSDP, Megatron-LM and vLLM. Moreover, users can easily extend to other LLM training and inference frameworks.
- **Seamless integration of existing LLM infra with modular APIs**: Decouples computation and data dependencies, enabling seamless integration with existing LLM frameworks, such as PyTorch FSDP, Megatron-LM and vLLM. Moreover, users can easily extend to other LLM training and inference frameworks.

- **Flexible device mapping**: Supports various placement of models onto different sets of GPUs for efficient resource utilization and scalability across different cluster sizes.

- Readily integration with popular Hugging Face models
- Readily integration with popular HuggingFace models


veRL is fast with:
Expand Down Expand Up @@ -150,12 +150,15 @@ export PYTHONPATH=$PYTHONPATH:$(pwd)
## Getting Started
Visit our [documentation](https://verl.readthedocs.io/en/latest/index.html) to learn more.

**Running an PPO example should follow:**
- Preparation
- [Installation](https://verl.readthedocs.io/en/latest/preparation/install.html)
**Quickstart:**
- [Installation](https://verl.readthedocs.io/en/latest/preparation/install.html)
- [Quickstart](https://verl.readthedocs.io/en/latest/start/quickstart.html)

**Running an PPO example step-by-step:**
- Data and Reward Preparation
- [Prepare Data (Parquet) for Post-Training](https://verl.readthedocs.io/en/latest/preparation/prepare_data.html)
- [Implement Reward Function for Dataset](https://verl.readthedocs.io/en/latest/preparation/reward_function.html)
- PPO Example (Run an example)
- Understanding the PPO Example
- [PPO Example Architecture](https://verl.readthedocs.io/en/latest/examples/ppo_code_architecture.html)
- [Config Explanation](https://verl.readthedocs.io/en/latest/examples/config.html)
- [Run GSM8K Example](https://verl.readthedocs.io/en/latest/examples/gsm8k_example.html)
Expand All @@ -175,16 +178,6 @@ Visit our [documentation](https://verl.readthedocs.io/en/latest/index.html) to l
- [Add models to Megatron-LM backend](https://verl.readthedocs.io/en/latest/advance/megatron_extension.html)


## Contribution

### Code formatting
We use yapf (Google style) to enforce strict code formatting when reviewing MRs. Run `yapf` at the top level of verl repo:
```bash
# pip3 install yapf
yapf -ir -vv --style ./.style.yapf verl examples tests
```


## Citation

```tex
Expand Down
6 changes: 3 additions & 3 deletions docs/advance/dpo_extension.rst
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ Here, ``SampleGenerator`` can be viewed as a multi-process pulled up by
the control flow to call. The implementation details inside can use any
inference engine including vllm, sglang and huggingface. Users can
largely reuse the code in
verl/verl/trainer/ppo/rollout/vllm_rollout/vllm_rollout.py and we wont
verl/verl/trainer/ppo/rollout/vllm_rollout/vllm_rollout.py and we won't
go into details here.

**ReferencePolicy inference**
Expand Down Expand Up @@ -179,7 +179,7 @@ steps:

Frequently calling these 3 steps on the controller process greatly hurts
code readability. **In veRL, we have abstracted and encapsulated these 3
steps, so that the workers method + dispatch + collect can be
steps, so that the worker's method + dispatch + collect can be
registered into the worker_group**

.. code:: python
Expand Down Expand Up @@ -230,7 +230,7 @@ Here it requires the data interface to be ``DataProto``. Definition of
Step 3: Main training loop
~~~~~~~~~~~~~~~~~~~~~~~~~~

With the above training flows, we can implement the algorithms control
With the above training flows, we can implement the algorithm's control
flow. It is recommended that ``main_task`` is also a ray remote process.

.. code:: python
Expand Down
2 changes: 1 addition & 1 deletion docs/advance/fsdp_extension.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ loader for the models below in `dtensor_weight_loader.py <https://github.com/vol
- ``Qwen2ForCausalLM``
- ``DeepseekV2ForCausalLM``

To implement ``dtensor_weight_loader`` of a model thats supported in
To implement ``dtensor_weight_loader`` of a model that's supported in
vLLM, follow the guide of gemma model below:

1. Copy the
Expand Down
6 changes: 3 additions & 3 deletions docs/advance/megatron_extension.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Add models to Megatron-LM backend
===========
===================================

Model
-----------
Expand All @@ -20,6 +20,6 @@ To support other model, users are required to implement:
your loader to ``weight_loader_registry`` in `weight_loader_registry.py <https://github.com/volcengine/verl/blob/main/verl/models/weight_loader_registry.py>`_.
3. Weight loader that synchronize the weight from Megatron to rollout
(vLLM) model. Note that both the actor model and rollout model are
partitioned during runtime. So, its advisable to map the model name
partitioned during runtime. So, it's advisable to map the model name
in actor model implementation. Otherwise, you may need an additional
name mapping and even weight transformation.
name mapping and even weight transformation.
36 changes: 19 additions & 17 deletions docs/examples/config.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.. _config-explain-page:

Config Explaination
===================

Expand All @@ -22,14 +24,14 @@ Data
return_raw_chat: False

- ``data.train_files``: Training set parquet. Can be a list or a single
file. The program will read all files into memory, so it cant be too
file. The program will read all files into memory, so it can't be too
large (< 100GB). The path can be either local path or HDFS path. For
HDFS path, we provide utils to download it to DRAM and convert the
HDFS path to local path.
- ``data.val_files``: Validation parquet. Can be a list or a single
file.
- ``data.prompt_key``: The field in the dataset where the prompt is
located. Default is prompt.
located. Default is 'prompt'.
- ``data.max_prompt_length``: Maximum prompt length. All prompts will be
left-padded to this length. An error will be reported if the length is
too long
Expand All @@ -41,13 +43,13 @@ Data
iteration.
- ``data.return_raw_input_ids``: Whether to return the original
input_ids without adding chat template. This is mainly used to
accommodate situations where the reward models chat template differs
from the policy. It needs to be decoded first, then apply the RMs
accommodate situations where the reward model's chat template differs
from the policy. It needs to be decoded first, then apply the RM's
chat template. If using a model-based RM, and the policy and RM
chat_templates are different, this flag needs to be set
- ``data.return_raw_chat``:
- ``data.truncation``: Truncate the input_ids or prompt length if they
exceed max_prompt_length. Default is error, not allow exceed the
exceed max_prompt_length. Default is 'error', not allow exceed the
max_prompt_length. The users should increase the max_prompt_length if
throwing the error.

Expand Down Expand Up @@ -114,7 +116,7 @@ Actor/Rollout/Reference Policy

**Common config for actor, rollout and reference model**

- ``actor_rollout_ref.hybrid_engine``: Whether its a hybrid engine,
- ``actor_rollout_ref.hybrid_engine``: Whether it's a hybrid engine,
currently only supports hybrid engine
- ``actor_rollout_ref.model.path``: Huggingface model path. This can be
either local path or HDFS path. For HDFS path, we provide utils to
Expand All @@ -123,7 +125,7 @@ Actor/Rollout/Reference Policy
that need to be imported. Used to register models or tokenizers into
the Huggingface system.
- ``actor_rollout_ref.model.override_config``: Used to override some of
the models original configurations, mainly dropout
the model's original configurations, mainly dropout
- ``actor_rollout_ref.model.enable_gradient_checkpointing``: Whether to
enable gradient checkpointing for the actor

Expand Down Expand Up @@ -154,12 +156,12 @@ Actor/Rollout/Reference Policy
- ``actor_rollout_ref.actor.shuffle``: Whether to shuffle data when
there are multiple epochs

- ``actor_rollout_ref.actor.optim``: Actors optimizer parameters
- ``actor_rollout_ref.actor.optim``: Actor's optimizer parameters

- ``actor_rollout_ref.actor.fsdp_config``: FSDP config for actor
training

- ``wrap_policy``: FSDP wrap policy. By default, it uses Huggingfaces
- ``wrap_policy``: FSDP wrap policy. By default, it uses Huggingface's
wrap policy, i.e., wrapping by DecoderLayer

- No need to set transformer_layer_cls_to_wrap, so we comment it.
Expand All @@ -172,19 +174,19 @@ Actor/Rollout/Reference Policy
**Reference Model**

- ``actor_rollout_ref.ref``: FSDP config same as actor. **For models
larger than 7B, its recommended to turn on offload for ref by
larger than 7B, it's recommended to turn on offload for ref by
default**
- ``actor_rollout_ref.ref.log_prob_micro_batch_size``: The batch size
for one forward pass in the computation of ``ref_log_prob``.

**Rollout Model**

- ``actor_rollout_ref.rollout.name``: hf/vllm. We use vLLM by default
because its much efficient and our hybrid engine is implemented with
because it's much efficient and our hybrid engine is implemented with
vLLM.

- Rollout (Auto-regressive) parameters. The key should be equal to the
property name in vLLMs ``SamplingParams``.
property name in vLLM's ``SamplingParams``.

- ``temperature``, ``top_k``, ``top_p`` and others: Sampling
parameters in ``SamplingParams``.
Expand Down Expand Up @@ -224,15 +226,15 @@ Actor/Rollout/Reference Policy
- ``megatron``: Use Megatron weight loader. Deployed with Megatron
backend. The input model ``state_dict()`` is already partitioned
along TP dimension and already gathered along PP dimension. This
weight loader requires that the Rollout model and Actor models
weight loader requires that the Rollout model and Actor model's
parameters shape and name should be identical.
- ``dtensor``: Default solution when using Huggingface weight loader.
Deployed with FSDP backend and the state_dict_type is
``StateDictType.SHARDED_STATE_DICT``. Recommend to use this weight
loader
- ``hf``: Use Huggingface weight loader. Deployed with FSDP backend
and the state_dict_type is ``StateDictType.FULL_STATE_DICT``. This
solution doesnt need to rewrite the weight loader for each model
solution doesn't need to rewrite the weight loader for each model
implemented in vLLM but it results in larger peak memory usage.
- ``dummy_hf``, ``dummy_megatron``, ``dummy_dtensor``: Random
initialization.
Expand Down Expand Up @@ -268,11 +270,11 @@ Reward Model
responses. If False, the following parameters are not effective.
- ``reward_model.model``

- ``input_tokenizer``: Input tokenizer. If the reward models chat
- ``input_tokenizer``: Input tokenizer. If the reward model's chat
template is inconsistent with the policy, we need to first decode to
plaintext, then apply the rms chat_template. Then score with RM. If
plaintext, then apply the rm's chat_template. Then score with RM. If
chat_templates are consistent, it can be set to null.
- ``path``: RMs HDFS path or local path. Note that RM only supports
- ``path``: RM's HDFS path or local path. Note that RM only supports
AutoModelForSequenceClassification. Other model types need to define
their own RewardModelWorker and pass it from the code.

Expand Down
6 changes: 3 additions & 3 deletions docs/examples/gsm8k_example.rst
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ Step 1: Prepare dataset
Step 2: Download Model
----------------------

Therere three ways to prepare the model checkpoints for post-training:
There're three ways to prepare the model checkpoints for post-training:

- Download the required models from hugging face

Expand Down Expand Up @@ -96,7 +96,7 @@ We also provide various training scripts for SFT on GSM8K dataset in `gsm8k sft
Step 4: Perform PPO training with your model on GSM8K Dataset
-------------------------------------------------------------

- Prepare your own run.sh script. Heres an example for GSM8k dataset
- Prepare your own run.sh script. Here's an example for GSM8k dataset
and deepseek-llm-7b-chat model.
- Users could replace the ``data.train_files`` ,\ ``data.val_files``,
``actor_rollout_ref.model.path`` and ``critic.model.path`` based on
Expand All @@ -107,7 +107,7 @@ Step 4: Perform PPO training with your model on GSM8K Dataset

We use a rule-based reward model. We force the model to produce a final
answer following 4 “#” as shown in the solution. We extract the final
answer from both the solution and models output using regular
answer from both the solution and model's output using regular
expression matching. We compare them and assign a reward of 1 to correct
answer, 0.1 to incorrect answer and 0 to no answer.

Expand Down
10 changes: 5 additions & 5 deletions docs/examples/ppo_code_architecture.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
PPO Example Architecture
========================

Lets start with the Proximal Policy Optimization algorithm, which is
Let's start with the Proximal Policy Optimization algorithm, which is
most widely used algorithm in LLM post-training.

The main entry point of the PPO algorithm example is:
Expand Down Expand Up @@ -151,18 +151,18 @@ Defining reward model/function
resource_pool_manager = ResourcePoolManager(resource_pool_spec=resource_pool_spec, mapping=mapping)

Since not all tasks use model-based RM, users need to define here
whether its a model-based RM or a function-based RM
whether it's a model-based RM or a function-based RM

- If its a model-based RM, directly add the ``RewardModel`` role in the
- If it's a model-based RM, directly add the ``RewardModel`` role in the
resource mapping and add it to the resource pool mapping.

- Note that the pre-defined ``RewardModelWorker`` only supports models
with the structure of huggingface
``AutoModelForSequenceClassification``. If its not this model, you
``AutoModelForSequenceClassification``. If it's not this model, you
need to define your own RewardModelWorker in `FSDP Workers <https://github.com/volcengine/verl/blob/main/verl/trainer/ppo/workers/fsdp_workers.py>`_
and `Megatron-LM Workers <https://github.com/volcengine/verl/blob/main/verl/trainer/ppo/workers/megatron_workers.py>`_.

- If its a function-based RM, the users are required to classified the
- If it's a function-based RM, the users are required to classified the
reward function for each datasets.

.. code:: python
Expand Down
Loading
Loading