English | 中文
Introduction | Installation | Supported Models | Get Started | Contributions | License
OPENAI's ChatGPT has demonstrated astonishing natural language processing capabilities, opening the door to universal artificial intelligence. Its exceptional performance is closely tied to the Reinforcement Learning from Human Feedback (RLHF) algorithm. In its predecessor, InstructGPT, RLHF was used to collect human feedback and generate content that better aligns with human cognition and values, thus compensating for potential cognitive biases in large models.
MindSpore RLHF (MindRLHF) is based on the MindSpore and utilizes the framework's capabilities for large model parallel training, inference, and deployment to help customers quickly train and deploy RLHF algorithm processes with models that have billions or trillions of parameters.
The MindRLHF learning process consists of three stages:
- Stage 1: Supervised fine-tuning.
- Stage 2: Reward model training.
- Stage 3: Reinforcement learning training.
MindRLHF integrates the rich model library of the MindFormers, providing fine-tuning processes for basic models such as Pangu-Alpha (2.6B, 13B) and GPT-2.
Fully inheriting the parallel interface of MindSpore, MindRLHF can easily deploy models to the training cluster with just one click, enabling training and inference of large models.
To improve inference performance, MindRLHF integrates incremental inference
, which is known as K-V cache
or state reuse
and can achieve more than a 30% improvement in inference performance compared to full inference.
MindRLHF architecture diagram is as follows:
Current version 0.3.0
can be used directly.
There are some requirements for MindRLHF:
requirements | version |
---|---|
MindSpore | r2.3.1 |
Mindformers | r1.2.0 |
Current version of MindRLHF: 0.3.0
The current version integrates Pangu-alpha(13B), GPT2, Baichuan2(7B/13B) models, and users can explore these two models. In the future, we will provide more models such as LLAMA, BLOOM, GLM, etc. To help users quickly implement their own applications. The specific supported list is shown below:
Table 1: The models and scales supported in MindRLHF
Models | Pangu-alpha | GPT2 | Baichuan2 | Baichuan2 |
---|---|---|---|---|
Scales | 2.6B/13B | 124M | 7B/13B | 7B |
Parallel | Y | Y | Y | Y |
Device | NPU | NPU | NPU | NPU |
The support of models for different training stages is shown in the following table:
Table 2: The models and stages supported in MindRLHF
Stages | Pangu-alpha | GPT2 | Baichuan2 |
---|---|---|---|
SFT | Y | Y | Y |
RM | Y | Y | Y |
RLHF | Y | Y | Y |
In the future, we will integrate more models such as LLAMA, GLM, BLOOM, etc.
Now we support DPO
, and models supported are shown in the following table:
Table 3: The models for DPO
Type | Baichuan2 | Qwen2 | Qwen2_5 |
---|---|---|---|
offline | Y | Y | Y |
online |
In the future, we will integrate more models such as LLAMA, GLM, Qwen, etc.
-
Reward model training: a
GPT2
based reward model training tutorial is listed in 'examples'. -
RLHF fine-tuning: here is an example for RLHF fine-tuning in
MindRLHF
:
ppo_config, sft_model_config, ref_model_config, critic_model_config, rm_model_config = init_configs(
args)
trainer = PPOTrainer(ppo_config=ppo_config, sft_model_config=sft_model_config, ref_model_config=ref_model_config,
critic_model_config=critic_model_config, rm_model_config=rm_model_config)
ppo_with_grad = init_network_and_optimizer(trainer)
rank_id = D.get_rank()
for epoch in range(ppo_config.epochs):
# sampling
trainer.make_experience(num_rollouts=ppo_config.num_rollouts)
dataset = init_ppo_dataset(trainer)
# use data sink to accelerate
trainer.train(ppo_with_grad, dataset, epoch)
trainer.save_checkpoint(rank_id, epoch)
Welcome to the community. You can refer to the MindSpore contribution requirements on the Contributor Wiki.
Apache 2.0 License.