We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
策略梯度(pg)
近端策略优化(ppo)
大模型RLHF:PPO原理与源码解读
DPO
1.rlhf相关
2.强化学习