Improving the creation of questions by fine tuning a base LLM through Reinfocement Learning using new novel reward functions is the focus of "Enhancing Question Generation with Novel Reward Functions: Evaluation and Comparison." This study examines and compares these new approaches to see how well they work and what improvements they bring.
- Pretrained SQUADv2 model
- PPO (Proximal Policy Optimization) instead of SCST (Self-Critical Sequence Training)
- Batch Size: 512
- Total Batches: 50 out of 170
- Learning Rate: 5e-5
- Generation Kwargs: "min_new_tokens": 1, "max_new_tokens": 32