- Uses Hugging Face TRL for PPO
- Uses Hugging Face Peft for LoRA.
- Uses Bitsandbytes internally for 4bits and 8bits reference model modes.
- Uses our QLora standalone lib for QLora.
Where the reinforcement learning is located.
There, one finds the supervised baselines:
- Generate, then learn, masked.
Launch all SFT jobs with
cd approach_sft
./queue_all_jobs.sh