Project Website · Paper · Platform · Datasets · Clean Offline RLHF
This is the official PyTorch implementation of the paper "Uni-RLHF: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human Feedback". Clean-Offline-RLHF is an Offline Reinforcement Learning with Human Feedback codebase that provides high-quality and realistic human feedback implementations of offline RL algorithms.
- [03-26-2024] 🔥 Update Mini-Uni-RLHF, a minimal out-of-the-box annotation tool for researchers, powered by streamlit.
- [03-24-2024] Release of SMARTS environment training dataset, scripts and labels. You can find it in the smarts branch.
- [03-20-2024] Update detail setup bash files.
- [02-22-2024] Initial code release.
Clone this repository.
git clone https://github.com/pickxiguapi/Clean-Offline-RLHF.git
cd Clean-Offline-RLHF
Install PyTorch & torchvision.
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu118
Install extra dependencies.
pip install -r requirements/requirements.txt
Before using offline RLHF algorithm, you should annotate your dataset using human feedback. If you wish to collect labeled dataset to new tasks, we refer to platform part for crowdsourced annotation. Here, we provide a ~15M steps crowdsourced annotation dataset for the sample task. raw dataset.
The processed crowdsourced (CS) and scripted teacher (ST) labels are located at crowdsource_human_labels and generated_fake_labels folders.
Note: for comparison and validation purposes, we provide fast track for scripted teacher (ST) label generation in fast_track/generate_d4rl_fake_labels.py
.
Here we provided an example of CS-MLP
method for walker2d-medium-expert-v2
task, and you can customize it in configuration file rlhf/cfgs/default.yaml
.
cd rlhf
python train_reward_model.py domain=mujoco env=walker2d-medium-expert-v2 \
modality=state structure=mlp fake_label=false ensemble_size=3 n_epochs=50 \
num_query=2000 len_query=200 data_dir="../crowdsource_human_labels" \
seed=0 exp_name="CS-MLP"
For more environment of reward model training, we provide the bash files:
cd rlhf
bash scripts/train_mujoco.sh
bash scripts/train_antmze.sh
bash scripts/train_adroit.sh
Following Uni-RLHF codebase implemeration, we modified IQL
, CQL
and TD3BC
algorithm.
Example: Train IQL
with CS-MLP
reward model. The log will be uploaded to wandb.
python algorithms/offline/iql_p.py --device "cuda:0" --seed 0 \
--reward_model_path "path/to/reward_model" --config_path ./configs/offline/iql/walker/medium_expert_v2.yaml \
--reward_model_type mlp --seed 0 --name CS-MLP-IQL-Walker-medium-expert-v2
You can have any combination of algorithms, label types and reward model types:
Algorithm | Label Type | Reward Model Type |
---|---|---|
IQL | CS | MLP |
CQL | ST | TFM |
TD3BC | CNN |
For more environment of policy training, we provide the bash files:
bash scripts/run_mujoco.sh
bash scripts/run_antmze.sh
bash scripts/run_adroit.sh
Distributed under the MIT License. See LICENSE.txt
for more information.
For any questions, please feel free to email [email protected].
If you find our work useful, please consider citing:
@inproceedings{anonymous2023unirlhf,
title={Uni-{RLHF}: Universal Platform and Benchmark Suite for Reinforcement Learning with Diverse Human Feedback},
author={Yuan, Yifu and Hao, Jianye and Ma, Yi and Dong, Zibin and Liang, Hebin and Liu, Jinyi and Feng, Zhixin and Zhao, Kai and Zheng, Yan}
booktitle={The Twelfth International Conference on Learning Representations, ICLR},
year={2024},
url={https://openreview.net/forum?id=WesY0H9ghM},
}