CORL (Clean Offline Reinforcement Learning)

This is a fork, with some modifications to edac, of:

CORL (Clean Offline Reinforcement Learning)

🧵 CORL is an Offline Reinforcement Learning library that provides high-quality and easy-to-follow single-file implementations of SOTA ORL algorithms. Each implementation is backed by a research-friendly codebase, allowing you to run or tune thousands of experiments. Heavily inspired by cleanrl for online RL, check them out too!

📜 Single-file implementation
📈 Benchmarked Implementation for N algorithms
🖼 Weights and Biases integration

Getting started

This assumes you are running Ubuntu. First install mujoco-py, i.e. download the Mujoco binaries and extract them to ~/.mujoco/mujoco210. Then:

sudo apt install libosmesa6-dev libgl1-mesa-glx libglfw3  # requirements for mujoco
sudo apt install build-essential  # for gcc for pybullet
git clone https://github.com/tinkoff-ai/CORL.git && cd CORL
conda create -n corl python=3.10  # if you want to use a conda env
pip install -r requirements/requirements_updated.txt

# alternatively, you could use docker
docker build -t <image_name> .
docker run --gpus=all -it --rm --name <container_name> <image_name>

Then run an algorithm with python algorithms/ALGO.py (replace ALGO with the algorithm name to run). Add --help to show the available options.

conda activate corl
# example: run edac with some options
python algorithms/edac.py --device=cuda --batch_size=2048 --critic_qvalue_reduction=softmax --group=EDAC-SoftmaxCritic --num_epochs=60 --eval_every=1

By default, the results are synced to Weights & Biases. This can be disabled with wandb offline and enabled again with wandb online.

Algorithms Implemented

Algorithm	Variants Implemented	Wandb Report
✅ Behavioral Cloning (BC)	`any_percent_bc.py`	`Gym-MuJoCo, Maze2D`
✅ Behavioral Cloning-10% (BC-10%)	`any_percent_bc.py`	`Gym-MuJoCo, Maze2D`
✅ Conservative Q-Learning for Offline Reinforcement Learning (CQL)	`cql.py`	`Gym-MuJoCo, Maze2D`
✅ Accelerating Online Reinforcement Learning with Offline Datasets (AWAC)	`awac.py`	`Gym-MuJoCo, Maze2D`
✅ Offline Reinforcement Learning with Implicit Q-Learning (IQL)	`iql.py`	`Gym-MuJoCo, Maze2D`
✅ A Minimalist Approach to Offline Reinforcement Learning (TD3+BC)	`td3_bc.py`	`Gym-MuJoCo, Maze2D`
✅ Decision Transformer: Reinforcement Learning via Sequence Modeling (DT)	`dt.py`	`Gym-MuJoCo, Maze2D`
✅ Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble (SAC-N)	`sac_n.py`	`Gym-MuJoCo, Maze2D`
✅ Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble (EDAC)	`edac.py`	`Gym-MuJoCo, Maze2D`
✅ Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch Size (LB-SAC)	`lb_sac.py`	`Gym-MuJoCo`

D4RL Benchmarks

For learning curves and all the details, you can check the links above. Here, we report reproduced final and best scores. Note that thay differ by a big margin, and some papers may use different approaches not making it always explicit which one reporting methodology they chose.

Last Scores

Gym-MuJoCo

Task-Name	BC	BC-10%	TD3 + BC	CQL	IQL	AWAC	SAC-N	EDAC	DT	LB-SAC
halfcheetah-medium-v2	42.40±0.21	42.46±0.81	48.10±0.21	47.08±0.19	48.31±0.11	50.01±0.30	68.20±1.48	67.70±1.20	42.20±0.30	71.21±1.35
halfcheetah-medium-expert-v2	55.95±8.49	90.10±2.83	90.78±6.98	95.98±0.83	94.55±0.21	95.29±0.91	98.96±10.74	104.76±0.74	91.55±1.10	106.57±3.90
halfcheetah-medium-replay-v2	35.66±2.68	23.59±8.02	44.84±0.68	45.19±0.58	43.53±0.43	44.91±1.30	60.70±1.17	62.06±1.27	38.91±0.57	64.10±0.82
hopper-medium-v2	53.51±2.03	55.48±8.43	60.37±4.03	64.98±6.12	62.75±6.02	63.69±4.29	40.82±11.44	101.70±0.32	65.10±1.86	103.75±0.07
hopper-medium-expert-v2	52.30±4.63	111.16±1.19	101.17±10.48	93.89±14.34	106.24±6.09	105.29±7.19	101.31±13.43	105.19±11.64	110.44±0.39	110.93±0.51
hopper-medium-replay-v2	29.81±2.39	70.42±9.99	64.42±24.84	87.67±14.42	84.57±13.49	98.15±2.85	100.33±0.90	99.66±0.94	81.77±7.93	102.53±0.92
walker2d-medium-v2	63.23±18.76	67.34±5.97	82.71±5.51	80.38±3.45	84.03±5.42	69.39±31.97	87.47±0.76	93.36±1.60	67.63±2.93	90.95±0.65
walker2d-medium-expert-v2	98.96±18.45	108.70±0.29	110.03±0.41	109.68±0.52	111.68±0.56	111.16±2.41	114.93±0.48	114.75±0.86	107.11±1.11	113.46±2.31
walker2d-medium-replay-v2	21.80±11.72	54.35±7.32	85.62±4.63	79.24±4.97	82.55±8.00	71.73±13.98	78.99±0.58	87.10±3.21	59.86±3.15	87.95±1.43

locomotion average	50.40	69.29	76.45	78.23	79.80	78.85	83.52	92.92	73.84	94.60

Maze2d

Task-Name	BC	BC-10%	TD3 + BC	CQL	IQL	AWAC	SAC-N	EDAC	DT
maze2d-umaze-v1	0.36±10.03	12.18±4.95	29.41±14.22	-14.83±0.47	37.69±1.99	68.30±25.72	130.59±19.08	95.26±7.37	18.08±29.35
maze2d-medium-v1	0.79±3.76	14.25±2.69	59.45±41.86	86.62±11.11	35.45±0.98	82.66±46.71	88.61±21.62	57.04±3.98	31.71±30.40
maze2d-large-v1	2.26±5.07	11.32±5.88	97.10±29.34	33.22±43.66	49.64±22.02	218.87±3.96	204.76±1.37	95.60±26.46	35.66±32.56

maze2d average	1.13	12.58	61.99	35.00	40.92	123.28	141.32	82.64	28.48

Antmaze

Task-Name	BC	BC-10%	TD3 + BC	CQL	IQL	AWAC	SAC-N	EDAC	DT
antmaze-umaze-v0	51.50±8.81	67.75±6.40	93.25±1.50	72.75±5.32	74.50±11.03	63.50±9.33	0.00±0.00	29.25±33.35	51.75±11.76
antmaze-medium-play-v0	0.00±0.00	2.50±1.91	0.00±0.00	0.00±0.00	71.50±12.56	0.00±0.00	0.00±0.00	0.00±0.00	0.00±0.00
antmaze-large-play-v0	0.00±0.00	0.00±0.00	0.00±0.00	0.00±0.00	40.75±12.69	0.00±0.00	0.00±0.00	0.00±0.00	0.00±0.00

antmaze average	17.17	23.42	31.08	24.25	62.25	21.17	0.00	9.75	17.25

Best Scores

Gym-MuJoCo

Task-Name	BC	BC-10%	TD3 + BC	CQL	IQL	AWAC	SAC-N	EDAC	DT	LB-SAC
halfcheetah-medium-v2	43.60±0.16	43.90±0.15	48.93±0.13	47.45±0.10	48.77±0.06	50.87±0.21	72.21±0.35	69.72±1.06	42.73±0.11	71.82±0.68
halfcheetah-medium-expert-v2	79.69±3.58	94.11±0.25	96.59±1.01	96.74±0.14	95.83±0.38	96.87±0.31	111.73±0.55	110.62±1.20	93.40±0.25	110.37±0.47
halfcheetah-medium-replay-v2	40.52±0.22	42.27±0.53	45.84±0.30	46.38±0.14	45.06±0.16	46.57±0.27	67.29±0.39	66.55±1.21	40.31±0.32	66.14±1.06
hopper-medium-v2	69.04±3.35	73.84±0.43	70.44±1.37	77.47±6.00	80.74±1.27	99.40±1.12	101.79±0.23	103.26±0.16	69.42±4.21	103.88±0.17
hopper-medium-expert-v2	90.63±12.68	113.13±0.19	113.22±0.50	112.74±0.07	111.79±0.47	113.37±0.63	111.24±0.17	111.80±0.13	111.18±0.24	110.93±0.51
hopper-medium-replay-v2	68.88±11.93	90.57±2.38	98.12±1.34	102.20±0.38	102.33±0.44	101.76±0.43	103.83±0.61	103.28±0.57	88.74±3.49	104.00±0.94
walker2d-medium-v2	80.64±1.06	82.05±1.08	86.91±0.32	84.57±0.15	87.99±0.83	86.22±4.58	90.17±0.63	95.78±1.23	74.70±0.64	90.95±0.65
walker2d-medium-expert-v2	109.95±0.72	109.90±0.10	112.21±0.07	111.63±0.20	113.19±0.33	113.40±2.57	116.93±0.49	116.52±0.86	108.71±0.39	113.46±2.31
walker2d-medium-replay-v2	48.41±8.78	76.09±0.47	91.17±0.83	89.34±0.59	91.85±2.26	87.06±0.93	85.18±1.89	89.69±1.60	68.22±1.39	92.25±2.20

locomotion average	70.15	80.65	84.83	85.39	86.40	88.39	95.60	96.36	77.49	95.97

Maze2d

Task-Name	BC	BC-10%	TD3 + BC	CQL	IQL	AWAC	SAC-N	EDAC	DT
maze2d-umaze-v1	16.09±1.00	22.49±1.75	99.33±18.66	84.92±34.40	44.04±3.02	141.92±12.88	153.12±7.50	149.88±2.27	63.83±20.04
maze2d-medium-v1	19.16±1.44	27.64±2.16	150.93±4.50	137.52±9.83	92.25±40.74	160.95±11.64	93.80±16.93	154.41±1.82	68.14±14.15
maze2d-large-v1	20.75±7.69	41.83±4.20	197.64±6.07	153.29±12.86	138.70±44.70	228.00±2.06	207.51±1.11	182.52±3.10	50.25±22.33

maze2d average	18.67	30.65	149.30	125.25	91.66	176.96	151.48	162.27	60.74

Antmaze

Task-Name	BC	BC-10%	TD3 + BC	CQL	IQL	AWAC	SAC-N	EDAC	DT
antmaze-umaze-v0	71.25±9.07	79.50±2.38	97.75±1.50	85.00±3.56	87.00±2.94	74.75±8.77	0.00±0.00	75.00±27.51	60.50±3.11
antmaze-medium-play-v0	4.75±2.22	8.50±3.51	6.00±2.00	3.00±0.82	86.00±2.16	14.00±11.80	0.00±0.00	0.00±0.00	0.25±0.50
antmaze-large-play-v0	0.75±0.50	11.75±2.22	0.50±0.58	0.50±0.58	53.00±6.83	0.00±0.00	0.00±0.00	0.00±0.00	0.00±0.00

antmaze average	25.58	33.25	34.75	29.50	75.33	29.58	0.00	25.00	20.25

Citing CORL

If you use CORL in your work, please use the following bibtex

@inproceedings{
tarasov2022corl,
  title={{CORL}: Research-oriented Deep Offline Reinforcement Learning Library},
  author={Denis Tarasov and Alexander Nikulin and Dmitry Akimov and Vladislav Kurenkov and Sergey Kolesnikov},
  booktitle={3rd Offline RL Workshop: Offline RL as a ''Launchpad''},
  year={2022},
  url={https://openreview.net/forum?id=SyAS49bBcv}
}

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github		.github
algorithms		algorithms
configs		configs
requirements		requirements
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CORL (Clean Offline Reinforcement Learning)

Getting started

Algorithms Implemented

D4RL Benchmarks

Last Scores

Gym-MuJoCo

Maze2d

Antmaze

Best Scores

Gym-MuJoCo

Maze2d

Antmaze

Citing CORL

About

Releases

Packages

Languages

License

JonasLoos/CORL

Folders and files

Latest commit

History

Repository files navigation

CORL (Clean Offline Reinforcement Learning)

Getting started

Algorithms Implemented

D4RL Benchmarks

Last Scores

Gym-MuJoCo

Maze2d

Antmaze

Best Scores

Gym-MuJoCo

Maze2d

Antmaze

Citing CORL

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages