Advanced Deep Reinforcement Learning

This project aims to combine the exploration strategy offered by Random Network Distillation (RND) with the robustness of Proximal Policy Optimization (PPO). As previous attempts improved smaller state-space environments, this project is thought to solve the milestones from Crafter environment which is a 2D abstraction of Minecraft. The main challenge in this project is the utilization of intrinsic rewards while the base environment has incorporated milestones which are essential for the reward function and solving the environment.

Results

First plot is vanilla PPO on 500 episodes Second plot is PPO with RND

Installation

(Recommended) Setup new clean environment

Use a conda package manager.

Conda

Subsequently run these commands, following the prompted runtime instructions:

conda create -n crafter python=3.12.15
conda activate crafter
pip install -r requirements.txt

How to run

export PYTHONPATH=$(pwd)/src:$PYTHONPATH

baselines ppo

python src/agents/baselines.py --steps 500

To run without seeds and log the policies for comparison

src/agents/ppo_rnd/with_policies.py --steps 500

ppo_rnd version with seeds

python src/agents/ppo_rnd/new_ppornd.py --steps 500

plotting

The plot scores_baselines.png show each seeds trajectory to inspect robustness and correct episode length. The plot mean_scores_with_std_dev.png shows the mean achievement over 500 episodes with the respective confidence interval across all seeds.

Change path of to the json of your scores.json file from logdir, resulting plots are found in src/plots directory python src/utils/plotting.py

Rewards in Crafter

Reward: The sparse reward is +1 for unlocking an achievement during the episode and -0.1 or +0.1 for lost or regenerated health points. Results should be reported not as reward but as success rates and score. Episodes end automatically after 10.000 steps if the agent survives that long.

Success rates: The success rates of the 22 achievemnts are computed as the percentage across all training episodes in which the achievement was unlocked, allowing insights into the ability spectrum of an agent.

Used Hardware

Compute/Evaluation Infrastructure
Device	MacBook Pro M3 Pro 14-Inch
CPU	M3 Pro
GPU	-
TPU	-
RAM	18 GB RAM
OS	Sonoma 14.5
Python Version	3.12.15

Further Work

To reduce the necessary computational resources to reach 1 million episodes, it is recommended to transfer this approach to Craftax [3]. It is an implementation of Crafter in Jax and offers a faster performance. Furthermore each achievement can be inspected separately.

Sources

[1] Nugroho, W. (2021). Reinforcement Learning PPO RND [Source code]. GitHub. https://github.com/wisnunugroho21/reinforcement_learning_ppo_rnd
[2] Hafner, D. (2021). Benchmarking the Spectrum of Agent Capabilities. arXiv preprint:2109.06780
[3] Matthews, M. et al. (2024). Craftax: A Lightning-Fast Benchmark for Open-End Reinforcement Learning. International Conference on Machine Learning (ICML). [4] Bradbury, J. et al (2018). JAX: composable transformations of {P}ython+{N}um{P}y programs. http://github.com/google/jax. [5] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
logdir		logdir
presentation		presentation
scripts		scripts
src		src
.DS_Store		.DS_Store
LLM-usage.txt		LLM-usage.txt
README.md		README.md
policies.log		policies.log
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Advanced Deep Reinforcement Learning

Results

Installation

(Recommended) Setup new clean environment

Conda

How to run

baselines ppo

To run without seeds and log the policies for comparison

ppo_rnd version with seeds

plotting

Rewards in Crafter

Used Hardware

Further Work

Sources

About

Releases

Packages

Languages

arzx/adrl-project

Folders and files

Latest commit

History

Repository files navigation

Advanced Deep Reinforcement Learning

Results

Installation

(Recommended) Setup new clean environment

Conda

How to run

baselines ppo

To run without seeds and log the policies for comparison

ppo_rnd version with seeds

plotting

Rewards in Crafter

Used Hardware

Further Work

Sources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages