Library for reinforcement learning in Java 17.
Repository includes algorithms, examples, and exercises from the 2nd edition of Reinforcement Learning: An Introduction by Richard S. Sutton, and Andrew G. Barto.
Our implementation is inspired by the python code by Shangtong Zhang, but differs from the reference in two aspects:
- the algorithms are implemented separate from the problem scenarios
- the math is in exact precision which reproduces symmetries in the results in case the problem features symmetries
- Iterative Policy Evaluation (parallel, in 4.1, p.59)
- Value Iteration to determine V*(s) (parallel, in 4.4, p.65)
- Action-Value Iteration to determine Q*(s,a) (parallel)
- First Visit Policy Evaluation (in 5.1, p.74)
- Monte Carlo Exploring Starts (in 5.3, p.79)
- Contant-alpha Monte Carlo
- Tabular Temporal Difference (in 6.1, p.96)
- Sarsa: An on-policy TD control algorithm (in 6.4, p.104)
- Q-learning: An off-policy TD control algorithm (in 6.5, p.105)
- Expected Sarsa (in 6.6, p.107)
- Double Sarsa, Double Expected Sarsa, Double Q-Learning (in 6.7, p.109)
- n-step Temporal Difference for estimating V(s) (in 7.1, p.115)
- n-step Sarsa, n-step Expected Sarsa, n-step Q-Learning (in 7.2, p.118)
- Random-sample one-step tabular Q-planning (parallel, in 8.1, p.131)
- Tabular Dyna-Q (in 8.2, p.133)
- Prioritized Sweeping (in 8.4, p.137)
- Semi-gradient Tabular Temporal Difference (in 9.3, p.164)
- True Online Sarsa (in 12.8, p.309)
Prisoner's Dilemma |
Exact Gambler |
AV-Iteration q(s,a) |
TabularQPlan |
Monte Carlo |
Q-Learning |
Expected-Sarsa |
Sarsa |
3-step Q-Learning |
3-step E-Sarsa |
3-step Sarsa |
OTrue Online Sarsa |
ETrue Online Sarsa |
QTrue Online Sarsa |
Value Iteration v(s)
Value Iteration v(s)
Action Value Iteration and optimal policy
Monte Carlo q(s,a) |
ESarsa q(s,a) |
QLearning q(s,a) |
Monte Carlo Exploring Starts
AV-Iteration |
TabularQPlan |
Q-Learning |
E-Sarsa |
Sarsa |
Monte Carlo |
paths obtained using value iteration
track 1 |
track 2 |
Action Value Iteration |
TabularQPlan |
Action Value Iteration |
Q-Learning |
TabularQPlan |
Expected Sarsa |
Action Value Iteration |
Prioritized sweeping |
Exact expected reward of two adversarial optimistic agents depending on their initial configuration:
Exact expected reward of two adversarial Upper-Confidence-Bound agents depending on their initial configuration:
From time to time, a version is deployed and made available for maven integration. Specify repository
and dependency
of the subare library in the pom.xml
file of your maven project:
<dependencies>
<!-- other dependencies -->
<dependency>
<groupId>ch.alpine</groupId>
<artifactId>subare</artifactId>
<version>0.4.3</version>
</dependency>
</dependencies>
<repositories>
<!-- other repositories -->
<repository>
<id>subare-mvn-repo</id>
<url>https://raw.github.com/datahaki/subare/mvn-repo/</url>
<snapshots>
<enabled>true</enabled>
<updatePolicy>always</updatePolicy>
</snapshots>
</repository>
</repositories>
The source code is attached to every release.
The branch master
always contains the latest features for Java 17, and does not correspond to the most recent deployed version generally.
Jan Hakenberg, Christian Fluri
- Learning to Operate a Fleet of Cars by Christian Fluri, Claudio Ruch, Julian Zilly, Jan Hakenberg, and Emilio Frazzoli
- Reinforcement Learning: An Introduction by Richard S. Sutton, and Andrew G. Barto