This is the official repository of the ICLR 2025 paper OptionZero: Planning with Learned Options.
If you use this work for research, please consider citing our paper as follows:
@inproceedings{
huang2025optionzero,
title={OptionZero: Planning with Learned Options},
author={Po-Wei Huang and Pei-Chiun Peng and Hung Guei and Ti-Rong Wu},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=3IFRygQKGL}
}
This repository is built upon MiniZero and supports two algorithms, OptionZero and MuZero, within the atari
and gridworld
environments. The followings provide the training results, trained models, instructions to reproduce the experiments in main text.
The training curves of OptionZero on 26 Atari games are shown as follows, including
This section provides instructions for reproducing the experiments described in the OptionZero main text, including training OptionZero with a maximum option length of 9 on gridworld
, as well as training OptionZero with maximum option lengths of 1 (atari
.
The OptionZero program requires a Linux platform with at least one NVIDIA GPU to operate.
First, clone this repository:
git clone --branch option-public [email protected]:rlglab/minizero-dev.git minizero-option
cd minizero-option
Then, train models by the script, tools/quick-run.sh
:
tools/quick-run.sh train GAME_TYPE mz END_ITER -conf_file CONFIG_FILE -conf_str CONF_STR
GAME_TYPE
sets the target game, including GridWorld and Atari games, e.g.,gridworld, breakout, ms_pacman
. See below for a list of supported Atari games.END_ITER
sets the total number of iterations for training, e.g.,300
.CONFIG_FILE
specifies a configuration file, we have provided the configuration files for GridWorld (cfg/gridworld.cfg
) and Atari (cfg/atari.cfg
).CONF_STR
sets additional configurations based on the configuration file, e.g.,option_seq_length
: the maximum option length of OptionZero.
Commands for reproducing experiments:
# Section 5.1: train GridWorld with maximum option length 9 for 500 iterations
tools/quick-run.sh train gridworld mz 500 -conf_file cfg/gridworld.cfg -conf_str option_seq_length=9
# For supported Atari games, please refer to "env_atari_name" in cfg/atari.cfg
# Section 5.2: train Ms.Pac-Man with maximum option length 1 for 300 iterations
tools/quick-run.sh train ms_pacman mz 300 -conf_file cfg/atari.cfg -conf_str option_seq_length=1
# Section 5.2: train Ms.Pac-Man with maximum option length 3 for 300 iterations
tools/quick-run.sh train ms_pacman mz 300 -conf_file cfg/atari.cfg -conf_str option_seq_length=3
# Section 5.2: train Ms.Pac-Man with maximum option length 6 for 300 iterations
tools/quick-run.sh train ms_pacman mz 300 -conf_file cfg/atari.cfg -conf_str option_seq_length=6
Other Tips
- For detailed arguments, run
tools/quick-run.sh train -h
. - For more customized training settings, modify the hyperparameters in the provided configuration files.
- For detailed trainig instructions, please refer to the training document in MiniZero.
- To add a new environment, please follow the development document in MiniZero.
- Our implementation of the prior in option selection stage differs slightly from the method described in the paper, though the two approaches are theoretically equivalent.
- In the paper, the prior of the primitive node is computed by substracting the predicted probabilities from policy network and option network:
$\tilde{P}(s^k, a^{k+1})=\max(0, P(s^k, a^{k+1})-P(s^k, o^{k+1}))$ . - In our implementation, we exclude the first action's probability and let option network predict the cumulative probability starting from the second action. Consequently, the prior of the primitive node is calculated as:
$\tilde{P}(s^k, a^{k+1})=1-P(s^k, o^{k+1})$ .
- In the paper, the prior of the primitive node is computed by substracting the predicted probabilities from policy network and option network:
This section introduces the files generated by the training process. For example, the following command demonstrate training OptionZero with maximum option length 3 in Ms.Pac-Man:
# train Ms.Pac-Man with maximum option length 3 for 300 iterations
tools/quick-run.sh train ms_pacman mz 300 -conf_file cfg/atari.cfg -conf_str option_seq_length=3
You will obtain a folder:
# Format of folder name:
# "atari_ms_pacman": game name
# "mz": muzero algorithm
# "2bx64": network architecture, 2b represents 2 residual blocks and 64 represents 64 filters
# "n50": number of simulations used in MCTS
# "e0bfd9": git commit hash
# "l3": maximum option length
# "20250108112746": timestamp
atari_ms_pacman_mz_2bx64_n50-e0bfd9_l3_20250108112746
├── atari_ms_pacman_mz_2bx64_n50-e0bfd9_l3_20250108112746.cfg # configuration file
├── analysis/ # figures of the training process
│ ├── accuracy_policy.png # accuracy for policy network
│ ├── Lengths.png # self-play game lengths
│ ├── loss_option.png # loss for option network
│ ├── loss_policy.png # loss for policy network
│ ├── loss_reward.png # loss for reward network
│ ├── loss_state_consistency.png # loss for state consistecny network
│ ├── loss_value.png # loss for value network
│ ├── Returns.png # self-play game returns
│ └── Time.png # elapsed training time
├── model/ # all network models produced by each optimization step
│ ├── *.pkl # include training step, parameters, optimizer, scheduler
│ └── *.pt # model parameters only (use for testing)
├── option_analysis/ # statistics of the usage of options in the last 100 completed games
│ ├── latest_100_games_sgf/ # last 100 completed self-play games
│ │ └── *.sgf # naming format: `[source iter]-[completed line number].sgf`, e.g., `300-249.sgf` means the extracted game is completed at line 249 of the 300th iteration
│ ├── stats/ # the raw statistics of the usage of options
│ ├── option_in_games.csv # proportions of options in games used in **Table 2**
│ └── option_in_trees.csv # proportions of options in search trees used in **Table 3**
├── sgf/ # self-play games of each iteration
│ └── *.sgf # `1.sgf`, `2.sgf`, ... for the 1st, the 2nd, ... iteration, respectively
├── sgf_record/ # statistics for the last n completed games of each iteration. If an iteration contains fewer than n games, the statistics are supplemented with games from the previous iteration. By default, n is set to 100.
│ ├── count.png # the average simulation number of using options in each search tree
│ ├── ratio.png # the proportions of each primitive action (labeled with numbers) and option (labeled as 'OP') used in the environment
│ ├── sgf_record.csv # some statistics, including min score, max score, median score, mean score, standard error, average game length, total game number, and the mean score is used in **Table 1**
│ ├── sgf_record.npz # the statistics as numpy arrays
│ └── sgf_record.png # average score curve during training
├── op.log # the optimization worker log
├── Training.log # the main training log
└── Worker.log # the worker connection log
There are some additional tools for further analysis:
to-video.py
: convert the self-playing records into videos in Atari games. The records will be saved as*.mp4
in[OUTPUT_DIR]
.
# cat [INPUT_SGF] | tools/to-video.py -out_dir [OUTPUT_DIR]
cat atari_ms_pacman_mz_2bx64_n50-e0bfd9_l3_20250108112746/sgf/300.sgf | tools/to-video.py -out_dir atari_video
sgf_analysis.py
: analyze the data in the self-play games stored insgf/
, and store the statistics tosgf_record/
, it runs automatically after each iteration during training.
# tools/sgf_analysis.py -in_dir [TRAINING_DIR] -out_dir [OUTPUT_DIR] -n [GAME_NUM] --save
tools/sgf_analysis.py -in_dir atari_ms_pacman_mz_2bx64_n50-e0bfd9_l3_20250108112746 -out_dir sgf_record -n 100 --save