OptionZero: Planning with Learned Options

This is the official repository of the ICLR 2025 paper OptionZero: Planning with Learned Options.

If you use this work for research, please consider citing our paper as follows:

@inproceedings{
    huang2025optionzero,
    title={OptionZero: Planning with Learned Options},
    author={Po-Wei Huang and Pei-Chiun Peng and Hung Guei and Ti-Rong Wu},
    booktitle={The Thirteenth International Conference on Learning Representations},
    year={2025},
    url={https://openreview.net/forum?id=3IFRygQKGL}
}

This repository is built upon MiniZero and supports two algorithms, OptionZero and MuZero, within the atari and gridworld environments. The followings provide the training results, trained models, instructions to reproduce the experiments in main text.

Results and Trained Models

The training curves of OptionZero on 26 Atari games are shown as follows, including $\ell_1, \ell_3, \ell_6$, each configured with maximum option lengths 1, 3, 6, respectively, where $\ell_1$ is identical to MuZero. More details and publicly released models are available here.

Training OptionZero

This section provides instructions for reproducing the experiments described in the OptionZero main text, including training OptionZero with a maximum option length of 9 on gridworld, as well as training OptionZero with maximum option lengths of 1 ($\ell_1$, equivalent to MuZero), 3 ($\ell_3$), and 6 ($\ell_6$) on atari.

Prerequisites

The OptionZero program requires a Linux platform with at least one NVIDIA GPU to operate.

Usage

First, clone this repository:

git clone --branch option-public [email protected]:rlglab/minizero-dev.git minizero-option
cd minizero-option

Then, train models by the script, tools/quick-run.sh:

tools/quick-run.sh train GAME_TYPE mz END_ITER -conf_file CONFIG_FILE -conf_str CONF_STR

GAME_TYPE sets the target game, including GridWorld and Atari games, e.g., gridworld, breakout, ms_pacman. See below for a list of supported Atari games.
END_ITER sets the total number of iterations for training, e.g., 300.
CONFIG_FILE specifies a configuration file, we have provided the configuration files for GridWorld (cfg/gridworld.cfg) and Atari (cfg/atari.cfg).
CONF_STR sets additional configurations based on the configuration file, e.g.,
- option_seq_length: the maximum option length of OptionZero.

Commands for reproducing experiments:

# Section 5.1: train GridWorld with maximum option length 9 for 500 iterations
tools/quick-run.sh train gridworld mz 500 -conf_file cfg/gridworld.cfg -conf_str option_seq_length=9

# For supported Atari games, please refer to "env_atari_name" in cfg/atari.cfg
# Section 5.2: train Ms.Pac-Man with maximum option length 1 for 300 iterations
tools/quick-run.sh train ms_pacman mz 300 -conf_file cfg/atari.cfg -conf_str option_seq_length=1

# Section 5.2: train Ms.Pac-Man with maximum option length 3 for 300 iterations
tools/quick-run.sh train ms_pacman mz 300 -conf_file cfg/atari.cfg -conf_str option_seq_length=3

# Section 5.2: train Ms.Pac-Man with maximum option length 6 for 300 iterations
tools/quick-run.sh train ms_pacman mz 300 -conf_file cfg/atari.cfg -conf_str option_seq_length=6

Other Tips

For detailed arguments, run tools/quick-run.sh train -h.
For more customized training settings, modify the hyperparameters in the provided configuration files.
For detailed trainig instructions, please refer to the training document in MiniZero.
To add a new environment, please follow the development document in MiniZero.
Our implementation of the prior in option selection stage differs slightly from the method described in the paper, though the two approaches are theoretically equivalent.
- In the paper, the prior of the primitive node is computed by substracting the predicted probabilities from policy network and option network: $\tilde{P}(s^k, a^{k+1})=\max(0, P(s^k, a^{k+1})-P(s^k, o^{k+1}))$.
- In our implementation, we exclude the first action's probability and let option network predict the cumulative probability starting from the second action. Consequently, the prior of the primitive node is calculated as: $\tilde{P}(s^k, a^{k+1})=1-P(s^k, o^{k+1})$.

Training Results

This section introduces the files generated by the training process. For example, the following command demonstrate training OptionZero with maximum option length 3 in Ms.Pac-Man:

# train Ms.Pac-Man with maximum option length 3 for 300 iterations
tools/quick-run.sh train ms_pacman mz 300 -conf_file cfg/atari.cfg -conf_str option_seq_length=3

You will obtain a folder:

# Format of folder name:
# "atari_ms_pacman": game name
# "mz": muzero algorithm
# "2bx64": network architecture, 2b represents 2 residual blocks and 64 represents 64 filters
# "n50": number of simulations used in MCTS
# "e0bfd9": git commit hash
# "l3": maximum option length
# "20250108112746": timestamp
atari_ms_pacman_mz_2bx64_n50-e0bfd9_l3_20250108112746
├── atari_ms_pacman_mz_2bx64_n50-e0bfd9_l3_20250108112746.cfg # configuration file
├── analysis/                      # figures of the training process
│   ├── accuracy_policy.png        # accuracy for policy network
│   ├── Lengths.png                # self-play game lengths
│   ├── loss_option.png            # loss for option network
│   ├── loss_policy.png            # loss for policy network
│   ├── loss_reward.png            # loss for reward network
│   ├── loss_state_consistency.png # loss for state consistecny network
│   ├── loss_value.png             # loss for value network
│   ├── Returns.png                # self-play game returns
│   └── Time.png                   # elapsed training time
├── model/                         # all network models produced by each optimization step
│   ├── *.pkl                      # include training step, parameters, optimizer, scheduler
│   └── *.pt                       # model parameters only (use for testing)
├── option_analysis/               # statistics of the usage of options in the last 100 completed games
│   ├── latest_100_games_sgf/      # last 100 completed self-play games
│   │   └── *.sgf                  # naming format: `[source iter]-[completed line number].sgf`, e.g., `300-249.sgf` means the extracted game is completed at line 249 of the 300th iteration
│   ├── stats/                     # the raw statistics of the usage of options
│   ├── option_in_games.csv        # proportions of options in games used in **Table 2**
│   └── option_in_trees.csv        # proportions of options in search trees used in **Table 3**
├── sgf/                           # self-play games of each iteration
│   └── *.sgf                      # `1.sgf`, `2.sgf`, ... for the 1st, the 2nd, ... iteration, respectively
├── sgf_record/                    # statistics for the last n completed games of each iteration. If an iteration contains fewer than n games, the statistics are supplemented with games from the previous iteration. By default, n is set to 100.
│   ├── count.png                  # the average simulation number of using options in each search tree
│   ├── ratio.png                  # the proportions of each primitive action (labeled with numbers) and option (labeled as 'OP') used in the environment
│   ├── sgf_record.csv             # some statistics, including min score, max score, median score, mean score, standard error, average game length, total game number, and the mean score is used in **Table 1**
│   ├── sgf_record.npz             # the statistics as numpy arrays
│   └── sgf_record.png             # average score curve during training
├── op.log                         # the optimization worker log
├── Training.log                   # the main training log
└── Worker.log                     # the worker connection log

There are some additional tools for further analysis:

to-video.py: convert the self-playing records into videos in Atari games. The records will be saved as *.mp4 in [OUTPUT_DIR].

# cat [INPUT_SGF] | tools/to-video.py -out_dir [OUTPUT_DIR]
cat atari_ms_pacman_mz_2bx64_n50-e0bfd9_l3_20250108112746/sgf/300.sgf | tools/to-video.py -out_dir atari_video

sgf_analysis.py: analyze the data in the self-play games stored in sgf/, and store the statistics to sgf_record/, it runs automatically after each iteration during training.

# tools/sgf_analysis.py -in_dir [TRAINING_DIR] -out_dir [OUTPUT_DIR] -n [GAME_NUM] --save
tools/sgf_analysis.py -in_dir atari_ms_pacman_mz_2bx64_n50-e0bfd9_l3_20250108112746 -out_dir sgf_record -n 100 --save

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.githooks		.githooks
docs		docs
gridworld_maps		gridworld_maps
minizero		minizero
scripts		scripts
tools		tools
.autopep8		.autopep8
.clang-format		.clang-format
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OptionZero: Planning with Learned Options

Results and Trained Models

Training OptionZero

Prerequisites

Usage

Training Results

About

Releases

Packages

Languages

rlglab/optionzero

Folders and files

Latest commit

History

Repository files navigation

OptionZero: Planning with Learned Options

Results and Trained Models

Training OptionZero

Prerequisites

Usage

Training Results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages