Towards Understanding How Machines Can Learn Causal Overhypotheses (Submission: NeurIPS Baselines 2022)
This repository hosts the code for the blicket-environment baselines presented in the paper Towards Understanding How Machines Can Learn Causal Overhypotheses.
The environment is a standard gym environment located at envs/causal_env_v0.py
.
The following details are provided for running the experiments in the paper.
To install the environment, we require python <= 3.7, and tensorflow/tensorflow-gpu < 2 (usually 1.15.5). This is due to a dependency on the old version of stable-baselines (the new version does not support LSTM models). Dependencies can be installed with:
pip install tensorflow==1.15.5 stable-baselines gym==0.21.0 protobuf==3.20 tqdm
To run Q-Learning experiments, use the command python models/q_learning.py
with the correct options:
usage: q_learning.py [-h] [--num NUM] [--alpha ALPHA] [--discount DISCOUNT] [--epsilon EPSILON]
Train a q-learner
optional arguments:
-h, --help show this help message and exit
--num NUM Number of times to experiment
--alpha ALPHA Learning rate
--discount DISCOUNT Discount factor
--epsilon EPSILON Eepsilon-greedy exploration rate
To train standard RL models, use the command python driver.py
with the correct options:
usage: driver.py [-h] [--alg ALG] [--policy POLICY] [--lstm_units LSTM_UNITS]
[--num_steps NUM_STEPS]
[--quiz_disabled_steps QUIZ_DISABLED_STEPS]
[--holdout_strategy HOLDOUT_STRATEGY]
[--reward_structure REWARD_STRUCTURE]
Train a model
optional arguments:
-h, --help show this help message and exit
--alg ALG Algorithm to use
--policy POLICY Policy to use
--lstm_units LSTM_UNITS
Number of LSTM units
--num_steps NUM_STEPS
Number of training steps
--quiz_disabled_steps QUIZ_DISABLED_STEPS
Number of quiz disabled steps (-1 for no forced
exploration)
--holdout_strategy HOLDOUT_STRATEGY
Holdout strategy
--reward_structure REWARD_STRUCTURE
Reward structure
Algorithm Choices: [a2c, ppo2]
Policy Choices: [mlp, mlp_lstm, mlp_lnlstm]
Holdout Strategy Choices: [none, disjunctive_train (Only disjunctive overhypotheses), conjunctive_train (Only conjunctive overhypotheses), disjunctive_loo (Only disjunctive, leave on out), conjunctive_loo (Only conjunctive, leave one out), both_loo (Leave one out for both)]
Reward Structure Choices: [baseline (Light up the blicket detector), quiz (Determine which are blickets), quiz-type (Determine which are blickets + Causal vs. Non-Causal), quiz-typeonly (Causal vs. Non-Causal only)]
This will produce a model output file {model_name}.zip
based on the options that are chosen during the training process. Evaluation data is printed during the training process, and training is terminated when 3M training steps are reached.
Training the decision transformer consists of two (or three) steps. The first step is to generate trajectories for training. This can either be done with scripts/collect_trajectories.py
, which has the following signature:
usage: collect_trajectories.py [-h] [--env ENV] [--model_path MODEL_PATH]
[--num_trajectories NUM_TRAJECTORIES]
[--max_steps MAX_STEPS]
[--quiz_disabled_steps QUIZ_DISABLED_STEPS]
[--output_path OUTPUT_PATH]
Collect Trajectories from Causal Environments
optional arguments:
-h, --help show this help message and exit
--env ENV Environment to use
--model_path MODEL_PATH
Path to model
--num_trajectories NUM_TRAJECTORIES
Number of trajectories to collect
--max_steps MAX_STEPS
Maximum number of steps per trajectory
--quiz_disabled_steps QUIZ_DISABLED_STEPS
Number of steps to disable quiz
--output_path OUTPUT_PATH
Path to output file
The model path generated by the training step from the standard RL models can be passed to this script to generate samples from the pre-trained model. This will generate a file trajectories.pkl
. This pkl
file should be re-named to take the form causal-<name>-v2.pkl
. This file should then be moved to the models/decision-transformer/data
folder.
To train the decision transformer, follow the instructions in models/decision-transformer
to install the conda environment, and then run python experiment.py
with the options --env causal
and a --dataset
option corresponding to the chosen value for <name>
above (for example, if the trajectories are named causal-mlp-v2.pkl
the command would be python experiment.py --env causal --dataset mlp --batch_size 128 --K 30 --model dt
). The batch size, K, and model (dt: decision transformer, bc: behavior cloning) can be adjusted as desired.
This code is licensed under the MIT license by the Regents of the University of California. The code we use for the Decision Transformer benchmark is also licensed under the MIT license by the authors of https://arxiv.org/abs/2106.01345.