Here we'll describe in detail the full set of command line flags available for preprocessing, training, and sampling.
The preprocessing script scripts/preprocess.py
accepts the following command-line flags:
--input_txt
: Path to the text file to be used for training. Default is thetiny-shakespeare.txt
dataset.--input_folder
: Path to a folder containing .txt files to use for training. Overrides the--input_txt
option--output_h5
: Path to the HDF5 file where preprocessed data should be written.--output_json
: Path to the JSON file where preprocessed data should be written.--val_frac
: What fraction of the data to use as a validation set; default is0.1
.--test_frac
: What fraction of the data to use as a test set; default is0.1
.--quiet
: If you pass this flag then no output will be printed to the console.--use_words
: Passing this flag preprocesses the flag as word tokens rather than characters. Using it activates additional options below (ignored otherwise).
##Preprocessing Word Tokens
--case_sensitive
: Makes word tokens case-sensitive. Default is to convert everything to lowercase for words, character tokens are ALWAYS case-sensitive.--min_occurrences
: Minimum number of times a word needs to be seen to be given a token. Default is 20.--min_documents
: Minimum number of documents a word needs to be seen in to be given a token. Default is 1.--use_ascii
: Convert the input files to ASCII by removing all non-ASCII characters. Default is unicode.--wildcard_rate
: Number of wildcards generated as a fraction of ignored words. Ex.0.01
will generate 1 percent of the number of ignored words as wildcards. Default is0.01
.--wildcard_max
: If set, the maximum number of wildcards that will be generated. Default is unlimited.--wildcard_min
: Minimum number of wildcards that will be generated. Cannot be less than 1. Default is 10.
The training script train.lua
accepts the following command-line flags:
Data options:
-input_h5
,-input_json
: Paths to the HDF5 and JSON files output from the preprocessing script.-batch_size
: Number of sequences to use in a minibatch; default is 50.-seq_length
: Number of timesteps for which the recurrent network is unrolled for backpropagation through time.
Model options:
-resume_from
: Path to a resume checkpoint from a previous run oftrain.lua
. Use this to pick up training with the EXACT same options and state from a previous training run that was interrupted. If this flag is passed then ALL OTHER options will be ignored.-init_from
: Path to a checkpoint file from a previous run oftrain.lua
. Use this to continue training from an existing checkpoint; if this flag is passed then ONLY the other flags in THIS SECTION will be ignored and the architecture from the existing checkpoint will be used instead.-reset_iterations
: Set this to 0 to restore the iteration counter of a previous run. Default is 1 (do not restore iteration counter). Only applicable if-init_from
option is used.-model_type
: The type of recurrent network to use; eitherlstm
(default) orrnn
.lstm
is slower but better.-wordvec_size
: Dimension of learned word vector embeddings; default is 64. You probably won't need to change this.-rnn_size
: The number of hidden units in the RNN; default is 128. Larger values (256 or 512) are commonly used to learn more powerful models and for bigger datasets, but this will significantly slow down computation.-dropout
: Amount of dropout regularization to apply after each RNN layer; must be in the range0 <= dropout < 1
. Settingdropout
to 0 disables dropout, and higher numbers give a stronger regularizing effect.-num_layers
: The number of layers present in the RNN; default is 2.
Optimization options:
-max_epochs
: How many training epochs to use for optimization. Default is 50.-learning_rate
: Learning rate for optimization. Default is2e-3
.-grad_clip
: Maximum value for gradients; default is 5. Set to 0 to disable gradient clipping.-lr_decay_every
: How often to decay the learning rate, in epochs; default is 5.-lr_decay_factor
: How much to decay the learning rate. After everylr_decay_every
epochs, the learning rate will be multiplied by thelr_decay_factor
; default is 0.5.
Output options:
-print_every
: How often to print status message, in iterations. Follow a number with 'e' to specify in epochs rather than batches, e.g.-print_every 0.1e
Default is 1000.-checkpoint_name
: Base filename for saving checkpoints; default iscv/checkpoint
. This will create checkpoints named -cv/checkpoint_1000.t7
,cv/checkpoint_1000_log.json
,cv/checkpoint_1000_resume.json
, etc.-checkpoint_every
: How often to save intermediate checkpoints. Default is 1000; set to 0 to disable intermediate checkpointing. Follow a number with 'e' to specify in epochs rather than batches, e.g.-checkpoint_every 1e
Note that we always save a checkpoint on the final iteration of training.-checkpoint_log
: Set to 0 to disable log checkpoints, which contain loss function history for every minibatch and epoch.
Benchmark options:
-speed_benchmark
: Set this to 1 to test the speed of the model at every iteration. This is disabled by default because it requires synchronizing the GPU at every iteration, which incurs a performance overhead. Speed benchmarking results will be printed and also stored in saved checkpoints.-memory_benchmark
: Set this to 1 to test the GPU memory usage at every iteration. This is disabled by default because like speed benchmarking it requires GPU synchronization. Memory benchmarking results will be printed and also stored in saved checkpoints. Only available when running in GPU mode.
Backend options:
-gpu
: The ID of the GPU to use (zero-indexed). Default is 0. Set this to -1 to run in CPU-only mode-gpu_backend
: The GPU backend to use; eithercuda
oropencl
. Default iscuda
.-cudnn
: Set to 1 to enable CUDNN support. Note: This prevents the network from EVER being used on CPU or OpenCL-cudnn_fastest
: Set to 1 to enable the fastest algorithm CUDNN offers. This will usually use more memory than allowing CUDNN to make that choice by default.
The sampling script sample.lua
accepts the following command-line flags:
-checkpoint
: Path to a.t7
checkpoint file fromtrain.lua
-length
: The length of the generated text, in characters.-start_text
: You can optionally start off the generation process with a string; if this is provided the start text will be processed by the trained network before we start sampling. Without this flag or the-start_tokens
flag, the first character is chosen randomly.-start_tokens
: As an alternative to start_text for word-based tokenizing, accepts a JSON file generated byscripts/tokenize.py
which contains tokens for start text. Without this flag or the-start_text
flag, the first character is chosen randomly.-sample
: Set this to 1 to sample from the next-character distribution at each timestep; set to 0 to instead just pick the argmax at every timestep. Sampling tends to produce more interesting results.-temperature
: Softmax temperature to use when sampling; default is 1. Higher temperatures give noiser samples. Not used when using argmax sampling (sample
set to 0).-gpu
: The ID of the GPU to use (zero-indexed). Default is 0. Set this to -1 to run in CPU-only mode.-gpu_backend
: The GPU backend to use; eithercuda
oropencl
. Default iscuda
.-verbose
: By default just the sampled text is printed to the console. Set this to 1 to also print some diagnostic information.-stream
: By default the sampled text is buffered and printed in one go. Set this to 1 to disable buffering and stream the sampled text one character at a time.
The tokenizing script scripts/tokenizeWords.py
accepts the following command-line flags:
--input_str
: The string to tokenize as a quoted block, ex.--input "lorem ipsum"
--input_txt
: Path to the text file to be used for training. Default is thetiny-shakespeare.txt
dataset.--input_folder
: Path to a folder containing .txt files to use for training. Overrides the--input_txt
option--input_json: The JSON output from
scripts/preprocessWords.py` to use to tokenize the string--output_json
: Optional - The output JSON file to save the tokenization to.--output_h5
: Optional - The path to the HDF5 file where preprocessed data should be written.--val_frac
: What fraction of the data to use as a validation set; default is0.1
.--test_frac
: What fraction of the data to use as a test set; default is0.1
.--quiet
: If you pass this flag then no output will be printed to the console except in case of error.