This is a Texar implementation of OpenAI GPT-2 (Generative Pre-Trainning) language model, which allows to load official pre-trained model parameters, generate samples, and fine-tune the model, etc.
With Texar, building the GPT-2 model is as simple as creating a TransformerDecoder
instance. We can initialize the parameters of the TransformerDecoder using a pre-trained GPT-2 checkpoint by calling init_gpt2_checkpoint(path_to_gpt2_checkpoint)
.
In sum, this example showcases:
- Contructing and using pre-trained GPT-2 models in Texar
- Using GPT-2 to generate text samples with or without context
- Train or fine-tune the model with distributed GPU
- Examples of other use cases
Download the GPT-2 117M
model checkpoint with the following command:
sh gpt2_pretrained_models/download_model.sh model_117M
By default, it will download a pretrained model named model_117M
to gpt2_pretrained_models/
.
To download the GPT-2 345M
model checkpoint, use:
sh gpt2_pretrained_models/download_model.sh model_345M
WARNING: Samples are unfiltered and may contain offensive content. |
---|
This mode will initialize an interactive interface, which allows users to type in the context sentence. The model then generates continuation of the context. Top-K sample decoding is used. By default, the GPT-2 117M
model is used.
python gpt2_generate_main.py --is_interactive \
--max_decoding_length=100 \
--temperature=0.7 \
--top_k=40
Here:
is_interactive
: Specifies interactive mode.max_decoding_length
: The maximum number of tokens in the sample. Note that this includes tokens in the context.temperature
: Softmax temperature of top-k sample decoding. Larger values (above 1.0) result in more random samples, while smaller values push the sampling distribution towards the argmax. Must be strictly greater than 0. Defaults to0.7
.top_k
: Number of top most likely candidates from a vocab distribution in each decoding step. Defaults to40
.nsamples
: Number of samples to generate for each input.
To use the GPT-2 345M
model, specify --pretrain_checkpoint
and --config_model
:
python gpt2_generate_main.py --is_interactive \
--max_decoding_length=100 \
--temperature=0.7 \
--top_k=40 \
--config_model=configs.config_model_345M \
--pretrain_checkpoint=gpt2_pretrained_models/model_345M/model.ckpt
--pretrain_model_dir=gpt2_pretrained_models/model_345M
Here:
pretrain_checkpoint
: Path to the model checkpoints. Default togpt2_pretrained_models/model_117M/model.ckpt
.config_model
: Model configuration file. Default toconfigs.config_model_117M
.pretrain_model_dir
: The directory of pretrained model, for loading vocabuary, etc. Default togpt2_pretrained_models/model_117M
.
Example input:
Model input >>> Micheal Jordan is the greatest player in history !
Example output:
======================================== SAMPLE 1 ========================================
He's the one who has made all the difference. He's a true legend. He's a great athlete,
a great athlete. He's a great athlete. I'm so happy for him. I'm so happy for his family,
the family, and I'm so happy for him. I'm so happy for his teammates, his teammates, and
I'm so happy for him.
The last time we saw him on stage, he
================================================================================
This mode generates a batch of samples from scratch.
python gpt2_generate_main.py
--nsamples=1 \
--batch_size=1 \
--max_decoding_len=100 \
--temperature=0.7 \
--top_k=40
Here:
nsamples
: Total number of samples to generate, must be dividable by thebatch_size
.batch_size
: Each iteration generatesbatch_size
number of samples.
To use GPT-2 345M
model, specify --pretrain_checkpoint
, --config_model
and --pretrain_model_dir
as above.
Example output:
"A new government and a healthy economy have a chance to take this up."
After he said the election's outcome in the House was important and had helped to build
confidence in the House, former Ukip leader Nigel Farage spoke about working to boost
the economy, saying the vote for the "lefties" and others "were bad optics for Labour
in this way".
This section shows how we can fine-tune the pre-trained GPT2 model and use the resulting model for generation.
First of all, download the pre-trained model as above.
We first preprocess data with the GPT-2 BPE encoding.
A toy dataset is provided under data/toy/
which includes train.txt
, dev.txt
, and test.txt
. This example will fit the GPT-2 model on train.txt
, evaluate perplexity on dev.txt
, and do continuation generation using test.txt
as the context.
Run the following cmd to transform the data into TFRecord format and perform processing such as truncation, BPE encoding, adding special tokens, etc:
python prepare_data.py --data_dir data/toy
[--max_seq_length=128]
[--tfrecord_output_dir=data/toy]
[--pretrain_model_dir=gpt2_pretrained_models/model_117M]
data_dir
: The directory of raw data, wherein data files must be named as 'train.txt', 'dev.txt', or 'test.txt'. It is not necessary to provide all three files.max_seq_length
: The maxium length of sequence after BPE encoding. This includes GPT-2 special tokens that will be automatically added. Longer sequence will be trimmed.tfrecord_output_dir
: The output path where the resulting TFRecord files will be put in. Be default, it is set to be the same asdata_dir
.pretrain_model_dir
: The downloaded pretrained model directory, wherein the vocabulary files are used for data processing.
The above cmd will output TFRecord files in the specified output directory. E.g., if train.txt
is provided under data_dir
, the output file train.tf_record
will be produced under tfrecord_output_dir
.
For single-GPU training (and evaluation), run the following cmd. The cmd fine-tunes the pre-trained GPT-2 parameters, and evalautes perplexity on the dev set.
python gpt2_train_main.py --do_train --do_eval
[--config_train=configs.config_train]
[--output_dir=output]
Here:
config_train
: Configurations of GPT-2 training, including data and optimization hyperparameters. By default, the config fileconfigs/config_train.py
is used. Remember to specify correct data path if you are using your own data.output_dir
: The output path where checkpoints are saved.
By default, the GPT-2 117M
model is used. To use the GPT-2 345M
model instead, specify relevant FLAGS as below:
python gpt2_train_main.py --do_train --do_eval \
--config_model=configs.config_model_345M \
--pretrain_model_dir=gpt2_pretrained_models/model_345M \
--pretrain_checkpoint=gpt2_pretrained_models/model_345M/model.ckpt \
[--config_train=configs.config_train]
[--output_dir=output]
where --pretrain_checkpoint
is necessary only when you want to load the pretrained checkpoint, and is ignored if --checkpoint
is specified.
Please see the FLAGS in the code for more options.
For Multi-GPU training on one or multiple machines, you may first install the prerequisite OpenMPI and Hovorod packages, as detailed in the distributed_gpu example.
Then run the following cmd for training and evaluation. The cmd trains the model on local with 2 GPUs. Evaluation is performed with the single rank-0 GPU.
mpirun -np 2 \
-H localhost:2\
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl tcp,self \
-mca btl_tcp_if_include ens3 \
python gpt2_train_main.py --do_train --do_eval --distributed
[--config_train=configs.config_train]
[--output_dir=output]
The key configurations of multi-gpu training:
-
-np
: total number of processes -
-H
: IP addresses of different servers and the number of processes used in each server. For example,-H 192.168.11.22:1,192.168.33.44:1
-
-mca
: sets the MPI communication interface. Use the setting specified above to avoid possible multiprocessing and network communication issues.- The above configuration uses the
ens3
network interface. If this interface does not work in your environment (e.g., yielding error messageUnknown interfance name
), you may want to use a different interface (Run cmdifconfig
to see alternative interfaces in your environment.)
- The above configuration uses the
Please refer to distributed_gpu example for more details of the other multi-gpu configurations.
Make sure to specifiy the --distributed
flag as above for multi-gpu training.
python gpt2_train_main.py --do_test --checkpoint=output/model.ckpt [--config_train=config_train] [--output_dir=output]
The output is by default saved in output/test_samples.tsv
, where each line contains the context text and the generated continuation (separated with TAB).
Texar's TransformerDecoder
(and other RNN-based decoders) easily supports common, advanced, or customized use, such as:
- Sample or continuation generation
- Greedy / (top-k) sample / Gumbel-softmax / beam-search / ... / your-customized decoding
- Training / fine-tuning in (un)conditional settings
- Perplexity evaluation
For example, after creating the language model
def _embedding_fn(ids, times):
return word_embedder(ids) + pos_embedder(times)
decoder = TransformerDecoder(
output_layer=tf.transpose(word_embedder.embedding),
hparams=gpt2_hparams)
We can do
Ex. Use 1): Continuation generation w/ greedy decoding
output, output_length = decoder(
context=ctx,
context_sequence_length=ctx_len,
decoding_strategy='infer_greedy',
end_token=end_token
embedding=_embedding_fn)
sample_id = output.sample_id
logits = output.logits
Ex. Use 2): Top-k sample decoding
topk_helper = tx.modules.TopKSampleEmbeddingHelper(
embedding=_embedding_fn,
start_tokens=ctx[:,0],
end_token=end_token,
top_k=20,
softmax_temperature=0.7)
output, output_length = decoder(
context=ctx,
context_sequence_length=ctx_len,
helper=topk_helper)
Ex. Use 3): Fine-tuning for conditional generation
tgt_embed = word_embedder(truth_target[:, :-1]) + pos_embedder(sequence_length=tgt_len-1)
output = decoder(
memory=source_hidden_states,
memory_sequence_length=src_len,
inputs=tgt_embed,
decoding_strategy='train_greedy') # teacher-forcing decoding
loss = tx.losses.sequence_sparse_softmax_cross_entropy(
lables=truth_target[:, 1:],
logits=output.logits,
sequence_length=tgt_len-1)