This is the source code for Fact-Enhanced Synthetic News Generation
cd gpt2 && python download_model.py 124M
We provide the raw file under ../data directory.
0. Move to the work directory
cd gpt2
- You should first encode the test into byte.
python encode_text.py --directory /path/to/data/directory
- Zip the training data together
python preprocess.py -train_src /path/to/data/directorytrain.txt.src.bpe -train_tgt /path/to/data/directory/train.txt.tgt.bpe -valid_src /path/to/data/directory/val.txt.src.bpe -valid_tgt data/cnndm/val.txt.tgt.bpe -save_data data/cnndm/CNNDM_BPE_COPY -src_seq_length_trunc 400 -tgt_seq_length_trunc 100 -src_vocab gpt2/vocab.txt -tgt_vocab gpt2/vocab.txt -dynamic_dict -fixed_vocab
This stage only trains with claim and news content.
0. Move to new work Directory after preprocess
cd ../
- Pre-train the Language Model, you can specify the training hyper-parameters in ./config/*.config file.
python train.py -config config/transformer_cnndm_psa.yml -run_name psa -gpt2_params_path gpt2/models/124M/ -gpt2_init_embanddec
- Move to the work directory
cd onmt/modules/
- Pre-train the Fact Reconstructor
python clsAttention.py --file_path /path/to/raw/src/file
- Move to the work directory
cd ../../
- Train LM and FR together. Please specify train_from and clf_path in the configure file. It should one of the checkpoint in Pre-trained language model and fact reconstructor path respectively.
python train.py -config config/transformer_cnndm_psa_Fact.yml -run_name psa -gpt2_params_path gpt2/models/124M/ -gpt2_init_embanddec
- Please specify train_from and data in the configuration file. train_from should be new checkpoint from last step and data file should contain the queried fact information.
python train.py -config config/transformer_cnndm_psa_Fact_Query.yml -run_name psa -gpt2_params_path gpt2/models/124M/ -gpt2_init_embanddec
Generation is performed via top-k/random sampling.
python translate.py \
-beam_size 1 \
-model ./output/checkpoints/model_step_21000.pt \
-src ../data/news_corpus/cnndm/test.txt.tgt.bpe \
-min_length 200 \
-max_length 300 \
-random_sampling_topk 100 -random_sampling_temp 0.9 \
-batch_size 40
Please check AutomaticEvaluation directory for three automatic text quality evaluation scripts.
Please check Detection directory for Neural Generation Detection
Please cite our paper if you use this code in your own work:
@misc{shu2020factenhanced,
title={Fact-Enhanced Synthetic News Generation},
author={Kai Shu and Yichuan Li and Kaize Ding and Huan Liu},
year={2020},
eprint={2012.04778},
archivePrefix={arXiv},
primaryClass={cs.CL}
}