Skip to content

Code and data for "Improving Temporal Generalization of Pre-trained Language Models with Lexical Semantic Change" (EMNLP2022)

Notifications You must be signed in to change notification settings

zhaochen0110/LMLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Improving Temporal Generalization of Pre-trained Language Models with Lexical Semantic Change

This repository contains the code and pre-trained models for our paper Improving Temporal Generalization of Pre-trained Language Models with Lexical Semantic Change

Quick Links

  • Overview
  • Datasets
  • Baseline
  • Train
    • Requirements
    • Candidate Words Selection
    • Representation and Clustering
    • Train LMLM
    • Finetuning
  • Results

Overview

We proposes LMLM, i.e., a simple yet effective lexical-level masking strategy to post-train the pre-trained models.

We investigate the temporal misalignment of the PLMs from the lexical level and observe that the words with salient lexical semantic change contribute significantly to the temporal problems.

Datasets

The PLM is adapted to temporality using unlabelled data, fine-tuned with the downstream labeled data, and then evaluated with the testing data which has the same time as the pre-training data. We conduct experiments by employing the pre-trained model implemented with the Hugging-Face transformers package.

We choose the ARXIV dataset for the scientific domain and Reddit Time Corpus (RTC) dataset for the political domain. We also turn to two different datasets with a similar distribution for pre-training and fine-tuning, respectively. Specifically, we select WMT News Crawl (WMT) dataset, which contains news covering various topics, e.g., finance, politics, etc, as unlabeled data and PoliAff dataset in politic domain as labeled data.

Please download the dataset from their website and put them into the \data\ path.

Baseline

Methods Source Method Link
TADA Röttger and Pierrehumbert, 2021
PERL Ben-David et al, 2020
DILBERT Lekhtman et al, 2021

Train

Requirements

You should run the following script to install the remaining dependencies first.

pip install -r requirements.txt

This section introduce the lexical semantic change detection process

Candidate Words Selection

python src/important_word.py

Representation and Clustering

python src/main_extract.py

Semantic Change Quantification

python src/jsd.py

This section introduce how to post-pretrain the lexical-based Masked Language Model (LMLM) objective and finetune them inteo the downstream task:

Train LMLM

Train LMLM using run_mlm.py. This script is similar to Hugging Face's language modeling training script (link), and introduces three new arguments: pivot_file, mlm_prob and pivot_num which can used to set a custom probability for semantic masking.

bash sh/run_mlm.sh
python run_mlm.py \
    --model_name_or_path bert-base-uncased \
    --train_file  \
    --validation_file  \
    --per_device_train_batch_size 64 \
    --per_device_eval_batch_size 64 \
    --line_by_line \
    --do_train \
    --do_eval \
    --num_train_epochs 5 \
    --pivot_file data/arxiv/arxiv_word/jsd_09_21.txt \
    --mlm_prob ${mlm_prob} \
    --pivot_num ${pivot_num} \
    --max_seq_length 128 \
    --overwrite_output_dir \
    --save_steps 100000 \
    --output_dir saved_models/mlm_model/poliaff16-${mlm_prob}-${pivot_num}-mlm-${gpu_num}gpu

Finetuing

For finetuning the LMLM or pre-trained model, we use the same scripts with Hugging Face's language modeling. Especially, we average five random seeds as our final results:

bash sh/finetuning.sh
seed_list=(5 10 15 20 42)
for loop in $(seq 0 4); do
  python finetuning.py \
    --model_name_or_path saved_models/mlm_model/poliaff${year}-${mlm_prob}-${pivot_num}-mlm-${gpu_num}gpu \
    --train_file data/poliaff/train/20${year}.csv \
    --validation_file data/poliaff/test/2017.csv \
    --do_train \
    --do_eval \
    --seed ${seed_list[${loop}]} \
    --per_device_eval_batch_size 128 \
    --per_device_train_batch_size 128 \
    --output_dir saved_models/finetuning_model/-${mlm_prob}-${pivot_num}-${year}-17-${gpu_num}gpu/${seed_list[$loop]} \
    --output saved_models/finetuning_model/arxiv-${mlm_prob}-${pivot_num}-${year}-17-${gpu_num}gpu/${seed_list[$loop]} \
    --overwrite_output_dir \
    --save_steps 100000 \
    --num_train_epochs 5 \
    --max_seq_length 128 \
    --use_special_tokens
done

Results

Here we list our results of different masking strategies of LMLM in Arxiv datasets. More information about experiment details and results can be seen in our paper.

About

Code and data for "Improving Temporal Generalization of Pre-trained Language Models with Lexical Semantic Change" (EMNLP2022)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published