LREBench: A low-resource relation extraction benchmark.

This repo is official implementation for the EMNLP2022 (Findings) paper LREBench: Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study [poster].

This paper presents an empirical study to build relation extraction systems in low-resource settings. Based upon recent PLMs, three schemes are comprehensively investigated to evaluate the performance in low-resource settings: $(i)$ different types of prompt-based methods with few-shot labeled data; $(i i)$ diverse balancing methods to address the long-tailed distribution issue; $(i i i)$ data augmentation technologies and self-training to generate more labeled in-domain data.

Environment

To install requirements:

>> conda create -n LREBench python=3.9
>> conda activate LREBench
>> pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu113

Datasets

We provide 8 benchmark datasets and prompts used in our experiments.

All processed full-shot datasets can be downloaded and need to be placed in the dataset folder. The expected files of one dataset contains rel2id.json, train.json and test.json.

Normal Prompt-based Tuning

1 Initialize Answer Words

Use the command below to get answer words first.

>> python get_label_word.py --modelpath roberta-large --dataset semeval

The {modelpath}_{dataset}.pt will be saved in the dataset folder, and you need to assign the modelpath and dataset with names of the pre-trained language model and the dataset to be used before.

2 Split Datasets

We provide the sampling code for obtaining 8-shot (sample_8shot.py) , 10% (sample_10.py) datasets and the rest datasets used as unlabeled data for self-training. If there are classes with less than 8 instances, these classes are removed in training and testing sets when sampling 8-shot datasets and new_test.json and new_rel2id.json are obtained.

>> python sample_8shot.py -h
    usage: sample_8shot.py [-h] --input_dir INPUT_DIR --output_dir OUTPUT_DIR

    optional arguments:
      -h, --help            show this help message and exit
      --input_dir INPUT_DIR, -i INPUT_DIR
                            The directory of the training file.
      --output_dir OUTPUT_DIR, -o OUTPUT_DIR
                            The directory of the sampled files.
>> python sample_10.py -h
    usage: sample_10.py [-h] --input_file INPUT_FILE --output_dir OUTPUT_DIR

    optional arguments:
      -h, --help            show this help message and exit
      --input_file INPUT_FILE, -i INPUT_FILE
                            The directory of the training file.
      --output_dir OUTPUT_DIR, -o OUTPUT_DIR
                            The directory of the sampled files.

For example:

>> python sample_8.py -i dataset/semeval -o dataset/semeval/8-shot
>> cd dataset/semeval
>> mkdir 8-1
>> cp 8-shot/new_rel2id.json 8-1/rel2id.json
>> cp 8-shot/new_test.json 8-1/test.json
>> cp 8-shot/train_8_1.json 8-1/train.json
>> cp 8-shot/unlabel_8_1.json 8-1/label.json

3 Prompt-based Tuning

All running scripts for each dataset are in the scripts folder. For example, train KonwPrompt on SemEval, CMeIE and ChemProt with the following commands:

>> bash scripts/semeval.sh  # RoBERTa-large
>> bash scripts/CMeIE.sh    # Chinese RoBERTa-large
>> bash scripts/ChemProt.sh # BioBERT-large

4 Different prompts

Simply add parameters to the scripts.

Template Prompt: --use_template_words 0

Schema Prompt: --use_template_words 0 --use_schema_prompt True

PTR: refer to PTR

Balancing

1 Re-sampling

Create the re-sampled training file based on the 10% training set by resample.py.

>> python resample.py -h
    usage: resample.py [-h] --input_file INPUT_FILE --output_dir OUTPUT_DIR --rel_file REL_FILE

    optional arguments:
      -h, --help            show this help message and exit
      --input_file INPUT_FILE, -i INPUT_FILE
                            The path of the training file.
      --output_dir OUTPUT_DIR, -o OUTPUT_DIR
                            The directory of the sampled files.
      --rel_file REL_FILE, -r REL_FILE
                            the path of the relation file

For example,

>> mkdir dataset/semeval/10sa-1
>> python resample.py -i dataset/semeval/10/train10per_1.json -r dataset/semeval/rel2id.json -o dataset/semeval/sa
>> cd dataset/semeval
>> cp rel2id.json test.json 10sa-1/
>> cp sa/sa_1.json 10sa-1/train.json

2 Re-weighting Loss

Simply add the useloss parameter to script for choosing various re-weighting loss.

For exampe: --useloss MultiFocalLoss. (chocies: MultiDSCLoss, MultiFocalLoss, GHMC_Loss, LDAMLoss)

Data Augmentation

1 Prepare the environment

>> pip install nlpaug nlpcda

Please follow the instructions from nlpaug and nlpcda for more information (Thanks a lot!).

2 Try different DA methods

We provide many data augmentation methods

English (nlpaug): TF-IDF, contextual word embedding (BERT and RoBERTa), and WordNet' Synonym (-lan==en, -d).
Chinese (nlpcda): Synonym (-lan==cn)
All DA methods can be implemented on contexts, entities and both of them (--locations).

Generate augmented data

>> python DA.py -h
    usage: DA.py [-h] --input_file INPUT_FILE --output_dir OUTPUT_DIR --language {en,cn}
                  [--locations {sent1,sent2,sent3,ent1,ent2} [{sent1,sent2,sent3,ent1,ent2} ...]]
                  [--DAmethod {word2vec,TF-IDF,word_embedding_bert,word_embedding_roberta,random_swap,synonym}]
                  [--model_dir MODEL_DIR] [--model_name MODEL_NAME] [--create_num CREATE_NUM] [--change_rate CHANGE_RATE]

    optional arguments:
      -h, --help            show this help message and exit
      --input_file INPUT_FILE, -i INPUT_FILE
                            the training set file
      --output_dir OUTPUT_DIR, -o OUTPUT_DIR
                            The directory of the sampled files.
      --language {en,cn}, -lan {en,cn}
                            DA for English or Chinese
      --locations {sent1,sent2,sent3,ent1,ent2} [{sent1,sent2,sent3,ent1,ent2} ...], -l {sent1,sent2,sent3,ent1,ent2} [{sent1,sent2,sent3,ent1,ent2} ...]
                            List of positions that you want to manipulate
      --DAmethod {word2vec,TF-IDF,word_embedding_bert,word_embedding_roberta,random_swap,synonym}, -d {word2vec,TF-IDF,word_embedding_bert,word_embedding_roberta,random_swap,synonym}
                            Data augmentation method
      --model_dir MODEL_DIR, -m MODEL_DIR
                            the path of pretrained models used in DA methods
      --model_name MODEL_NAME, -mn MODEL_NAME
                            model from huggingface
      --create_num CREATE_NUM, -cn CREATE_NUM
                            The number of samples augmented from one instance.
      --change_rate CHANGE_RATE, -cr CHANGE_RATE
                            the changing rate of text

Take context-level DA based on contextual word embedding on ChemProt for example:

python DA.py \
    -i dataset/ChemProt/10/train10per_1.json \
    -o dataset/ChemProt/aug \
    -d word_embedding_bert \
    -mn dmis-lab/biobert-large-cased-v1.1 \
    -l sent1 sent2 sent3

Delete repeated instances and get the final augmented data

>> python merge_dataset.py -h
usage: merge_dataset.py [-h] [--input_files INPUT_FILES [INPUT_FILES ...]] [--output_file OUTPUT_FILE]

optional arguments:
  -h, --help            show this help message and exit
  --input_files INPUT_FILES [INPUT_FILES ...], -i INPUT_FILES [INPUT_FILES ...]
                        List of input files containing datasets to merge
  --output_file OUTPUT_FILE, -o OUTPUT_FILE
                        Output file containing merged dataset

For example:

python merge_dataset.py \
    -i dataset/ChemProt/train10per_1.json dataset/ChemProt/aug/aug.json \
    -o dataset/ChemProt/aug/merge.json

Self-training for Semi-supervised learning

Train a teacher model on a few labeled data (8-shot or 10%)
Place the unlabeled data label.json in the corresponding dataset folder.
Assigning pseudo labels using the trained teacher model: add --labeling True to the script and obtain the pseudo-labeled dataset label2.json.

Put the gold-labeled data and pseudo-labeled data together. For example:

>> python self-train_combine.py -g dataset/semeval/10-1/train.json -p dataset/semeval/10-1/label2.json -la dataset/semeval/10la-1
>> cd dataset/semeval
>> cp rel2id.json test.json 10la-1/

Train the final student model: add --stutrain True to the script

Standard Fine-tuning Baseline

Fine-tuning

Citation

If you use the code, please cite the following paper:

@inproceedings{xu-etal-2022-towards-realistic,
    title = "Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study",
    author = "Xu, Xin  and
      Chen, Xiang  and
      Zhang, Ningyu  and
      Xie, Xin  and
      Chen, Xi  and
      Chen, Huajun",
    editor = "Goldberg, Yoav  and
      Kozareva, Zornitsa  and
      Zhang, Yue",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-emnlp.29",
    doi = "10.18653/v1/2022.findings-emnlp.29",
    pages = "413--427"
}

Name	Name	Last commit message	Last commit date
Latest commit xxupiano Update README.md Feb 23, 2024 b137bde · Feb 23, 2024 History 66 Commits
.github/workflows	.github/workflows	Update python-package-conda.yml	Jan 26, 2023
Fine-tuning	Fine-tuning	Update README.md	Sep 14, 2022
data	data	Update dialogue.py	Dec 19, 2022
dataset	dataset	init	Jun 30, 2022
figs	figs	Add files via upload	Oct 20, 2022
lit_models	lit_models	Update transformer.py	Nov 30, 2022
models	models	init	Jun 30, 2022
scripts	scripts	Update	Sep 12, 2022
CITATION.cff	CITATION.cff	Update CITATION.cff	Nov 15, 2022
DA.py	DA.py	Update DA.py	Oct 13, 2022
LICENSE	LICENSE	Initial commit	May 2, 2022
README.md	README.md	Update README.md	Feb 23, 2024
environment.yml	environment.yml	Add files via upload	Jan 26, 2023
get_label_word.py	get_label_word.py	init	Jun 30, 2022
main.py	main.py	init	Jun 30, 2022
merge_dataset.py	merge_dataset.py	Update merge_dataset.py	Dec 22, 2022
requirements.txt	requirements.txt	Update requirements.txt	Oct 12, 2022
resample.py	resample.py	Update resample.py	Oct 11, 2022
sample_10.py	sample_10.py	Update sample_10.py	Sep 14, 2022
sample_8shot.py	sample_8shot.py	Update sample_8shot.py	Oct 11, 2022
self-train_combine.py	self-train_combine.py	Update	Sep 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LREBench: A low-resource relation extraction benchmark.

Contents

Environment

Datasets

Normal Prompt-based Tuning

1 Initialize Answer Words

2 Split Datasets

3 Prompt-based Tuning

4 Different prompts

Balancing

1 Re-sampling

2 Re-weighting Loss

Data Augmentation

1 Prepare the environment

2 Try different DA methods

Self-training for Semi-supervised learning

Standard Fine-tuning Baseline

Citation

About

Releases

Packages

Contributors 2

Languages

License

zjunlp/LREBench

Folders and files

Latest commit

History

Repository files navigation

LREBench: A low-resource relation extraction benchmark.

Contents

Environment

Datasets

Normal Prompt-based Tuning

1 Initialize Answer Words

2 Split Datasets

3 Prompt-based Tuning

4 Different prompts

Balancing

1 Re-sampling

2 Re-weighting Loss

Data Augmentation

1 Prepare the environment

2 Try different DA methods

Self-training for Semi-supervised learning

Standard Fine-tuning Baseline

Citation

About

Topics

Resources

License

Citation

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages