This repository has a pytorch implementation of data augmentation for NER, introduced in our COLING 2020 paper:
Xiang Dai and Heike Adel. 2020. An Analysis of Simple Data Augmentation for Named Entity Recognition. In COLING, Online.
Please cite this paper if you use this code. The paper can be found at the ACL Anthology or at ArXiv.
This software is a research prototype, solely developed for and published as part of the publication cited above. It will neither be maintained nor monitored in any way.
Note that the given dataset in data/ contains only sample files, showing the needed format
cp /data/dai031/Experiments/2020-06-03-01/50/* data/
python main.py --data_folder data --embedding_type bert --pretrained_dir /data/dai031/Corpora/SciBERT/scibert_scivocab_cased --result_filepath baseline.json
python main.py --data_folder data --embedding_type bert --pretrained_dir /data/dai031/Corpora/SciBERT/scibert_scivocab_cased --augmentation LwTR --result_filepath lwtr.json
python main.py --data_folder data --embedding_type bert --pretrained_dir /data/dai031/Corpora/SciBERT/scibert_scivocab_cased --augmentation SR --result_filepath sr.json
python main.py --data_folder data --embedding_type bert --pretrained_dir /data/dai031/Corpora/SciBERT/scibert_scivocab_cased --augmentation MR --result_filepath mr.json
python main.py --data_folder data --embedding_type bert --pretrained_dir /data/dai031/Corpora/SciBERT/scibert_scivocab_cased --augmentation SiS --result_filepath sis.json
python main.py --data_folder data --embedding_type bert --pretrained_dir /data/dai031/Corpora/SciBERT/scibert_scivocab_cased --augmentation MR LwTR SiS SR --result_filepath all.json
Method | F1 score |
---|---|
No augmentation | 37.9 |
Label-wise token replacement | 40.8 |
Synonym replacement | 40.8 |
Mention replacement | 41.2 |
Shuffle within segments | 38.1 |
All | 42.5 |
The code in this repository is open-sourced under the Apache 2.0 license. See the LICENSE file for details. For a list of other open source components included in this project, see the file 3rd-party-licenses.txt.