Towards Robust Extraction of Named Entities in Economics
This is a PyTorch Implementation
This repository contains our code for reproducing the paper EconBERTa: Towards Robust Extraction of Named Entities in Economics by Karim Lasri, Pedro Vitor Quinta de Castro, Mona Schirmer, Luis Eduardo San Martin, Linxi Wang, Tomáš Dulka, Haaya Naushan, John Pougué-Biyong, Arianna Legovini, and Samuel Fraiberger published at EMNLP Findings 2023.
This implementation is totally different from the author's implementation where he used allennlp for the most part to pretrain and finetune the models. We, on the other hand, used Pytorch and Transformers primarily. You can find the author's implementation here.
This code demonstrates how to perform Named Entity Recognition (NER) using various transformer models and a Conditional Random Field (CRF) layer. The code is written in Python and utilizes the PyTorch and Transformers libraries.
Authors: Ashutosh Pathak, Chaithra Bekal, Vikas Velagapudi
- Python 3.9
- PyTorch
- Transformers
- scikit-learn
- pandas
- pytorch-crf (imported as torchcrf in the code)
Make sure to install the required dependencies before running the code.
The main notebook is
notebook/EconBERTa.ipynb
- Install the required dependencies.
- Place the dataset files (
train.conll
,dev.conll
,test.conll
) in the appropriate directory. - Set the desired
model_name
and other hyperparameters in the code. - Run the code cells in the provided order.
- The training progress, validation performance, and test performance will be printed.
- The trained model checkpoint will be saved for future use.
Feel free to experiment with different models and hyperparameters to achieve the best performance for your specific NER task.
The code supports the following transformer models:
worldbank/econberta
(default)bert-base-uncased
roberta-base
mdeberta-v3-base
To use a different model, simply replace the model_name
variable with the desired model name and re-run the cells from that point onwards.
The code assumes the dataset is in CoNLL format and expects the following files:
train.conll
: Training datadev.conll
: Validation datatest.conll
: Test data
Make sure to place these files in the appropriate directory before running the code.
The code performs the following preprocessing steps:
- Reading the CoNLL files and converting them into pandas DataFrames.
- Tokenizing the words using the specified transformer model's tokenizer.
- Encoding the labels using a one-hot encoding scheme.
- Creating PyTorch datasets by combining the tokenized input IDs, attention masks, and encoded labels.
The preprocessing steps ensure that the data is in the appropriate format for training and evaluation.
The code performs hyperparameter search by trying different learning rates specified in the learning_rates
list. The training loop runs for a specified number of epochs (max_epochs
) and uses the AdamW optimizer with a linear learning rate scheduler.
During training, the code evaluates the model on the validation set after each epoch and prints the classification report and entity-level metrics.
After training, the code loads the best model checkpoint and evaluates it on the test set. It prints the classification report and entity-level metrics for the test set.
The code calculates the following entity-level metrics:
- Exact Match (EM)
- Exact Boundary (EB)
- Partial Match (PM)
- Partial Boundaries (PB)
- Missed Label (ML)
- False Alarm (FA)
These metrics provide a detailed evaluation of the model's performance in recognizing named entities.
The code includes a function analyze_generalization
that analyzes the model's generalization ability. It groups the entities based on their length and whether they were seen or unseen during training. It then calculates the entity-level metrics for each group and prints the results.
We were able to reproduce the trend in f1-scores among papers that the author talked about in the paper. We also implemented CheckList tests to assess model's robustness.