This repository contains the code that acheived 4th place in Feedback Prize - Evaluating Student Writing. You can see the detailed explanation of 4th place solution in this post and also please check the final private leaderboard.
This code requires the below libraries:
- numpy
- omegaconf
- pandas
- pytorch_lightning
- scikit_learn
- sentencepiece
- torch==1.10.2+cu113
- transformers
- iterative-stratification
- text_unidecode
- wandb
Instead of installing the above modules independently, you can simply do at once by using:
$ pip install -f requirements.txt -f https://download.pytorch.org/whl/torch_stable.html
This repository supports NVIDIA Apex. It will automatically detect the apex module and if it is found then some training procedures will be replaced with the highly-optimized and fused operations in the apex module. Run the below codes in the terminal to install apex and enable performance boosting:
$ git clone https://github.com/NVIDIA/apex
$ sed -i "s/or (bare_metal_minor != torch_binary_minor)//g" apex/setup.py
$ pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" apex/
$ rm -rf apex
Before training the models, you need to download the dataset from the competition page. Instead, you can simply download and unzip the dataset by:
$ pip install --upgrade kaggle
$ export KAGGLE_USERNAME=[your kaggle username]
$ export KAGGLE_KEY=[your kaggle api key]
$ kaggle competitions download -c feedback-prize-2021
$ unzip -qq feedback-prize-2021.zip -d feedback-prize-2021
$ rm feedback-prize-2021.zip
Make sure the dataset directory name is feedback-prize-2021
.
In this repository, we provide 4 finetuning configurations which are used in the solution:
- deberta-large.yaml
- deberta-v2-xlarge.yaml
- deberta-v3-large.yaml
- deberta-xlarge.yaml
Of course you can write your own finetuning configuration:
dataset:
dataset_filename: ./feedback-prize-2021/train.csv
textfile_dir: ./feedback-prize-2021/train
max_length: 2048
normalize: true
num_folds: 5
fold_index: 0
dataloader_workers: -1
random_seed: 42
model:
transformer:
pretrained_model_name_or_path: ...
decoding:
beam_size: 4
minimum_lengths:
Lead: 9
Claim: 3
Position: 5
Rebuttal: 4
Evidence: 14
Counterclaim: 6
Concluding Statement: 11
minimum_probs:
Lead: 0.70
Claim: 0.55
Position: 0.55
Rebuttal: 0.55
Evidence: 0.65
Counterclaim: 0.50
Concluding Statement: 0.70
num_reinit_layers: 0
random_seed: 42
optim:
optimizer:
lr: ...
betas: [0.9, 0.999]
eps: 1e-6
weight_decay: 0.01
scheduler:
name: linear
num_warmup_steps: ...
num_training_steps: ...
train:
name: ...
batch_size: ...
accumulate_grads: ...
max_grad_norm: ...
gradient_checkpointing: ...
validation_interval: ...
logging_interval: 10
evaluate_after_steps: ...
save_best_checkpoint: true
precision: 16
gpus: 1
Here are descriptions for main hyperparameters:
model.transformer.pretrained_model_name_or_path
: name of the backbone transformeroptim.optimizer.lr
: learning rate of the optimizeroptim.scheduler.num_warmup_steps
: warmup steps for linear learning rate decayoptim.scheduler.num_training_steps
: total training stepstrain.name
: name of the finetuning experimenttrain.batch_size
: batch size for single training steptrain.accumulate_grads
: number of gradient accumulation stepstrain.max_grad_norm
: maximum gradient normtrain.gradient_checkpointing
: boolean whether to determine gradient checkpointingtrain.validation_interval
: interval of validations per epochtrain.evaluate_after_steps
: validations will be performed after this steps
After writing your own configuration, run the below code to train the model:
$ python src/train.py config/... dataset.num_folds=... dataset.fold_index=...
Note that you can change the hyperparameters in command line. It is useful to change the target fold index dataset.fold_index=...
.
In addition, we recommend to login wandb to log the metrics.
If you complete the training and prepare the model weights, now it is time to create a submission.
$ python src/predict.py ... --output_name=submission.csv --textfile_dir=... --batch_size=4 --max_length=1024 --beam_size=4
You have to specify the test directory path to --textfile_dir=...
option. The prediction script supports simple model ensemble and the predictions will be averaged before extracting the entities. We strongly recommend to predict the model separately and ensemble them using below section. --return_confidence
option makes the submission to contain the confidence of each entity. It is required when you are using the below ensemble script.
You can ensemble the predictions to improve the performance. Because the tokenizers of different architectures have different subword tokenizations, it is impossible to apply the simple average ensemble. This script gathers the similar entities and sorts by their confidence and average the ranges in each group. If you want to use this script, make sure the submission files contain confidence
column which is produced by --return_confidence
option in the prediction script.
$ python src/ensemble.py \
deberta-large-fold0.csv \
deberta-large-fold1.csv \
deberta-large-fold2.csv \
deberta-large-fold3.csv \
deberta-large-fold4.csv \
deberta-v3-large-fold0.csv \
deberta-v3-large-fold1.csv \
deberta-v3-large-fold2.csv \
deberta-v3-large-fold3.csv \
deberta-v3-large-fold4.csv \
--min_threshold=5
In our experiments, --group_strategy=mean
with --min_threshold=[half of submissions]
performed best.