Skip to content

MorenoLaQuatra/bart-it

Repository files navigation

BART-IT: Italian pretraining for BART sequence to sequence model

This repository contains the code for the pretraining BART-IT, an efficient and accurate sequence to sequence model for Italian language.

Notes

As pointed out by the IT5 co-author (@gsarti_, thanks!) the IT5 model compared in the paper was not trained with multi-task learning, but with the regular span masking objective (as adopted from newer versions of T5).

Table of Contents

Model Tokenizer

The code for training the tokenizer is self-contained in the train_tokenizer.py script. The tokenizer is trained on mC4, a large Italian corpus, and it is based on the BPE algorithm. The tokenizer is trained using the tokenizers library.

The following parameters are used to train the tokenizer:

  • vocab_size: 52,000
  • min_frequency: 10
  • special_tokens: <s>, </s>, <pad>, <unk>, <mask>

The tokenizer is saved in the tokenizer_bart_it folder.

Model Pretraining

The main script for pretraining the model is pretrain_base.py. The model is trained following the same denoising pretraining strategy used for BART. Model parameters are reported on the table below.

Parameter Value
VOCAB_SIZE 52,000
MAX_POSITION_EMBEDDINGS 1,024
ENCODER_LAYERS 6
ENCODER_FFN_DIM 3,072
ENCODER_ATTENTION_HEADS 12
DECODER_LAYERS 6
DECODER_FFN_DIM 3,072
DECODER_ATTENTION_HEADS 12
D_MODEL 768
DROPOUT 0.1

The model is trained on 2 NVIDIA RTX A6000 GPUs for a total of 1,7 million steps. The pre-trained model is released for the community on the HuggingFace Hub - BART-IT

Model Fine-tuning

The model is fine-tuned on the abstractive summarization task using the parameters reported in the table below.

Parameter Value
MAX_NUM_EPOCHS 10
BATCH_SIZE 32
LEARNING_RATE 1e-5
MAX_INPUT_LENGTH 1024
MAX_TARGET_LENGTH 128

For more information about the model parameters, please refer to the summarization/finetune_summarization.py script and to the following paper.

The model is fine-tuned on different summarization datasets and model weights for each dataset are released on the HuggingFace Hub - following table:

Dataset Type Dataset Name Model Weights Dataset Paper
News Summarization FanPage bart-it-fanpage Two New Datasets for Italian-Language Abstractive Text Summarization
News Summarization IlPost bart-it-ilpost Two New Datasets for Italian-Language Abstractive Text Summarization
Wikipedia Summarization WITS bart-it-WITS WITS: Wikipedia for Italian Text Summarization

The model is an efficient and accurate sequence to sequence model for Italian language. The performance of the model are reported using both ROUGE and BERTScore metrics. Please refer to the following paper for more details.

The script for evaluating the model on the summarization task is summarization/evaluate_summarization.py.

Demo

The demo for the summarization of Italian text is available on the HuggingFace Spaces. You can try it out by clicking on the link above or by using the app.py script available in the repository (you may need to install gradio library if you want to run the script locally).

Citation and acknowledgments

If you use this code or the pre-trained model, please cite the following paper:

@Article{BARTIT,
    AUTHOR = {La Quatra, Moreno and Cagliero, Luca},
    TITLE = {BART-IT: An Efficient Sequence-to-Sequence Model for Italian Text Summarization},
    JOURNAL = {Future Internet},
    VOLUME = {15},
    YEAR = {2023},
    NUMBER = {1},
    ARTICLE-NUMBER = {15},
    URL = {https://www.mdpi.com/1999-5903/15/1/15},
    ISSN = {1999-5903},
    DOI = {10.3390/fi15010015}
}

If you use the FanPage or IlPost datasets, please cite the following paper.

If you use the WITS dataset, please cite the following paper.

If you use the mC4 dataset, please refer to the original mT5 paper and if you are interested to the cleaned version of the dataset, please refer to the IT5 paper and to the cleaned mC4 repository.

About

Pre-training BART model for the Italian Language

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published