ArabicT5: Efficient Adaptation of T5 on Arabic Language

Model Description

This model adapts T5 on the Arabic Language by pre-training T5 on :

Arabic Wikipedia.
Marefa encyclopedia.
Hindawi Books.
a collection of Arabic News.
OSCAR Dataset (32GB)

This model uses an efficient implementation of T5 which reduces the fine-tuning and memory used Link and uses T5x for pre-training Link

Pre-training Settings and Results on TyDi QA Development Dataset ( Model in this card is highlighted in bold )

Model	Hidden Layer	Atten. head	Atten. Layers	Vocab	Hardware	Training Steps	Batch	Train x Batch Factor	Corpora
AraT5-base	768	12	12	110K	TPUv3-8	1M	128	1.0x	248GB 29B tokens (MSA + Tweets)
AraT5-msa-base	768	12	12	110K	TPUv3-8	1M	128	1.0x	70GB (MSA)
AraT5-tweets-base	768	12	12	110K	TPUv3-8	1M	128	1.0x	178GB (Tweets)
AraBART-base	768	12	12	50K	128 V100 GPUs (60h)	25 epochs	-	-	73GB (MSA)
mT5-base	768	12	12	250K	TPUv3-32	1M	1024	8.0x	6.3T tokens (mC4)
ArabicT5-17GB-small	512	8	20	32K	TPUv3-32	256K	256	0.5x	17GB (MSA)
ArabicT5-49GB-small	512	8	16	32K	TPUv3-64	500K	256	1.0x	49GB (MSA + OSCAR)
ArabicT5-17GB-base	768	12	16	32K	TPUv3-128	500K	512	2.0x	17GB (MSA)
ArabicT5-49GB-base	768	12	16	32K	TPUv3-64	500K	256	1.0x	49GB (MSA + OSCAR)
ArabicT5-17GB-large	768	12	36	32K	TPUv3-128	500K	512	2.0x	17GB (MSA)

Results on TyDi QA, HARD, Sentiment Analysis, Sarcasm Detection ( Best Score is highlighted in bold )

Model	TyDi QA	HARD	ArSarcasm-v2-Sentiment	ArSarcasm-v2-Sarcasm	XL-SUM
AraT5-base	70.4/84.2	96.5	69.7/72.6	60.4	30.3
AraT5-msa-base	70.9/84.0	96.5	70.0/72.7	60.7	27.4
AraT5-tweets-base	65.1/79.0	96.3	70.7/73.5	61.1	25.1
mT5-base	72.2/84.1	96.2	67.3/68.8	52.2	25.7
AraBART-base	48.8/71.2	96.1	66.2/68.2	56.3	31.2
ArabicT5-17GB-small	70.8/84.8	96.4	68.9/71.2	58.9	29.2
ArabicT5-49GB-small	72.4/85.1	96.4	70.2/73.4	61.0	30.2
ArabicT5-17GB-base	73.3/86.1	96.4	70.4/73.0	59.8	30.3
ArabicT5-49GB-base	72.1/85.1	96.5	71.3/74.1	60.4	30.9
ArabicT5-17GB-large	75.5/87.1	96.5	72.2/75.2	61.7	31.7

Evaluation Metrics: TyDi QA (EM/F1), HARD (Accuracy), Sentiment Analysis (Accuracy / F1-PN positive-negative), Sarcasm Detection (F1-sarcastic), XL-SUM (Rouge-L with Stemmer).

You can download the full details of our grid search for all models in all tasks above from this link: https://github.com/salrowili/ArabicT5/raw/main/ArabicT5_Grid_Search.zip

For the XL-Sum task, we choose our best run for each model using the eval set. We use the official evaluation script from XL-Sum, which uses the stemmer function, which may show better results than papers that don't use the stemmer function. The official XL-Sum paper uses a stemmer function.

FineTuning our efficient ArabicT5-49GB-Small model with Torch on 3070 laptop GPU

If you are running your code on a laptop GPU (e.g., a gaming laptop) or limited GPU memory, we recommended using our ArabicT5-49GB-Small model, which was the only model from the list that we were able to run on 3070 Laptop card with a batch size of 8. We manage to achieve an F1 score of 85.391 (slightly better than our FLAX code ) on the TyDi QA task.

FineTuning our ArabicT5 model on generative and abstractive tasks with FLAX

FineTuning ArabicT5 on TPUv3-8 with free Kaggle

https://www.kaggle.com/code/sultanalrowili/arabict5-on-tydi-with-free-tpuv3-8-with-kaggle

Continual Pre-Training of ArabicT5 with T5x

if you want to continue pre-training ArabicT5 on your own data, we have uploaded the raw t5x checkpoint to this link https://huggingface.co/sultan/ArabicT5-49GB-base/blob/main/arabict5_49GB_base_t5x.tar.gz We will soon share a tutorial on how you can do that for free with Kaggle TPU

Acknowledgment

We want to acknowledge the support we have from The TPU Research Cloud (TRC) team to grant us access to TPUv3 units.

Paper

Generative Approach for Gender-Rewriting Task with ArabicT5

Citation

@inproceedings{alrowili-shanker-2022-generative,
    title = "Generative Approach for Gender-Rewriting Task with {A}rabic{T}5",
    author = "Alrowili, Sultan  and
      Shanker, Vijay",
    booktitle = "Proceedings of the The Seventh Arabic Natural Language Processing Workshop (WANLP)",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates (Hybrid)",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.wanlp-1.55",
    pages = "491--495",
    abstract = "Addressing the correct gender in generative tasks (e.g., Machine Translation) has been an overlooked issue in the Arabic NLP. However, the recent introduction of the Arabic Parallel Gender Corpus (APGC) dataset has established new baselines for the Arabic Gender Rewriting task. To address the Gender Rewriting task, we first pre-train our new Seq2Seq ArabicT5 model on a 17GB of Arabic Corpora. Then, we continue pre-training our ArabicT5 model on the APGC dataset using a newly proposed method. Our evaluation shows that our ArabicT5 model, when trained on the APGC dataset, achieved competitive results against existing state-of-the-art methods. In addition, our ArabicT5 model shows better results on the APGC dataset compared to other Arabic and multilingual T5 models.",
}

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
ArabicT5_49GB_Small_on_3070_Laptop_GPU.ipynb		ArabicT5_49GB_Small_on_3070_Laptop_GPU.ipynb
ArabicT5_Grid_Search.zip		ArabicT5_Grid_Search.zip
FineTuning_ArabicT5_with_FLAX_and_TPU.ipynb		FineTuning_ArabicT5_with_FLAX_and_TPU.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ArabicT5: Efficient Adaptation of T5 on Arabic Language

Model Description

Pre-training Settings and Results on TyDi QA Development Dataset ( Model in this card is highlighted in bold )

Results on TyDi QA, HARD, Sentiment Analysis, Sarcasm Detection ( Best Score is highlighted in bold )

FineTuning our efficient ArabicT5-49GB-Small model with Torch on 3070 laptop GPU

FineTuning our ArabicT5 model on generative and abstractive tasks with FLAX

FineTuning ArabicT5 on TPUv3-8 with free Kaggle

Continual Pre-Training of ArabicT5 with T5x

Acknowledgment

Paper

Citation

About

Releases

Packages

Languages

salrowili/ArabicT5

Folders and files

Latest commit

History

Repository files navigation

ArabicT5: Efficient Adaptation of T5 on Arabic Language

Model Description

Pre-training Settings and Results on TyDi QA Development Dataset ( Model in this card is highlighted in bold )

Results on TyDi QA, HARD, Sentiment Analysis, Sarcasm Detection ( Best Score is highlighted in bold )

FineTuning our efficient ArabicT5-49GB-Small model with Torch on 3070 laptop GPU

FineTuning our ArabicT5 model on generative and abstractive tasks with FLAX

FineTuning ArabicT5 on TPUv3-8 with free Kaggle

Continual Pre-Training of ArabicT5 with T5x

Acknowledgment

Paper

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages