Leveraging Corpus Metadata to Detect Template-based Translation: An Exploratory Case Study of the Egyptian Arabic Wikipedia Edition

We, in this repository, share our labeled datasets, extracted corpora, code and scripts of the exploratory analysis, the multivariate machine learning classifiers and clusters, and the implementation and deployment of the best-performing classifier as a web-based detection system called "Egyptian Arabic Wikipedia Scanner", which all are introduced in our accepted paper, Leveraging Corpus Metadata to Detect Template-based Translation: An Exploratory Case Study of the Egyptian Arabic Wikipedia Edition, at The 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT6), co-located with LREC-COLING 2024, 20-25 May 2024.

Exploratory Analysis:
Experimental Setups:
- Dataset Filtering, Labeling, and Cleaning
- Dataset Encoding Using Spark-NLP & CAMeLBERT:
  - Encoding with Spark-NLP (Egyptian Word2Vec-CBOW 300D)
  - Encoding with CAMeLBERT (CAMeLBERT-Mix POS-EGY Model)
Template Translation Detection:
- Supervised Classification Algorithms:
- Unsupervised Clustering Algorithms:
Web-based Detection System/Application:
1. Best-performing Classifier, XGBoost
2. Egyptian Arabic Wikipedia Scanner:
  - Streamlit Community Cloud
  - Hugging Face Spaces
Corpora and Datasets:
- Arabic Wikipedia Corpora:
- Egyptian Template-translated Articles:
Paper Citations:

Saied Alshahrani, Hesham Haroon, Ali Elfilali, Mariama Njie, and Jeanna Matthews. 2024. Leveraging Corpus Metadata to Detect Template-based Translation: An Exploratory Case Study of the Egyptian Arabic Wikipedia Edition. In Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024, pages 31–45, Torino, Italia. ELRA and ICCL.*

Provide feedback