Twitter dialect classification

Description

In this project I used the QADI dataset to classify the different dialects of Arabic language. I used machine learning methods specifically logestic regression, linear SVM and naive bayes. I also tried to compare those methods with deep learning models namely LSTM and word embeddings. I did many my experiments locally on my machine. I deployed my project using Flask
There are 4 main files in this project

4 .py scripts (Data_fetching.py, Data_pre_processing.py, Model_training.py and app.py) with the final results/code

There are also

2 Jupyter notebooks fetching and processing.ipynb and ML and DL training.ipynb with detailed code of all the experiments that I did
a dicts.py file which is there just to help in predicting during Flask deployment. it contains a processing function to process the user's input text
a picture final model.png of the final deep learning model
the Flask App's files like HTML and CSS files
a presentation file Tweeter dialect classification.pptx
The dataset

Getting Started

Dependencies

Anaconda is a must
tensorflow
flask
farasapy (you need to install java in order to work)
PyArabic
gensim

Installing

this code was run successfully on my windows machine.
it's recommended to create a new anaconda environment with

conda create -n tf tensorflow
conda activate tf

you need to install the dependencies

conda install pandas
conda install scikit-learn 
conda install -c anaconda flask
conda install -c anaconda gensim
conda install tensorflow
conda install -c conda-forge matplotlib
pip install farasapy
pip install PyArabic

if you faced any problem trying to use jupyter with this message "'jupyter' is not recognized as an internal or external command", install jupyter
pip install notebook
please install java for farasapy to work
in order to train the models that require pretrained word embedding you need to download word embedding from

Mazajak specifically the CBOW words that were trained on 100M tweets (this is required to runModel_training.py)
AraVec specifically the Unigrams CBOW Models with vector size of 100

Executing program

you need to run .py scripts in the right order in the command line.
if you're intersted you can open the jupyter notebook for full detailed code/experiments
type flask run in the command line to run the flask app in your browser

Results

Results of deep learning on the validation set

Model name	Accuracy	F1 score
Embedding layer without lstm from scratch	0.523	0.494
LSTM from scratch	0.455	0.399
Embedding layer with finetuned AraVec	0.524	0.493
Embedding layer with finetuned Mazajak	0.526	0.497
LSTM with fixed pretrained embedding Mazajak	0.313	0.181
LSTM with fixed pretrained embedding AraVec	0.125	0.012

results of the machine learning models on the validation set

Model name	Accuracy	F1 score
uni-gram(Tf-idf) SVM	0.512	0.478
two-gram(Tf-idf) SVM	0.538	0.507

comparison of the deep learning and machine learning on the test set

Model name	Accuracy	F1 score
two gram SVM	0.5388	0.5072
Embedding layer with finetuned Mazajak	0.5288	0.5024

Author

name: Bassel Ali Mahmoud
email: basel_ebeed1@yahoo.com

Acknowledgments

Mazajak
AraVec
farasa
farasapy
QADI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Twitter dialect classification

Table of contents

Description

Getting Started

Dependencies

Installing

Executing program

Results

Author

Acknowledgments

Files

README.md

Latest commit

History

README.md

File metadata and controls

Twitter dialect classification

Table of contents

Description

Getting Started

Dependencies

Installing

Executing program

Results

Author

Acknowledgments