In this project I used the QADI dataset to classify the different dialects of Arabic language. I used machine learning methods specifically logestic regression, linear SVM and naive bayes. I also tried to compare those methods with deep learning models namely LSTM and word embeddings. I did many my experiments locally on my machine. I deployed my project using Flask
There are 4 main files in this project
- 4
.py
scripts (Data_fetching.py
,Data_pre_processing.py
,Model_training.py
andapp.py
) with the final results/code
There are also
- 2 Jupyter notebooks
fetching and processing.ipynb
andML and DL training.ipynb
with detailed code of all the experiments that I did - a
dicts.py
file which is there just to help in predicting during Flask deployment. it contains a processing function to process the user's input text - a picture
final model.png
of the final deep learning model - the Flask App's files like HTML and CSS files
- a presentation file
Tweeter dialect classification.pptx
- The dataset
- Anaconda is a must
- tensorflow
- flask
- farasapy (you need to install java in order to work)
- PyArabic
- gensim
- this code was run successfully on my windows machine.
- it's recommended to create a new anaconda environment with
conda create -n tf tensorflow
conda activate tf
- you need to install the dependencies
conda install pandas
conda install scikit-learn
conda install -c anaconda flask
conda install -c anaconda gensim
conda install tensorflow
conda install -c conda-forge matplotlib
pip install farasapy
pip install PyArabic
- if you faced any problem trying to use jupyter with this message "'jupyter' is not recognized as an internal or external command", install jupyter
pip install notebook
- please install java for
farasapy
to work - in order to train the models that require pretrained word embedding you need to download word embedding from
- Mazajak specifically the CBOW words that were trained on 100M tweets (this is required to run
Model_training.py
) - AraVec specifically the Unigrams CBOW Models with vector size of 100
- you need to run
.py
scripts in the right order in the command line. - if you're intersted you can open the jupyter notebook for full detailed code/experiments
- type
flask run
in the command line to run the flask app in your browser
- Results of deep learning on the validation set
Model name | Accuracy | F1 score |
---|---|---|
Embedding layer without lstm from scratch | 0.523 | 0.494 |
LSTM from scratch | 0.455 | 0.399 |
Embedding layer with finetuned AraVec | 0.524 | 0.493 |
Embedding layer with finetuned Mazajak | 0.526 | 0.497 |
LSTM with fixed pretrained embedding Mazajak | 0.313 | 0.181 |
LSTM with fixed pretrained embedding AraVec | 0.125 | 0.012 |
- results of the machine learning models on the validation set
Model name | Accuracy | F1 score |
---|---|---|
uni-gram(Tf-idf) SVM | 0.512 | 0.478 |
two-gram(Tf-idf) SVM | 0.538 | 0.507 |
- comparison of the deep learning and machine learning on the test set
Model name | Accuracy | F1 score |
---|---|---|
two gram SVM | 0.5388 | 0.5072 |
Embedding layer with finetuned Mazajak | 0.5288 | 0.5024 |
name: Bassel Ali Mahmoud
email: [email protected]