Implementation of NLP models for classification of the Persian comments' sentiment in Python.
see requirements.txt and use pip install -r requirements.txt
for installation.
numpy
for some calculationspandas
for data read/writescikit-learn
for classfiersgensim
for Word2Vec model
Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine [Wiki]. Following figure shows the workflow of sentiment analysis [monkeylearn].
-
This directory contains data used in the project.
train.csv
has labeled comments for training (we split it into train-validation to find the best model).test.csv
has unlabeled comments that we should predict the label of each row using our propused model and save the result intest-labeled.csv
. -
This directory contains some files for being used in NLP models (e.g. stop-words).
-
Contains steps for finding the best method for being applied to data. In this file, we only use
train.csv
and split it into train-validation sets. Here is the final result of all 18 models employed in this phase:
Classifiers | TF-IDF | Word2Vec |
---|---|---|
KNN(n=4) | 0.4619 | 0.6131 |
KNN(n=8) | 0.6940 | 0.6298 |
KNN(n=16) | 0.7524 | 0.6238 |
SVM(linear) | 0.7905 | 0.4774 |
SVM(poly) | 0.6417 | 0.6452 |
SVM(rbf) | 0.7905 | 0.6810 |
XGB(n=50) | 0.7214 | 0.6571 |
XGB(n=100) | 0.7298 | 0.6786 |
XGB(n=150) | 0.7381 | 0.6714 |
-
This file is the implementation of the proposed method found in phase1. The result of applying the proposed model on
test.csv
can be seen intest-labeled.csv
.