The objective of this project is to implement a model to detect fake news using Apache Spark in Pyspark in Google colab.
Three libraries are installed using the below command to implement this project. Pyspark is installed to code the project in Apache Spark using python. Elephas is installed to integrate keras with Spark. Elephas supports certain versions of keras and tensforflow. Keras version 2.2.4 and TF version 1.14.0 is installed.
pip install pyspark
pip install q keras==2.2.4
pip install q tensorflow==1.14.0
pip install elephas
The fake news dataset is taken from kaggle: https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset. There are two seperate csv files for real and fake news. The two data are combined to a single dataset by creating labels: Real and Fake. Each dataset has title, text, subject and date features.The fake news detection is performed using three different features: title, text of the news article and by concatenating title and text of the news.
The text processing is performed using RegexTokenizer, Word2Vec and StringIndexer. The fake news detection is implemented by training the data using three algorithms: Decision Tree, Gradient Boosting and Neural Network. The parameter tuning for Gradient Boosting is done using 3-fold Cross Validation. The neural network is implemented by integrating Keras with Elephas to run the network on Apache Spark. The model evaluation is performed by studying the accuracy, AUC and F1 score and confusion matrix.
Among the three models, Neural network performed the best for all the three features: title, text and title-text. The below are the results for the three models implemented.
Feature | Model | Accuracy | No. of False positive | No. of False neagative |
---|---|---|---|---|
Title | Decision Tree Gradient Boosting Neural Network |
88.07% 90.06% 91.41% |
477 430 405 |
601 468 371 |
Text | Decision Tree Gradient Boosting Neural Network |
90.08% 93.83% 98.53% |
571 256 48 |
313 294 83 |
Title-text | Decision Tree Gradient Boosting Neural Network |
91% 93.79% 98.60% |
358 262 61 |
457 300 66 |
Comparing the results to see the best performance of each model against the individual features, Decision tree performed better with title-text feature, Gradient Boosting with text feature and Neural network with both text and text-title features.