Fake-News-Detection-using-Spark

The objective of this project is to implement a model to detect fake news using Apache Spark in Pyspark in Google colab.

Libraries Installed:

Three libraries are installed using the below command to implement this project. Pyspark is installed to code the project in Apache Spark using python. Elephas is installed to integrate keras with Spark. Elephas supports certain versions of keras and tensforflow. Keras version 2.2.4 and TF version 1.14.0 is installed.

pip install pyspark

pip install q keras==2.2.4

pip install q tensorflow==1.14.0

pip install elephas

Dataset

The fake news dataset is taken from kaggle: https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset. There are two seperate csv files for real and fake news. The two data are combined to a single dataset by creating labels: Real and Fake. Each dataset has title, text, subject and date features.The fake news detection is performed using three different features: title, text of the news article and by concatenating title and text of the news.

Modelling

The text processing is performed using RegexTokenizer, Word2Vec and StringIndexer. The fake news detection is implemented by training the data using three algorithms: Decision Tree, Gradient Boosting and Neural Network. The parameter tuning for Gradient Boosting is done using 3-fold Cross Validation. The neural network is implemented by integrating Keras with Elephas to run the network on Apache Spark. The model evaluation is performed by studying the accuracy, AUC and F1 score and confusion matrix.

Results

Among the three models, Neural network performed the best for all the three features: title, text and title-text. The below are the results for the three models implemented.

Feature	Model	Accuracy	No. of False positive	No. of False neagative
Title	Decision Tree Gradient Boosting Neural Network	88.07% 90.06% 91.41%	477 430 405	601 468 371
Text	Decision Tree Gradient Boosting Neural Network	90.08% 93.83% 98.53%	571 256 48	313 294 83
Title-text	Decision Tree Gradient Boosting Neural Network	91% 93.79% 98.60%	358 262 61	457 300 66

Comparing the results to see the best performance of each model against the individual features, Decision tree performed better with title-text feature, Gradient Boosting with text feature and Neural network with both text and text-title features.

Reference:

https://spark.apache.org/docs/latest/ml-guide.html

https://github.com/maxpumperla/elephas

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Fake_news_detection_using_text.ipynb		Fake_news_detection_using_text.ipynb
Fake_news_detection_using_title.ipynb		Fake_news_detection_using_title.ipynb
Fake_news_detection_using_title_and_text.ipynb		Fake_news_detection_using_title_and_text.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fake-News-Detection-using-Spark

Libraries Installed:

Dataset

Modelling

Results

Reference:

About

Releases

Packages

Languages

License

gprashmi/Fake-News-Detection-using-Spark

Folders and files

Latest commit

History

Repository files navigation

Fake-News-Detection-using-Spark

Libraries Installed:

Dataset

Modelling

Results

Reference:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages