Skip to content

Latest commit

 

History

History
63 lines (30 loc) · 2.65 KB

File metadata and controls

63 lines (30 loc) · 2.65 KB

BigData_Amazon_reviews_ETL_Cloud

Performing cloud-based ETL, and analyzing data by SQL, Natural Language Processing (NLP) pipeline and several Machine Learning model.

Project Background

There are many of Amazon's shoppers depend on product reviews to make a purchase. It turns more and more important to perform data analysis on tons of Amazon's reviews. This project utilized one of Amazon public datasets on AWS S3 Buckets (Beauty product Reviews with over 5 milloin records) to extract, transform and load to AWS RDS database and statistical analyzed data to determine whether the Vine reviews are biased. The point of the Amazon Vine program is supposed to offer unbiased reviews to consumers. Vendors should have no influence on these vine reviews.

This project performed by Spark and completely in the cloud with Google Colab Notebook and AWS RDS.

cloud base ETL Part

  1. Extract
  1. Transform
  • Cleaned and transformed the dataset to fit the furnished schemata of AWS RDS database.
  1. Load
  • Loaded and write directly into the correspond tables into Amazon RDS instance.

Analysis Part

  1. SQL statistical Analysis

Performed a basic statistical analysis using SQL.

Conclusion :SQLAnalysisConclusion.md

  1. Machine Learning Analysis in Colab-Notebook

Built advanced statistical classification and regression models for prediction and evaluation by pySpark and presented in Google Colab Notebook.

Conclusion :ML_AnalysisConclusion.md


Natural Language Process (NLP) with Naive Bayes ML model:

Built a NLP preprocessing pipeline included StringIndexer, Tokenizer, StopWordsRemover, HashingTF as well as IDF, fed the pipeline into a NaiveBayes to predict, identify whether vine reviews or not.

Classification and Regression: