this is my repository for the Amazon Review Helpfulness prediction model project using Pyspark for ML and data cleaning.
last updated: 10/29/2017
The motivation here is to repeat/improve what I did for the Amazon review helpfulness prediction all in pyspark.
- spark can do data cleaning and tfidf and machine learning computationally faster.
- Learning experience. Learn more about Spark, Hive and SQL.
- Rewrite/improve original python codes to pyspark codes.
- compare the original result to the pyspark result (accuracy and computation time).
- Connect to EMR and run with AWS support for even bigger dataset (Home&kitchen data?).
unlike my previous project, on this repo, I will post lots of notes.
I figure that there aren't much stackoverflow postings on pyspark compare to other language that Spark can handle (scala, Java).
So I want to put notes on my repo just so that it can maybe help someone who is new to spark (just like myself).
notes_on_data_cleaning_process.md
- will have notes on the first process of data cleaning.
notes_on_tfidf_nmf_process.md
- will have notes on the tfidf, nmf, nlp related functions.
notes_on_ml.md (in the future)
- will have notes on the ML library used in this project.
notes_on_EMR.md (in the future)
- will have notes on the EMR. How to set up EMR.
For the Rank_Value feature, null values were replaced with average value.
Because the Rank_Value feature represents the rank of products, I figure it will not be realistic to impute it with the average value.
Instead, I am imputating these nulls with (Max of Rank_Value) + 1
Previously, I imputed price feature with the average values and then ran train and test split.
Since the price feature was the most important feature, I figure I should look back and see if I didn't make a mistake.
The mistake I made was that imputing it with the average vale before train,test split causes some error.
Because average values should be different between train and test data, I needed some other way to impute this price feature.
What I end up doing was that I replaced pruce values into categorical values.
- below20
- below50
- below100
- below300
- above300
- unknown (nulls)
With these 2 changes, I can make train-test split after running my data cleaning process.
In the previous model, I had TF-IDF matrix with top 500 words as features (so 500 features).
Addition to these features, I had NMF results calculated from TF-IDF matrix with top 10000 words as fetures (8 features).
My question regarding this structure:
Are these 500 TF-IDF terms causing multicollinearity and influence my feature importance of my RF model?
Since I have both TF-IDF and NMF features, there should be some correlation between these features.
To be safe and simple, I'm removing these TF-IDF terms and work with NMF hidden layers only. (instead of 8, lets try 15).
Video Game Product reviews:
- Ran the original ver. in my laptop (MacBookPro 2015)
- Ran the pyspark with 4 CPU in my laptop and with 9 R3.xlarge EMR instances (1master,8nodes).
(data cleaning reduce samples to about 4,300 samples) training: 2800 samples, test: 636 samples
Original ver. | Pyspark laptop | Pyspark EMR | |
---|---|---|---|
data cleaning | 00:06:12:38 | 00:10:39:36 | 00:00:00:00 |
TFIDF+NMF | 00:04:46:44 | Memory prob. | 00:00:00:00 |
Random Forest | 00:00:22:00 | 00:00:00:00 | 00:00:00:00 |
Total | 00:11:20:82 | 00:00:00:00 | 00:00:00:00 |
TOP10 Important Features
enjoy : 0.907305699071%
percent_GROUP_4 : 0.960251319721%
sort : 1.10448132333%
die : 1.12857298946%
surprise : 1.22353217363%
price : 1.42303649998%
gamecube : 1.61354571494%
rank_values : 3.45245920816%
text_length : 6.00004820027%
overall : 8.50613117299%
NOT HELPFUL TRUE | HIGHLY HELPFUL TRUE | |
---|---|---|
NOT HELPFUL PRED | 205.0 | 50.0 |
HIGHLY HELPFUL PRED | 37.0 | 344.0 |
LOW prediction rate: 84.71% HIGH prediction rate: 87.31%
Used 4CPU from my laptop (2015 MBP) to do the same process in pyspark.
The data cleaning process was much slower than running it without spark.
TFIDF calculation worked just fine. However, when I set up a rating matrix for NMF,
I started to have a memory issue. the sparse matrix containing TFIDF terms contained about 70000 samples
and I started to see more error message as I prepare a rating matrix for NMF collaborative filtering.
(in fact, it did not run ALS with my rating matrix).
So I decided to use AWS EMR instances instead.
Homeandkitchen Product reviews:
- Ran the original ver. with AWS EC2 instance (m4.2xlarge)
- Ran the pyspark ver. with AWS EMR instances (1master, 8nodes)
(data cleaning reduce samples to about 81,775 samples)
Original ver. | Pyspark ver. | |
---|---|---|
data cleaning | 00:12:58:33 | 00:00:00:00 |
TFIDF+NMF | 00:28:28:17 | 00:00:00:00 |
Random Forest | 00:03:49:10 | 00:00:00:00 |
Total | 00:45:15:60 | 00:00:00:00 |