Big Data Analysis and Recommdender System for Amazon

Problem Statement:

Perform Data Analysis on 130M+ records and build a recommender system.

Deployed single node cluster to perform analysis using map reduce

Carried out data analysis on big data.

/**

author Sayali Walke

**/

Dataset used for analysis:

The Amazon Customer Reviews Dataset is a large dataset with size > 20GB. However, for this analysis, I’ve used a subset of this dataset named “amazon_reviews_us_Beauty_v1_00.tsv”

Here’s the detailed description of dataset and its contents.

marketplace: 2 letter country code of the marketplace where the review was written.

customer_id: Random identifier that can be used to aggregate reviews written by a single author.

review_id: The unique ID of the review.

product_id: The unique Product ID the review pertains to. In the multilingual dataset the reviews for the same product in different countries can be grouped by the same product_id.

product_parent: Random identifier that can be used to aggregate reviews for the same product.

product_title: Title of the product.

product_category: Broad product category that can be used to group reviews (also used to group the dataset into coherent parts).

star_rating: The 1-5 star rating of the review.

helpful_votes: Number of helpful votes.

total_votes: Number of total votes the review received.

Vine: Review was written as part of the Vine program.

verified_purchase: The review is on a verified purchase.

review_headline: The title of the review.

review_body: The review text.

review_date: The date the review was written.

Apache MAHOUT

Created a recommender system using Mahout. It serves the functionality of recommending similar products based on the similar items bought by other customers.(People who bought this….also bought this…..)

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
AmazonBinningOfRatings.pdf		AmazonBinningOfRatings.pdf
AmazonChainingAverageSorting.pdf		AmazonChainingAverageSorting.pdf
AmazonDistinctReviewsCounter.pdf		AmazonDistinctReviewsCounter.pdf
AmazonHelpfulReviewPercentage.pdf		AmazonHelpfulReviewPercentage.pdf
AmazonInvertedIndex.pdf		AmazonInvertedIndex.pdf
AmazonPartitionByDay.pdf		AmazonPartitionByDay.pdf
AmazonRecommenderSystem.pdf		AmazonRecommenderSystem.pdf
AmazonReviewsAggregation.pdf		AmazonReviewsAggregation.pdf
AmazonReviewsHive.pdf		AmazonReviewsHive.pdf
AmazonReviewsInnerJoin.pdf		AmazonReviewsInnerJoin.pdf
AmazonReviewspig.pdf		AmazonReviewspig.pdf
AmazonTop100Products.pdf		AmazonTop100Products.pdf
README.md		README.md
Summary.pdf		Summary.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Big Data Analysis and Recommdender System for Amazon

Problem Statement:

Perform Data Analysis on 130M+ records and build a recommender system.

Deployed single node cluster to perform analysis using map reduce

Carried out data analysis on big data.

/**

author Sayali Walke

**/

Dataset used for analysis:

Here’s the detailed description of dataset and its contents.

Apache MAHOUT

Created a recommender system using Mahout. It serves the functionality of recommending similar products based on the similar items bought by other customers.(People who bought this….also bought this…..)

Hadoop

I have created a single node Hadoop cluster on virtual machine and carried out following data analysis

using various Map-reduce Algorithms:

1. Top 100 Products based on average of reviews: (Filtering Pattern )

2. Average chaining and Sorting Of reviews: (Chaining and Sorting )

3. No of reviews per product: (Numerical Summarization)

4. Inner join on Average of reviews and no of reviews for each product(Inner Join)

5. Customer list for each product: (Inverted Index)

6. Created 5 bins for 1,2,3,4,5 ratings: Binning (Organization Pattern)

7. Partitioned the data into different files for each day in 2015-08: Partitioning (Organization Pattern)

8. Distinct Reviews Counter: (Numerical Summarization Pattern)

9. Percentage of Helpful votes: (Numerical Summarization Pattern)

Apache HIVE

1. Top 10 Products based on Average ratings

2. Most Valuable Customer based on number of products bought

3. Most popular product based on number times product bought

4. Number of products bought per day

5. Number of products per ratings

Apache Pig

1. Number of reviews given per day.

2. Number reviews given per product.

About

Releases

Packages

sayaliwalke30/BigDataAnalysis-RecommenderForAmazon

Folders and files

Latest commit

History

Repository files navigation

Big Data Analysis and Recommdender System for Amazon

Problem Statement:

Perform Data Analysis on 130M+ records and build a recommender system.

Deployed single node cluster to perform analysis using map reduce

Carried out data analysis on big data.

/**

author Sayali Walke

**/

Dataset used for analysis:

Here’s the detailed description of dataset and its contents.

Apache MAHOUT

Created a recommender system using Mahout. It serves the functionality of recommending similar products based on the similar items bought by other customers.(People who bought this….also bought this…..)

Hadoop

I have created a single node Hadoop cluster on virtual machine and carried out following data analysis

using various Map-reduce Algorithms:

1. Top 100 Products based on average of reviews: (Filtering Pattern )

2. Average chaining and Sorting Of reviews: (Chaining and Sorting )

3. No of reviews per product: (Numerical Summarization)

4. Inner join on Average of reviews and no of reviews for each product(Inner Join)

5. Customer list for each product: (Inverted Index)

6. Created 5 bins for 1,2,3,4,5 ratings: Binning (Organization Pattern)

7. Partitioned the data into different files for each day in 2015-08: Partitioning (Organization Pattern)

8. Distinct Reviews Counter: (Numerical Summarization Pattern)

9. Percentage of Helpful votes: (Numerical Summarization Pattern)

Apache HIVE

1. Top 10 Products based on Average ratings

2. Most Valuable Customer based on number of products bought

3. Most popular product based on number times product bought

4. Number of products bought per day

5. Number of products per ratings

Apache Pig

1. Number of reviews given per day.

2. Number reviews given per product.

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages