Project Introduction

Lilian Sun, 12/6/2023

Timeline

Start Date: September 2023
End Date: December 2023

Overview

The project commenced in September 2023 and successfully concluded in December 2023. It encompassed the creation of secure data lakes on Amazon S3, efficient data ingestion into SageMaker Studio, and comprehensive data exploration using AWS Data Wrangler. Biases were addressed through thorough profiling and SageMaker Clarify. Scaling feature engineering and hyperparameter tuning optimized model performance. The training and evaluation of models on SageMaker Autopilot resulted in the deployment of high-performing sentiment analysis models, continuously monitored for real-time accuracy. The README provides a succinct overview of the project's scope and achievements within this timeframe.

Project Documents

Project Paper: Big Data Machine Learning System for Product Review Sentiment Analysis.pdf

The other related documents are:

1 code-data transformation & data profiling (data distribution)
2 code-data profiling (bias detection)
3 code-hyperparameter tuning
4 code-model training and tuning on SageMaker Autopilot
Comparative Analysis of Technologies for the Big Data ML System.xlsx

The artifacts of the directories

outputs: The content/resources generated by the program during execution.
datasets: Related datasets, including processed data and balanced data.
code scripts: Specifically, this refers to .ipynb scripts.
screenshot: Snapshots of the running status and results of the program on the AWS cloud services, where the big data machine learning system is built.

The Excel spreadsheet contains manually created comparative analysis plots based on official resources and technological blogs.

Toolkits

Amazon S3:
- Established centralized and secure data lakes for efficient storage and retrieval of growing datasets.
Amazon SageMaker:
- Ingested raw data into SageMaker Studio for development using S3 commands.
- Conducted feature engineering and hyperparameter tuning at scale.
- Deployed sentiment analysis models using SageMaker's hosting capabilities.
AWS Glue:
- Organized and cataloged data within the S3 data lake, facilitating streamlined accessibility.
- Conducted data profiling to uncover key patterns and trends.
AWS Data Wrangler:
- Executed SQL queries on Amazon Athena for in-depth exploration of datasets.
Matplotlib and Seaborn:
- Created static, animated, and interactive visualizations to enhance dataset understanding.
SageMaker Clarify:
- Detected statistical data biases, focusing on metrics such as class imbalance and Difference in Proportions of Labels (DPL).
Amazon Athena:
- Used for additional data exploration and querying.
AutoML on SageMaker Autopilot:
- Employed for training models with both built-in algorithms and custom BERT models.
- Evaluated models based on training accuracy and loss to optimize performance.

Why did I utlilize Amazon Glue and Amazon Athena in this project?

When constructing highly intricate analytical queries to process not only gigabytes but potentially terabytes or petabytes of data, Athena eliminates concerns about compute and memory resources needed for supporting such queries. Athena seamlessly and automatically scales out, breaking down the query into simpler components that run in parallel against the extensive dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Introduction

Timeline

Overview

Project Documents

Toolkits

Why did I utlilize Amazon Glue and Amazon Athena in this project?

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
1 code-data transformation & data profiling (data distribution)		1 code-data transformation & data profiling (data distribution)
2 code-data profiling (bias detection)		2 code-data profiling (bias detection)
3 code-hyperparameter tuning		3 code-hyperparameter tuning
4 code-model training and tuning on SageMaker Autopilot		4 code-model training and tuning on SageMaker Autopilot
Big Data Machine Learning System for Product Review Sentiment Analysis.pdf		Big Data Machine Learning System for Product Review Sentiment Analysis.pdf
Comparative Analysis of Technologies for the Big Data ML System.xlsx		Comparative Analysis of Technologies for the Big Data ML System.xlsx
Readme.md		Readme.md

lilian-swen/ProductsReviewSentimentAnalysis

Folders and files

Latest commit

History

Repository files navigation

Project Introduction

Timeline

Overview

Project Documents

Toolkits

Why did I utlilize Amazon Glue and Amazon Athena in this project?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages