Skip to content

This is an end-to-end data engineering ETL project on my personal spotify streaming data for 2023 done on AWS console

License

Notifications You must be signed in to change notification settings

nikitgoku/aws_data_engineering_e2e

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

End-to-End AWS Data Pipeline for Spotify Streaming Analytics

image

Overview

This project consist of an end-end data pipeline where personal Spotify streaming data is utilized stored in .csv format and seamlessly uploaded it into Amazon S3. A database is then established employed by Glue Crawler to intricately analyze the streaming data stored in S3, determining its schema. Subsequently, Amazon Glue ETL job was used to orchestrate the data pipeline, using an Apache Spark script to adeptly convert the .csv data into the efficient .parquet format. The transformed data was then stored back in S3, paving the way for insightful and interactive queries. To achieve this, Amazon Athena was employed, allowing to extract meaningful insights from the parquet database.

Extract

The personal streaming data was requested and then stored in .csv file in Amazon S3. The required IAM role and IAM policies were employed in order to include AWS Glue for the jobs moving forward.

Transform

To start the transformation job, a Glue Data Catalog service with a database and table was created. This table stores the metadata associated the object which in the current scenario is the .csv. AWS GLue Crawler was incorporated to infer the schema of the S3 object.

After sucessfully incorporating the table with the necessary schema a Glue ETL job was created whose main was to modify the source CSV file using the Glue Data Catalog and upload the modified data frame in the parquet format into S3 and create a corresponding target data catalog that keeps the metadata information of the target object.

A Spark script was developed which utilized Glue's dynamic frame converted into spark dataframe to drop unnecessary columns, remove NULL values, rename the columns with relevant names and extract month and day details from timestamp columns.

Load

The final spark dataframe was converted back into Glue dynamic frame and loaded as a .parquet file back into S3 along with the corresponding table in the Glue Data Catalog.

Querying

With the Glue Data Catalog created after running the spark script, Amazon Athena was used to run queries on the data to get relevan insigths from the data.

About

This is an end-to-end data engineering ETL project on my personal spotify streaming data for 2023 done on AWS console

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published