End-to-End AWS Data Pipeline for Spotify Streaming Analytics

Overview

This project consist of an end-end data pipeline where personal Spotify streaming data is utilized stored in .csv format and seamlessly uploaded it into Amazon S3. A database is then established employed by Glue Crawler to intricately analyze the streaming data stored in S3, determining its schema. Subsequently, Amazon Glue ETL job was used to orchestrate the data pipeline, using an Apache Spark script to adeptly convert the .csv data into the efficient .parquet format. The transformed data was then stored back in S3, paving the way for insightful and interactive queries. To achieve this, Amazon Athena was employed, allowing to extract meaningful insights from the parquet database.

Extract

The personal streaming data was requested and then stored in .csv file in Amazon S3. The required IAM role and IAM policies were employed in order to include AWS Glue for the jobs moving forward.

Transform

To start the transformation job, a Glue Data Catalog service with a database and table was created. This table stores the metadata associated the object which in the current scenario is the .csv. AWS GLue Crawler was incorporated to infer the schema of the S3 object.

After sucessfully incorporating the table with the necessary schema a Glue ETL job was created whose main was to modify the source CSV file using the Glue Data Catalog and upload the modified data frame in the parquet format into S3 and create a corresponding target data catalog that keeps the metadata information of the target object.

A Spark script was developed which utilized Glue's dynamic frame converted into spark dataframe to drop unnecessary columns, remove NULL values, rename the columns with relevant names and extract month and day details from timestamp columns.

Load

The final spark dataframe was converted back into Glue dynamic frame and loaded as a .parquet file back into S3 along with the corresponding table in the Glue Data Catalog.

Querying

With the Glue Data Catalog created after running the spark script, Amazon Athena was used to run queries on the data to get relevan insigths from the data.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data/streaming_history		data/streaming_history
scripts		scripts
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

End-to-End AWS Data Pipeline for Spotify Streaming Analytics

Overview

Extract

Transform

Load

Querying

About

Releases

Packages

Languages

License

nikitgoku/aws_data_engineering_e2e

Folders and files

Latest commit

History

Repository files navigation

End-to-End AWS Data Pipeline for Spotify Streaming Analytics

Overview

Extract

Transform

Load

Querying

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages