An end-to-end data pipeline that ingests simulated music stream data, structures, cleans and models the raw data, and perfroms analytics on clean data.
Eventsim is a top music streaming company. The management of Eventsim are working on a new feature tailored to the preferences of the users. In order to aid the development of this feature, the developers needed to understand certain things about the streaming habits of users. Hence, they came up with use cases and questions that need to be answered.
- What is the total number of active users, heir total stream hours and their geographic distribution?
- What is the general gender composition of users and how do they make up the top artists?
- What are the top songs and who are the top artists that users listen to?
- Eventsim API produces the streaming data which are then consumed by Kafka.
- Stream data are read from Kafka with Spark Streaming.
- Spark Streaming structures the data and writes to data lake (Cloud Storage) as flat file.
- ELT from data lake (Cloud Storage) to data warehouse (BigQuery) using dbt, and orchestrated with Airflow
- Stream Analytics were performed and deployed using Google Data Studio.
Eventsim is a program that generates event data to replicate page requests for a fake music web site. The results look like real use data, but are totally fake. The docker image is borrowed from viirya's fork of it, as the original project has gone without maintenance for a few years now.
Eventsim uses song data from Million Songs Dataset to generate events. I have used a subset of 10000 songs.
Click here to view latest version on Data Studio
-
clone this repo to the
~/musicaly-project
directorygit clone https://github.com/topefolorunso/musicaly-project.git ~/musicaly-project && \ cd ~/musicaly-project