Hello! We are from IS3107 Group 14
This repository is about creating data pipeline using Google Cloud Platform by:
- Extracting data from Youtube Data API V3
- Transforming using Python, and
- Loading it to BigQuery using Google Cloud Composer to create the Airflow environment and Google Cloud Storage to store our dag files.
Content creation has become more prevalent these days. However, becoming one comes with many challenges. One of them is the unpredictable video algorithm (Mittal & Liu, 2021). This is the reason why there are so many aspiring youtubers, but only a few succeed. In this project, we would like to help youtubers and talent agencies identify whether or not it is unpredictable, and the possible things they could do to reach more audience or even have their videos in the trending chart and become the next popular star.
The insights that we would like to help identify from this project includes: Analysis of the overall engagement from the viewers activities (views, likes, subscribes, etc.) on different categories. Analysis of the correlation between video duration and the number of viewers. Analysis of the correlation between view count and number of subscribers. Analysis of the correlation between video count and number of subscribers. Analysis of the correlation between channel description length and the number of subscribers. Analysis of commonly used words in the video description. Analysis of the correlation between video description length and the number of subscribers.
Hence, here is the document you need to set up your own data pipeline with our dag files. https://docs.google.com/document/d/1NPHpKnZhjBq8kG5n58RQ4bfmcKBqTZSwHa8C2wx6AAM/edit?usp=sharing
Thank you for reading!