r/popular Subreddit Pipeline

This project extracts Reddit data from r/popular through a combination of web scraping and API calls. The final output, which is shown below, is a Google Data Studio report showcasing the popular subreddits in each of the available regions of Southeast Asia (Malaysia, Philippines, Singapore, and Thailand).

The Google Data Studio dashboard can be interacted with through this link.

Observations

This project was made based on the interest of knowing how redditors from Southeast Asia (SEA) spend their time on the platform. Initial results have yielded the following observations

Redditors from the Philippines tend to participate more on their own local subreddits.
- A potential reason for this may be the abundance of local subreddits catered towards their own specific topics. Other countries in SEA might only have a few local subreddits where topics can be mixed.
Other countries in SEA have a few local subreddits with diverse topics.
- Malaysia has r/Malaysia and r/Bolehland as their popular local subreddits.
- Thailand has r/Thailand as their popular local subreddit.
Mobile gaming subreddits (Genshin Impact, Zenless Zone Zero, and Honkai Star Rail) are popular in Malaysia, Philippines, and Thailand.
- Mobile gaming is very popular in SEA because of the large supply of cheap phones.
Singapore's own subreddit r/Singapore is not popular in the region.
- Singapore may be using another social media platform to discuss local topics. Redditors from Singapore instead use Reddit to participate in international subreddits.

Architecture

This project was also made with the desire to practice several tools in handling data. Frankly, some of the tools overcomplicate the project like the use of cloud based storage. At the same time, using them also provides a good opportunity to develop the skills in using such tools.

Extract data using Reddit API
Load into GCP Buckets
Copy into GCP Bigquery
Transform using dbt
Create a Google Data Studio dashboard
Orchestrate workflow using Apache Airflow inside a Docker container
Manage GCP resources using Terraform

How this pipeline works

The pipeline starts by scraping the post urls from r/popular since Reddit API currently does not support this feature (to my knowledge). There is this endpoint but I do not know if this will work with specific regions or just the whole subreddit in general.

Each post url is then passed through the Reddit API to obtain the post details. The api returns a JSON object, and the script extracts the necessary details. The post details are turned into a pandas Dataframe and then saved into a parquet file.
- See extract.py
- NOTE: If you wish to change the regions, change the variable countries in line 98
Parquet files are then uploaded into a GCP bucket. Then the files in the bucket are copied into a GCP Bigquery table.
- See load_bucket.py and load_bq.py
The data in the initial Bigquery table is transformed through dbt cloud
- See dbt folder

Steps 1 and 2 compose the single DAG being run through Apache Airflow. The orchestration tool is inside a Docker container defined by the docker-compose.yaml file. The dbt transformation is scheduled through its own job in the dbt cloud platform.

Prequisites

GCP Account
Docker
Python
Terraform
Reddit Account
DBT Cloud Account

Setup

NOTE: We rename files for .gitignore purposes

Environment Variables

Run the command below and fill in the required variables

cd .\airflow\
mv dev.env .env

Download your GCP service account keys and store it into google_keys.json then run the command below. This is not the best practice to be honest.

cd .\airflow\
mv google_keys.json keys.json

Terraform

Fill in var.tf with the required GCP variables then run the command below

mv var.tf variables.tf

terraform init
terraform apply

Docker and Airflow

Run the command below

cd .\airflow\

docker compose build
docker compose up airflow-init
docker compose up -d

NOTE: Only run docker compose up -d on subsequent runs

Go to localhost:8080 to trigger the DAG or schedule it in a cloud service

DBT Cloud

Search on YouTube for tutorial or follow DE Zoomcamp Week 4

Google Data Studio

Search on YouTube for tutorial

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
airflow		airflow
dbt		dbt
readme_images		readme_images
.gitignore		.gitignore
.terraform.lock.hcl		.terraform.lock.hcl
README.md		README.md
main.tf		main.tf
var.tf		var.tf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

r/popular Subreddit Pipeline

Observations

Architecture

How this pipeline works

Prequisites

Setup

Environment Variables

Terraform

Docker and Airflow

DBT Cloud

Google Data Studio

About

Releases

Packages

Languages

navs-svan/reddit-popular

Folders and files

Latest commit

History

Repository files navigation

r/popular Subreddit Pipeline

Observations

Architecture

How this pipeline works

Prequisites

Setup

Environment Variables

Terraform

Docker and Airflow

DBT Cloud

Google Data Studio

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages