This project extracts Reddit data from r/popular through a combination of web scraping and API calls. The final output, which is shown below, is a Google Data Studio report showcasing the popular subreddits in each of the available regions of Southeast Asia (Malaysia, Philippines, Singapore, and Thailand).
- The Google Data Studio dashboard can be interacted with through this link.
This project was made based on the interest of knowing how redditors from Southeast Asia (SEA) spend their time on the platform. Initial results have yielded the following observations
-
Redditors from the Philippines tend to participate more on their own local subreddits.
- A potential reason for this may be the abundance of local subreddits catered towards their own specific topics. Other countries in SEA might only have a few local subreddits where topics can be mixed.
-
Other countries in SEA have a few local subreddits with diverse topics.
- Malaysia has r/Malaysia and r/Bolehland as their popular local subreddits.
- Thailand has r/Thailand as their popular local subreddit.
-
Mobile gaming subreddits (Genshin Impact, Zenless Zone Zero, and Honkai Star Rail) are popular in Malaysia, Philippines, and Thailand.
- Mobile gaming is very popular in SEA because of the large supply of cheap phones.
-
Singapore's own subreddit r/Singapore is not popular in the region.
- Singapore may be using another social media platform to discuss local topics. Redditors from Singapore instead use Reddit to participate in international subreddits.
This project was also made with the desire to practice several tools in handling data. Frankly, some of the tools overcomplicate the project like the use of cloud based storage. At the same time, using them also provides a good opportunity to develop the skills in using such tools.
- Extract data using Reddit API
- Load into GCP Buckets
- Copy into GCP Bigquery
- Transform using dbt
- Create a Google Data Studio dashboard
- Orchestrate workflow using Apache Airflow inside a Docker container
- Manage GCP resources using Terraform
-
The pipeline starts by scraping the post urls from r/popular since Reddit API currently does not support this feature (to my knowledge). There is this endpoint but I do not know if this will work with specific regions or just the whole subreddit in general.
Each post url is then passed through the Reddit API to obtain the post details. The api returns a JSON object, and the script extracts the necessary details. The post details are turned into a
pandas Dataframe
and then saved into a parquet file.-
See extract.py
-
NOTE: If you wish to change the regions, change the variable
countries
in line 98
-
-
Parquet files are then uploaded into a GCP bucket. Then the files in the bucket are copied into a GCP Bigquery table.
- See load_bucket.py and load_bq.py
-
The data in the initial Bigquery table is transformed through dbt cloud
- See dbt folder
Steps 1 and 2 compose the single DAG being run through Apache Airflow. The orchestration tool is inside a Docker container defined by the docker-compose.yaml file. The dbt transformation is scheduled through its own job in the dbt cloud platform.
- GCP Account
- Docker
- Python
- Terraform
- Reddit Account
- DBT Cloud Account
- NOTE: We rename files for .gitignore purposes
- Run the command below and fill in the required variables
cd .\airflow\
mv dev.env .env
- Download your GCP service account keys and store it into google_keys.json then run the command below. This is not the best practice to be honest.
cd .\airflow\
mv google_keys.json keys.json
- Fill in var.tf with the required GCP variables then run the command below
mv var.tf variables.tf
terraform init
terraform apply
- Run the command below
cd .\airflow\
docker compose build
docker compose up airflow-init
docker compose up -d
- NOTE: Only run
docker compose up -d
on subsequent runs
- Go to
localhost:8080
to trigger the DAG or schedule it in a cloud service
- Search on YouTube for tutorial or follow DE Zoomcamp Week 4
- Search on YouTube for tutorial