This project focuses on building a Data Lake for Facebook post comments. The pipeline covers *data extraction, **cleaning, **dimension modeling, and *automatic loading into a relational database using Talend. The goal is to enable structured analysis and reporting on user engagement data.
-
Tool: Facepager
-
Source: Facebook post (Post ID: 10102577175875681)
-
Extracted all comments and metadata including:
- Profile ID
- User name
- Date/time of comment
- Likes
- Live video timestamp
- Image URLs
- Comment text
The extracted data was saved as an Excel file.
-
Tool: Python (pandas)
-
Process:
- Removed duplicates.
- Created dim_user table with a surrogate key for each user.
- Created dim_date table with a surrogate key and derived attributes (year, month, day, hour, minute, weekday).
- Added a fixed post_key for the post.
- Merged tables to form a fact_comment table with only relevant columns.
- Exported cleaned data to comments_cleaned.xlsx.
python
dim_user["user_key"] = dim_user.index + 1
✅ Output: Cleaned dataset ready for loading into the database.
- Implemented a star schema with dimensions and a fact table:
sql -- User Dimension CREATE TABLE dim_user ( user_key INT PRIMARY KEY, profile_id BIGINT UNIQUE, user_name VARCHAR(255) );
-- Date Dimension CREATE TABLE dim_date ( date_key INT PRIMARY KEY, -- e.g. YYYYMMDDHHMMSS full_datetime DATETIME, year INT, month TINYINT, day TINYINT, hour TINYINT, minute TINYINT, weekday TINYINT );
sql -- Fact Table for Comments CREATE TABLE fact_comment ( fact_id BIGINT PRIMARY KEY AUTO_INCREMENT, user_key INT, date_key INT, likes INT, live_video_timestamp VARCHAR(50), comment_text TEXT, has_image BOOLEAN, FOREIGN KEY (user_key) REFERENCES dim_user(user_key), FOREIGN KEY (date_key) REFERENCES dim_date(date_key) );
-
Tool: Talend Open Studio 7 (with JDK 8)
-
Process:
- Created Talend Jobs to read cleaned Excel/CSV files.
- Mapped source columns to corresponding tables (dim_user, dim_date, fact_comment).
- Automated insertion of data into MySQL database in phpMyAdmin.
- Ensured foreign key relationships were maintained.
✅ Result: Automatic population of the star schema tables from Talend.
- End-to-end pipeline from data extraction to database population.
- Implemented a multimodal star schema for user engagement analytics.
- Fully automated ETL process using Talend.
- Ready for reporting and analytics on Facebook comment data.
| Layer | Technology |
|---|---|
| Data Extraction | Facepager |
| Data Cleaning | Python (pandas) |
| Data Storage | MySQL (phpMyAdmin) |
| ETL / Automation | Talend Open Studio 7, JDK 8 |
| Schema Design | Star Schema (Dim + Fact) |
This is a screenshot of the dim_date table after populating it with Facebook comments data:
The table includes surrogate keys, full datetime, and derived attributes like year, month, day, hour, minute, and weekday.
