Skip to content

This project focuses on building a *Data Lake* for Facebook post comments. The pipeline covers *data extraction, **cleaning, **dimension modeling, and **automatic loading* into a relational database using *Talend*. The goal is to enable structured analysis and reporting on user engagement data.

Notifications You must be signed in to change notification settings

ImenBenAmar/Data_Lake_project

Repository files navigation

projet_data_lake

Facebook Comments Data Lake Project

Overview

This project focuses on building a Data Lake for Facebook post comments. The pipeline covers *data extraction, **cleaning, **dimension modeling, and *automatic loading into a relational database using Talend. The goal is to enable structured analysis and reporting on user engagement data.


Project Workflow

1. Data Extraction

  • Tool: Facepager

  • Source: Facebook post (Post ID: 10102577175875681)

  • Extracted all comments and metadata including:

    • Profile ID
    • User name
    • Date/time of comment
    • Likes
    • Live video timestamp
    • Image URLs
    • Comment text

The extracted data was saved as an Excel file.


2. Data Cleaning

  • Tool: Python (pandas)

  • Process:

    1. Removed duplicates.
    2. Created dim_user table with a surrogate key for each user.
    3. Created dim_date table with a surrogate key and derived attributes (year, month, day, hour, minute, weekday).
    4. Added a fixed post_key for the post.
    5. Merged tables to form a fact_comment table with only relevant columns.
    6. Exported cleaned data to comments_cleaned.xlsx.

python

Example: creating dim_user

dim_user["user_key"] = dim_user.index + 1

✅ Output: Cleaned dataset ready for loading into the database.


3. Star Schema Design

  • Implemented a star schema with dimensions and a fact table:

Dimensions

sql -- User Dimension CREATE TABLE dim_user ( user_key INT PRIMARY KEY, profile_id BIGINT UNIQUE, user_name VARCHAR(255) );

-- Date Dimension CREATE TABLE dim_date ( date_key INT PRIMARY KEY, -- e.g. YYYYMMDDHHMMSS full_datetime DATETIME, year INT, month TINYINT, day TINYINT, hour TINYINT, minute TINYINT, weekday TINYINT );

Fact Table

sql -- Fact Table for Comments CREATE TABLE fact_comment ( fact_id BIGINT PRIMARY KEY AUTO_INCREMENT, user_key INT, date_key INT, likes INT, live_video_timestamp VARCHAR(50), comment_text TEXT, has_image BOOLEAN, FOREIGN KEY (user_key) REFERENCES dim_user(user_key), FOREIGN KEY (date_key) REFERENCES dim_date(date_key) );


4. Data Loading Automation

  • Tool: Talend Open Studio 7 (with JDK 8)

  • Process:

    1. Created Talend Jobs to read cleaned Excel/CSV files.
    2. Mapped source columns to corresponding tables (dim_user, dim_date, fact_comment).
    3. Automated insertion of data into MySQL database in phpMyAdmin.
    4. Ensured foreign key relationships were maintained.

✅ Result: Automatic population of the star schema tables from Talend.


5. Key Achievements

  • End-to-end pipeline from data extraction to database population.
  • Implemented a multimodal star schema for user engagement analytics.
  • Fully automated ETL process using Talend.
  • Ready for reporting and analytics on Facebook comment data.

6.🛠 Technologies Used

Layer Technology
Data Extraction Facepager
Data Cleaning Python (pandas)
Data Storage MySQL (phpMyAdmin)
ETL / Automation Talend Open Studio 7, JDK 8
Schema Design Star Schema (Dim + Fact)

3 Dimension Date – Example from the Project

This is a screenshot of the dim_date table after populating it with Facebook comments data:

dim_date_example

The table includes surrogate keys, full datetime, and derived attributes like year, month, day, hour, minute, and weekday.


About

This project focuses on building a *Data Lake* for Facebook post comments. The pipeline covers *data extraction, **cleaning, **dimension modeling, and **automatic loading* into a relational database using *Talend*. The goal is to enable structured analysis and reporting on user engagement data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published