projet_data_lake

Facebook Comments Data Lake Project

Overview

This project focuses on building a Data Lake for Facebook post comments. The pipeline covers *data extraction, **cleaning, **dimension modeling, and *automatic loading into a relational database using Talend. The goal is to enable structured analysis and reporting on user engagement data.

Project Workflow

1. Data Extraction

Tool: Facepager
Source: Facebook post (Post ID: 10102577175875681)
Extracted all comments and metadata including:
- Profile ID
- User name
- Date/time of comment
- Likes
- Live video timestamp
- Image URLs
- Comment text

The extracted data was saved as an Excel file.

2. Data Cleaning

Tool: Python (pandas)
Process:
1. Removed duplicates.
2. Created dim_user table with a surrogate key for each user.
3. Created dim_date table with a surrogate key and derived attributes (year, month, day, hour, minute, weekday).
4. Added a fixed post_key for the post.
5. Merged tables to form a fact_comment table with only relevant columns.
6. Exported cleaned data to comments_cleaned.xlsx.

python

Example: creating dim_user

dim_user["user_key"] = dim_user.index + 1

✅ Output: Cleaned dataset ready for loading into the database.

3. Star Schema Design

Implemented a star schema with dimensions and a fact table:

Dimensions

sql -- User Dimension CREATE TABLE dim_user ( user_key INT PRIMARY KEY, profile_id BIGINT UNIQUE, user_name VARCHAR(255) );

-- Date Dimension CREATE TABLE dim_date ( date_key INT PRIMARY KEY, -- e.g. YYYYMMDDHHMMSS full_datetime DATETIME, year INT, month TINYINT, day TINYINT, hour TINYINT, minute TINYINT, weekday TINYINT );

Fact Table

sql -- Fact Table for Comments CREATE TABLE fact_comment ( fact_id BIGINT PRIMARY KEY AUTO_INCREMENT, user_key INT, date_key INT, likes INT, live_video_timestamp VARCHAR(50), comment_text TEXT, has_image BOOLEAN, FOREIGN KEY (user_key) REFERENCES dim_user(user_key), FOREIGN KEY (date_key) REFERENCES dim_date(date_key) );

4. Data Loading Automation

Tool: Talend Open Studio 7 (with JDK 8)
Process:
1. Created Talend Jobs to read cleaned Excel/CSV files.
2. Mapped source columns to corresponding tables (dim_user, dim_date, fact_comment).
3. Automated insertion of data into MySQL database in phpMyAdmin.
4. Ensured foreign key relationships were maintained.

✅ Result: Automatic population of the star schema tables from Talend.

5. Key Achievements

End-to-end pipeline from data extraction to database population.
Implemented a multimodal star schema for user engagement analytics.
Fully automated ETL process using Talend.
Ready for reporting and analytics on Facebook comment data.

6.🛠 Technologies Used

Layer	Technology
Data Extraction	Facepager
Data Cleaning	Python (pandas)
Data Storage	MySQL (phpMyAdmin)
ETL / Automation	Talend Open Studio 7, JDK 8
Schema Design	Star Schema (Dim + Fact)

3 Dimension Date – Example from the Project

This is a screenshot of the dim_date table after populating it with Facebook comments data:

The table includes surrogate keys, full datetime, and derived attributes like year, month, day, hour, minute, and weekday.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Example from the Project.png		Example from the Project.png
README.md		README.md
comments_cleaned.csv		comments_cleaned.csv
dim-fait.txt		dim-fait.txt
facebook-comments680f8b4dd6d8b-10102577175875681 (1).xlsx		facebook-comments680f8b4dd6d8b-10102577175875681 (1).xlsx
preprocessing.ipynb		preprocessing.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

projet_data_lake

Facebook Comments Data Lake Project

Overview

Project Workflow

1. Data Extraction

2. Data Cleaning

Example: creating dim_user

3. Star Schema Design

Dimensions

Fact Table

4. Data Loading Automation

5. Key Achievements

6.🛠 Technologies Used

3 Dimension Date – Example from the Project

About

Uh oh!

Releases

Packages

Languages

ImenBenAmar/Data_Lake_project

Folders and files

Latest commit

History

Repository files navigation

projet_data_lake

Facebook Comments Data Lake Project

Overview

Project Workflow

1. Data Extraction

2. Data Cleaning

Example: creating dim_user

3. Star Schema Design

Dimensions

Fact Table

4. Data Loading Automation

5. Key Achievements

6.🛠 Technologies Used

3 Dimension Date – Example from the Project

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages