LinkedIn Company Data ETL System

Overview

This repository contains an ETL (Extract, Transform, Load) pipeline for processing LinkedIn company profile data from JSON files and storing it in a normalized MySQL database. The system covers conceptual, logical, and physical design aspects to ensure data consistency and referential integrity.

Key Features

Data Modeling
- Conceptual, Logical, and Physical diagrams (Crow's Foot notation, 3NF design).
- Focus on proper normalization principles to achieve Third Normal Form (3NF).
Staging
- Loads raw JSON data into a staging database (staging) using Pentaho Data Integration.
Transformation
- Normalizes data into intermediate tables, correcting inconsistencies and handling duplicates.
Loading
- Moves transformed data into the final company database with proper constraints and relationships.

Technology Stack

MySQL for relational storage
Pentaho Data Integration (Spoon) for ETL workflows
Python 3.x for scripting and data manipulation
JSON as source format

Repository Structure

sql_db/
├── README.md                     # This documentation
├── instructions.md               # Setup and execution instructions
├── normalization_script.py       # Python data processing script
├── stage-json-to-table.ktr       # Pentaho transformation definition
├── staging.sql                   # SQL schema and normalization queries
└── company-profile/              # Source JSON data files
    └── [company_*.json]          # Multiple company profile files

Data Flow

Extract
- JSON files from the company-profile directory are identified and read.
Staging
- Using Pentaho Data Integration (Spoon), data is inserted into staging tables in MySQL.
Transform
- Primary Method: Run the SQL statements in staging.sql for normalization:
  - First execute the table creation section to establish the schema
  - Then execute the remaining transformation queries in order
- Optional: Execute the normalization_script.py only if additional custom logic is required.
Load
- Final normalized data is moved into the company database with enforced primary/foreign keys.

Getting Started

Prerequisites

MySQL Server 5.7+
Pentaho Data Integration 8.0+
Python 3.6+ (optional, only if using the Python script)

Quick Setup

Clone this repository.
Launch Pentaho with the ./spoon command.
Configure database connections in Pentaho.
Run the Pentaho transformation to load staging data.
Normalize the data:
- Run the remaining SQL transformation querie after the first one:
```
mysql -u <user> -p <database> < staging_transformations.sql
```
- Note: Python script is optional and only needed for specialized transformations.
Confirm the final structure in the company database.

Example Query

-- Retrieve companies with their specialties
SELECT c.name, GROUP_CONCAT(s.specialty_name) AS specialties
FROM companies c
JOIN company_specialties cs ON c.id = cs.company_id
JOIN specialties s ON cs.specialty_id = s.id
GROUP BY c.name;

Additional Documentation

See instructions.md for more details on the SQL normalization process and how the ERD was developed to achieve proper 3NF design.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
company-profile		company-profile
.DS_Store		.DS_Store
README.md		README.md
Report.pdf		Report.pdf
data_models_01.pdf		data_models_01.pdf
etl_sqlscripts_01.sql		etl_sqlscripts_01.sql
etl_staging_01.ktr		etl_staging_01.ktr
instructions.md		instructions.md
normalization_script.py		normalization_script.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LinkedIn Company Data ETL System

Overview

Key Features

Technology Stack

Repository Structure

Data Flow

Getting Started

Prerequisites

Quick Setup

Example Query

Additional Documentation

About

Releases

Packages

Languages

dkoh2018/sql_db

Folders and files

Latest commit

History

Repository files navigation

LinkedIn Company Data ETL System

Overview

Key Features

Technology Stack

Repository Structure

Data Flow

Getting Started

Prerequisites

Quick Setup

Example Query

Additional Documentation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages