This repository contains the code for an ETL (Extract, Transform, Load) pipeline focused on World Bank data. The project involves cleaning and combining datasets from various sources, including CSV, JSON, and XML files. The goal is to create a unified dataset for predicting World Bank Project total costs using a machine learning model.
-
Ensure you have Python installed (version 3.6 or later).
-
Install the required packages using:
pip install -r requirements.txt
-
Clone the repository:
gh repo clone AdityaDwivediAtGit/World-Bank-ETL-Pipeline
-
Navigate to the project directory, install Prerequisites, and unzip files:
cd World-Bank-ETL-Pipeline pip install -r requirements.txt unzip archive_etl.zip
-
Run the ETL pipeline:
python main.py
- cleaned_files/: Contains the cleaned CSV files. (This directory automatically generated after running main.py)
- population_data.db, projects_data.csv, ...: These are raw files that appear after you unzip
archive_etl.zip
and are present in the same folder asmain.py
. - Documentation/ETL_PySpark_task3.ipynb: Refer to this Jupyter notebook for a detailed journey of how i finished the ETL process.
- Documentation/ETL_fullCodeTest.ipynb: Refer to this Jupyter notebook for a detailed test run of the ETL process.
- The ETL pipeline uses PySpark for efficient data processing. Ensure you have Java installed on your machine.
- Modify the
debug
variable in themain.py
file to toggle debugging information. - After finishing loading,
combined_data_db.sqlite
is generated as output containing all the tables.
- You can read more about behind the scenes in the
ETL_PySpark_task3.ipynb
. REFERDocumentation
dir] - For a detailed test run of the ETL process, refer to
ETL_fullCodeTest.ipynb
. REFERDocumentation
dir]
Feel free to provide feedback or suggestions! Contributions are welcome.
Author: Aditya Dwivedi
Note: Ensure you have the necessary dependencies installed before running the code.