Skip to content

Builded an ETL pipeline using Python, Pandas, Python dictionary methods and regular expressions to extract and transform the data. Created four CSV files and use the CSV file data to create an ERD and a table schema. Finally, uploaded the CSV file data into a Postgres database.

Notifications You must be signed in to change notification settings

helenaschatz/Crowdfunding_ETL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Crowdfunding ETL

A collaborative project to build an ETL pipeline using Python, Pandas, and PostgreSQL for extracting, transforming, and loading crowdfunding data from Excel files into a relational database.

ETL (Extract, Transform and Load) data processing is an automated procedure that extracts relevant information from raw data, converts it into a format that fulfills business requirements and loads it into a target system.

demo

Instructions

The instructions for this mini project are divided into the following subsections:

  1. Create the Category and Subcategory DataFrames
  2. Create the Campaign DataFrame
  3. Create the Contacts DataFrame
  4. Create the Crowdfunding Database

Features

  • Utilize Python and Pandas in order to:
    • Extract and transform crowdfunding and contact data from Excel files
    • Create and export Category, Subcategory, Campaign, and Contact DataFrames as CSV files
  • Utilize PostgreSQL in order to:
    • Design an ERD and table schema for the database
    • Create and populate PostgreSQL database tables

Data

  1. For our analysis, we have extracted and transformed data from the following datasets available in the Resources folder.

    • contacts.xlsx
    • crowdfunding.xlsx
  2. Then, we loaded our transformed data into crowdfunding_db PostgreSQL database.

    • category.csv
    • subcategory.csv
    • contacts.csv
    • campaign.csv

How to Run

  • Clone the repository.

  • Install PostgresSQL

  • Install required Python packages: pandas, numpy and openpyxl.

  • In order to run our analysis, first select Python Environment (Python 3.9.13) in Jupyter Notebook.

  • Select "Run All" in ETL_Mini_Project.ipynb file, which will Extract, Transform data, and Create CSV files.

    • All CSV files will be located in Resources folder.
  • Set up a PostgreSQL server to create a new database called crowdfunding_db.

  • In Crowdfunding DB folder, use provided crowdfunding_db_schema.sql file to create tables in PostgreSQL crowdfunding_db database.

    • Import each CSV file into its corresponding SQL table in the following order: category, subcategory, contacts, and campaign.
    • Query the database to verify the data has been loaded correctly, by running a SELECT statement for each.

demo

Outcome

In this project, we have demonstrated the ETL pipeline utilizing Python, Pandas, and a combination of both Python dictionary and regular expression techniques for data extraction and transformation. We were able to successfully generate four CSV files, and use the data in these files to design an ERD as well as table schema, then loaded the data into PostgresSQL database from the CSV files via dataframes.

Resources and Libraries

Programming Languages Database Libraries Software Modules
Python demo PostgreSQLdemo Pandas demo QuickDBD demo datetime as dt
numpy demo Jupyter Notebook demo pprint
json

Team Members:

Helena Fedorenko
Jason Barbagallo
Xing Ying Chen
Nancy Santiago
Anthony Parry

About

Builded an ETL pipeline using Python, Pandas, Python dictionary methods and regular expressions to extract and transform the data. Created four CSV files and use the CSV file data to create an ERD and a table schema. Finally, uploaded the CSV file data into a Postgres database.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors 4

  •  
  •  
  •  
  •