Real‑cast Project

Building an Efficient ETL Pipeline for Property Records in Real Estate

Specialization: Data Engineering
Business Focus: Real Estate (Zipco Real Estate Agency)
Tools: Python · PostgreSQL · SQLAlchemy · GitHub
Project Level: Beginner

Business Context

Zipco Real Estate Agency operates in the fast-paced, competitive world of real estate. Their success hinges on:

Deep knowledge of local market dynamics in the South Atlantic Division (DE, FL, GA, MD, NC, SC, VA, WV)
Exceptional customer service and robust online presence
Up-to-date property listings (no older than 5 days)

Timely access to accurate data is crucial for securing the hottest deals and maintaining a competitive edge.

Problem Statement

Current challenges at Zipco:

Inefficient Data Processing: Manual workflows delay access to critical property information.
Disparate & Inconsistent Datasets: Multiple sources and formats complicate analysis and reporting.
Compromised Data Quality: Inaccuracies and outdated records lead to poor decision-making.
High Operational Costs: Manual data reconciliation diverts resources from growth activities.

Project Objectives

Automate the full ETL pipeline (Extract → Transform → Load) using Python and SQLAlchemy.
Standardize and cleanse diverse property records.
Load structured data into a PostgreSQL database with a 3NF schema.
Schedule batch runs every 5 days to ensure the freshest listings.

Assumptions

Zipco deals in both rental and sales listings.
Operations are focused on the South Atlantic Division (DE, FL, GA, MD, NC, SC, VA, WV).
Only properties listed within the last 5 days are ingested to maintain recency.

Architecture

A high‑level overview can be found in the schema_and_architecture/ folder (created in Canva).

┌────────────┐    ┌─────────────┐    ┌────────────┐
│  RentCast  │ →  │  Extraction │ →  │ Pre‑Clean  │
└────────────┘    └─────────────┘    └────────────┘
                                ↓
                           ┌──────────────┐
                           │ CleaningJob  │
                           └──────────────┘
                                ↓
                           ┌─────────────┐
                           │  Loading    │ → PostgreSQL
                           └─────────────┘

Data Extraction

Source: RentCast API (rental & sales endpoints)
Challenges: 500‑row limit per call, one state per request
Solution: Loop through 8 states & two endpoints, filter by days_listed ≤ 5, aggregate results.

Data Transformation

Pre‑cleaning: Parse JSON into sub‑datasets:
- Sales: sales_info, property_history, agent_info, officer_info
- Rentals: rental_info, property_history
CleaningJob:
- Load into pandas DataFrames
- Drop duplicates & unnecessary columns
- Fill selective missing values
- Convert types (e.g. dates) & strip whitespace/symbols

Data Schema

Normalization: 3NF
Sales Tables (4):
- sales_info (central)
- property_history
- agent_info
- officer_info
Rentals Tables (2):
- rental_info (central)
- property_history

Foreign keys enforce 1‑to‑many relationships between central tables and their history/sub‑tables.

Data Loading

Database: PostgreSQL
ORM: SQLAlchemy
Batch Logic:
- First run: if_exists='replace'
- Subsequent (every 5 days): if_exists='append'

Note: Future PRs are welcome to propose upserts or merge logic to handle duplicates more gracefully.

Automation & Scheduling

Use a scheduler (e.g., cron, Airflow) to trigger the pipeline every 5 days:

# Example cron: run at midnight every 5 days
0 0 */5 * * cd /path/to/Real-cast && python load.py

Project Structure

Real-cast/
├── clean_data/               # Parquet files from first batch run
├── schema_and_architecture/  # Architecture diagrams (Canva exports)
├── data.py                   # Handles API extraction
├── precleaning.py            # JSON parsing & sub‑dataset extraction
├── cleaning_job.py           # DataFrame cleaning & transformation
├── load.py                   # ORM logic to load into PostgreSQL
├── requirements.txt          # Python dependencies
└── README.md                 # This file

Setup & Run Locally

Fork & Clone this repo:

git clone [https://github.com/Nel-zi/Real-cast-project cd Real-cast]


2. **Create & activate** a virtual environment:  
   ```bash
python -m venv .venv
source .venv/bin/activate   # macOS/Linux
.\.venv\Scripts\activate  # Windows

Install dependencies:
```
pip install -r requirements.txt
```


4. **Obtain** your RentCast API key (free tier) from https://app.rentcast.io/  

5. **Create** a `.env` file in the project root:  
   ```dotenv
API_KEY="<your_api_key>"
DB_NAME="<your_db_name>"
DB_USER="<your_db_user>"
DB_PASSWORD="<your_db_password>"
DB_HOST="localhost"
DB_PORT="5432"

Create your PostgreSQL database and update .env accordingly.
Run the pipeline:
```
python load.py
```


## Contributing

Contributions, issues, and feature requests are welcome!  
Feel free to open a Pull Request or an Issue with your suggested improvements, especially if you add visualizations or refine the loading logic.

## License & Contact

This project is open source under the MIT License.  

For questions or support, reach out via my GitHub profile: [@Nel-zi](https://github.com/Nel-zi)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Clean_data_parquet		Clean_data_parquet
Schema and architecture		Schema and architecture
.gitignore		.gitignore
Pre_cleaning_job.py		Pre_cleaning_job.py
all_data_22.json		all_data_22.json
cleaning_job.py		cleaning_job.py
data.py		data.py
load_data.py		load_data.py
readme.md		readme.md
rental_listings.json		rental_listings.json
requirements.txt		requirements.txt
sale_listings.json		sale_listings.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Real‑cast Project

Table of Contents

Business Context

Problem Statement

Project Objectives

Assumptions

Architecture

Data Extraction

Data Transformation

Data Schema

Data Loading

Automation & Scheduling

Project Structure

Setup & Run Locally

About

Uh oh!

Releases

Packages

Languages

Nel-zi/Real-cast-project

Folders and files

Latest commit

History

Repository files navigation

Real‑cast Project

Table of Contents

Business Context

Problem Statement

Project Objectives

Assumptions

Architecture

Data Extraction

Data Transformation

Data Schema

Data Loading

Automation & Scheduling

Project Structure

Setup & Run Locally

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages