Building an Efficient ETL Pipeline for Property Records in Real Estate
Specialization: Data Engineering
Business Focus: Real Estate (Zipco Real Estate Agency)
Tools: Python · PostgreSQL · SQLAlchemy · GitHub
Project Level: Beginner
- Business Context
- Problem Statement
- Project Objectives
- Assumptions
- Architecture
- Data Extraction
- Data Transformation
- Data Schema
- Data Loading
- Automation & Scheduling
- Project Structure
- Setup & Run Locally
- Environment Variables
- Contributing
- License & Contact
Zipco Real Estate Agency operates in the fast-paced, competitive world of real estate. Their success hinges on:
- Deep knowledge of local market dynamics in the South Atlantic Division (DE, FL, GA, MD, NC, SC, VA, WV)
- Exceptional customer service and robust online presence
- Up-to-date property listings (no older than 5 days)
Timely access to accurate data is crucial for securing the hottest deals and maintaining a competitive edge.
Current challenges at Zipco:
- Inefficient Data Processing: Manual workflows delay access to critical property information.
- Disparate & Inconsistent Datasets: Multiple sources and formats complicate analysis and reporting.
- Compromised Data Quality: Inaccuracies and outdated records lead to poor decision-making.
- High Operational Costs: Manual data reconciliation diverts resources from growth activities.
- Automate the full ETL pipeline (Extract → Transform → Load) using Python and SQLAlchemy.
- Standardize and cleanse diverse property records.
- Load structured data into a PostgreSQL database with a 3NF schema.
- Schedule batch runs every 5 days to ensure the freshest listings.
- Zipco deals in both rental and sales listings.
- Operations are focused on the South Atlantic Division (DE, FL, GA, MD, NC, SC, VA, WV).
- Only properties listed within the last 5 days are ingested to maintain recency.
A high‑level overview can be found in the schema_and_architecture/
folder (created in Canva).
┌────────────┐ ┌─────────────┐ ┌────────────┐
│ RentCast │ → │ Extraction │ → │ Pre‑Clean │
└────────────┘ └─────────────┘ └────────────┘
↓
┌──────────────┐
│ CleaningJob │
└──────────────┘
↓
┌─────────────┐
│ Loading │ → PostgreSQL
└─────────────┘
- Source: RentCast API (rental & sales endpoints)
- Challenges: 500‑row limit per call, one state per request
- Solution: Loop through 8 states & two endpoints, filter by
days_listed ≤ 5
, aggregate results.
- Pre‑cleaning: Parse JSON into sub‑datasets:
- Sales:
sales_info
,property_history
,agent_info
,officer_info
- Rentals:
rental_info
,property_history
- Sales:
- CleaningJob:
- Load into pandas DataFrames
- Drop duplicates & unnecessary columns
- Fill selective missing values
- Convert types (e.g. dates) & strip whitespace/symbols
- Normalization: 3NF
- Sales Tables (4):
sales_info
(central)property_history
agent_info
officer_info
- Rentals Tables (2):
rental_info
(central)property_history
Foreign keys enforce 1‑to‑many relationships between central tables and their history/sub‑tables.
- Database: PostgreSQL
- ORM: SQLAlchemy
- Batch Logic:
- First run:
if_exists='replace'
- Subsequent (every 5 days):
if_exists='append'
- First run:
Note: Future PRs are welcome to propose upserts or merge logic to handle duplicates more gracefully.
Use a scheduler (e.g., cron, Airflow) to trigger the pipeline every 5 days:
# Example cron: run at midnight every 5 days
0 0 */5 * * cd /path/to/Real-cast && python load.py
Real-cast/
├── clean_data/ # Parquet files from first batch run
├── schema_and_architecture/ # Architecture diagrams (Canva exports)
├── data.py # Handles API extraction
├── precleaning.py # JSON parsing & sub‑dataset extraction
├── cleaning_job.py # DataFrame cleaning & transformation
├── load.py # ORM logic to load into PostgreSQL
├── requirements.txt # Python dependencies
└── README.md # This file
- Fork & Clone this repo:
git clone [https://github.com/Nel-zi/Real-cast-project cd Real-cast]
2. **Create & activate** a virtual environment:
```bash
python -m venv .venv
source .venv/bin/activate # macOS/Linux
.\.venv\Scripts\activate # Windows
- Install dependencies:
pip install -r requirements.txt
4. **Obtain** your RentCast API key (free tier) from https://app.rentcast.io/
5. **Create** a `.env` file in the project root:
```dotenv
API_KEY="<your_api_key>"
DB_NAME="<your_db_name>"
DB_USER="<your_db_user>"
DB_PASSWORD="<your_db_password>"
DB_HOST="localhost"
DB_PORT="5432"
-
Create your PostgreSQL database and update
.env
accordingly. -
Run the pipeline:
python load.py
## Contributing
Contributions, issues, and feature requests are welcome!
Feel free to open a Pull Request or an Issue with your suggested improvements, especially if you add visualizations or refine the loading logic.
## License & Contact
This project is open source under the MIT License.
For questions or support, reach out via my GitHub profile: [@Nel-zi](https://github.com/Nel-zi)