Skip to content

sehroz/heart-failure-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

95 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

UofT Logo Data Sciences Logo

UofT | DSI - Team Project: πŸ”Ž Heart Failure Prediction

Heart

πŸ” Business Question

Can heart failure be accurately predicted using only demographic and baseline pre-stress test data, without the need to conduct exercise stress tests? The goal is to explore whether machine learning models can predict heart disease risk by utilizing only basic data available prior to stress testing.

πŸ› οΈ Why Address This Problem?

Value to Patients

  • Accessible Screening: Early detection for high-risk patients without stress testing, benefiting remote or resource-limited settings.
  • Reduced Burden: Stress tests can be physically and emotionally taxing; using baseline data streamlines diagnosis and supports proactive care.

Value to Healthcare Providers

  • Efficiency: Baseline data models allow quick preliminary screening, saving time for acute cases.
  • Data-Driven Decisions: Machine learning insights enhance diagnostic accuracy and care quality.

Value to the Healthcare System

  • Cost Savings: Reducing unnecessary stress tests can save significant resources (e.g., Ontario spends ~C$300M annually on ~500,000 non-invasive cardiac tests).

🎯 Project Overview

This project analyzes a dataset containing clinical and demographic features to predict heart disease events. The goal is to create a machine learning model capable of predicting the likelihood of heart disease using only basic patient data, improving accessibility to heart disease screening and potentially reducing mortality rates.


πŸ“Š Dataset

We are using the Heart Failure Prediction Dataset which includes:

  • 11 clinical features (2 demographic and 9 medical measurements)
  • Binary classification target (Heart Disease: Yes/No)

πŸ‘₯ Team Members


πŸŽ₯ Project Video


πŸ—οΈ Project Folder Structure

|-- data
|   |-- processed     
|   |-- raw           
|-- experiments
|   |-- model_development  
|-- models
|   |-- logistic_regression  
|   |-- neural_networks      
|   |-- xgboost            
|-- reports
|   |-- figures   

🏁 Setup Instructions

  1. Clone the repository:
git clone https://github.com/sehroz/heart-failure-prediction.git
cd heart-failure-prediction
  1. Create and activate conda environment:
conda env create -f environment.yml

conda activate heart-ml
  1. Run Jupyter Notebook:
jupyter notebook

πŸ’‘ Project Context

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. Four out of 5 CVD deaths are due to heart attacks and strokes, and one-third of these deaths occur prematurely in people under 70 years of age. Heart failure is a common event caused by CVDs, and this dataset contains 11 features that can be used to predict a possible heart disease.

People with cardiovascular disease or at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidemia, or already established disease) need early detection and management, where a machine learning model can be of great help.


πŸ“‹ Attribute Information

  • Age: Age of the patient [years]
  • Sex: Sex of the patient [M: Male, F: Female]
  • ChestPainType: Chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
  • RestingBP: Resting blood pressure [mm Hg]
  • Cholesterol: Serum cholesterol [mg/dl]
  • FastingBS: Fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
  • RestingECG: Resting electrocardiogram results [Normal: Normal, ST: ST-T wave abnormality, LVH: Left ventricular hypertrophy]
  • MaxHR: Maximum heart rate achieved [Numeric]
  • ExerciseAngina: Exercise-induced angina [Y: Yes, N: No]
  • Oldpeak: ST depression induced by exercise [Numeric]
  • ST_Slope: The slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
  • HeartDisease: Target class [1: heart disease, 0: Normal]

πŸ“‹ Main Findings

Insights

  1. Stress Testing: Baseline data offers reasonable predictions, but incorporating stress tests enhances diagnostic accuracy.
  2. Model Performance: XGBoost outperformed other models, showing robust predictive ability. Neural Networks required more tuning but captured complex relationships.
  3. Minimizing False Negatives: Prioritized recall to reduce the risk of undiagnosed heart disease.

Challenges & Solutions

  • Feature Encoding: Used one-hot encoding for categorical variables like ChestPainType.
  • New Model Adoption: Successfully implemented XGBoost using documentation and GridSearch, despite limited prior experience.
  • Explainability: Enhanced interpretability of Neural Networks using SHAP values.

πŸ§ͺ Results and Insights

Model Performance:

The XGBoost model demonstrated the highest accuracy (92%) and F1-score, making it the best-performing algorithm among the models tested (Logistic Regression, Neural Networks, and XGBoost).

Feature Importance:

The top predictors of heart disease include:

  1. Age: Higher age groups show increased risk.
  2. ST_Slope: A flat or down-sloping ST segment strongly correlates with heart disease.
  3. ChestPainType: Asymptomatic cases exhibit the highest risk factor.
  4. ExerciseAngina: Strongly linked to higher disease likelihood.
  5. Oldpeak: ST depression during exercise was a significant indicator.

πŸ“Š Data Analysis & Results

Age Distribution Gender Distribution Correlation Heatmap Baseline Feature Importance All Features Importance ROC Curve Baseline ROC Curve All Features SHAP Values Analysis

🎧 Project Overview Audio

Listen to the project report:

Heart.Failure.Project.Report.1.webm

To learn more about our methodology, results, and insights, view the complete project report.


Forks Contributors Top Language