Can heart failure be accurately predicted using only demographic and baseline pre-stress test data, without the need to conduct exercise stress tests? The goal is to explore whether machine learning models can predict heart disease risk by utilizing only basic data available prior to stress testing.
- Accessible Screening: Early detection for high-risk patients without stress testing, benefiting remote or resource-limited settings.
- Reduced Burden: Stress tests can be physically and emotionally taxing; using baseline data streamlines diagnosis and supports proactive care.
- Efficiency: Baseline data models allow quick preliminary screening, saving time for acute cases.
- Data-Driven Decisions: Machine learning insights enhance diagnostic accuracy and care quality.
- Cost Savings: Reducing unnecessary stress tests can save significant resources (e.g., Ontario spends ~C$300M annually on ~500,000 non-invasive cardiac tests).
This project analyzes a dataset containing clinical and demographic features to predict heart disease events. The goal is to create a machine learning model capable of predicting the likelihood of heart disease using only basic patient data, improving accessibility to heart disease screening and potentially reducing mortality rates.
We are using the Heart Failure Prediction Dataset which includes:
- 11 clinical features (2 demographic and 9 medical measurements)
- Binary classification target (Heart Disease: Yes/No)
|-- data
| |-- processed
| |-- raw
|-- experiments
| |-- model_development
|-- models
| |-- logistic_regression
| |-- neural_networks
| |-- xgboost
|-- reports
| |-- figures
- Clone the repository:
git clone https://github.com/sehroz/heart-failure-prediction.git
cd heart-failure-prediction
- Create and activate conda environment:
conda env create -f environment.yml
conda activate heart-ml
- Run Jupyter Notebook:
jupyter notebook
Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. Four out of 5 CVD deaths are due to heart attacks and strokes, and one-third of these deaths occur prematurely in people under 70 years of age. Heart failure is a common event caused by CVDs, and this dataset contains 11 features that can be used to predict a possible heart disease.
People with cardiovascular disease or at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidemia, or already established disease) need early detection and management, where a machine learning model can be of great help.
- Age: Age of the patient [years]
- Sex: Sex of the patient [M: Male, F: Female]
- ChestPainType: Chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
- RestingBP: Resting blood pressure [mm Hg]
- Cholesterol: Serum cholesterol [mg/dl]
- FastingBS: Fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
- RestingECG: Resting electrocardiogram results [Normal: Normal, ST: ST-T wave abnormality, LVH: Left ventricular hypertrophy]
- MaxHR: Maximum heart rate achieved [Numeric]
- ExerciseAngina: Exercise-induced angina [Y: Yes, N: No]
- Oldpeak: ST depression induced by exercise [Numeric]
- ST_Slope: The slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
- HeartDisease: Target class [1: heart disease, 0: Normal]
- Stress Testing: Baseline data offers reasonable predictions, but incorporating stress tests enhances diagnostic accuracy.
- Model Performance: XGBoost outperformed other models, showing robust predictive ability. Neural Networks required more tuning but captured complex relationships.
- Minimizing False Negatives: Prioritized recall to reduce the risk of undiagnosed heart disease.
- Feature Encoding: Used one-hot encoding for categorical variables like
ChestPainType
. - New Model Adoption: Successfully implemented XGBoost using documentation and GridSearch, despite limited prior experience.
- Explainability: Enhanced interpretability of Neural Networks using SHAP values.
The XGBoost model demonstrated the highest accuracy (92%) and F1-score, making it the best-performing algorithm among the models tested (Logistic Regression, Neural Networks, and XGBoost).
The top predictors of heart disease include:
- Age: Higher age groups show increased risk.
- ST_Slope: A flat or down-sloping ST segment strongly correlates with heart disease.
- ChestPainType: Asymptomatic cases exhibit the highest risk factor.
- ExerciseAngina: Strongly linked to higher disease likelihood.
- Oldpeak: ST depression during exercise was a significant indicator.
Listen to the project report:
Heart.Failure.Project.Report.1.webm
To learn more about our methodology, results, and insights, view the complete project report.