Skip to content

Streamlit application to predict risk of cardiovascular disease

Notifications You must be signed in to change notification settings

KTong06/-ML-CVD_Predict

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[ML]CVD_Predict

Streamlit Python Spyder NumPy Pandas scikit-learn love

Is your heart at RISK?

Be it living at the peaceful countryside or among the hustle and bustle of big city, we are subjected to stress at different levels. Keeping your body in good shape isn't just enough to cope with daily tasks, your heart needs to LIVE through the day too! Here as part of my assessment I have made a Streamlit app that gives a binary output(Positive/Negative), aimed to predict the risk of getting cardiovascular disease (CVD) based on several features:

  • 'thalachh'(Maximum heart rate achieved),
  • 'oldpeak'(ST depression induced by exercise relative to rest),
  • 'caa'(Number of major vessels),
  • 'cp'(Type of chest pain), and
  • 'thall'(Thalium Stress Test result)

I would also like to thank Rashik Rahman for providing the dataset to work on, for information about description of data can be obtained here.

WebApp Interface

With the help of Streamlit I have managed to create my first WebApp with my favourite clean n' crisp setting, so here is: st_interface

Model Accuracy

This app is built on a ML model with 87% accuracy, performance of model is summarised below:

eval_acc

Classification Report Confusion Matrix
eval_CR eval_cm

Testing model with new observations

Prior to deploying the model I have also tested the model with several new observations as shown below: test_data

The model is able to achieve 80% accuracy.

Classification Report Confusion Matrix
test_CR test_cm

Dataset Overview

The dataset consists of 303 observations and 14 columns (303,14), all of which are numerized: df_desc

Data Inspection/ Cleaning

Dataset is fairly balanced:

output

Upon inspection 1 duplicate observations and 2 null values in 'thall' are observed: df_dup df_null

Null value is represented by 0 and hence filled using imputation by median method.

Feature selection

Numeric data

Logistic Regression is used to infer correlation of selected column to target feature 'output': feature_num

Numeric features with R-squared value lesser than 0.65 are filtered.

Categorical data

Cramer's V is used to study correlation of categorical features to target feature 'output': feature_cat

Since most categorical features show low correlation to target column only 'cp' and 'thall' are selected.

feature

Train and test data

Train and test dataset is splitted at 7:3 ratio.

Pipeline Building

Pipelines with different combinations of scalers and classification models are built and tested as summarised below: pipeline

Best pipeline

Upon testing it is observed that the optimum pipeline is built base on StandardScaler() with LogisticRegression():

best_pl

Finetune pipeline

GridSearchCV with cross validation cv=5 is applied to finetune the optimum pipeline. Hyperparameter C and penalty are tested out: pl_finetune

Optimum hyperparamter:

best_param

Discussion

Model is able to achieve 87% and 80% during best pipeline evaluation and testing on new observations respectively. Suggestions to improve model:

  • Train model with a larger dataset.
  • Ensembling methods.