Skip to content

Latest commit

 

History

History
37 lines (26 loc) · 2.87 KB

README.md

File metadata and controls

37 lines (26 loc) · 2.87 KB

Stroke prediction

This repository contains an imbalanced supervised binary classification task: predicting patients that will have a stroke given sociological and biological factors, using the Kaggle dataset present in "https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset".

Tools used

python numpy pandas scikit_learn seaborn imblearn optuna

The main code is scattered across three notebooks:

1 - Exploratory data analysis

In the notebook 1-eda.ipynb, I perform exploratory data analysis on the data:

  • Check for missing values and data types
  • Draw the blueprint for the data pipeline
  • Perform univariate and bivariate data analysis

2 - Model selection and tuning

This notebook, 2-model.ipynb, selects a model across many different classifiers and tunes the best selected classifiers using cross-validation.

The following approach is used:

  • Creating a data pipeline
  • Selecting the best models using cross-validation
  • Performing cross-validaition hyperparameter tuning on the best models using the optuna package
  • Saving the best model pipelines for later evaluation

3 - Model evaluation

Notebook 3-eval.ipynb evaluates the tuned models from the previous notebook and benchmarks them across various different metrics on the test set.

The evaluation consists of the following steps:

  • Accuracy, ROC AUC and F1 score
  • Confusion matrix
  • ROC curve
  • Precision Recall curve
  • True vs predicted distributions
  • Threshold tuning using F1-score