Skip to content

Welcome to my data cleaning and storytelling project. The dataset represents clinical care for 130 US hospitals (1999-2008), where each row concerns hospital records of patients diagnosed with diabetes, who underwent laboratory and medications. The goal is to determine the early readmission of the patient within 30 days of discharge.

Notifications You must be signed in to change notification settings

duynlq/storytelling-diabetes

Repository files navigation

banner

Tools GitHub last commit GitHub repo size Badge source

Problem Statement

  • Diabetes is a serious health problem that affects many Americans. In clinical care, there are many risk factors that directly affects the likelihood of a patient with diabetes to be readmitted within 30 days of discharge.
  • A Logistic Regression model with SMOTE was performed to classify patients in this category with an accuracy of 62% (formal report), however it was a good starting point to pinpoint the top 10 important features that contribute to the increase of classification for hospital readmission.

Key Findings: The top 10 important features are listed in the table below, along with their regressor weight.

Num Feature Weight Note
1 discharge_disposition_id_8 0.612653 Transferred to home under care of Home IV provider
2 admission_source_id_7 0.269167 Admission by Emergency Room
3 gender_Female 0.268416 Distribution of females are slightly less than males
4 metformin-rosiglitazone_No 0.175598 Medicine combination used to treat type 2 diabetes
5 admission_source_id_8 0.169067 Admission by Court/Law Enforcement
6 num_procedures 0.113014 Number of procedures done
7 max_glu_serum_>300 0.070908 Simple and direct single test for diabetes
8 diag_3_Diabetes 0.060908 Diabetes as one of patient’s diagnoses
9 miglitol_No 0.058262 Oral anti-diabetic drug that helps patient breaks down complex carbohydrates into glucose
10 glipizide-metformin_Steady 0.043848 Medicine combination used to treat high blood sugar levels caused by type 2 diabetes

A Tableau Dashboard was created to visualize these 10 important features. viz

Data Source

Data Cleaning Summary

  • Replace '?' values with numpy nan
  • Remove 'encounter_id' and 'patient_nbr' since unique IDs
  • Remove 'weight' since 97% missing
  • Remove 'payer_code' and 'medical_specialty' both have 40% and 49% missing
  • Remove 'examide' and 'citoglipton' since both have only 1 value (zero variance)
  • Impute 'race' randomly based on its categorical proportions
  • Classify ranges of 'diag_1', 'diag_2', and 'diag_3' into categorized diagnoses using 2nd link above
  • Classify 'admission_type_id', 'admission_source_id', and 'discharge_disposition_id' according to given IDs_mapping.csv, while joining various categories to prepare for visualization
  • Convert response variable 'readmitted' into binary ('NO': 0, '>30': 0, '<30': 1)

About

Welcome to my data cleaning and storytelling project. The dataset represents clinical care for 130 US hospitals (1999-2008), where each row concerns hospital records of patients diagnosed with diabetes, who underwent laboratory and medications. The goal is to determine the early readmission of the patient within 30 days of discharge.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published