GitHub - duynlq/storytelling-diabetes: Welcome to my data cleaning and storytelling project. The dataset represents clinical care for 130 US hospitals (1999-2008), where each row concerns hospital records of patients diagnosed with diabetes, who underwent laboratory and medications. The goal is to determine the early readmission of the patient within 30 days of discharge.

Problem Statement

Diabetes is a serious health problem that affects many Americans. In clinical care, there are many risk factors that directly affects the likelihood of a patient with diabetes to be readmitted within 30 days of discharge.
A Logistic Regression model with SMOTE was performed to classify patients in this category with an accuracy of 62% (formal report), however it was a good starting point to pinpoint the top 10 important features that contribute to the increase of classification for hospital readmission.

Num	Feature	Weight	Note
1	discharge_disposition_id_8	0.612653	Transferred to home under care of Home IV provider
2	admission_source_id_7	0.269167	Admission by Emergency Room
3	gender_Female	0.268416	Distribution of females are slightly less than males
4	metformin-rosiglitazone_No	0.175598	Medicine combination used to treat type 2 diabetes
5	admission_source_id_8	0.169067	Admission by Court/Law Enforcement
6	num_procedures	0.113014	Number of procedures done
7	max_glu_serum_>300	0.070908	Simple and direct single test for diabetes
8	diag_3_Diabetes	0.060908	Diabetes as one of patient’s diagnoses
9	miglitol_No	0.058262	Oral anti-diabetic drug that helps patient breaks down complex carbohydrates into glucose
10	glipizide-metformin_Steady	0.043848	Medicine combination used to treat high blood sugar levels caused by type 2 diabetes

A Tableau Dashboard was created to visualize these 10 important features.

The dataset, obtained from the Center for Machine Learning and Intelligent Systems at University of California, Irvine, represents 10 years (1999-2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes 100,000+ entries and 50 features, representing patient and hospital attributes such as patient identification, diagnosis codes, admission type, source, and discharge disposition, risk related medications and test results, and numerous other hospitalization indicators.
UCI Diabetes 130-US hospitals for years 1999-2008
Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records

Replace '?' values with numpy nan
Remove 'encounter_id' and 'patient_nbr' since unique IDs
Remove 'weight' since 97% missing
Remove 'payer_code' and 'medical_specialty' both have 40% and 49% missing
Remove 'examide' and 'citoglipton' since both have only 1 value (zero variance)
Impute 'race' randomly based on its categorical proportions
Classify ranges of 'diag_1', 'diag_2', and 'diag_3' into categorized diagnoses using 2nd link above
Classify 'admission_type_id', 'admission_source_id', and 'discharge_disposition_id' according to given IDs_mapping.csv, while joining various categories to prepare for visualization
Convert response variable 'readmitted' into binary ('NO': 0, '>30': 0, '<30': 1)

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
images		images
ANALYSIS.ipynb		ANALYSIS.ipynb
ANALYSIS_SCRIPT.py		ANALYSIS_SCRIPT.py
FORMAL_REPORT.docx		FORMAL_REPORT.docx
FORMAL_REPORT.pdf		FORMAL_REPORT.pdf
IDs_mapping.csv		IDs_mapping.csv
README.md		README.md
diabetic_data.csv		diabetic_data.csv
preproc_diabetic_data.csv		preproc_diabetic_data.csv