Skip to content

sb01022006/Heart-Disease-Prediction-Using-Logistic-Regression

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Heart-Disease-Prediction-Using-Logistic-Regression

Introduction

World Health Organization has estimated 12 million deaths occur worldwide, every year due to Heart diseases. Half the deaths in the United States and other developed countries are due to cardio vascular diseases. The early prognosis of cardiovascular diseases can aid in making decisions on lifestyle changes in high risk patients and in turn reduce the complications. This research intends to pinpoint the most relevant/risk factors of heart disease as well as predict the overall risk using logistic regression Data Preparation

Source

The dataset is publically available on the Kaggle website, and it is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. The classification goal is to predict whether the patient has 10-year risk of future coronary heart disease (CHD).The dataset provides the patients’ information. It includes over 4,000 records and 15 attributes. Variables Each attribute is a potential risk factor. There are both demographic, behavioral and medical risk factors.

Demographic:

• Sex: male or female(Nominal) • Age: Age of the patient;(Continuous - Although the recorded ages have been truncated to whole numbers, the concept of age is continuous)

Behavioral

• Current Smoker: whether or not the patient is a current smoker (Nominal) • Cigs Per Day: the number of cigarettes that the person smoked on average in one day.(can be considered continuous as one can have any number of cigarettes, even half a cigarette.)

Medical( history)

• BP Meds: whether or not the patient was on blood pressure medication (Nominal) • Prevalent Stroke: whether or not the patient had previously had a stroke (Nominal) • Prevalent Hyp: whether or not the patient was hypertensive (Nominal) • Diabetes: whether or not the patient had diabetes (Nominal)

Medical(current)

• Tot Chol: total cholesterol level (Continuous) • Sys BP: systolic blood pressure (Continuous) • Dia BP: diastolic blood pressure (Continuous) • BMI: Body Mass Index (Continuous) • Heart Rate: heart rate (Continuous - In medical research, variables such as heart rate though in fact discrete, yet are considered continuous because of large number of possible values.) • Glucose: glucose level (Continuous)

Predict variable (desired target)

• 10 year risk of coronary heart disease CHD (binary: “1”, means “Yes”, “0” means “No”)

Logistic Regression

Logistic regression is a type of regression analysis in statistics used for prediction of outcome of a categorical dependent variable from a set of predictor or independent variables. In logistic regression the dependent variable is always binary. Logistic regression is mainly used to for prediction and also calculating the probability of success. The results above show some of the attributes with P value higher than the preferred alpha(5%) and thereby showing low statistically significant relationship with the probability of heart disease. Backward elimination approach is used here to remove those attributes with highest P-value one at a time followed by running the regression repeatedly until all attributes have P Values less than 0.05.

Feature Selection: Backward elimination (P-value approach)

Logistic regression equation

P=eβ0+β1X1/1+eβ0+β1X1P=eβ0+β1X1/1+eβ0+β1X1 When all features plugged in: logit(p)=log(p/(1−p))=β0+β1∗Sexmale+β2∗age+β3∗cigsPerDay+β4∗totChol+β5∗sysBP+β6∗glucoselogit(p)=log(p/(1−p))=β0+β1∗Sexmale+β2∗age+β3∗cigsPerDay+β4∗totChol+β5∗sysBP+β6∗glucose

Interpreting the results: Odds Ratio, Confidence Intervals and P-values

• This fitted model shows that, holding all other features constant, the odds of getting diagnosed with heart disease for males (sex_male = 1)over that of females (sex_male = 0) is exp(0.5815) = 1.788687. In terms of percent change, we can say that the odds for males are 78.8% higher than the odds for females. • The coefficient for age says that, holding all others constant, we will see 7% increase in the odds of getting diagnosed with CDH for a one year increase in age since exp(0.0655) = 1.067644. • Similarly , with every extra cigarette one smokes thers is a 2% increase in the odds of CDH. • For Total cholesterol level and glucose level there is no significant change.

• There is a 1.7% increase in odds for every unit increase in systolic Blood Pressure.

Model Evaluation - Statistics

From the above statistics it is clear that the model is highly specific than sensitive. The negative values are predicted more accurately than the positives. Predicted probabilities of 0 (No Coronary Heart Disease) and 1 ( Coronary Heart Disease: Yes) for the test data with a default classification threshold of 0.5 lower the threshold Since the model is predicting Heart disease too many type II errors is not advisable. A False Negative ( ignoring the probability of disease when there actually is one) is more dangerous than a False Positive in this case. Hence in order to increase the sensitivity, threshold can be lowered.

Conclusions

• All attributes selected after the elimination process show P-values lower than 5% and thereby suggesting significant role in the Heart disease prediction. • Men seem to be more susceptible to heart disease than women. Increase in age, number of cigarettes smoked per day and systolic Blood Pressure also show increasing odds of having heart disease • Total cholesterol shows no significant change in the odds of CHD. This could be due to the presence of 'good cholesterol(HDL) in the total cholesterol reading. Glucose too causes a very negligible change in odds (0.2%) • The model predicted with 0.88 accuracy. The model is more specific than sensitive. Overall model could be improved with more data

Significance of the Code

1: Importing Necessary Libraries: We will import Numpy, Pandas, Matplotlib, Seaborn, Statsmodels and sklearn library in python. statsmodels: for statistical modeling for fitting logistic regression. sklearn: Provides tools for machine learning modeling. Screenshot 2025-07-09 at 8 14 48 PM 2: Data Preparation The dataset is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. The classification goal is to predict whether the patient has 10-year risk of future coronary heart disease (CHD). The dataset provides the patients information. It includes over 4,000 records and 15 attributes.

2.1 Loading and Handling Missing Values from the Dataset We will load the dataset and drop the irrelevant features from the the dataset like "education" and rename columns also. disease_df.pd.read_csv(): This is used to read the contents of CSV file. disease_df, dropna(axis=0, inplace=True): This removes any rows with missing values (NaN) from the DataFrame. disease_df.TenYearCHD.value_counts(): This prints the count of unique values in the TenYearCHD column which likely indicates whether a patient has heart disease.

Screenshot 2025-07-09 at 8 15 51 PM Screenshot 2025-07-09 at 8 16 37 PM

3: Splitting the Dataset into Test and Train Sets We will split the dataset into training and testing portions. But before that we will transform our data by scaling all the features using StandardScaler.

X=preprocessing.StandardScaler().fit(X).transform(X): This scales the features in X to have a mean of 0 and standard deviation of 1 using StandardScaler. Scaling is important for many machine learning models, especially when the features have different units or magnitudes. Training set (70% of data, X_train and y_train) Test set (30% of data, X_test and y_test) random_state=4 ensures the split is reproducible. Screenshot 2025-07-09 at 8 17 16 PM

4: Exploratory Data Analysis of Heart Disease Dataset In Exploratory Data Analysis (EDA) we perform EDA on the heart disease dataset to understand and gain insights into the dataset before building a predictive model for heart disease.

4.1: Ten Year's CHD Record of all the patients available in the dataset: sns.countplot(x='TenYearCHD', data=disease_df, palettte="BuGn_r"): creates a count plot using Seaborn which visualizes the distribution of the values in the TenYearCHD column showing how many individuals have heart disease (1) vs. how many don’t (0).

Screenshot 2025-07-09 at 8 17 48 PM The count plot shows a high imbalance in the dataset where the majority of individuals (over 3000) do not have heart disease (label 0) while only a small number (around 500) have heart disease (label 1).

4.2: Counting number of patients affected by CHD where (0= Not Affected; 1= Affected)

Screenshot 2025-07-09 at 8 18 16 PM

Blue bars: Indicate the absence of heart disease. White space (gaps): These represent the presence of heart disease.

5: Fitting Logistic Regression Model for Heart Disease Prediction

We will create a simple logistic regression model for prediction. logreg=LogisticRegression(): This creates an instance of the LogisticRegression model. logreg.fit(X_train, y_train): This trains the logistic regression model using the training data (X_train for features and y_train for the target). y_pred=logreg.predict(X_test): This uses the trained logistic regression model to make predictions on the test set (X_test). The predicted values are stored in y_pred.

Screenshot 2025-07-09 at 8 18 57 PM

6: Evaluating Logistic Regression Model

Screenshot 2025-07-09 at 8 19 18 PM

Plotting Confusion Matrix

Confusion Matrix is a performance evaluation tool used to assess the accuracy of a classification model. It is used to evaluate the performance of our logistic regression model in predicting heart disease helping us understand how well the model distinguishes between positive and negative cases.

cm=confusion_matrix(y_test, y_pred): Compute confusion matrix by comparing the actual values (y_test) with the predicted values(y_pred). It returns a 2x2 matrix showing true positives, true negatives, false positives and false negatives.

Screenshot 2025-07-09 at 8 19 54 PM Screenshot 2025-07-09 at 8 20 05 PM

The model performs well at predicting no heart disease (class 0) but poorly predicts heart disease (class 1) result in an imbalanced classification performance. To enhance model performance techniques such as class balancing, adjust thresholds or experiment with different algorithms help to achieve better results to correctly identify individuals with heart disease.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published