diff --git a/Predicting_FL_Depression.html b/Predicting_FL_Depression.html new file mode 100644 index 0000000..b5fc808 --- /dev/null +++ b/Predicting_FL_Depression.html @@ -0,0 +1,17489 @@ + + +
+ + +SimplyAnalytics is a web-based data mapping application that incorporates information from several nationally representative data sources on a variety of subjects. One such source is the CDC PLACES project, primarily concerned with health outcomes (https://www.cdc.gov/places/index.html). Unfortunately, the CDC's data on rates of depression is missing for all counties in Florida. In this notebook, I attempt to predict what those depression rates are using machine learning.
+I use various machine learning models to generate predictions of what this missing data is likely to be, based on the abundance of recorded data on other measures I deemed relevant to rates of depression in the US. The important thing to note is that data for depression is missing in Florida, but is recorded for every other state's counties. Additionally, both Florida and the other states' counties share the explanatory variables I have selected for this series of analyses. Because of this commonality, I can train the models using data from counties that have both the explanatory variables and data on depressive rates, so that I can apply it later to counties that have just the explanatory variables, and not data on depressive rates.
+Please look below for my code.
+ +import warnings
+warnings.filterwarnings("ignore", message="X does not have valid feature names")
+
#Backbone of data organization
+import pandas as pd
+import numpy as np
+#Machine learning tools from SciKit-Learn
+from sklearn.linear_model import LogisticRegression, LinearRegression
+from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
+from sklearn import metrics
+from sklearn.naive_bayes import GaussianNB
+from sklearn.metrics import confusion_matrix, classification_report, precision_score, get_scorer_names, mean_squared_error, r2_score, mean_squared_error, roc_auc_score, ConfusionMatrixDisplay, accuracy_score, f1_score, recall_score
+from sklearn.model_selection import train_test_split, LeaveOneOut, cross_val_score, KFold, GridSearchCV, RandomizedSearchCV
+from sklearn.inspection import DecisionBoundaryDisplay
+from sklearn.neighbors import KNeighborsClassifier
+from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, plot_tree, export_text
+from sklearn.preprocessing import StandardScaler, PolynomialFeatures, SplineTransformer
+from sklearn.feature_selection import SequentialFeatureSelector
+from sklearn.ensemble import VotingClassifier, BaggingRegressor, BaggingClassifier, RandomForestRegressor, RandomForestClassifier, AdaBoostClassifier, AdaBoostRegressor, GradientBoostingRegressor, GradientBoostingClassifier
+from sklearn.svm import LinearSVC, SVC
+#Statistical Testing
+from scipy import stats
+#Visualization tools
+import ipywidgets as widgets
+from ipywidgets import interact
+import seaborn as sns
+import matplotlib.pyplot as plt
+import plotly.graph_objects as go
+import plotly.express as px
+from plotly.subplots import make_subplots
+import plotly.io as pio
+pio.renderers.default = 'iframe'
+#Helps keep track of training times
+import time
+
From SimplyAnalytics, I downloaded data on various explanatory variables in Shapefile format, before putting them through ArcGIS online to export them as .csv files. These were then joined together to create my datasets. As to not risk infringing on SimplyAnalytics' policies, I do not include the data I used with this repository, but one can find the data that I used with their own SimplyAnalytics account and the codebook I have included. To do this, export each of the variables I have specified in the codebook at the US county level, rename them to the variable names I used for my analysis, and join them together on the counties' spatial IDs.
+Analyzedata.csv contains all of the data I utilized in my analysis. Depression itself is recorded via the "% population, with depression" column. The CDC PLACES website lists a more detailed explanation of this value (https://www.cdc.gov/places/measure-definitions/health-outcomes/index.html#depression). They record depression as the percentage of adults aged 18 or more years in the county who had ever reported a kind of depressive disorder to a healthcare provider.
+There are some limitations to this variable; this data does not, for instance, discriminate between major and minor depression. This means that regardless of the intensity, peoples' depressive symptoms are counted equally. Additionally, even if one is not depresssed now, if they told a healthcare provider they were depressed at any point in their life, they are treated as part of this statistic. This means that some of the data may not reflect current trends in nation-wide depressive rates. Still, this variable should capture a suitable estimate of the scope of depression in the US.
+ +analyzedata = pd.read_csv("analyzedata.csv")
+
analyzedata.head()
+
+ | Unnamed: 0 | +spatial_id | +name | +perc_population_lacking_health_insurance | +perc_population_noncitizens | +perc_population_without_internet_access | +perc_population_without_highschool_education | +perc_population_whose_commute_is_90_minutes | +Median_income | +perc_population_male | +... | +perc_population_cognitive_disability | +perc_population_individual_living_disability | +average_selfreported_mental_health | +average_selfreported_physical_health | +perc_population_binge_drinking | +perc_population_smoking | +perc_population_without_physical_activity | +perc_population_less_than_7_hours_of_sleep | +perc_population_in_poverty | +perc_population_with_depression | +
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | +0 | +51610 | +Falls Church city, VA | +3.8285 | +4.2811 | +1.6937 | +3.6473 | +1.3051 | +158005.583721 | +49.0126 | +... | +7.1765 | +3.7250 | +11.5805 | +6.9313 | +18.4176 | +7.9498 | +13.5138 | +30.2433 | +2.0871 | +19.8510 | +
1 | +1 | +51678 | +Lexington city, VA | +7.6005 | +3.3659 | +6.0019 | +4.5010 | +0.7884 | +63925.395035 | +60.4235 | +... | +15.8981 | +8.1917 | +19.5995 | +9.9059 | +19.0079 | +12.5000 | +21.1924 | +30.8101 | +26.2336 | +24.5904 | +
2 | +2 | +72033 | +Cataño Municipio, PR | +0.0000 | +0.0000 | +52.0332 | +24.0707 | +5.0907 | +19630.975519 | +46.9006 | +... | +0.0000 | +0.0000 | +0.0000 | +0.0000 | +0.0000 | +0.0000 | +0.0000 | +0.0000 | +45.0673 | +0.0000 | +
3 | +3 | +51685 | +Manassas Park city, VA | +12.1139 | +15.3833 | +4.0519 | +24.1794 | +10.5785 | +90256.312061 | +51.6286 | +... | +13.2299 | +6.5991 | +16.4425 | +9.7563 | +18.7213 | +15.1132 | +23.1366 | +35.5040 | +2.8343 | +20.9052 | +
4 | +4 | +51580 | +Covington city, VA | +8.4537 | +0.0351 | +3.2832 | +11.8706 | +1.4455 | +42783.823622 | +45.1805 | +... | +14.7383 | +9.3237 | +17.0748 | +13.5251 | +15.2101 | +19.3440 | +27.4994 | +35.5650 | +12.9587 | +23.4779 | +
5 rows × 37 columns
+The analyzedata dataset is split into two kinds - miss_data, which records information for the counties in FL (and Puerto Rico, which also has no information on depressive rates but does not have recorded values for most other explanatory variables), and precomp_data, which records information for all the other counties, and which will be used for training and testing the models.
+ +miss_data = analyzedata[analyzedata['perc_population_with_depression'] == 0]
+
precomp_data = analyzedata[analyzedata['perc_population_with_depression'] != 0]
+
With precomp_data I can do some basic exploration of depression in the US before I start the analysis. Sorting the data by depressive rates, Honolulu County, Hawaii is the single least depressed county in the US (12.03%), followed by several Hawaiian counties and Prince George's County in Maryland. Additionally, Magoffin County, Kentucky is the single most depressed county in the US (33.20%), followed by several other Kentucky counties.
+ +(precomp_data.sort_values(by='perc_population_with_depression', ascending = True)).head()
+
+ | Unnamed: 0 | +spatial_id | +name | +perc_population_lacking_health_insurance | +perc_population_noncitizens | +perc_population_without_internet_access | +perc_population_without_highschool_education | +perc_population_whose_commute_is_90_minutes | +Median_income | +perc_population_male | +... | +perc_population_cognitive_disability | +perc_population_individual_living_disability | +average_selfreported_mental_health | +average_selfreported_physical_health | +perc_population_binge_drinking | +perc_population_smoking | +perc_population_without_physical_activity | +perc_population_less_than_7_hours_of_sleep | +perc_population_in_poverty | +perc_population_with_depression | +
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3148 | +3148 | +15003 | +Honolulu County, HI | +5.2956 | +7.2010 | +5.6249 | +8.1999 | +2.6695 | +92824.581663 | +50.5588 | +... | +9.3545 | +5.2851 | +12.8306 | +8.1250 | +17.2830 | +11.0493 | +20.6669 | +41.6486 | +8.6468 | +12.0295 | +
70 | +70 | +15005 | +Kalawao County, HI | +5.5556 | +0.3906 | +0.0000 | +0.7752 | +0.0000 | +81665.653931 | +59.3750 | +... | +10.7692 | +7.6923 | +12.3077 | +9.2308 | +16.9231 | +13.8462 | +21.5385 | +36.9231 | +1.9841 | +12.3077 | +
3086 | +3086 | +15009 | +Maui County, HI | +6.2804 | +8.1524 | +7.3821 | +9.0461 | +1.4498 | +89603.826535 | +49.8382 | +... | +10.4216 | +5.8887 | +13.1815 | +8.2612 | +18.3690 | +12.9121 | +18.8404 | +35.7029 | +9.0900 | +13.0596 | +
2010 | +2010 | +24033 | +Prince George's County, MD | +11.7676 | +13.2117 | +3.9966 | +13.8406 | +4.1999 | +91744.863839 | +48.2978 | +... | +11.4939 | +6.8346 | +14.1653 | +9.0093 | +12.5228 | +11.4578 | +23.3100 | +40.3515 | +9.2147 | +13.1289 | +
2594 | +2594 | +15007 | +Kauai County, HI | +6.0408 | +7.8048 | +4.7208 | +9.0390 | +0.5492 | +88415.786691 | +49.7258 | +... | +10.3278 | +6.0378 | +13.2342 | +9.5351 | +17.3332 | +12.8307 | +20.4056 | +38.2744 | +8.5223 | +13.8287 | +
5 rows × 37 columns
+(precomp_data.sort_values(by='perc_population_with_depression', ascending = False)).head()
+
+ | Unnamed: 0 | +spatial_id | +name | +perc_population_lacking_health_insurance | +perc_population_noncitizens | +perc_population_without_internet_access | +perc_population_without_highschool_education | +perc_population_whose_commute_is_90_minutes | +Median_income | +perc_population_male | +... | +perc_population_cognitive_disability | +perc_population_individual_living_disability | +average_selfreported_mental_health | +average_selfreported_physical_health | +perc_population_binge_drinking | +perc_population_smoking | +perc_population_without_physical_activity | +perc_population_less_than_7_hours_of_sleep | +perc_population_in_poverty | +perc_population_with_depression | +
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1151 | +1151 | +21153 | +Magoffin County, KY | +10.8973 | +0.6770 | +18.3244 | +29.3527 | +8.6843 | +28278.600000 | +50.0391 | +... | +24.7120 | +14.5200 | +22.4909 | +19.2631 | +13.1707 | +30.8251 | +38.7234 | +40.0519 | +29.5216 | +33.2019 | +
190 | +190 | +21129 | +Lee County, KY | +12.6859 | +0.6922 | +18.3061 | +28.2597 | +18.3665 | +28848.907609 | +58.7194 | +... | +26.6484 | +16.2740 | +23.1233 | +21.0046 | +13.0594 | +33.4612 | +41.8265 | +41.2237 | +30.0706 | +32.3653 | +
1917 | +1917 | +21205 | +Rowan County, KY | +8.2695 | +0.7073 | +5.8783 | +13.5629 | +3.6119 | +45760.072048 | +47.9568 | +... | +22.4782 | +12.0390 | +22.5940 | +15.0123 | +14.6300 | +22.9914 | +30.4825 | +38.0792 | +26.4521 | +32.2433 | +
1226 | +1226 | +21131 | +Leslie County, KY | +11.0912 | +0.1171 | +7.5329 | +16.5167 | +7.0366 | +36659.224346 | +50.8883 | +... | +24.8793 | +15.0836 | +22.8731 | +20.0248 | +13.0526 | +30.9226 | +40.2353 | +41.1022 | +34.0897 | +31.9257 | +
1632 | +1632 | +21051 | +Clay County, KY | +13.3266 | +0.4504 | +21.7026 | +32.8616 | +3.2357 | +32320.219341 | +52.6453 | +... | +26.5801 | +15.8982 | +23.2358 | +20.2658 | +13.2651 | +33.4935 | +39.9763 | +42.5158 | +31.2620 | +31.8525 | +
5 rows × 37 columns
+Given the low predictive power of single variables, and challenges of utilizing multiple linear regression with a high number of independent variables, predicting an exact number for depression in Florida presents a challenging task. It would instead be easier to split the depression statistic into multiple percentile ranges, thereby treating depression as a categorical variable, so that tools such as logistic regression and linear discriminant analysis can be used. The goal being to generate a simple estimate of depression in these states, a categorical variable could perform just as well as a numerical variable. At any rate, the prediction is only meaningful when compared against the data for other counties, which this approach will accomplish.
+The quartileranges function assesses what each of the categories I divide the depression data into actually mean. In this case, since I am splitting the data into quartiles, I have four classifications, for which the function shows us the ranges they correspond to. 'Q1' stands for the first quartile, or the category entailing the smallest rates of depression, and 'Q4' stands for the fourth quartile, entailing the highest rates of depression.
+ +def quartileranges(x):
+ quartiles = []
+ for i in range(1, x + 1):
+ quartiles.append("Q" + str(i))
+ comp_data = precomp_data.copy()
+ comp_data['dep_qt'] = pd.qcut(precomp_data['perc_population_with_depression'], q=x, labels=quartiles)
+ for i in quartiles:
+ comp_databd = comp_data.query(f'dep_qt == "{i}"')
+ print([i, comp_databd.perc_population_with_depression.min(), comp_databd.perc_population_with_depression.max()])
+
quartileranges(4)
+
['Q1', 12.0295, 20.9249] +['Q2', 20.9275, 22.8042] +['Q3', 22.8045, 24.979] +['Q4', 24.9818, 33.2019] ++
The last step before the analysis is to create a new dataset, "comp_data", from precomp_data, with which I finalize the quartile split. I then separate the new dependent variable, dep_qt, which becomes "y," and drop unnecessary columns from comp_data, which becomes "X." These separate dataframes will be utilized for the train-test split function. Leaving 35% of the counties for testing purposes, I also standardize the independent variables in order to ensure each functions on a scale equivalent to one another in the models.
+ +comp_data = precomp_data.copy()
+comp_data['dep_qt'] = pd.qcut(precomp_data['perc_population_with_depression'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
+
y = comp_data['dep_qt']
+X = comp_data.drop(['Unnamed: 0', 'spatial_id', 'name', 'perc_population_with_depression', 'dep_qt'], axis=1)
+
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35, random_state=1)
+
scaler = StandardScaler()
+X_train_scaled = scaler.fit_transform(X_train)
+X_train = pd.DataFrame(X_train_scaled, columns=X_train.columns)
+X_test_scaled = scaler.transform(X_test)
+X_test = pd.DataFrame(X_test_scaled, columns=X_test.columns)
+
I can optimize the logistic regression in one of three ways based off of the different assignable penalties during hyperparameter tuning, those being l1, l2, and elasticnet, which uses both l1 and l2 penalties. Since these penalties are only compatible with certain solvers, I set each of the penalties as constant and use GridSearchCV to parse through the solvers and C values in three separate functions.
+Comparing each penalty individually, the best performance for logistic regression comes from using an LBFGS solver and l2 penalty, with around 69% accuracy. This model, as does many others, does exceptionally well at predicting the most extreme quartiles, but not as well as predicting those in between, suffering roughly a 20% decrease in accuracy when predicting the second or third quartiles as opposed to predicting the first or fourth.
+ +# start = time.time()
+# logreg = LogisticRegression(solver = 'liblinear', penalty = 'l1', max_iter = 100000)
+# param_grid = {'C': [0.01, 0.1, 1, 10, 100, 1000]}
+# grid_search = GridSearchCV(logreg, param_grid, cv=5, scoring='accuracy')
+# grid_search.fit(X_train, y_train)
+# best_params = grid_search.best_params_
+# print("Best Hyperparameters:", best_params)
+# end = time.time()
+# print(f"Time elapsed: {end-start:.4f}")
+
# Starts timer
+start = time.time()
+# References specific model, establishes set hyperparameters
+logreg = LogisticRegression(solver = 'newton-cg', penalty = 'l2', max_iter = 100000)
+# Outlines hyperparameters for GridSearchCV to find the optimal combination between
+param_grid = {
+ 'C': [0.01, 0.1, 1, 10, 100, 1000],
+ 'solver': ['lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag']}
+grid_search = GridSearchCV(logreg, param_grid, cv=5, scoring='accuracy')
+# Finds, prints optimal combination
+grid_search.fit(X_train, y_train)
+best_params = grid_search.best_params_
+print("Best Hyperparameters:", best_params)
+# Ends timer
+end = time.time()
+# Provides total time needed to complete model
+print(f"Time elapsed: {end-start:.4f}")
+
Best Hyperparameters: {'C': 1, 'solver': 'lbfgs'} +Time elapsed: 65.0526 ++
# start = time.time()
+# logreg = LogisticRegression(solver = 'saga', penalty = 'elasticnet', max_iter = 100000)
+# param_grid = {
+# 'C': [0.01, 0.1, 1, 10, 100, 1000],
+# 'l1_ratio': [0.1, 0.5, 0.7, 0.9]
+# }
+# grid_search = GridSearchCV(logreg, param_grid, cv=5, scoring='accuracy')
+# grid_search.fit(X_train, y_train)
+# best_params = grid_search.best_params_
+# print("Best Hyperparameters:", best_params)
+# end = time.time()
+# print(f"Time elapsed: {end-start:.4f}")
+
# References model found by GridSearchCV
+logregbm = grid_search.best_estimator_
+# Loads in testing data
+lr_y_pred = logregbm.predict(X_test)
+# Assesses, prints accuracy of model's predictions to actual quartiles
+logregacc = accuracy_score(y_test, lr_y_pred)
+logregpre = precision_score(y_test, lr_y_pred, average='weighted')
+logregf1 = f1_score(y_test, lr_y_pred, average='weighted')
+logregrecall = recall_score(y_test, lr_y_pred, average='weighted')
+print(f"Accuracy: {logregacc:.4f}")
+print(f"Precision: {logregpre:.4f}")
+print(f"F1: {logregf1:.4f}")
+print(f"Recall: {logregrecall:.4f}")
+# Creates confusion matrixes and classification report detailing strengths and weaknesses of models' predictions
+ConfusionMatrixDisplay.from_estimator(logregbm, X_test, y_test, cmap = plt.cm.Reds, normalize = 'true')
+plt.show()
+print(confusion_matrix(y_test, lr_y_pred))
+print(classification_report(y_test, lr_y_pred))
+
Accuracy: 0.6899 +Precision: 0.6907 +F1: 0.6903 +Recall: 0.6899 ++
[[203 56 10 1] + [ 56 154 50 7] + [ 12 51 170 39] + [ 1 9 42 216]] + precision recall f1-score support + + Q1 0.75 0.75 0.75 270 + Q2 0.57 0.58 0.57 267 + Q3 0.62 0.62 0.62 272 + Q4 0.82 0.81 0.81 268 + + accuracy 0.69 1077 + macro avg 0.69 0.69 0.69 1077 +weighted avg 0.69 0.69 0.69 1077 + ++
The Linear Discriminant Analysis (LDA) model does not do as well as the Logistic Regression, but only for predicting extreme quartiles. Performance for predicting the middle quartiles is rarer and more valuable, and in this regard the LDA performs roughly the same if not slightly better.
+The LDA can be optimized by solver and by shrinkage. The SVD solver, which is the most accurate in this case, is not compatible with shrinkage, so a distinction is made in the code. This model has around 67% accuracy overall.
+ +start = time.time()
+lda = LinearDiscriminantAnalysis()
+param_grid = [
+ {'solver': ['svd'], 'shrinkage': [None]},
+ {'solver': ['lsqr', 'eigen'], 'shrinkage': ['auto', 0.1, 0.5, 0.9]}
+]
+grid_search = GridSearchCV(lda, param_grid, cv=5, scoring='accuracy')
+grid_search.fit(X_train, y_train)
+best_params = grid_search.best_params_
+print("Best Hyperparameters:", best_params)
+end = time.time()
+print(f"Time elapsed: {end-start:.4f}")
+
Best Hyperparameters: {'shrinkage': None, 'solver': 'svd'} +Time elapsed: 3.0342 ++
ldabm = grid_search.best_estimator_
+lda_y_pred = ldabm.predict(X_test)
+ldaacc = accuracy_score(lda_y_pred, y_test)
+print(f"Accuracy: {ldaacc:.4f}")
+ConfusionMatrixDisplay.from_estimator(ldabm, X_test, y_test, cmap = plt.cm.Blues, normalize = 'true')
+plt.show()
+print(confusion_matrix(y_test, lda_y_pred))
+print(classification_report(y_test, lda_y_pred))
+
Accuracy: 0.6676 ++
[[189 64 17 0] + [ 50 161 53 3] + [ 8 61 167 36] + [ 1 5 60 202]] + precision recall f1-score support + + Q1 0.76 0.70 0.73 270 + Q2 0.55 0.60 0.58 267 + Q3 0.56 0.61 0.59 272 + Q4 0.84 0.75 0.79 268 + + accuracy 0.67 1077 + macro avg 0.68 0.67 0.67 1077 +weighted avg 0.68 0.67 0.67 1077 + ++
The Gaussian Naive Bayes model performs the worst out of all the models at 41% accuracy and furthermore has limited room for hyperparameter tuning. It is progressively more accurate at predicting values the greater the rate of depression they correspond to, but fails to surpass the prior models' predictions in any quartiles. The reason for the model's lackluster performance may be due to the fact that many of my indpendent variables are not normally distributed, as can be seen in my interactive scatterplot.
+ +start = time.time()
+NBayes = GaussianNB()
+NBayes.fit(X_train, y_train)
+NB_y_pred = NBayes.predict(X_test)
+NBayesacc = accuracy_score(NB_y_pred, y_test)
+print(f"Accuracy: {NBayesacc:.4f}")
+ConfusionMatrixDisplay.from_estimator(NBayes, X_test, y_test, cmap = plt.cm.Greens, normalize = 'true')
+plt.show()
+print(confusion_matrix(y_test, NB_y_pred))
+print(classification_report(y_test, NB_y_pred))
+end = time.time()
+print(f"Time elapsed: {end-start:.4f}")
+
Accuracy: 0.4132 ++
[[ 42 136 81 11] + [ 23 75 124 45] + [ 8 35 152 77] + [ 8 15 69 176]] + precision recall f1-score support + + Q1 0.52 0.16 0.24 270 + Q2 0.29 0.28 0.28 267 + Q3 0.36 0.56 0.44 272 + Q4 0.57 0.66 0.61 268 + + accuracy 0.41 1077 + macro avg 0.43 0.41 0.39 1077 +weighted avg 0.43 0.41 0.39 1077 + +Time elapsed: 0.4085 ++
I first define the optimal K to set my K-Nearest Neighbors Classifier to using the below function and graph. For each point, K refers to the number of data points most similar to it that will be used to determine which quartile that point belongs to. In this case, K is 57 - around 5% of the 1,033 points in the testing data.
+Similarly to the Logisstic Regression and LDA models, this model does well at predicting extreme quartiles but suffers in middle quartiles. In this regard, it also has a reduced accuracy of around 60%.
+ +# Finds optimal "K", or the number of nearby data points needed to classify another data point
+def optknn(X, y):
+ k_values = range(1, 100)
+ cv_scores = []
+ # Finds Cross-Validated Accuracy for models with Ks ranging from 1 to 100 given appropriate data
+ for k in k_values:
+ knn = KNeighborsClassifier(n_neighbors=k)
+ scores = cross_val_score(knn, X, y, cv=4, scoring='accuracy')
+ cv_scores.append(scores.mean())
+ optimal_k = k_values[np.argmax(cv_scores)]
+ print(f"The optimal K is: {optimal_k}")
+ # Creates visualization between accuracy and K for 100 models built
+ plt.plot(k_values, cv_scores)
+ plt.xlabel('K')
+ plt.ylabel('Cross-validated Accuracy')
+ plt.title('Cross-validated Accuracy vs K (KNN)')
+ plt.grid(True)
+ plt.show()
+optknn(X, y)
+
The optimal K is: 57 ++
start = time.time()
+knn = KNeighborsClassifier(n_neighbors=57)
+cv_scores = cross_val_score(knn, X_train, y_train, cv=5)
+print("Cross-validation scores:", cv_scores)
+print(f"Average cross-validation score: {cv_scores.mean():.4f}", )
+knn.fit(X_train, y_train)
+knn_y_pred = knn.predict(X_test)
+test_accuracy = accuracy_score(y_test, knn_y_pred)
+knnacc = accuracy_score(knn_y_pred, y_test)
+print(f"Accuracy: {knnacc:.4f}")
+ConfusionMatrixDisplay.from_estimator(knn, X_test, y_test, cmap = plt.cm.Greys, normalize = 'true')
+plt.show()
+print(confusion_matrix(y_test, knn_y_pred))
+print(classification_report(y_test, knn_y_pred))
+end = time.time()
+print(f"Time elapsed: {end-start:.4f}")
+
Cross-validation scores: [0.5975 0.52 0.565 0.5625 0.5112782] +Average cross-validation score: 0.5513 +Accuracy: 0.6091 ++
[[193 58 18 1] + [ 62 106 83 16] + [ 23 32 157 60] + [ 7 7 54 200]] + precision recall f1-score support + + Q1 0.68 0.71 0.70 270 + Q2 0.52 0.40 0.45 267 + Q3 0.50 0.58 0.54 272 + Q4 0.72 0.75 0.73 268 + + accuracy 0.61 1077 + macro avg 0.61 0.61 0.60 1077 +weighted avg 0.61 0.61 0.60 1077 + +Time elapsed: 0.5556 ++
The Support Vector Classifier (SVC) has the best accuracy out of all the models thus far, at 72%. Tuning for different kernel types and C values, I can use an RBF kernel and a C-value of 10. There are also more in-depth customizations for polynomial and sigmoid kernels, but factoring them in is computationally expensive, and besides, RBF is already the most accurate kernel. This model has decent accuracy for all its predictions.
+ +start = time.time()
+svc = SVC()
+param_grid = {
+ 'C': [0.1, 1, 10, 100],
+ 'kernel': ['linear', 'rbf', 'poly', 'sigmoid'],
+}
+grid_search = GridSearchCV(svc, param_grid, cv=5, scoring='accuracy')
+grid_search.fit(X_train, y_train)
+best_params = grid_search.best_params_
+print("Best Hyperparameters:", best_params)
+end = time.time()
+print(f"Time elapsed: {end-start:.4f}")
+
Best Hyperparameters: {'C': 10, 'kernel': 'rbf'} +Time elapsed: 66.8091 ++
svcbm = grid_search.best_estimator_
+svc_y_pred = svcbm.predict(X_test)
+svcacc = accuracy_score(svc_y_pred, y_test)
+print(f"Accuracy: {svcacc:.4f}")
+ConfusionMatrixDisplay.from_estimator(svcbm, X_test, y_test, cmap = plt.cm.Oranges, normalize = 'true')
+plt.show()
+print(confusion_matrix(y_test, svc_y_pred))
+print(classification_report(y_test, svc_y_pred))
+
Accuracy: 0.7205 ++
[[209 53 6 2] + [ 36 183 42 6] + [ 6 62 169 35] + [ 1 7 45 215]] + precision recall f1-score support + + Q1 0.83 0.77 0.80 270 + Q2 0.60 0.69 0.64 267 + Q3 0.65 0.62 0.63 272 + Q4 0.83 0.80 0.82 268 + + accuracy 0.72 1077 + macro avg 0.73 0.72 0.72 1077 +weighted avg 0.73 0.72 0.72 1077 + ++
For the Random Forest Classifier, using too many hyperparameters with GridSearchCV will become too computationally expensive for my server to handle, even with the use of a GPU. Multiple hyperparameters are set to standards that I have judged to be suited for the analysis' needs. Leaving the number of trees constant and depth ambiguous should be fine, so long as the splits and leaves of each tree can be customized. At 70% accuracy, this test is slightly less accurate at prediction than the SVC model, but it is more accurate at predicting values in the third quartile.
+Below the confusion matrix, I can use the feature importances to assess which of the independent variables have the most weight in predicting depression. It would seem that this model places the most emphasis on variation in county-wide health statistics such as rates of arthritis, diabetes, and stroke.
+ +start = time.time()
+rf = RandomForestClassifier(max_depth = None, n_estimators = 500, criterion = 'gini', max_features = 'sqrt', bootstrap = False, random_state = 1)
+param_grid = {
+ 'min_samples_split': [2, 5, 10],
+ 'min_samples_leaf': [1, 2, 4]
+}
+grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='accuracy')
+grid_search.fit(X_train, y_train)
+best_params = grid_search.best_params_
+print("Best Hyperparameters:", best_params)
+end = time.time()
+print(f"Time elapsed: {end-start:.4f}")
+
Best Hyperparameters: {'min_samples_leaf': 1, 'min_samples_split': 2} +Time elapsed: 218.1307 ++
rfbm = grid_search.best_estimator_
+rf_y_pred = rfbm.predict(X_test)
+rfacc = accuracy_score(rf_y_pred, y_test)
+print(f"Accuracy: {rfacc:.4f}")
+ConfusionMatrixDisplay.from_estimator(rfbm, X_test, y_test, cmap = plt.cm.Purples, normalize = 'true')
+plt.show()
+print(confusion_matrix(y_test, rf_y_pred))
+print(classification_report(y_test, rf_y_pred))
+
Accuracy: 0.7029 ++
[[207 53 7 3] + [ 50 162 46 9] + [ 10 50 183 29] + [ 2 9 52 205]] + precision recall f1-score support + + Q1 0.77 0.77 0.77 270 + Q2 0.59 0.61 0.60 267 + Q3 0.64 0.67 0.65 272 + Q4 0.83 0.76 0.80 268 + + accuracy 0.70 1077 + macro avg 0.71 0.70 0.70 1077 +weighted avg 0.71 0.70 0.70 1077 + ++
importances = pd.Series(data = rfbm.feature_importances_,
+ index = X_train.columns)
+
+importances_sorted = importances.sort_values()
+importances_sorted = importances_sorted.head(20)
+
+fig = plt.figure(figsize = (25,20))
+ax = importances_sorted.plot(kind='barh', color='purple')
+plt.title('Feature Importances', fontsize=50) # Title font size
+plt.xlabel('Importance', fontsize=40) # X-axis label font size
+plt.ylabel('Features', fontsize=40) # Y-axis label font size
+ax.tick_params(axis='x', labelsize=30) # X-axis tick label size
+ax.tick_params(axis='y', labelsize=30)
+plt.show
+
<function matplotlib.pyplot.show(close=None, block=None)>+
missX = miss_data.drop(['Unnamed: 0', 'spatial_id', 'name', 'perc_population_with_depression'], axis=1)
+missX_scaled = scaler.fit_transform(missX)
+missX = pd.DataFrame(missX_scaled, columns=missX.columns)
+
Lastly, I assemble all of the predictions for the missing data into a dataframe. I use the "predmatcher" function to assemble the predictions from each of the models and align them with the names of the counties they correspond to. I then tally how many times the models predicted a certain outcome (in this case, whether the county had low, medium, or high rates of depression) before assessing which outcome was predicted the most in the "consensus" variable.
+ +def predmatcher(model):
+ x = list(model.predict(missX_scaled))
+ namedpreds = []
+ for i in range(0, len(x)):
+ namedpreds.append(x[i])
+ return namedpreds
+
modelpreds = pd.DataFrame({
+ 'County_Name':list(miss_data.name),
+ f'logregpreds':predmatcher(logregbm),
+ f'ldapreds':predmatcher(ldabm),
+# f'NBpreds':predmatcher(NBayes),
+ f'knn':predmatcher(knn),
+ f'svcpreds':predmatcher(svcbm),
+ f'rfpreds':predmatcher(rfbm)
+ })
+for i in ['Q1', 'Q2', 'Q3', 'Q4']:
+ modelpreds[f'{i}_counts'] = modelpreds.apply(lambda row: (row == i).sum(), axis=1)
+modelpreds['max'] = modelpreds[['Q1_counts', 'Q2_counts', 'Q3_counts', 'Q4_counts']].max(axis=1)
+modelpreds['consensus'] = modelpreds.mode(axis=1)[0]
+modelpreds['consensus'] = modelpreds['consensus'].replace(2, 'Tie')
+modelpredsFL = modelpreds[modelpreds['County_Name'].str.contains('FL')]
+
votedf = pd.DataFrame({
+ f'logregpreds':lr_y_pred,
+ f'ldapreds':lda_y_pred,
+ f'knn':knn_y_pred,
+ f'svcpreds':svc_y_pred,
+ f'rfpreds':rf_y_pred
+ })
+for i in ['Q1', 'Q2', 'Q3', 'Q4']:
+ votedf[f'{i}_counts'] = votedf.apply(lambda row: (row == i).sum(), axis=1)
+votedf['max'] = votedf[['Q1_counts', 'Q2_counts', 'Q3_counts', 'Q4_counts']].max(axis=1)
+votedf['consensus'] = votedf.mode(axis=1)[0]
+votedf['consensus'] = votedf['consensus'].replace(2, 'Q2')
+
cm = confusion_matrix(list(y_test), list(votedf['consensus']), labels=['Q1', 'Q2', 'Q3', 'Q4'])
+disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Q1', 'Q2', 'Q3', 'Q4'])
+disp.plot(cmap=plt.cm.BuPu)
+accuracy = accuracy_score(list(y_test), list(votedf['consensus']))
+print(f"Accuracy: {accuracy}")
+plt.show()
+
Accuracy: 0.7000928505106778 ++
For Floridian counties, the SVC - the most accurate model - predicts an overwhelming majority in the the second quartile, and six counties in the first quartile. This means that, overall, most Floridian counties should have lower rates of depression than the rest of the country, with a few in particular - Lee, Orange, Palm Beach, Hillsborough, Miami, and Escambia - having exceptionally low rates of depresssion.
+However, the RF model, which had a higher accuracy for detecting counties in the 3rd quartile, tells a different story. It predicts the majority of counties as belonging to the 3rd quartile, with just three belonging to the 2nd quartile. This is an entirely new set of predictions created from the differences between the two models.
+The consensus column, which factors in the predictions of all the models I have covered thus far (save for the Naive Bayes), predicts most counties being in the third quartile, some in the second, one tie between the third and second, and one county in the first, that being Miami.
+ +modelpredsFL.head()
+
+ | County_Name | +logregpreds | +ldapreds | +knn | +svcpreds | +rfpreds | +Q1_counts | +Q2_counts | +Q3_counts | +Q4_counts | +max | +consensus | +
---|---|---|---|---|---|---|---|---|---|---|---|---|
77 | +Union County, FL | +Q2 | +Q2 | +Q2 | +Q2 | +Q3 | +0 | +4 | +1 | +0 | +4 | +Q2 | +
78 | +Gilchrist County, FL | +Q3 | +Q2 | +Q3 | +Q2 | +Q3 | +0 | +2 | +3 | +0 | +3 | +Q3 | +
79 | +Seminole County, FL | +Q3 | +Q3 | +Q3 | +Q2 | +Q3 | +0 | +1 | +4 | +0 | +4 | +Q3 | +
80 | +Hardee County, FL | +Q2 | +Q2 | +Q3 | +Q2 | +Q3 | +0 | +3 | +2 | +0 | +3 | +Q2 | +
81 | +DeSoto County, FL | +Q3 | +Q2 | +Q3 | +Q2 | +Q3 | +0 | +2 | +3 | +0 | +3 | +Q3 | +
modelpredsFL['svcpreds'].value_counts()
+
svcpreds +Q2 61 +Q1 6 +Name: count, dtype: int64+
modelpredsFLQ1 = modelpredsFL[modelpredsFL['svcpreds'] == 'Q1']
+modelpredsFLQ1.head(6)
+
+ | County_Name | +logregpreds | +ldapreds | +knn | +svcpreds | +rfpreds | +Q1_counts | +Q2_counts | +Q3_counts | +Q4_counts | +max | +consensus | +
---|---|---|---|---|---|---|---|---|---|---|---|---|
115 | +Lee County, FL | +Q3 | +Q3 | +Q3 | +Q1 | +Q3 | +1 | +0 | +4 | +0 | +4 | +Q3 | +
122 | +Orange County, FL | +Q3 | +Q3 | +Q1 | +Q1 | +Q3 | +2 | +0 | +3 | +0 | +3 | +Q3 | +
129 | +Palm Beach County, FL | +Q4 | +Q3 | +Q3 | +Q1 | +Q3 | +1 | +0 | +3 | +1 | +3 | +Q3 | +
131 | +Hillsborough County, FL | +Q3 | +Q3 | +Q3 | +Q1 | +Q3 | +1 | +0 | +4 | +0 | +4 | +Q3 | +
141 | +Miami-Dade County, FL | +Q1 | +Q3 | +Q1 | +Q1 | +Q3 | +3 | +0 | +2 | +0 | +3 | +Q1 | +
143 | +Escambia County, FL | +Q3 | +Q2 | +Q3 | +Q1 | +Q3 | +1 | +1 | +3 | +0 | +3 | +Q3 | +
modelpredsFL['rfpreds'].value_counts()
+
rfpreds +Q3 64 +Q2 3 +Name: count, dtype: int64+
modelpredsFLQ2 = modelpredsFL[modelpredsFL['rfpreds'] == 'Q2']
+modelpredsFLQ2.head()
+
+ | County_Name | +logregpreds | +ldapreds | +knn | +svcpreds | +rfpreds | +Q1_counts | +Q2_counts | +Q3_counts | +Q4_counts | +max | +consensus | +
---|---|---|---|---|---|---|---|---|---|---|---|---|
93 | +Gadsden County, FL | +Q3 | +Q3 | +Q2 | +Q2 | +Q2 | +0 | +3 | +2 | +0 | +3 | +Q2 | +
95 | +Madison County, FL | +Q2 | +Q2 | +Q3 | +Q2 | +Q2 | +0 | +4 | +1 | +0 | +4 | +Q2 | +
104 | +Hamilton County, FL | +Q2 | +Q2 | +Q3 | +Q2 | +Q2 | +0 | +4 | +1 | +0 | +4 | +Q2 | +
modelpredsFL['consensus'].value_counts()
+
consensus +Q3 49 +Q2 16 +Tie 1 +Q1 1 +Name: count, dtype: int64+
modelpredsFLQ1 = modelpredsFL[modelpredsFL['consensus'] == 'Q1']
+modelpredsFLQ1.head()
+
+ | County_Name | +logregpreds | +ldapreds | +knn | +svcpreds | +rfpreds | +Q1_counts | +Q2_counts | +Q3_counts | +Q4_counts | +max | +consensus | +
---|---|---|---|---|---|---|---|---|---|---|---|---|
141 | +Miami-Dade County, FL | +Q1 | +Q3 | +Q1 | +Q1 | +Q3 | +3 | +0 | +2 | +0 | +3 | +Q1 | +
modelpredsFLTie = modelpredsFL[modelpredsFL['consensus'] == 'Tie']
+modelpredsFLTie.head()
+
+ | County_Name | +logregpreds | +ldapreds | +knn | +svcpreds | +rfpreds | +Q1_counts | +Q2_counts | +Q3_counts | +Q4_counts | +max | +consensus | +
---|---|---|---|---|---|---|---|---|---|---|---|---|
125 | +Hendry County, FL | +Q4 | +Q3 | +Q2 | +Q2 | +Q3 | +0 | +2 | +2 | +1 | +2 | +Tie | +
In conclusion, the five viable models for predicting depression rates in Florida would overall state that most counties are below the median value for rates of depression acrosss America, though a select few are actually above the median value. In particular, Miami County seems to have an exceptionally low rate of depression.
+In battling growing rates of depression across the country, it is important to understand the impact of county-specific circumstances, cultures, demographics, and more. Perhaps, based on the results of this prediction, research on depression in Florida can start with Miami to see if there are special qualities about the county that explain why it is predicted to have minimal depression rates as opposed to the rest of the state. From there, an empirically supported approach to supporting mental health in Florida can be achieved.
+ +