Skip to content

dxk613/la_youth_homeless

Repository files navigation

Problem Statement


California leads the country in youth homelessness. To help address this growing issue in Los Angeles, this project's aim is to use classification models leveraging public data on youth homelessness surveys to predict who is at risk of becoming homeless based on family conditions and geographic consideration. The goal of this project is to help LA city government gain a better understanding of who is most at risk by utilizing binary classification models. To enhance the models' accuracies, I will incorporate domain expertise and external research to conduct feature selection, identifying relevant factors influencing the likelihood of becoming homeless.

Methodology


Data Collection and Cleaning

For data collection, I gathered three datasets on individual homeless youth surveys during 2017-2019 from the Los Angeles Continuum of Care Homeless Counts. Once I gathered all three datasets, the main challenge was reformatting the questions in an uniform fashion to concatenate the three datasets together, which involved consulting respective data dictionaries. Once I was able to successfully reformat the datasets, I examined 119 different columns and used this 2015 report, which goes over common reasons why young individuals may end up homeless such as domestic issues with their families, to determine my features.

This became my final dataset to train the classification models:

Variable Data Type Value Count Description
hmlsmorethan1Yr int64 2577 0 - No, 1 - Yes: homeless more than 1 year this time
dv_neglect int64 2577 0 - No, 1 - Yes, 2 - R, 3 - S, 4 - D, 5 - C, 6 - N: faced neglect
dv_physical int64 2577 0 - No, 1 - Yes, 2 - R, 3 - S, 4 - D, 5 - C, 6 - N: faced physical abuse
dv_physical_rel int64 2577 0 - No, 1 - Yes, 2 - R, 3 - S, 4 - D, 5 - C, 6 - N: faced physical abuse by parents
dv_sexual_rel int64 2577 0 - No, 1 - Yes, 2 - R, 3 - S, 4 - D, 5 - C, 6 - N: faced sexual abuse by parents
subsabuse int64 2577 0 - No, 1 - Yes, 2 - M, 3 - Y, 4 - A: substance abuse problem of long duration (18+ only)
drugabuse int64 2577 0 - No, 1 - Yes, 2 - M: Drug Abuse
SPA int64 2577 1 - 8: Service Planning Areas

Model Summary

To predict who is at risk of becoming homeless, I have used these five classification models: Logistic Regression, KNN Neighbors, Random Forest, Ada Boost, and SVC. Because my target variable - hmlsmorethan1Yr - has imbalanced data, I also utilized the RandomOverSampler, SMOTEN, ADASYN, and OverWeighing the minority class to help counter the imbalance. Overall, out of the five models, the SVC model with the RandomOverSampler method performed the best with roughly 58 percent, doing better than the null model's baseline of 50 percent accuracy. Despite hypertuning the parameters for my SVC model, my model was unable to perform better than 58 percent.

Model Balanced Accuracy Recall Precision F1 Score
Logistic Regression
logr 0.504139 0.011236 0.666667 0.022099
RandomOverSampler 0.533592 0.460674 0.381395 0.417303
SMOTEN 0.532262 0.455056 0.380282 0.414322
ADASYN 0.526644 0.443820 0.374408 0.406170
WeightedLogr 0.545409 0.516854 0.389831 0.444444
Model Balanced Accuracy Recall Precision F1 Score
KNN 0.522754 0.089888 0.516129 0.153110
RandomOverSampler 0.562180 0.331461 0.457364 0.384365
SMOTEN 0.544728 0.320225 0.422222 0.364217
ADASYN 0.549315 0.314607 0.434109 0.364821
Weightedknn 0.539010 0.146067 0.530612 0.229075
Model balanced_accuracy recall precision f1_score
SVC 0.504139 0.011236 0.666667 0.022099
RandomOverSampler 0.578685 0.544944 0.425439 0.477833
SMOTEN 0.556496 0.544944 0.399177 0.460808
ADASYN 0.536932 0.668539 0.371875 0.477912
Weightedsvc 0.540523 0.533708 0.383065 0.446009
Model balanced_accuracy recall precision f1_score
RF 0.533974 0.168539 0.468750 0.247934
RandomOverSampler 0.554302 0.460674 0.407960 0.432718
SMOTEN 0.542916 0.443820 0.395000 0.417989
ADASYN 0.534456 0.539326 0.376471 0.443418
Weightedrf 0.535204 0.511236 0.379167 0.435407
Model balanced_accuracy recall precision f1_score
ADA 0.504720 0.044944 0.400000 0.080808
RandomOverSampler 0.535520 0.443820 0.385366 0.412533
SMOTEN 0.519696 0.426966 0.367150 0.394805
ADASYN 0.516272 0.500000 0.360324 0.418824
Weightedada 0.531082 0.443820 0.379808 0.409326

Next Steps

For the following next steps, I plan to examine and select additional columns that were not utilized in the modeling process, collect more data, and observe any potential changes in modeling performances. I also plan to seek ways to make the streamlit app for predicting homelessness more user-friendly.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published