California leads the country in youth homelessness. To help address this growing issue in Los Angeles, this project's aim is to use classification models leveraging public data on youth homelessness surveys to predict who is at risk of becoming homeless based on family conditions and geographic consideration. The goal of this project is to help LA city government gain a better understanding of who is most at risk by utilizing binary classification models. To enhance the models' accuracies, I will incorporate domain expertise and external research to conduct feature selection, identifying relevant factors influencing the likelihood of becoming homeless.
For data collection, I gathered three datasets on individual homeless youth surveys during 2017-2019 from the Los Angeles Continuum of Care Homeless Counts. Once I gathered all three datasets, the main challenge was reformatting the questions in an uniform fashion to concatenate the three datasets together, which involved consulting respective data dictionaries. Once I was able to successfully reformat the datasets, I examined 119 different columns and used this 2015 report, which goes over common reasons why young individuals may end up homeless such as domestic issues with their families, to determine my features.
This became my final dataset to train the classification models:
Variable | Data Type | Value Count | Description |
---|---|---|---|
hmlsmorethan1Yr | int64 | 2577 | 0 - No, 1 - Yes: homeless more than 1 year this time |
dv_neglect | int64 | 2577 | 0 - No, 1 - Yes, 2 - R, 3 - S, 4 - D, 5 - C, 6 - N: faced neglect |
dv_physical | int64 | 2577 | 0 - No, 1 - Yes, 2 - R, 3 - S, 4 - D, 5 - C, 6 - N: faced physical abuse |
dv_physical_rel | int64 | 2577 | 0 - No, 1 - Yes, 2 - R, 3 - S, 4 - D, 5 - C, 6 - N: faced physical abuse by parents |
dv_sexual_rel | int64 | 2577 | 0 - No, 1 - Yes, 2 - R, 3 - S, 4 - D, 5 - C, 6 - N: faced sexual abuse by parents |
subsabuse | int64 | 2577 | 0 - No, 1 - Yes, 2 - M, 3 - Y, 4 - A: substance abuse problem of long duration (18+ only) |
drugabuse | int64 | 2577 | 0 - No, 1 - Yes, 2 - M: Drug Abuse |
SPA | int64 | 2577 | 1 - 8: Service Planning Areas |
To predict who is at risk of becoming homeless, I have used these five classification models: Logistic Regression, KNN Neighbors, Random Forest, Ada Boost, and SVC. Because my target variable - hmlsmorethan1Yr - has imbalanced data, I also utilized the RandomOverSampler, SMOTEN, ADASYN, and OverWeighing the minority class to help counter the imbalance. Overall, out of the five models, the SVC model with the RandomOverSampler method performed the best with roughly 58 percent, doing better than the null model's baseline of 50 percent accuracy. Despite hypertuning the parameters for my SVC model, my model was unable to perform better than 58 percent.
Model | Balanced Accuracy | Recall | Precision | F1 Score |
---|---|---|---|---|
Logistic Regression | ||||
logr | 0.504139 | 0.011236 | 0.666667 | 0.022099 |
RandomOverSampler | 0.533592 | 0.460674 | 0.381395 | 0.417303 |
SMOTEN | 0.532262 | 0.455056 | 0.380282 | 0.414322 |
ADASYN | 0.526644 | 0.443820 | 0.374408 | 0.406170 |
WeightedLogr | 0.545409 | 0.516854 | 0.389831 | 0.444444 |
Model | Balanced Accuracy | Recall | Precision | F1 Score |
---|---|---|---|---|
KNN | 0.522754 | 0.089888 | 0.516129 | 0.153110 |
RandomOverSampler | 0.562180 | 0.331461 | 0.457364 | 0.384365 |
SMOTEN | 0.544728 | 0.320225 | 0.422222 | 0.364217 |
ADASYN | 0.549315 | 0.314607 | 0.434109 | 0.364821 |
Weightedknn | 0.539010 | 0.146067 | 0.530612 | 0.229075 |
Model | balanced_accuracy | recall | precision | f1_score |
---|---|---|---|---|
SVC | 0.504139 | 0.011236 | 0.666667 | 0.022099 |
RandomOverSampler | 0.578685 | 0.544944 | 0.425439 | 0.477833 |
SMOTEN | 0.556496 | 0.544944 | 0.399177 | 0.460808 |
ADASYN | 0.536932 | 0.668539 | 0.371875 | 0.477912 |
Weightedsvc | 0.540523 | 0.533708 | 0.383065 | 0.446009 |
Model | balanced_accuracy | recall | precision | f1_score |
---|---|---|---|---|
RF | 0.533974 | 0.168539 | 0.468750 | 0.247934 |
RandomOverSampler | 0.554302 | 0.460674 | 0.407960 | 0.432718 |
SMOTEN | 0.542916 | 0.443820 | 0.395000 | 0.417989 |
ADASYN | 0.534456 | 0.539326 | 0.376471 | 0.443418 |
Weightedrf | 0.535204 | 0.511236 | 0.379167 | 0.435407 |
Model | balanced_accuracy | recall | precision | f1_score |
---|---|---|---|---|
ADA | 0.504720 | 0.044944 | 0.400000 | 0.080808 |
RandomOverSampler | 0.535520 | 0.443820 | 0.385366 | 0.412533 |
SMOTEN | 0.519696 | 0.426966 | 0.367150 | 0.394805 |
ADASYN | 0.516272 | 0.500000 | 0.360324 | 0.418824 |
Weightedada | 0.531082 | 0.443820 | 0.379808 | 0.409326 |
For the following next steps, I plan to examine and select additional columns that were not utilized in the modeling process, collect more data, and observe any potential changes in modeling performances. I also plan to seek ways to make the streamlit app for predicting homelessness more user-friendly.