Problem Statement

California leads the country in youth homelessness. To help address this growing issue in Los Angeles, this project's aim is to use classification models leveraging public data on youth homelessness surveys to predict who is at risk of becoming homeless based on family conditions and geographic consideration. The goal of this project is to help LA city government gain a better understanding of who is most at risk by utilizing binary classification models. To enhance the models' accuracies, I will incorporate domain expertise and external research to conduct feature selection, identifying relevant factors influencing the likelihood of becoming homeless.

Methodology

Data Collection and Cleaning

For data collection, I gathered three datasets on individual homeless youth surveys during 2017-2019 from the Los Angeles Continuum of Care Homeless Counts. Once I gathered all three datasets, the main challenge was reformatting the questions in an uniform fashion to concatenate the three datasets together, which involved consulting respective data dictionaries. Once I was able to successfully reformat the datasets, I examined 119 different columns and used this 2015 report, which goes over common reasons why young individuals may end up homeless such as domestic issues with their families, to determine my features.

This became my final dataset to train the classification models:

Variable	Data Type	Value Count	Description
hmlsmorethan1Yr	int64	2577	0 - No, 1 - Yes: homeless more than 1 year this time
dv_neglect	int64	2577	0 - No, 1 - Yes, 2 - R, 3 - S, 4 - D, 5 - C, 6 - N: faced neglect
dv_physical	int64	2577	0 - No, 1 - Yes, 2 - R, 3 - S, 4 - D, 5 - C, 6 - N: faced physical abuse
dv_physical_rel	int64	2577	0 - No, 1 - Yes, 2 - R, 3 - S, 4 - D, 5 - C, 6 - N: faced physical abuse by parents
dv_sexual_rel	int64	2577	0 - No, 1 - Yes, 2 - R, 3 - S, 4 - D, 5 - C, 6 - N: faced sexual abuse by parents
subsabuse	int64	2577	0 - No, 1 - Yes, 2 - M, 3 - Y, 4 - A: substance abuse problem of long duration (18+ only)
drugabuse	int64	2577	0 - No, 1 - Yes, 2 - M: Drug Abuse
SPA	int64	2577	1 - 8: Service Planning Areas

Model Summary

To predict who is at risk of becoming homeless, I have used these five classification models: Logistic Regression, KNN Neighbors, Random Forest, Ada Boost, and SVC. Because my target variable - hmlsmorethan1Yr - has imbalanced data, I also utilized the RandomOverSampler, SMOTEN, ADASYN, and OverWeighing the minority class to help counter the imbalance. Overall, out of the five models, the SVC model with the RandomOverSampler method performed the best with roughly 58 percent, doing better than the null model's baseline of 50 percent accuracy. Despite hypertuning the parameters for my SVC model, my model was unable to perform better than 58 percent.

Model	Balanced Accuracy	Recall	Precision	F1 Score
Logistic Regression
logr	0.504139	0.011236	0.666667	0.022099
RandomOverSampler	0.533592	0.460674	0.381395	0.417303
SMOTEN	0.532262	0.455056	0.380282	0.414322
ADASYN	0.526644	0.443820	0.374408	0.406170
WeightedLogr	0.545409	0.516854	0.389831	0.444444

Model	Balanced Accuracy	Recall	Precision	F1 Score
KNN	0.522754	0.089888	0.516129	0.153110
RandomOverSampler	0.562180	0.331461	0.457364	0.384365
SMOTEN	0.544728	0.320225	0.422222	0.364217
ADASYN	0.549315	0.314607	0.434109	0.364821
Weightedknn	0.539010	0.146067	0.530612	0.229075

Model	balanced_accuracy	recall	precision	f1_score
SVC	0.504139	0.011236	0.666667	0.022099
RandomOverSampler	0.578685	0.544944	0.425439	0.477833
SMOTEN	0.556496	0.544944	0.399177	0.460808
ADASYN	0.536932	0.668539	0.371875	0.477912
Weightedsvc	0.540523	0.533708	0.383065	0.446009

Model	balanced_accuracy	recall	precision	f1_score
RF	0.533974	0.168539	0.468750	0.247934
RandomOverSampler	0.554302	0.460674	0.407960	0.432718
SMOTEN	0.542916	0.443820	0.395000	0.417989
ADASYN	0.534456	0.539326	0.376471	0.443418
Weightedrf	0.535204	0.511236	0.379167	0.435407

Model	balanced_accuracy	recall	precision	f1_score
ADA	0.504720	0.044944	0.400000	0.080808
RandomOverSampler	0.535520	0.443820	0.385366	0.412533
SMOTEN	0.519696	0.426966	0.367150	0.394805
ADASYN	0.516272	0.500000	0.360324	0.418824
Weightedada	0.531082	0.443820	0.379808	0.409326

Next Steps

For the following next steps, I plan to examine and select additional columns that were not utilized in the modeling process, collect more data, and observe any potential changes in modeling performances. I also plan to seek ways to make the streamlit app for predicting homelessness more user-friendly.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.ipynb_checkpoints		.ipynb_checkpoints
code		code
datasets		datasets
images		images
README.md		README.md
pickling.ipynb		pickling.ipynb
presentation.pdf		presentation.pdf
read_a_pickle.ipynb		read_a_pickle.ipynb
youth_home.py		youth_home.py
youth_pipe.pkl		youth_pipe.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Problem Statement

Methodology

Data Collection and Cleaning

Model Summary

Next Steps

About

Releases

Packages

Languages

dxk613/la_youth_homeless

Folders and files

Latest commit

History

Repository files navigation

Problem Statement

Methodology

Data Collection and Cleaning

Model Summary

Next Steps

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages