Github page: https://ucb-stat-159-s23.github.io/project-Group11/
This is a project attempting to analyze the factors that affect student retention at universities in a reproducible manner. We conduct exploratory data analysis, compute feature importance, and implement logistic regression for prediction. The data originates from the U.S. Department of Education College Scoreboard through the Institution-level data files for 1996-97 through 2020-21 containing aggregate data for each institution. The dataset includes information on institutional characteristics, enrollment, student aid, costs, and student outcomes. You can find more information here: https://collegescorecard.ed.gov/data/
- To run all tests for utility functions, run
pytest tools
from the terminal in the root directory. - To create the environment, run
make env
from the terminal in the root directory. - In addition, you may run
make html
to build the jupyter-book html,make clean
to remove figures and html files, andmake all
to run all the aforementionedmake
commands.
/data
: cleaned data csv files/figures
: all generated figures as png files- Jupyter Notebooks:
main.ipynb
: the main narrative notebook of the research projectEDA.ipynb
: introduction, data description and some basic exploratory data analysis of retention ratesEDA_control_of_school
: the code of the control of the school for exploratory data analysis and EDA related figuresEDA_in_out_state_tuition
: the code of in/out state tuition & fee for exploratory data analysis and for EDA related figuresEDA_rece_loans
: the code of individuals receiving federal loans for exploratory data analysis and for EDA related figures of percentageEDA_parent_edu
: the code of parent education for exploratory data analysis and for EDA related figuresEDA_program_offered
: the code of programs offered for exploratory data analysis and for EDA related figuresEDA_num_var
: feature analysis on numberical variablesVariable_Analysis_1.ipynb
: code for logistic regression model and their corresponding figures of four-year institutions retention ratesVariable_Analysis_2.ipynb
: code for logistic regression model and their corresponding figures of less than four-year institutions retention rates
- Python utility package tools:
/tools
: code and tests for python package- Setup files:
setup.py
,setup.cfg
,pyproj.toml
- Environment files:
environment.yml
,envsetup.sh
- Jupyter Book:
_config.yml
,_toc.yml
,conf.py
,postBuild
,requirements.txt
contribution_statement.md
: authors' contributions
UNITID
: Unit ID for institutionCONTROL
: Control of institutionCCUGPROF
: Carnegie Classification -- undergraduate profileCCSIZSET
: Carnegie Classification -- size and settingADM_RATE
: Admission RateSAT_AVG
: Average SAT equivalent score of students admittedUG
: Enrollment of all undergraduate studentsUGDS_[...]
: Total share of enrollment of undergraduate degree-seeking students who are [...]- UGDS_WHITE (white), UGDS_BLACK (black), UGDS_HISP (Hispanic), UGDS_ASIAN (Asian), UGDS_AIAN (American Indian/Alaska Native), UGDS_NHPI (Native Hawaiian/Pacific Islander), UGDS_2MOR (two or more races), UGDS_NRA (non-resident aliens), UGDS_UNKN (unknown), UGDS_WHITENH (white non-Hispanic)
NPT4_PUB
: Average net price for Title IV institutions (public institutions)NPT4_PRIV
: Average net price for Title IV institutions (private for-profit and nonprofit institutions)TUITIONFEE_IN
: In-state tuition and feesTUITIONFEE_OUT
: Out-of-state tuition and feesAVGFACSAL
: Average faculty salaryPCTPELL
: Percentage of undergraduates who receive a Pell grantRET_FT4
: First-time, full-time student retention rate at four-year institutionsRET_FTL4
: First-time, full-time student retention rate at less-than-four-year institutionsPCTFLOAN
: Percent of all undergraduate students receiving a federal student loanPAR_ED_PCT_MS
: Percent of students whose parents' highest educational level is middle schoolPAR_ED_PCT_HS
: Percent of students whose parents' highest educational level is high schoolPAR_ED_PCT_PS
: Percent of students whose parents' highest educational level was is some form of postsecondary educationDEP_INC_AVG
: Average family income of dependent students in real 2015 dollarsIND_INC_AVG
: Average family income of independent students in real 2015 dollarsGRAD_DEBT_MDN
: The median debt for students who have completedWDRAW_DEBT_MDN
: The median debt for students who have not completedFAMINC
: Average family incomeMD_FAMINC
: Median family incomePRGMOFR
: Number of programs offered
For more information on the additional variables, please refer to the Data Dictionary. Make sure to download the Most Recent Institution-Level Data
data dictionary and you can find all the descriptions of the variables in the Institution_Data_Dictionary
tab.
-
To guarantee that the correct environment is in place, execute the
make env
command. If an outdated version of the environment is already set up, or if you need to modify the environment, activate a separate environment usingconda activate
and then executemake remove-env
first to remove the prior setup. -
After running conda activate
final_proj
, runconda install -c anaconda pytest
. -
Run
pytest tools
.
To run the main notebook, you have two options: you can either use the provided binder link, or you can clone the repository and run the notebook locally. If you choose to run the notebook locally, we have also included a makefile with instructions to help streamline the process.
The project is released under the BSD 3-clause License.