This repository is a comprehensive collection of our team's work on exploring and analyzing the correlation between cancer-related death rates and socioeconomic indicators in the United States. Our main goal is to identify the most critical factors that contribute to cancer fatalities by using advanced statistical techniques and machine learning models. The datasets we use in this repository cited from the public-access data website data.world: https://data.world/nrippner/ols-regression-challenge.
GitHub Pages Link: https://ucb-stat-159-s23.github.io/project-group10/Main.html
Instructions for replication:
Install cancerolstools package
The custom package cancerolstools
can be installed using pip install .
You can run tests on the package using the command pytest cancerolstools
.
Makefile support
The makefile supports 5 operations: creating an environment, building JupyterBook, run all the notebooks, clean up the folders, and prints documentation.
Project structure:
The package cancerolstools
contains 3 scripts. Each maps roughly to the functions required in each of the 3 following notebooks.
The notebook Data-Preparation.ipynb
will provide the steps to preprocess and merge the data, including a preliminary step to mapping the anomalies in the visualization.
The notebook Data-Visualization.ipynb
will provide 2 different visualizations intended to guide the analysis.
The notebook Regression-AnalysisV2.ipynb
is a Python adaptation of original code in R intended to run some linear regression techniques on the data. It will create a basic model, apply LASSO penalization, and also conduct the nonparametric bootstrap.
LICENSE
contains information on the license.
environment.yml
provides requirements to build the environment to replicate the results.
_config.yml
, _toc.yml
, requirements.txt
used for building the JupyterBook.
setup.cfg
, setup.py
, pyproj.toml
are files for the package cancerolstools
.