CovidInfectionAnalysis

A collaborative project looking into the likelihood of infection between vaccinated and unvaccinated in the United States.

Segment 1 Deliverable Presentation

Topic: What is the likelihood of being infected by Covid-19? How is infection affected by factors such as vaccination rates, gender, and ethnicity?

Purpose: It is important to analyze future trends following the Covid-19 pandemic to understand the prevalence of infection within the American population.

Data Source: We gathered data from reliable organizations such as Johns Hopkins University and the Center for Disease Control (CDC) which provide csv files on their findings.

Questions to be answered: Are certain populations more likely to be infected than others? How do these factors affect the other? What other factors should be considered in identifying risks of infection?

Communication Protocols: Our group name is Endless Knot. The members exchange information on Slack and document notes on Google Docs. Group meetings are held virtually on Zoom. We collaborate on our codes through GitHub, which include our repository (CovidInfectionAnalysis), branches, commits, and pull requests.

Machine Learning Flowchart

• Data Wrangling: We checked out the data quality by sorting and filtering with PYTHON. We cleaned missing data and removed outliers. We omitted several unnecessary columns. We found the null values, used dropna() and converted strings to numbers.

Data Preprocessing

We ended up working with 3 datasets from the following sources: Covid Data Tracker-CDC, Covid19 cases by State -Johns Hopkins University and Genderscilab

Preprocessing – we view/examine each csv file using Excel.

Data Processing:

Using Vlookup we convert the State abbreviation to full name, draw common ground- Primary key and mapped relationship using Entity Relationship Diagram (ERD) and merged and analyze datasets using SQL and Pandas/Jupyter Notebook.

• Preparing data for machine learning • Importing libraries: pandas, NumPy, seaborn, matplotlib, sklearn, train_test_split, r2_score, mean_squared_error, sklearn.datasets, statsmodels.tsa.arima.model

Steps to run ML algorithm

• Read dataset • Activate ML environment in jupyter notebook mlev

Data transformation

• Convert strings to numbers using pd.get_dummies • Split the data into training and testing • Split the data into training and testing using StandardScaler() and X_train_scaled = X_scaler.transform(X_train) X_test_scaled = X_scaler.transform(X_test)

• We tried different Machine learning algorithm: Since our data is labeled, we used Supervised learning. We focused on Regression models because we are using data to make predictions in a continuous form.

Supervised learning and Models:

we used several models:

Ordinary Least Squares (OLS)
Linear regression
SVM support vector machine
ARIMA for Time series
OLS model can predict an output value with an acceptable error margin, based on a set of known input parameters.

Linear regression: coeffiecient of determinations : 0.57037

SVM support vector machine : SVM or Support Vector Machine is a linear model used for classification and regression problems. It can solve linear and non-linear problems and work well for many practical problems.

Time Series for Machine Learning Model :

An ARIMA model is a class of statistical models for analyzing and forecasting time series data. ARIMA stands for Autoregressive Integrated Moving Average. It is a generalization of the simpler Autoregressive Moving Average and adds the notion of integration.

The below summarizes the coefficient values used as well as the skill of the fit on the on the in-sample observations. The ARIMA model used is ARIMA(5, 1, 0)

Next, we get a density plot of the residual error values, suggesting the errors are Gaussian, but may not be centered on zero. The distribution of the residual errors is displayed. The results show that indeed there is a bias in the prediction (a non-zero mean in the residuals).

The graph below shows that A line plot is created showing the expected values (blue) compared to the rolling forecast predictions (red). We can see the values show some trend and are in the correct scale.

Plotly - Interactive Visualization

Plotly was an interactive platform that was used to help visualize the different covid factors used in this project. The two factors that we wanted to showcase through maps were gender infections and total percent of vaccinations. Two maps were created to take the states with the highest total of infections between men and women. For the state of California, it had the highest rate of infections for both men and women. Looking at the maps, men were more likely to get infected in Texas than women. In comparison, both genders are likely to get infected equally in the states with the highest amount of cases.

Another map that was created was to visualize the amount of fully vaccinated people in each state. This allows us to see which states has the most vaccinations and which had the least. We can determine that California, Oregon and Washington have a high percentage of vaccinations, while North Carolina has the least percentage of vaccinations.

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
CSVs		CSVs
DataBase		DataBase
Plotly_Images		Plotly_Images
machine learning model		machine learning model
square		square
.DS_Store		.DS_Store
.gitignore		.gitignore
All Cases Bubble.png		All Cases Bubble.png
Cases by Year Map.jpg		Cases by Year Map.jpg
Covid19V2.sql		Covid19V2.sql
CovidFinalRace_and_Ethnicity.csv		CovidFinalRace_and_Ethnicity.csv
Deliverable 2 Presentation.pdf		Deliverable 2 Presentation.pdf
ERD.PNG		ERD.PNG
FinalGender.csv		FinalGender.csv
FinalVax.csv		FinalVax.csv
FinalVax.ipynb		FinalVax.ipynb
Gender.ipynb		Gender.ipynb
README.md		README.md
Technologies_Outline.rtf		Technologies_Outline.rtf
Technology_Outline.docx		Technology_Outline.docx
US_Covid_Cases.ipynb		US_Covid_Cases.ipynb
us_states.csv		us_states.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CovidInfectionAnalysis

Segment 1 Deliverable Presentation

Machine Learning Flowchart

Data Preprocessing

Preprocessing – we view/examine each csv file using Excel.

Data Processing:

Steps to run ML algorithm

Data transformation

Supervised learning and Models:

Time Series for Machine Learning Model :

Plotly - Interactive Visualization

About

Releases

Packages

Languages

utsavchaudharygithub/CovidInfectionAnalysis

Folders and files

Latest commit

History

Repository files navigation

CovidInfectionAnalysis

Segment 1 Deliverable Presentation

Machine Learning Flowchart

Data Preprocessing

Preprocessing – we view/examine each csv file using Excel.

Data Processing:

Steps to run ML algorithm

Data transformation

Supervised learning and Models:

Time Series for Machine Learning Model :

Plotly - Interactive Visualization

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages