- HTML: https://www.opencasestudies.org/ocs-bp-diet
- GitHub: https://github.com/opencasestudies/ocs-bp-diet
- Bloomberg American Health Initiative: https://americanhealth.jhu.edu/open-case-studies
The purpose of the Open Case Studies project is to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given dataset, and should not be used in the context of making policy decisions without external consultation from scientific experts.
This case study is part of the OpenCaseStudies project. This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) United States License.
To cite this case study please use:
Wright, Carrie and Jager, Leah and Taub, Margaret and Hicks, Stephanie. (2020). https://github.com/opencasestudies/ocs-bp-diet. Exploring global patterns of dietary behaviors associated with health risk (Version v1.0.0).
We would like to acknowledge Jessica Fanzo for assisting in framing the major direction of the case study, as well as Ashkan Afshin and Erin Mullany for giving us access to the data.
We would also like to acknowledge the Bloomberg American Health Initiative for funding this work.
Exploring global patterns of dietary behaviors associated with health risk
According to this article that evaluated food consumption patterns in 185 countries for 15 dietary risk factors with probable associations with non-communicable disease:
High intake of sodium …, low intake of whole grains …, and low intake of fruits … were the leading dietary risk factors for deaths and DALYs globally and in many countries.”
In this case study we evaluate the data used in this article to explore regional, age, and gender specific differences in dietary consumption patterns around the world in 2017. We particularly focus on dietary consumption patterns within the United States (US) and how these compare to other that of other countries.
Our main questions:
- What are the global trends for potentially harmful diets?
- How do males and females compare?
- How do different age groups compare for these dietary factors?
- How do different countries compare? In particular, how does the US compare to other countries in terms of diet trends?
In this case study we will be using data that we requested form the Global Burden of Disease (GBD) about consumption of dietary factors associated with health risks.
We will also be using data from a PDF of an article about the optimal consumption guidelines for these dietary factors.
Their methods for identifying and authenticating incidents are outlined here.
Previously according to their website:
“The database compiles information from more than 25 different sources including peer-reviewed studies, government reports, mainstream media, non-profits, private websites, blogs, and crowd-sourced lists that have been analyzed, filtered, deconflicted, and cross-referenced. All of the information is based on open-source information and 3rd party reporting… and may include reporting errors.”
The skills, methods, and concepts that students will be familiar with by the end of this case study are:
Data Science Learning Objectives:
- Importing/extracting data from PDF (
dplyr
,stringr
) - How to reshape data by pivoting between “long” and “wide” formats
(
tidyr
) - Perform functions on all columns of a tibble (
purrr
) - Data cleaning with regular expressions (
stringr
) - Specific data value reassignment
- Separate data within a column into multiple columns (
tidyr
) - Methods to Compare data (
dplyr
) - Combining data from two sources (
dplyr
) - Make interactive plots (
ggiraph
) - Make a zoom facet for plot (
ggforce
) - Combine plots together (
cowplot
)
Statistical Learning Objectives:
- Understanding of how the t-test and the ANOVA are specialized regressions
- Basic understanding of the utility of a regression analysis
- How to implement a linear regression analysis in R
- How to interpret regression coefficients
- Awareness of t-test assumptions
- Awareness of linear regression assumptions
- How to use Q-Q plots to check for normality
- Difference between fixed effects and random effects
- How to perform paired t-test
- How to perform a linear mixed effects regression
In this case study we demonstrate how to import data from a csv and from a PDF.
This case study also covers many of the stringr
functions to
manipulate character strings, including str_split()
, str_subset()
,
str_replace()
, str_replace_all()
, str_which()
, str_count()
,
str_remove_all()
, and str_trim()
.
This case study also covers how to use the tidyr
functions such as
pivot_wider()
and pivot_longer()
for reshaping data and the
separate()
function for creating new columns from an existing column.
In addition, the case study covers how to replace NA
values with a
specific value using the replace_na()
function.
This case study also goes over how to use many of the dplyr
functions
to modify, select and filter data, such as: rename()
, mutate()
,
arrange()
, select()
and filter()
as well as functions to compare
data like the setequal()
, all_equal()
, and setdiff()
functions, as
well as similar functions to look for overlapping similarities like the
intersect()
function. The case study describes the differences of
these functions. We also introduce how to recode data using the
if_else()
and case_when()
functions and how to join data using the
full_join()
function.
We also cover how to use the purrr
package map()
function to apply
the same function to multiple columns in a tibble.
In this case study we show how to make faceted plots, as well as plots
with a facet that is zoomed in using the facet_zoom()
function of the
ggforce
package. We cover how to specifically highlight specific data
points, as well as how to add annotations and horizontal lines to make
the plot more interpretable.
We also demonstrate how to make interactive plots where the data points
link you to other websites using the ggiraph
package. Finally, we
demonstrate how to combine plots using the cowplot
package.
We also cover how to use the viridis
package to make plots that are
more interpretable for those who are colorblind.
This case study has a particularly thorough analysis section, which describes many ways of added complexity to examine the data. We describe how the t-test and the ANOVA are actually specialized forms of the regression analysis.
We provide an introduction to regression analysis.
We also describe paired data and how to interpret this using both a paired t-test and a linear model with fixed effects or a linear model with mixed effects. We also describe the difference between random and fixed effects.
See this other case study for more introductory material about comparing groups, hypothesis testing, probability, distributions, normality, paired data, and the paired t-test.
RStudio
Cheatsheet on RStuido IDE
Other RStudio cheatsheets
RStudio projects
String manipulation cheatsheet
Table formats
Terms and concepts covered:
Interpunct
Regular expressions
Inference
Regression
Different types of regression
Ordinary least squares method
Residual
t-tests
ANOVA
t-tests and ANOVA are equivalent to regression
also see
here
and
here
about how many commonly known statistical tests are specialized forms of
regression
Normally Distribution
Q-Q plot
Guide to residual diagnostic
plots and
Examples
Residual vs fitted plot
Scale-location plot
Homoscedasticity
Heteroscedasticity
Interpreting lm()
output
Coefficients
Linear mixed effects regression
Satterthwaite formula
Mood’s Two-Sample Scale Test
Standard deviation
Homogeneity of Variances assumption
polyunsaturated fatty acids
Tests of Homogeneity of Variance for 3 or more groups:
Bartlett’s test
Fligner-Killeen
Levene’s test
Other helpful links:
Long and Wide Data Formats
Distributions
Skewed Distributions
Bimodal Distribution
ggplot2
Shapiro-Wilk Test
Paired Data
Welch’s t-test
Parametric and Nonparametric Methods
Variance
Balanced Study Design
Independent Observations
Transformation
Permutation/Resampling Methods
Central Limit Theorem
Wilcoxon Signed Rank
Test
Wilcoxon Rank Sum Test
Two-sample Kolmogorov-Smirnov Test
Type 1 Error
p-value
Multiple Testing
Bonferroni Method of Multiple Testing Correction
Packages used in this case study:
Package | Use in this case study |
---|---|
here | to easily load and save data |
readr | to import the CSV file data |
dplyr | to arrange/filter/select/compare specific subsets of the data |
skimr | to get an overview of data |
pdftools | to read a PDF into R |
stringr | to manipulate the text within the PDF of the data |
magrittr | to use the %<>% pipping operator |
purrr | to perform functions on all columns of a tibble |
tibble | to create data objects that we can manipulate with dplyr/stringr/tidyr/purrr |
tidyr | to separate data within a column into multiple columns |
ggplot2 | to make visualizations with multiple layers |
ggpubr | to easily add regression line equations to plots |
forcats | to change details about factors (categorical variables) |
lmerTest | to perform linear mixed model testing |
car | to perform Levene’s Test of Homogeneity of Variances |
ggiraph | to make plots interactive |
ggforce | to modify facets in plots |
viridis | to plot in color palette |
cowplot | to allow plots to be combined |
There is a Makefile
in this folder that allows you to type
make
to knit the case study contained in the index.Rmd
to
index.html
and it will also knit the README.Rmd
to a
markdown file (README.md
).
Users can skip the Data Import and Data Wrangling sections to start with the Data Analysis and Visualization section if they wish.
Instructors can skip the Data Import and Data Wrangling sections and start with either the Data Exploration, Data Analysis, or Data Visualization sections if they wish.
This case study is appropriate for those new to R programming. It is also appropriate for more advanced R users who are new to the Tidyverse. This particular case study may require some introductory knowledge of R programming, particularly for creating visualizations.
Students can evaluate consumption estimates of another dietary factor besides red meat.