05-GLMs.Rmd



```{r, echo=FALSE, warning=FALSE, message=FALSE}
## Run but not shown
## Getting data ready for the examples
library(foreign)
library(furniture)
library(tidyverse)
load("~/Dropbox/GitHub/blog_rstats/assets/Data/NHANES_2012.rda")
```


# Chapter 5: Generalized Linear Models {-}

> "You must stick to your conviction, but be ready to abandon your assumptions." --- Dennis Waitley

Generalized Linear Models (GLM's) are extensions of linear regression to areas where assumptions of normality and homoskedasticity do not hold. There are several versions of GLM's, each for different types and distributions of outcomes. We are going to go through several of the most common. 

This chapter is to introduce the method very briefly and demonstrate how to perform one in `R`. We do not delve into the details of each method much, but rather focus on showing the quirks of the coding.

We discuss:

1. Logistic Regression
2. Poisson Regression
3. GLM with Gamma distribution
4. Negative binomial
5. Beta Regression

## Logistic Regression {-}

For binary outcomes (e.g., yes or no, correct or incorrect, sick or healthy), logistic regression is a fantastic tool that provides useful and interpretable information. Much like simple and multiple linear regression, logistic regression[^logistic] uses dummy coding and provides coefficients that tell us the relationship between the outcome and the independent variables.

Since the outcome is binary, we use a statistical transformation to make things work well. This makes it so the outcome is in "log-odds." A simple exponentiation of the coefficients and we get very useful "odds ratios." These are very common in many fields using binary data.

Luckily, running a logistic regression is simple in `R`. We first create the binary outcome variable called `dep`. We use a new function called `mutate` to create a new variable (we could do this a number of ways but this is probably the cleanest way).

```{r}
## First creating binary depression variable
df <- df %>%
  mutate(dep = dpq010 + dpq020 + dpq030 + dpq040 + dpq050 +
               dpq060 + dpq070 + dpq080 + dpq090) %>%
  mutate(dep2 = ifelse(dep >= 10, 1,
                ifelse(dep < 10, 0, NA)))
```
Note that we added the values from the ten variables that give us an overall depression score (`dep`). We then use `ifelse()` to create a binary version of depression called `dep2` with a cutoff of $\geq 16$ meaning depressed. Because there are missing values denoted as "NA" in this variable, we use a "nested ifelse" to say:

1. IF depression $\geq 10$ then dep2 is 1, 
2. IF dpression $< 10$, then dep2 is 0, 
3. ELSE dep2 is NA.

Note that these nested `ifelse()` statements can be as long as you want. We further need to clean up the asthma and sedentary variables.
```{r, message=FALSE, warning=FALSE}
## Fix some placeholders
df <- df %>%
  mutate(asthma = washer(mcq010, 9),
         asthma = washer(asthma, 2, value = 0)) %>%
  mutate(sed = washer(pad680, 9999, 7777))
```

Now let's run the logistic regression:
```{r, message=FALSE, warning=FALSE}
l_fit <- glm(dep2 ~ asthma + sed + race + famsize,
             data = df,
             family = "binomial")
summary(l_fit)
```

We used `glm()` (stands for generalized linear model). The key to making it logistic, since you can use `glm()` for a linear model using maximum likelihood instead of `lm()` with least squares, is `family = "binomial"`. This tells `R` to do a logistic regression.


## Poisson Regression {-}

As we did in logistic regression, we will use the `glm()` function. The difference here is we will be using an outcome that is a count variable. For example, the sedentary variable (`sed`) that we have in `df` is a count of the minutes of sedentary activity.

```{r, message=FALSE, warning=FALSE}
p_fit <- glm(sed ~ asthma + race + famsize,
             data = df,
             family = "poisson")
summary(p_fit)
```

Sedentary may be over-dispersed (see plot)
```{r, echo=FALSE, message=FALSE, warning=FALSE}
qplot(df$sed, alpha = .75, binwidth = 75) + 
  theme_bw() +
  labs(x = "Minutes of Sedentary Behavior") +
  scale_alpha(guide=FALSE)
```
and so other methods related to poisson may be necessary. For this book, we are not going to be delving into these in depth but we will introduce some below.


### Gamma {-}

Regression with a gamma distribution are often found when analyzing costs in dollars. It is very similar to poisson but does not require integers and can handle more dispersion. However, the outcome must have values $> 0$. Just for demonstration:
```{r, message=FALSE, warning=FALSE}
## Adjust sed
df$sed_gamma <- df$sed + .01
g_fit <- glm(sed_gamma ~ asthma + race + famsize,
             data = df,
             family = "Gamma")
summary(g_fit)
```

### Two-Part or Hurdle Models {-}

We are going to use the `pscl` package to run a hurdle model. These models are built for situations where there is a count variable with many zeros ("zero-inflated"). The hurdle model makes slightly different assumptions regarding the zeros than the pure negative binomial that we present next. The hurdle consists of two models: one for whether the person had a zero or more (binomial) and if more than zero, how many (poisson).

To run a hurdle model, we are going to make a sedentary variable with many more zeros to illustrate and then we will run a hurdle model.
```{r, message=FALSE, warning=FALSE}
## Zero inflated sedentary (don't worry too much about the specifics)
df$sed_zero <- ifelse(sample(1:100, 
                             size = length(df$sed), 
                             replace=TRUE) %in% c(5,10,11,20:25), 0, 
                      df$sed)
## Hurdle model
library(pscl)
h_fit = hurdle(sed_zero ~ asthma + race + famsize,
               data = df)
summary(h_fit)
```
Notice that the output has two parts: "Count model coefficients (truncated poisson with log link):" and "Zero hurdle model coefficients (binomial with logit link):". Together they tell us about the relationship between the predictors and a count variable with many zeros.


### Negative Binomial {-}

Similar to that above, negative binomial is for zero-inflated count variables. It makes slightly different assumptions than the hurdle and doesn't use a two-part approach. In order to run a negative binomial model we'll use the `MASS` package and the `glm.nb()` function.
```{r, eval=FALSE, warning=FALSE, message=FALSE}
library(MASS)
fit_nb <- glm.nb(sed_zero ~ asthma + race + famsize,
                 data = df)
summary(fit_nb)
```
Note that this model is not really appropriate because our data is somewhat contrived. 


## Beta Regression {-}

For outcomes that are bound between a lower and upper bound, Beta Regression is a great method. For example, if we are looking at test scores that are bound between 0 and 100. It is a very flexible method and allows for some extra analysis regarding the variation. 

For this, we are going to use the `betareg` package. But first, we are going to reach a little and create a ficticiously bound variable in the data set.
```{r, message=FALSE, warning=FALSE}
## Variable bound between 0 and 1
df$beta_var <- sample(seq(.05, .99, by = .01), 
                      size = length(df$asthma),
                      replace = TRUE)
library(betareg)
fit_beta <- betareg(beta_var ~ asthma + race + famsize,
                    data = df)
summary(fit_beta)
```

The output provides coefficients and the "Phi" coefficients. Both are important parts of using beta regression but we are not going to discuss it here. 

There are many resources available to learn more about beta regression and each of these GLM's. As for now, we are going to move on to more complex modeling where there are clustering or repeated measures in the data. 

## Apply It {-}

[This link](https://tysonbarrett.com/DataR/Chapter4.zip) contains a folder complete with an Rstudio project file, an RMarkdown file, and a few data files. Download it and unzip it to do the following steps.


### Step 1 {-}

Open the `Chapter4.Rproj` file. This will open up RStudio for you.

### Step 2 {-}

Once RStudio has started, in the panel on the lower-right, there is a `Files` tab. Click on that to see the project folder. You should see the data files and the `Chapter4.Rmd` file. Click on the `Chapter4.Rmd` file to open it. In this file, import the data and run each type of statistical analysis presented in this chapter (there are others that are presented in Chapter 4---that you may have done already---and Chapters 6 and 7 that you do not need to do yet).

Once that code is in the file, click the `knit` button. This will create an HTML file with the code and output knitted together into one nice document. This can be read into any browser and can be used to show your work in a clean document.


## Conclusions {-}

One of the great things about `R` is that most modeling is very similar to the basic `lm()` function. In all of these GLM's the arguments are nearly all the same: a formula, the data, and family of model. As you'll see for Multilevel and Other Models chapters, this does not change much. Having a good start with basic models and GLM's gets you ready for nearly every other modeling type in `R`.


[^logistic]: Technically, logistic regression is a linear regression model.