-
Notifications
You must be signed in to change notification settings - Fork 257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streamline model diagnostics plots #45
Comments
How I see these plots:
I'm open to discussing residuals vs fitted as also being included, though given the above, the motivation for including it doesn't seem to occur until the multiple/logistic regression chapter, because its benefits (that I'm aware of, please let me know if I'm missing something) are only evident for models with several variables. However, I'm very apprehensive about removing residuals vs x, which I think is a superior diagnostic approach. I also think residuals vs x is conceptually easier, making it preferable for a first course in statistics. The reason for that belief is that students get familiar with seeing predictors along the x-axis, and this plot only swaps out one variable in the plot (y --> residuals), so this plot shouldn't feel as new as residuals vs fitted. |
In practice, the switch from res vs. x to res vs. predicted feels abrupt to students and they get confused and try to come up with rules like for SLR you must use res vs. x while for MLR you must use res. predicted. So the conceptual ease (which I agree with) tends to come with the cost of a cognitive burden later. That being said, I'm happy to keep this conversation open for OS and reflect on our experience with how we're framing things in IMS as we update OS to the next edition (which won't happen very soon anyway). I mostly wanted to file this here to not lose the thread. |
I think that this is a great question and I'm glad that you are discussing it. Given the increasing importance of multivariate thinking I suspect that it will be something that merits your continued attention. I spent a fair amount of time thinking about this as I approached teaching my intro stats class this January. As you know, I dive early and fairly deep into descriptive multiple regression early on in the course then return to inferential multiple regression at the end of the course (with students undertaking projects where they analyze and interpret data from a multiple regression model). My prior approach was to have students plot k+1 scatterplots (with superimposed line and smoother) when they had k quantitative predictors: resid vs fitted plus a histogram (with superimposed normal) of the residuals. My experience is that they would get lost in a sea of plots and lose the forest for the trees. It was very common for every student to say "my regression assumptions aren't met, so what's the point?" I would encourage you to focus their attention on the residuals vs. fitted plot in multiple regression land to avoid a profusion of diagnostic plots. To help prepare them for this, I'd encourage you to note that for single-variable models, the methods communicate nearly identical information (@DavidDiez point 3 above). Here's an example that I used for exactly that purpose. suppressPackageStartupMessages(library(mosaic))
mod1 <- lm(cesd ~ mcs, data = HELPrct) %>%
broom::augment()
gf_point(cesd ~ mcs, data = mod1) %>%
gf_smooth() %>%
gf_lm()
#> `geom_smooth()` using method = 'loess' gf_point(.resid ~ mcs, data = mod1) %>%
gf_smooth() %>%
gf_lm() %>%
gf_labs(y = "residual")
#> `geom_smooth()` using method = 'loess' gf_point(.resid ~ .fitted, data = mod1) %>%
gf_smooth() %>%
gf_lm() %>%
gf_labs(y = "residual", x = "fitted")
#> `geom_smooth()` using method = 'loess' Created on 2021-03-03 by the reprex package (v1.0.0) Sample description: here we demonstrate three ways to explore the linearity and equal variance assumptions for our model with a single quantitative predictor. Note that plots 1 and 2 are quite similar with the only difference being that the negative slope has been regressed out so that the best fitting straight line for the residuals as a function of the predictor is horizontal. Plot 3 replaces the predictor value with the fitted (predicted) value from the model. Since the slope is negative this flips the values (see for example the three points on the right of plot 2 are now on the lefthand side of plot 3. Plots 2 and 3 also communicate very similar information because there is only one predictor in the model. Then add a note that for models with more than one predictor, one can also generate plots of the residuals vs. the individual predictors. Perhaps close with a reminder that we are dealing in a multivariate world and that the model won't be perfect but we want to detect important deviations from the assumptions if we are to trust its results. Thanks as always for your efforts on this project: it's enormously valuable. |
In IMS we're using residuals vs. predicted. In multiple regression chapters of this book we're also using residuals vs. predicted. We should use those (as opposed to residuals vs. x) in the simple linear chapter of this book too.
OpenIntroStat/ims#61 mentions it would be nice to keep this consistent across books.
The text was updated successfully, but these errors were encountered: