Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[paper] writing #332

Open
wants to merge 20 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,12 @@ ggplot(iris, aes(x = Species, y = Sepal.Width)) +
theme_minimal()
```

You can also get a "quick" plot using the `plot()` function:

```{r}
plot(means)
```



### Contrast analysis
Expand Down
20 changes: 20 additions & 0 deletions paper/example.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
library(easystats)

model <- lm(Petal.Width ~ Petal.Length * Species, data = iris)

parameters::parameters(model)


estimate_relation(model, by=c("Petal.Length", "Species"), length=100) |>
plot()


estimate_means(model, by="Species")

estimate_contrasts(model, contrast="Species")

estimate_slopes(model, trend = "Petal.Length", by="Species")

estimate_contrasts(model, contrast="Petal.Length", by="Species", backend="marginaleffectss")
# estimate_contrasts(model, contrast="Petal.Length", by="Species")

165 changes: 159 additions & 6 deletions paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,28 +56,181 @@ In line with the `easystats`' *raison d'être*, the `modelbased` package is desi
## Predictions

At a fundamental level, `modelbased` and similar packages leverage model *predictions*.
These predictions can be of different types, depending on the model and the question at hand.
For instance, for linear regressions, predictions can be associated with **confidence intervals** (`predict="expectation"`) or **prediction intervals** (`predict="prediction"`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not only linear regression? Remove that subordinary sentence?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

The former corresponds to the uncertainty around the "relationship" (i.e., the estimate in the model's conditional parameters) while the second provides information about the range where observation values can actually fall (predictions intervals are typically larger than confidence intervals).
For general linear models (GLMs), while the distinction between prediction and confidence intervals do not apply, predictions can be made on the **response scale** (`predict="response"`) or the **link scale** (`predict="link"`).
This corresponds for instance to predictions in terms of probability (response scale) or log odds (link scale) for logistic models.

These different types of variations can be obtained for the original data, which is useful to assess the model's goodness-of-fit, or for new data (typically a "data grid"), which is useful for visualization.

For convenience, the `modelbased` package includes 4 related functions, that mostly differ in their default arguments for `data` and `predict`:

- `estimate_prediction()`: original data, prediction intervals.
- `estimate_expectation()`: original data, confidence intervals.
- `estimate_relation()`: data grid, predictions on the response scale.
- `estimate_link()`: data grid, predictions on the link scale.

*Note:* if the defaults are changed, then these functions can become redundant. For instance, `estimate_relation(..., predict="link")` will be equivalent to `estimate_link(...)`.

These functions belong to the same family, and their relevance depends on the model and the research question.

Mention types of predictions (expectations, predictions, response vs. the link scale)

## Marginal effects

Concept of "marginal".
The concept of "marginal" in statistical modeling refers to the effect of one variable when all other variables are held constant at specific values (e.g., their reference value, or their empirical or theoretical average).
This is crucial for interpreting how individual predictors influence the response variable in complex models.

- `estimate_means()`: computes **Marginal Means**, i.e., the average predictions for each level of a categorical predictor, averaged across all levels of other predictors.
- `estimate_contrasts()`: computes **Marginal Contrasts**, i.e., the comparison the marginal means of different levels of a factor to assess differences or effects.
- `estimate_slopes()`: computes **Marginal Slopes**, i.e., the change in the response variable for a one-unit change in a predictor, holding all other predictors constant. They are essentially the partial derivatives of the response with respect to each predictor, useful for continuous predictors.

The modelbased package simplifies the extraction of these effects, providing a clear interface to understand how different predictors interact with outcomes in various scenarios.

[details about types of marginalization]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@strengejacke I think here we could mention our 2 main types of marginalization (essentially copypasta our nice docs on that)

Copy link
Member

@strengejacke strengejacke Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can start working/helping on this paper by mid of February, or earlier if it's just minor stuff.


Marginal means, contrasts and slopes.

## Technical details

It leverages the `get_datagrid()` function from the `insight` package to intuitively generate an appropriate grid of data points for which predictions will be computed.
It leverages the `get_datagrid()` function from the `insight` package (REF) to intuitively generate an appropriate grid of data points for which predictions will be computed.

The algorithmic heavy lifting is done by its two backend packages, `emmeans` and `marginaleffects`.

Differences?
- `emmeans` (REF) Originally, the package was known as `lsmeans`, which stands for "Least-Squares Means".
This term was coined by researchers in the statistical community to describe what are now more commonly referred to as "Estimated Marginal Means" or EMMs, which are essentially predictions averaged over specified levels of factors in the model while holding continuous variables at their means or other specified values.
The term "Least-Squares Means" was somewhat misleading as it suggested a method specific to least-squares estimation, hence its renaming to `emmeans` in 2016 to clarify its scope for a wider range of models including generalized linear models, mixed models, and Bayesian models.
- `marginaleffects` (REF) was more recently introduced and also employs the delta method to approximate variance estimation. It is compatible with a wider range of models and allows for more flexibility in the specification of the marginal effects to be estimated.

[What's the difference / benefits / drawbacks of using one or ther other?]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattansb any other interesting facts to mention here?

Copy link
Member

@strengejacke strengejacke Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference for marginaleffects would be https://www.jstatsoft.org/article/view/v111i09

Arel-Bundock, V., Greifer, N., & Heiss, A. (2024). How to Interpret Statistical Models Using marginaleffects for R and Python. Journal of Statistical Software, 111(9), 1–32. https://doi.org/10.18637/jss.v111.i09

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh there's a lot that I can add.

I think the biggest difference would be:

  • emmeans used a reference grid (that has by default a "complete" combination of all predictors) to generate expected values, by default conditioning on the mean of numeric values and marginalizing over categorical/binary variables, using a linear function of the model's coefficients (and v-cov matrix to get SEs) to give "predictions at the mean" (predictions for an average observation).
  • marginaleffects uses unit level predictions to generate two counterfactual values - the difference of which is then taken (with SEs computed using the delta method), and averaged across all units. By default, the original model frame is used.

Of course, emmeans can also the delta method and can build complex reference grids (that aren't actually "grid" like), and marginaleffects can also generate linear perditions at the mean.

Using the delta method is more computationally costly than using a linear combination (though marginaleffects is very efficient). Using linear combinations with orthogonal "grids" also often means that results from emmeans directly correspond to a models coefficients (which is a benefit for those who are used to looking at regressions tables to understand their models - this can be shown with an example).

...

Would you like me to add all of this in? Just some of this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like me to add all of this in? Just some of this?

Anything you think is relevant, but I think it'll be good to be quite thorough and detailed here as the inner workings of marginal stuff are not often clearly explained so having details is good!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright - I'll give it some time tomorrow.


Something about the delta method?

This backend can be set as a global option with e.g., `options(modelbased_backend = "marginaleffects")`.

# Examples

TODO.
Imagine the following linear model in which we predict flower's `Petal.Width` from the interaction between `Petal.Length` and `Species`.

```r
library(easystats)

model <- lm(Petal.Width ~ Petal.Length * Species, data = iris)

parameters::parameters(model)
```
```
Parameter | Coefficient | SE | 95% CI | t(144) | p
------------------------------------------------------------------------------------------
(Intercept) | -0.05 | 0.21 | [-0.47, 0.38] | -0.22 | 0.823
Petal Length | 0.20 | 0.15 | [-0.09, 0.49] | 1.38 | 0.170
Species [versicolor] | -0.04 | 0.32 | [-0.66, 0.59] | -0.11 | 0.909
Species [virginica] | 1.18 | 0.33 | [ 0.52, 1.84] | 3.54 | < .001
Petal Length × Species [versicolor] | 0.13 | 0.16 | [-0.18, 0.44] | 0.83 | 0.405
Petal Length × Species [virginica] | -0.04 | 0.15 | [-0.34, 0.26] | -0.27 | 0.789
```

The **parameters** of this model can be a bit hard to interpret and do not offer us all the insights that we could get from this model.

## Visualize relationship

We can start by easily visualizing the relationship between our response variable and our predictors.

```r
estimate_relation(model, by=c("Petal.Length", "Species"), length=100) |>
plot()
```
**ADD PLOT**

But what is the **average value** of `Petal.Width` for each species?

## Marginal Means

The marginal means can be computed, which are the mean predictions for each level of a categorical predictor, *averaged across* all levels of other predictors (`Petal.Length` in this case).

```r
estimate_means(model, by="Species")
```
```
Estimated Marginal Means

Species | Mean | SE | 95% CI
---------------------------------------
setosa | 0.71 | 0.34 | [0.04, 1.37]
versicolor | 1.16 | 0.04 | [1.09, 1.23]
virginica | 1.74 | 0.09 | [1.57, 1.91]

Marginal means estimated at Species
```

But are these different species **significantly different** from each other?

## Marginal Contrasts

We can estimate all the pairwise contrasts between the levels of the `Species` factor.

```r
estimate_contrasts(model, contrast="Species")
```
```
Marginal Contrasts Analysis

Level1 | Level2 | Difference | 95% CI | SE | t(144) | p
------------------------------------------------------------------------------
setosa | versicolor | -0.45 | [-1.27, 0.37] | 0.34 | -1.34 | 0.183
setosa | virginica | -1.03 | [-1.87, -0.19] | 0.35 | -2.97 | 0.007
versicolor | virginica | -0.58 | [-0.81, -0.35] | 0.09 | -6.18 | < .001

Marginal contrasts estimated at Species
p-value adjustment method: Holm (1979)
```

As we can see, the average difference betweeen *setosa* and *versicolor* is not significant.

## Marginal Slopes

Similarly, we can compute the marginal effect of `Petal.Length` (i.e., the "slope") for each species.

```r
estimate_slopes(model, trend="Petal.Length", by="Species")
```
```
Estimated Marginal Effects

Species | Coefficient | SE | 95% CI | t(144) | p
-----------------------------------------------------------------
setosa | 0.20 | 0.15 | [-0.09, 0.49] | 1.38 | 0.170
versicolor | 0.33 | 0.05 | [ 0.22, 0.44] | 6.14 | < .001
virginica | 0.16 | 0.05 | [ 0.07, 0.25] | 3.49 | < .001
Marginal effects estimated for Petal.Length
```

This shows that there is a significant positive relationship between `Petal.Length` and `Petal.Width` for all species but *setosa*

## Marginal Contrasts of Slopes

Finally, we can even compute the contrasts between the slopes of `Petal.Length` for each species.

```r
estimate_contrasts(model, contrast="Petal.Length", by="Species")
```
```
Marginal Contrasts Analysis

Parameter | Coefficient | SE | 95% CI | t | p
---------------------------------------------------------------------------
setosa - versicolor | -0.13 | 0.16 | [-0.43, 0.17] | -0.83 | 0.808
setosa - virginica | 0.04 | 0.15 | [-0.26, 0.34] | 0.27 | 0.808
versicolor - virginica | 0.17 | 0.07 | [ 0.03, 0.31] | 2.41 | 0.048

Marginal contrasts estimated at Petal.Length
p-value adjustment method: Holm (1979)
```

The effect of `Petal.Length` on `Petal.Width` is significantly stronger in *virginica* compared to *versicolor*.

# Conclusion

The `modelbased` package provides a simple and intuitive interface to extract and visualize important information contained within statistical models.

# Acknowledgements

Expand Down
Loading