Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds data argument to effectsize.htest/cohens_d #522

Merged
merged 35 commits into from
May 11, 2024

Conversation

rempsyc
Copy link
Member

@rempsyc rempsyc commented Oct 16, 2022

Fixes easystats/report#245


# Load effectsize
devtools::load_all(".../forks/effectsize")

# "Right" method
y <- t.test(mtcars$mpg ~ mtcars$vs)
cohens_d(y)
#> Cohen's d |         95% CI
#> --------------------------
#> -1.70     | [-2.55, -0.82]
#> 
#> - Estimated using un-pooled SD.

# Approximation method
x <- t.test(mpg ~ vs, data = mtcars)
cohens_d(x)
#> Warning: Unable to retrieve data from htest object. Returning an approximate
#>   effect size using t_to_d().
#> d     |         95% CI
#> ----------------------
#> -1.96 | [-2.94, -0.95]

# New method with data specified
cohens_d(x, data = mtcars)
#> Cohen's d |         95% CI
#> --------------------------
#> -1.70     | [-2.55, -0.82]
#> 
#> - Estimated using un-pooled SD.

# Comparison
cohens_d(y) == cohens_d(x, data = mtcars)
#>                                                 Cohens_d   CI CI_low CI_high
#> difference in means between group 0 and group 1     TRUE TRUE   TRUE    TRUE

# Approximation method
report::report_table(x)
#> Warning: Unable to retrieve data from htest object. Returning an approximate
#>   effect size using t_to_d().
#> Welch Two Sample t-test
#> 
#> Parameter | Group | Mean_Group1 | Mean_Group2 | Difference |          95% CI | t(22.72) |      p |     d |          d  CI
#> -------------------------------------------------------------------------------------------------------------------------
#> mpg       |    vs |       16.62 |       24.56 |      -7.94 | [-11.46, -4.42] |    -4.67 | < .001 | -1.96 | [-2.94, -0.95]
#> 
#> Alternative hypothesis: two.sided

# New method with data specified
report::report_table(x, data = mtcars)
#> Welch Two Sample t-test
#> 
#> Parameter | Group | Mean_Group1 | Mean_Group2 | Difference |          95% CI | t(22.72) |      p | Cohen's d |   Cohens_d  CI
#> -----------------------------------------------------------------------------------------------------------------------------
#> mpg       |    vs |       16.62 |       24.56 |      -7.94 | [-11.46, -4.42] |    -4.67 | < .001 |     -1.70 | [-2.55, -0.82]
#> 
#> Alternative hypothesis: two.sided

Created on 2022-10-16 with reprex v2.0.2

@codecov-commenter
Copy link

codecov-commenter commented Oct 16, 2022

Codecov Report

Attention: Patch coverage is 90.78947% with 7 lines in your changes are missing coverage. Please review.

Project coverage is 90.75%. Comparing base (16d832b) to head (1dc280f).
Report is 1 commits behind head on main.

❗ Current head 1dc280f differs from pull request most recent head a9c77ec. Consider uploading reports for the commit a9c77ec to get more accurate results

Files Patch % Lines
R/effectsize.htest.R 90.14% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #522      +/-   ##
==========================================
+ Coverage   90.71%   90.75%   +0.04%     
==========================================
  Files          57       56       -1     
  Lines        3563     3440     -123     
==========================================
- Hits         3232     3122     -110     
+ Misses        331      318      -13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rempsyc rempsyc requested review from DominiqueMakowski and removed request for strengejacke October 16, 2022 23:20
@mattansb
Copy link
Member

I haven't given a close look yet, but this seems like it does not cover the full scope of possible garbage htest data.name patterns...

See @strengejacke code here: https://github.com/easystats/insight/blob/main/R/utils_get_data.R

@mattansb mattansb added the WIP 👷‍♂️ work in progress label Oct 17, 2022
@rempsyc
Copy link
Member Author

rempsyc commented Oct 17, 2022

Ok. (When you have time) would it be possible to give me an example of what you mean by "htest data.name patterns"? Do you mean like to support the different htest objects (it wasn't clear to me from the link you provided)? I just added support for cohen's d and t-test for now, but am happy to add support for more if useful.

@rempsyc
Copy link
Member Author

rempsyc commented Dec 16, 2022

  • chisq.test (no data argument so formula interface not possible)
  • fisher.test (no data argument so formula interface not possible)
  • mcnemar.test (no data argument so formula interface not possible)
  • cor.test (already works)
  • oneway.test (already works)
  • t.test (done)
  • wilcox.test (done)
  • friedman.test (done)
  • kruskal.test (done)
  • chisq.test_gof (not sure what function to use to test this object type)

@rempsyc
Copy link
Member Author

rempsyc commented Dec 16, 2022

devtools::load_all("D:/github/forks/effectsize")
#> ℹ Loading effectsize

# t.test
x <- t.test(mpg ~ vs, data = mtcars)
effectsize(x)
#> Warning: Unable to retrieve data from htest object. Returning an approximate
#>   effect size using t_to_d().
#> d     |         95% CI
#> ----------------------
#> -1.96 | [-2.94, -0.95]
effectsize(x, data = mtcars)
#> Cohen's d |         95% CI
#> --------------------------
#> -1.70     | [-2.55, -0.82]
#> 
#> - Estimated using un-pooled SD.

# cor.test
# no need to specify the data argument
x <- cor.test(~ qsec + drat, data = mtcars)
effectsize(x)
#> Warning: This 'htest' method is not (yet?) supported.
#>   Returning 'parameters::model_parameters(model)'.
#> Pearson's product-moment correlation
#> 
#> Parameter1 | Parameter2 |    r |        95% CI | t(30) |     p
#> --------------------------------------------------------------
#> qsec       |       drat | 0.09 | [-0.27, 0.43] |  0.50 | 0.620
#> 
#> Alternative hypothesis: true correlation is not equal to 0

# wilcox.test
x <- wilcox.test(mpg ~ vs, data = mtcars)
#> Warning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): cannot
#> compute exact p-value with ties
effectsize(x)
#> Error: Unable to retrieve data from htest object.
#>   Try using 'rb()' directly.
effectsize(x, data = mtcars)
#> Warning: 'y' is numeric but has only 2 unique values.
#>   If this is a grouping variable, convert it to a factor.
#> r (rank biserial) |       95% CI
#> --------------------------------
#> 1.00              | [1.00, 1.00]

# chisq.test
# No data argument, so it is not possible to use the formula interface.

# fisher.test
# No data argument, so it is not possible to use the formula interface.

# chisq.test_gof
# ???

# mcnemar.test
# No data argument, so it is not possible to use the formula interface.

# friedman.test
wb <- aggregate(warpbreaks$breaks, by = list(
  w = warpbreaks$wool, t = warpbreaks$tension), FUN = mean)
wb
#>   w t        x
#> 1 A L 44.55556
#> 2 B L 28.22222
#> 3 A M 24.00000
#> 4 B M 28.77778
#> 5 A H 24.55556
#> 6 B H 18.77778
x <- friedman.test(x ~ w | t, data = wb)
effectsize(x)
#> Error: Unable to retrieve data from htest object.
#>   Try using 'kendalls_w()' directly.
effectsize(x, data = wb)
#> Kendall's W |       95% CI
#> --------------------------
#> 0.11        | [0.11, 1.00]
#> 
#> - One-sided CIs: upper bound fixed at [1.00].

# chisq.test_gof
# no need to specify the data argument

# oneway.test
# no need to specify the data argument
x <- oneway.test(extra ~ group, data = sleep, var.equal = TRUE)
effectsize(x)
#> Eta2 |       95% CI
#> -------------------
#> 0.16 | [0.00, 1.00]
#> 
#> - One-sided CIs: upper bound fixed at [1.00].

# kruskal.test
# Cannot get it to work
airquality2 <- airquality
airquality2$Month <- as.factor(airquality2$Month)
x <- kruskal.test(Ozone ~ Month, data = airquality2)
effectsize(x)
#> Warning in grepl("Kruskal-Wallis", x$method, fixed = TRUE) &&
#> grepl("^list\\(", : 'length(x) = 2 > 1' in coercion to 'logical(1)'
#> Error: Unable to retrieve data from htest object.
#>   Try using 'rank_epsilon_squared()' directly.
effectsize(x, data = airquality2)
#> Warning in grepl("Kruskal-Wallis", x$method, fixed = TRUE) &&
#> grepl("^list\\(", : 'length(x) = 2 > 1' in coercion to 'logical(1)'
#> Warning: Missing values detected. NAs dropped.
#> Epsilon2 (rank) |       95% CI
#> ------------------------------
#> 0.25            | [0.15, 1.00]
#> 
#> - One-sided CIs: upper bound fixed at [1.00].

Created on 2022-12-16 with reprex v2.0.2

@Larissa-Cury

This comment was marked as outdated.

@mattansb

This comment was marked as outdated.

@mattansb
Copy link
Member

mattansb commented Jul 3, 2023

@rempsyc chisq.test doesn't need any data - it is the only htest that actually contains the data 🥳

To merge this, we would need a lot of unit tests that show that this feature can be used as intended.
Here are two examples that fail because of the unrecoverable data - can they be made to work? (given here for t.test but applicable to anything really)

library(effectsize)
library(testthat)

tt1 <- t.test(mpg ~ I(am + cyl == 4), data = mtcars)
dd1 <- cohens_d(mpg ~ I(am + cyl == 4), data = mtcars, pooled_sd = FALSE)

dat <- mtcars
tt2 <- t.test(dat$mpg[dat$am==1], dat$mpg[dat$am==0])
dd2 <- cohens_d(dat$mpg[dat$am==1], dat$mpg[dat$am==0], pooled_sd = FALSE)
rm("dat")

col.y <- "mpg"
tt3 <- t.test(mtcars[[col.y]])
dd3 <- cohens_d(mtcars[[col.y]])
rm("col.y")

expect_equal(effectsize(tt1, data = mtcars)[[1]], dd1[[1]])
#> Error in `[.data.frame`(dots$data, , vars) : undefined columns selected

expect_equal(effectsize(tt2, data = mtcars)[[1]], dd2[[1]])
#> Error in `[.data.frame`(dots$data, , vars) : undefined columns selected

expect_equal(effectsize(tt3, data = mtcars)[[1]], dd3[[1]])
#> Error in `[.data.frame`(dots$data, , vars) : undefined columns selected
  • Use internal functions to not repeat yourself in code
  • If some examples cannot be made to work:
    • If this feature really worth the trouble just for a formula interface (that sucks)? If so...
    • Add informative errors to catch these "extreme" cases.

@rempsyc
Copy link
Member Author

rempsyc commented Jul 3, 2023

Thanks, that's very useful and specific! I'll look into implementing these suggestions :)

@rempsyc
Copy link
Member Author

rempsyc commented Jul 3, 2023

  • unit tests
  • internal functions to avoid repeating code
  • informative messages

Doesn’t work well with advanced formula interfaces for now but I return NULL for data if it can’t find it, so it defaults to the same as not providing the argument (plus a message about it). Working on trying to fix the formula stuff with regex right now but not sure if that’s the right approach…

Also, to fix the lints, I was going to change all the .effectsize_t.test to .effectsize_t_test, etc. (name style should match snake_case). But do you even want that?

@rempsyc
Copy link
Member Author

rempsyc commented Aug 20, 2023

For this to work you will literally need to re-write about 60% of the effectsize.htest.R file

Are you sure? Below I reproduce your example. Do you think it needs something more?

library(effectsize)

some_data <- mtcars
some_data$mpg[1] <- NA

args <- alist(
  mpg ~ am,
  data = some_data,
  alternative = "less", 
  mu = 1,
  pooled_sd = TRUE,
  var.equal = TRUE,
  subset = cyl == 4, 
  na.action = na.omit
)

tt <- do.call(args, what = t.test)

d1 <- do.call(args, what = cohens_d)

d2 <- do.call(c(alist(tt), args[-1]), what = effectsize)

all.equal(d1$Cohens_d, d2$Cohens_d)
#> [1] TRUE

Created on 2023-08-19 with reprex v2.0.2

we would have to bring the coverage there to 100% first to make sure we don't break anything

I'm happy to start on that first, with your approval. Worst case scenario it's not wasted even if we don't end up merging this PR.

For this to work - even if just for very standard formula interface - the user will need to supply data, subset, na.action

True, but in my experience these arguments are not really used, but even so, we support them if the user really wants them, so I think that's ok.

is this worth it?

I think so, with the above example and all new tests I included, I think it adds value for the user. But maybe there is more that I am missing and hopefully you can point it out to me :P

@mattansb
Copy link
Member

Without looking at the code, but looking at the tests, this looks pretty good.
Unfortunately I won't really have time to delve into this summer - but I think maybe we can have this done for a fall release?

@rempsyc rempsyc removed the WIP 👷‍♂️ work in progress label Apr 1, 2024
@rempsyc rempsyc marked this pull request as ready for review April 1, 2024 10:06
@rempsyc rempsyc added the enhancement 🔥 New feature or request label Apr 1, 2024
@rempsyc rempsyc added the WIP 👷‍♂️ work in progress label Apr 18, 2024
@rempsyc rempsyc marked this pull request as draft April 18, 2024 10:16
@rempsyc rempsyc marked this pull request as ready for review April 28, 2024 12:49
@rempsyc
Copy link
Member Author

rempsyc commented Apr 28, 2024

Mattan, there are still some (new) lints, but I can address them with your approval (since it involves changing a lot of variable names that collide with base R functions)

@mattansb
Copy link
Member

I still just don't like this as a solution - it requires the user to do so much work that they might as well just re-run the code to not use a formula (all of these tests run in a fraction of a second), no?

Does this fix anything upstream (e.g. in report)?

If you really think this is advantageous, go a head and make the fixes and merge (:

@rempsyc
Copy link
Member Author

rempsyc commented May 4, 2024

Yes, I can understand that, but the way I see it, it adds functionality and does not remove functionality, and yes, as well, it does fix some issues upstream with report :) So will go ahead with the fixes and merge, thanks!

@rempsyc
Copy link
Member Author

rempsyc commented May 4, 2024

Failing tests are unrelated to this PR

@rempsyc rempsyc requested a review from mattansb May 9, 2024 20:10
@rempsyc rempsyc removed the WIP 👷‍♂️ work in progress label May 11, 2024
@rempsyc
Copy link
Member Author

rempsyc commented May 11, 2024

I think I need the approval of a reviewer to merge this, if I want to avoid bypassing the branch protections

@mattansb mattansb merged commit 7cbed4f into easystats:main May 11, 2024
21 of 27 checks passed
@rempsyc rempsyc deleted the htest_data branch May 12, 2024 07:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement 🔥 New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Warning: Unable to retrieve data from htest object. Using t_to_d() approximation.
5 participants