Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slice_sample() uses statistically incorrect sampling algorithms #6848

Closed
skolenik opened this issue May 12, 2023 · 4 comments
Closed

slice_sample() uses statistically incorrect sampling algorithms #6848

skolenik opened this issue May 12, 2023 · 4 comments

Comments

@skolenik
Copy link

skolenik commented May 12, 2023

slice_sample() relies on base::sample.int() which in turn relies on incorrect sampling algorithms. The difficulty of obtaining appropriate unequal probability samples is obscure and is not known outside of survey statistics. I don't want to go into base R code although they have to fix that, too (somewhat more surprising given that Thomas Lumley, one of the core R people, is a survey statistician, at least part time). The appropriate algorithms are implemented in sampling::UPmaxentropy().

# example from Tille (2006)
set.seed(25)
uneq_p = c(0.07,0.17,0.41,0.61,0.83,0.91)
sim_uneq_p <- as.data.frame(matrix(rep(0,6*20000),ncol=6))
for (k in 1:nrow(sim_uneq_p)) {
  this <- sample.int(n=length(uneq_p),size=sum(uneq_p),prob=uneq_p)
  sim_uneq_p[k,this] <- 1
}
colSums(sim_uneq_p)/nrow(sim_uneq_p)
uneq_p
# done right
sim_uneq_p_done_right <- as.data.frame(matrix(rep(0,6*20000),ncol=6))
for (k in 1:nrow(sim_uneq_p_done_right)) {
   sim_uneq_p_done_right[k,] <- sampling::UPmaxentropy(uneq_p)
}
colSums(sim_uneq_p_done_right)/nrow(sim_uneq_p_done_right)
uneq_p

@krlmlr @DavisVaughan

Code in sampling is (kinda) ugly, and the development most likely does not satisfy the tidyverse standards. I don't know if you want to rely on this as a dependency. Relevant parts may need to be taken over and internalized. (I have written the unequal probability sampling code from scratch in Stata, so I am closely familiar with the methodology and what it takes to code it.)

@DavisVaughan
Copy link
Member

Thanks for the report! We don't currently have any plans to switch away from the base R algorithms here. I'd encourage you to advocate for this change in base R itself, and see if they will be open to changing it there.

@krlmlr
Copy link
Member

krlmlr commented Nov 3, 2023

Thanks. I also remember reading in R's changelog that a problem in that area has been fixed in a recent-ish version, might be worth revisiting the changelog.

@DavisVaughan
Copy link
Member

This from 3.6, i think https://github.com/wch/r-source/blob/0987da15dd567ad07f91745238588c7873844d4c/doc/NEWS.3#L284

@apeterson91
Copy link
Contributor

I have a +1 to @skolenik's original issue. I understand that it doesn't make sense to change the algorithm but added documentation can help here. To that end, I've submitted #7052 . PTAL and let me know what you think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants