-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slice_sample() uses statistically incorrect sampling algorithms #6848
Comments
Thanks for the report! We don't currently have any plans to switch away from the base R algorithms here. I'd encourage you to advocate for this change in base R itself, and see if they will be open to changing it there. |
Thanks. I also remember reading in R's changelog that a problem in that area has been fixed in a recent-ish version, might be worth revisiting the changelog. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
slice_sample()
relies onbase::sample.int()
which in turn relies on incorrect sampling algorithms. The difficulty of obtaining appropriate unequal probability samples is obscure and is not known outside of survey statistics. I don't want to go into base R code although they have to fix that, too (somewhat more surprising given that Thomas Lumley, one of the core R people, is a survey statistician, at least part time). The appropriate algorithms are implemented insampling::UPmaxentropy()
.@krlmlr @DavisVaughan
Code in
sampling
is (kinda) ugly, and the development most likely does not satisfy thetidyverse
standards. I don't know if you want to rely on this as a dependency. Relevant parts may need to be taken over and internalized. (I have written the unequal probability sampling code from scratch in Stata, so I am closely familiar with the methodology and what it takes to code it.)The text was updated successfully, but these errors were encountered: