slice_sample() uses statistically incorrect sampling algorithms #6848

skolenik · 2023-05-12T21:05:01Z

slice_sample() relies on base::sample.int() which in turn relies on incorrect sampling algorithms. The difficulty of obtaining appropriate unequal probability samples is obscure and is not known outside of survey statistics. I don't want to go into base R code although they have to fix that, too (somewhat more surprising given that Thomas Lumley, one of the core R people, is a survey statistician, at least part time). The appropriate algorithms are implemented in sampling::UPmaxentropy().

# example from Tille (2006)
set.seed(25)
uneq_p = c(0.07,0.17,0.41,0.61,0.83,0.91)
sim_uneq_p <- as.data.frame(matrix(rep(0,6*20000),ncol=6))
for (k in 1:nrow(sim_uneq_p)) {
  this <- sample.int(n=length(uneq_p),size=sum(uneq_p),prob=uneq_p)
  sim_uneq_p[k,this] <- 1
}
colSums(sim_uneq_p)/nrow(sim_uneq_p)
uneq_p
# done right
sim_uneq_p_done_right <- as.data.frame(matrix(rep(0,6*20000),ncol=6))
for (k in 1:nrow(sim_uneq_p_done_right)) {
   sim_uneq_p_done_right[k,] <- sampling::UPmaxentropy(uneq_p)
}
colSums(sim_uneq_p_done_right)/nrow(sim_uneq_p_done_right)
uneq_p

@krlmlr @DavisVaughan

Code in sampling is (kinda) ugly, and the development most likely does not satisfy the tidyverse standards. I don't know if you want to rely on this as a dependency. Relevant parts may need to be taken over and internalized. (I have written the unequal probability sampling code from scratch in Stata, so I am closely familiar with the methodology and what it takes to code it.)

The text was updated successfully, but these errors were encountered:

DavisVaughan · 2023-11-03T16:43:39Z

Thanks for the report! We don't currently have any plans to switch away from the base R algorithms here. I'd encourage you to advocate for this change in base R itself, and see if they will be open to changing it there.

krlmlr · 2023-11-03T17:28:26Z

Thanks. I also remember reading in R's changelog that a problem in that area has been fixed in a recent-ish version, might be worth revisiting the changelog.

DavisVaughan · 2023-11-03T17:36:54Z

This from 3.6, i think https://github.com/wch/r-source/blob/0987da15dd567ad07f91745238588c7873844d4c/doc/NEWS.3#L284

apeterson91 · 2024-07-14T18:02:42Z

I have a +1 to @skolenik's original issue. I understand that it doesn't make sense to change the algorithm but added documentation can help here. To that end, I've submitted #7052 . PTAL and let me know what you think.

DavisVaughan closed this as completed Nov 3, 2023

apeterson91 mentioned this issue Jul 14, 2024

Add documentation clarifying appropriate use of weights in slice_sample() #7052

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slice_sample() uses statistically incorrect sampling algorithms #6848

slice_sample() uses statistically incorrect sampling algorithms #6848

skolenik commented May 12, 2023 •

edited

Loading

DavisVaughan commented Nov 3, 2023

krlmlr commented Nov 3, 2023

DavisVaughan commented Nov 3, 2023

apeterson91 commented Jul 14, 2024

slice_sample() uses statistically incorrect sampling algorithms #6848

slice_sample() uses statistically incorrect sampling algorithms #6848

Comments

skolenik commented May 12, 2023 • edited Loading

DavisVaughan commented Nov 3, 2023

krlmlr commented Nov 3, 2023

DavisVaughan commented Nov 3, 2023

apeterson91 commented Jul 14, 2024

skolenik commented May 12, 2023 •

edited

Loading