Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when using bind_rows on a list of data.tables with keys, with either map_dfr or a do.call, the keys are not removed This is a problem, because the keys are not correct anymore, which means that later queries are using an index that is wrong. #5587

Open
emilBeBri opened this issue Feb 1, 2023 · 2 comments

Comments

@emilBeBri
Copy link

not sure if I should report this under data.table or dplyr, but since the issue is in bin_rows, I guess it is dplyr - but I think it's an issue for data.table users mostly - and I think it is pretty serious, since it can completely mess up data integrity, with no errors, no warnings, no way of knowing it happened except by external means, e.g. seeing that the data output just can't be right - this happened to me last week and it made a pretty big mess of things.

here is the issue I filed under dplyr



library(data.table)
library(dplyr)




data <- data.table(var1=c('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'), var2=c(rep(2016, 4), rep(2018, 4)), key=c('var1', 'var2'))


test1 <- lapply(c(2018, 2016), function(x) data[var2 %in% c(x-1, x)])


test2 <- do.call(bind_rows, test1)
# wrong 
test2[var1 %in% c('c', 'e'), .N]
# right
sum( test2$var1 %in% c('c', 'e') )


same thing happens w. map_dfr,that uses bind_rows under the hood. This is how I found the issue. It is very common in my workflow, and I suspect I am not the one.

doing the same with rbind removes the keys, which is the expected behaviour - as the index from the key are no longer valid.


test2_rbind <- do.call(rbind, test1) 
# here it is fine - because the keys not there, giving a wrong picture of the structure of the DT
test2_rbind[var1 %in% c('c', 'e'), .N]

# when removing the keys, everything is fine. 
setkey(test2, NULL)
test2[var1 %in% c('c', 'e'), .N]


curiosly, using split() instead of lapply circumvents the issue - but this is not a solution, since typically you need to apply a function to the data, - splitting it and then recombining it with split() makes no sense. But I am including it here for completeness.



test1 <- split(data, data$var2)
test2 <- do.call(bind_rows, test1)
# right 
test2[var1 %in% c('c', 'e'), .N]

sessionInfo


R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Manjaro Linux

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.11.0
LAPACK: /usr/lib/liblapack.so.3.11.0

locale:
 [1] LC_CTYPE=en_DK.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_DK.UTF-8        LC_COLLATE=en_DK.UTF-8
 [5] LC_MONETARY=en_DK.UTF-8    LC_MESSAGES=en_DK.UTF-8
 [7] LC_PAPER=en_DK.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_DK.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] dplyr_1.0.10      data.table_1.14.6

loaded via a namespace (and not attached):
 [1] fansi_1.0.4      assertthat_0.2.1 utf8_1.2.2       R6_2.5.1
 [5] DBI_1.1.3        lifecycle_1.0.3  magrittr_2.0.3   pillar_1.8.1
 [9] rlang_1.0.6      cli_3.6.0        vctrs_0.5.2      generics_0.1.3
[13] glue_1.6.2       compiler_4.2.2   pkgconfig_2.0.3  tidyselect_1.2.0
[17] tibble_3.1.8



@ben-schwen
Copy link
Member

Similar to #5569

@emilBeBri
Copy link
Author

emilBeBri commented Feb 3, 2023

Well yes, except that in this case, we DON'T want to preserve the key attribute, so it's a good thing rbindlist doesn't do that in this case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants