Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bind_rows() on a list of data.tables creates corrupt output #6676

Open
emilBeBri opened this issue Feb 1, 2023 · 5 comments
Open

bind_rows() on a list of data.tables creates corrupt output #6676

emilBeBri opened this issue Feb 1, 2023 · 5 comments
Labels
bug an unexpected problem or unintended behavior

Comments

@emilBeBri
Copy link

emilBeBri commented Feb 1, 2023

when using bind_rows on a list of data.tables with keys, with either map_dfr or a do.call, the keys are not removed This is a problem, because the keys are not correct anymore, which means that later queries are using an index that is wrong.

not sure if I should report this under data.table or dplyr, but since the issue is in bin_rows, I guess it is dplyr.

here is the issue I filed under data.table



library(data.table)
library(dplyr)




data <- data.table(var1=c('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'), var2=c(rep(2016, 4), rep(2018, 4)), key=c('var1', 'var2'))


test1 <- lapply(c(2018, 2016), function(x) data[var2 %in% c(x-1, x)])


test2 <- do.call(bind_rows, test1)
# wrong 
test2[var1 %in% c('c', 'e'), .N]
# right
sum( test2$var1 %in% c('c', 'e') )


same thing happens w. map_dfr,that uses bind_rows under the hood. This is how I found the issue. It is very common in my workflow, and I suspect I am not the one.

doing the same with rbind removes the keys, which is the expected behaviour - as the index from the key are no longer valid.


test2_rbind <- do.call(rbind, test1) 
# here it is fine - because the keys not there, giving a wrong picture of the structure of the DT
test2_rbind[var1 %in% c('c', 'e'), .N]

# when removing the keys, everything is fine. 
setkey(test2, NULL)
test2[var1 %in% c('c', 'e'), .N]


curiosly, using split() instead of lapply circumvents the issue - but this is not a solution, since typically you need to apply a function to the data, - splitting it and then recombining it with split() makes no sense. But I am including it here for completeness.



test1 <- split(data, data$var2)
test2 <- do.call(bind_rows, test1)
# right 
test2[var1 %in% c('c', 'e'), .N]

sessionInfo


R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Manjaro Linux

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.11.0
LAPACK: /usr/lib/liblapack.so.3.11.0

locale:
 [1] LC_CTYPE=en_DK.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_DK.UTF-8        LC_COLLATE=en_DK.UTF-8
 [5] LC_MONETARY=en_DK.UTF-8    LC_MESSAGES=en_DK.UTF-8
 [7] LC_PAPER=en_DK.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_DK.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] dplyr_1.0.10      data.table_1.14.6

loaded via a namespace (and not attached):
 [1] fansi_1.0.4      assertthat_0.2.1 utf8_1.2.2       R6_2.5.1
 [5] DBI_1.1.3        lifecycle_1.0.3  magrittr_2.0.3   pillar_1.8.1
 [9] rlang_1.0.6      cli_3.6.0        vctrs_0.5.2      generics_0.1.3
[13] glue_1.6.2       compiler_4.2.2   pkgconfig_2.0.3  tidyselect_1.2.0
[17] tibble_3.1.8



@hadley
Copy link
Member

hadley commented Feb 1, 2023

Can you please provide a minimal reprex (reproducible example)? The goal of a reprex is to make it as easy as possible for me to recreate your problem so that I can fix it: please help me help you! If you've never heard of a reprex before, start by reading about the reprex package, including the advice further down the page. Please make sure your reprex is created with the reprex package as it gives nicely formatted output and avoids a number of common pitfalls.

@hadley hadley changed the title when using bind_rows on a list of data.tables with keys, with either map_dfr or a do.call, the keys are not removed This is a problem, because the keys are not correct anymore, which means that later queries are using an index that is wrong. bind_rows() on a list of data.tables creates corrupt output Feb 1, 2023
@emilBeBri
Copy link
Author

emilBeBri commented Feb 2, 2023

So the reprex package gives me this:


Error in `reprex_render()`:
! This reprex appears to crash R. Call `reprex()` again with
  `std_out_err = TRUE` to get more info.
Run `rlang::last_error()` to see where the error occurred.

I already spend a lot of time reproducing this - it might not be using the reprex package but the example is solid, isn't it? I work on high security servers with no internet access, on data with social security numbers that I would be fired and possibly prosecuted for disclosing. Reproducing it outside those took a while, as the bug is so sublte. And I already spend 5 days finding the bug and correcting the issues caused by it. And there is more work to be done making sure that the bug hasn't caused errors other places in other projects. I don't have any more time to give, I'm sorry. It would have to wait a couple of weeks then.

@mgirlich
Copy link

mgirlich commented Feb 2, 2023

Here's a reprex

library(data.table)
library(dplyr, warn.conflicts = FALSE)

data <- data.table(
  var1 = c("a", "b"),
  key = "var1"
)
data[var1 %in% c('a', 'b')]
#>    var1
#> 1:    a
#> 2:    b

# doesn't work
dt <- bind_rows(data[2, ], data[1, ])
dt[var1 %in% c('a', 'b')]
#>    var1
#> 1:    b

# works
dt <- bind_rows(data[1, ], data[2, ])
dt[var1 %in% c('a', 'b')]
#>    var1
#> 1:    a
#> 2:    b

Created on 2023-02-02 with reprex v2.0.2

@DavisVaughan DavisVaughan added this to the 1.1.1 milestone Feb 7, 2023
@DavisVaughan
Copy link
Member

A bit more information. It looks like dplyr_reconstruct()'s default behavior isn't great for data.tables

library(data.table)
library(dplyr, warn.conflicts = FALSE)

data <- data.table(
  var1 = c("a", "b"),
  key = "var1"
)

# Knows it is sorted by `var1`
attributes(data)
#> $names
#> [1] "var1"
#> 
#> $row.names
#> [1] 1 2
#> 
#> $class
#> [1] "data.table" "data.frame"
#> 
#> $.internal.selfref
#> <pointer: 0x7fc95800d4e0>
#> 
#> $sorted
#> [1] "var1"

# These are still sorted
attributes(data[1,])$sorted
#> [1] "var1"
attributes(data[2,])$sorted
#> [1] "var1"

# As an example, this is no longer sorted
attributes(data[2:1,])$sorted
#> NULL

both <- bind_rows(
  data[2, ], 
  data[1, ]
)

# But this looks sorted because of attribute reconstruction by `dplyr_reconstruct()`
attributes(both)$sorted
#> [1] "var1"

# gforce's fast chin must use the `sorted` attribute so this is wrong
both[var1 %in% c("a", "b"), verbose = TRUE]
#> Optimized subsetting with key 'var1'
#> forder.c received 2 rows and 1 columns
#> forder took 0 sec
#> x is already ordered by these columns, no need to call reorder
#> i.var1 has same type (character) as x.var1. No coercion needed.
#> on= matches existing key, using key
#> Starting bmerge ...
#> bmerge done in 0.000s elapsed (0.000s cpu) 
#> Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)
#>    var1
#> 1:    b

Ironically vec_rbind() gives the right thing here, and it is only the usage of dplyr_reconstruct() in bind_rows() that messes us up by re-adding attributes.

library(data.table)

data <- data.table(
  var1 = c("a", "b"),
  key = "var1"
)

data <- vctrs::vec_rbind(
  data[2,],
  data[1,]
)

# No longer sorted, correct
attributes(data)$sorted
#> NULL

Here is another example with mutate(), which also ends up using dplyr_reconstruct() after going through dplyr_col_select()

library(data.table)
library(dplyr, warn.conflicts = FALSE)

data <- data.table(
  var1 = c("a", "b"),
  key = "var1"
)

# Knows it is sorted by `var1`
attributes(data)
#> $names
#> [1] "var1"
#> 
#> $row.names
#> [1] 1 2
#> 
#> $class
#> [1] "data.table" "data.frame"
#> 
#> $.internal.selfref
#> <pointer: 0x7fd6b700d4e0>
#> 
#> $sorted
#> [1] "var1"

data[var1 %in% c("a", "b")]
#>    var1
#> 1:    a
#> 2:    b

data <- mutate(data, var1 = rev(var1))

# Oops, still thinks it is sorted
attributes(data)$sorted
#> [1] "var1"

data[var1 %in% c("a", "b")]
#>    var1
#> 1:    b

It is possible we need a dplyr_reconstruct() method for data.table that just calls as.data.table() on data.

We'd have to consider whether or not the reconstruct method should try and restore "miscellaneous" attributes that look unrelated to data.table, but that might be fragile. We try to do that here, but I feel like we should reconsider how wise that is:

dplyr/R/generics.R

Lines 240 to 244 in e8702df

# Patch base data frames and data.table (#6171) to restore extra attributes that `[.data.frame` drops.
# We require `[` methods to keep extra attributes for all data frame subclasses.
if (identical(class(.data), "data.frame") || identical(class(.data), c("data.table", "data.frame"))) {
out <- dplyr_reconstruct(out, .data)
}

@DavisVaughan DavisVaughan removed this from the 1.1.1 milestone Feb 14, 2023
@r2evans
Copy link

r2evans commented Aug 14, 2023

A little more context, prompted by https://stackoverflow.com/q/76895536/3358272: the user does not have to explicitly (or even knowingly) added an index to a data.table for it to be present.

Using the example from that SO question:

mydt1 = data.table(year=rep(2017:2018, each=3), month=rep(1:3, times=2))
mydt2 = data.table(year=rep(2016:2017, each=3), month=rep(4:6, times=2))
mydt1[year == 2018]    # this appears to assigns `year` as an index to mydt1 

rbindlist(list(mydt1, mydt2))[year == 2017]
### produces expected output:
#    year month
# 1: 2017     1
# 2: 2017     2
# 3: 2017     3
# 4: 2017     4
# 5: 2017     5
# 6: 2017     6

subset(bind_rows(mydt1, mydt2), year == 2017)
### produces the same output as above

bind_rows(mydt1, mydt2)[year == 2017]
#    year month
# 1: 2017     1
# 2: 2017     2
# 3: 2017     3

The last output is incorrect because bind_rows passes the "index" attribute from the first argument. While I agree that it is likely good for bind_rows to preserve attributes it knows nothing about, this one (at least) is an attribute that should be dropped when combining data.table objects.

As stated in the answer in the Stack question and in https://cran.r-project.org/web/packages/data.table/vignettes/datatable-secondary-indices-and-auto-indexing.html, setting

options(datatable.use.index = FALSE)

prevents auto-indexing like this (though I think that only has explanatory value here, no benefit to a change in dplyr behavior).

@DavisVaughan DavisVaughan added the bug an unexpected problem or unintended behavior label Nov 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

5 participants