`bind_rows()` on a list of data.tables creates corrupt output #6676

emilBeBri · 2023-02-01T18:29:11Z

when using bind_rows on a list of data.tables with keys, with either map_dfr or a do.call, the keys are not removed This is a problem, because the keys are not correct anymore, which means that later queries are using an index that is wrong.

not sure if I should report this under data.table or dplyr, but since the issue is in bin_rows, I guess it is dplyr.

here is the issue I filed under data.table



library(data.table)
library(dplyr)




data <- data.table(var1=c('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'), var2=c(rep(2016, 4), rep(2018, 4)), key=c('var1', 'var2'))


test1 <- lapply(c(2018, 2016), function(x) data[var2 %in% c(x-1, x)])


test2 <- do.call(bind_rows, test1)
# wrong 
test2[var1 %in% c('c', 'e'), .N]
# right
sum( test2$var1 %in% c('c', 'e') )

same thing happens w. map_dfr,that uses bind_rows under the hood. This is how I found the issue. It is very common in my workflow, and I suspect I am not the one.

doing the same with rbind removes the keys, which is the expected behaviour - as the index from the key are no longer valid.


test2_rbind <- do.call(rbind, test1) 
# here it is fine - because the keys not there, giving a wrong picture of the structure of the DT
test2_rbind[var1 %in% c('c', 'e'), .N]

# when removing the keys, everything is fine. 
setkey(test2, NULL)
test2[var1 %in% c('c', 'e'), .N]

curiosly, using split() instead of lapply circumvents the issue - but this is not a solution, since typically you need to apply a function to the data, - splitting it and then recombining it with split() makes no sense. But I am including it here for completeness.



test1 <- split(data, data$var2)
test2 <- do.call(bind_rows, test1)
# right 
test2[var1 %in% c('c', 'e'), .N]

sessionInfo


R version 4.2.2 (2022-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Manjaro Linux

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.11.0
LAPACK: /usr/lib/liblapack.so.3.11.0

locale:
 [1] LC_CTYPE=en_DK.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_DK.UTF-8        LC_COLLATE=en_DK.UTF-8
 [5] LC_MONETARY=en_DK.UTF-8    LC_MESSAGES=en_DK.UTF-8
 [7] LC_PAPER=en_DK.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_DK.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] dplyr_1.0.10      data.table_1.14.6

loaded via a namespace (and not attached):
 [1] fansi_1.0.4      assertthat_0.2.1 utf8_1.2.2       R6_2.5.1
 [5] DBI_1.1.3        lifecycle_1.0.3  magrittr_2.0.3   pillar_1.8.1
 [9] rlang_1.0.6      cli_3.6.0        vctrs_0.5.2      generics_0.1.3
[13] glue_1.6.2       compiler_4.2.2   pkgconfig_2.0.3  tidyselect_1.2.0
[17] tibble_3.1.8

The text was updated successfully, but these errors were encountered:

hadley · 2023-02-01T19:24:57Z

Can you please provide a minimal reprex (reproducible example)? The goal of a reprex is to make it as easy as possible for me to recreate your problem so that I can fix it: please help me help you! If you've never heard of a reprex before, start by reading about the reprex package, including the advice further down the page. Please make sure your reprex is created with the reprex package as it gives nicely formatted output and avoids a number of common pitfalls.

emilBeBri · 2023-02-02T07:51:53Z

So the reprex package gives me this:


Error in `reprex_render()`:
! This reprex appears to crash R. Call `reprex()` again with
  `std_out_err = TRUE` to get more info.
Run `rlang::last_error()` to see where the error occurred.

I already spend a lot of time reproducing this - it might not be using the reprex package but the example is solid, isn't it? I work on high security servers with no internet access, on data with social security numbers that I would be fired and possibly prosecuted for disclosing. Reproducing it outside those took a while, as the bug is so sublte. And I already spend 5 days finding the bug and correcting the issues caused by it. And there is more work to be done making sure that the bug hasn't caused errors other places in other projects. I don't have any more time to give, I'm sorry. It would have to wait a couple of weeks then.

mgirlich · 2023-02-02T11:59:41Z

Here's a reprex

library(data.table)
library(dplyr, warn.conflicts = FALSE)

data <- data.table(
  var1 = c("a", "b"),
  key = "var1"
)
data[var1 %in% c('a', 'b')]
#>    var1
#> 1:    a
#> 2:    b

# doesn't work
dt <- bind_rows(data[2, ], data[1, ])
dt[var1 %in% c('a', 'b')]
#>    var1
#> 1:    b

# works
dt <- bind_rows(data[1, ], data[2, ])
dt[var1 %in% c('a', 'b')]
#>    var1
#> 1:    a
#> 2:    b

^{Created on 2023-02-02 with reprex v2.0.2}

DavisVaughan · 2023-02-07T21:10:05Z

A bit more information. It looks like dplyr_reconstruct()'s default behavior isn't great for data.tables

library(data.table)
library(dplyr, warn.conflicts = FALSE)

data <- data.table(
  var1 = c("a", "b"),
  key = "var1"
)

# Knows it is sorted by `var1`
attributes(data)
#> $names
#> [1] "var1"
#> 
#> $row.names
#> [1] 1 2
#> 
#> $class
#> [1] "data.table" "data.frame"
#> 
#> $.internal.selfref
#> <pointer: 0x7fc95800d4e0>
#> 
#> $sorted
#> [1] "var1"

# These are still sorted
attributes(data[1,])$sorted
#> [1] "var1"
attributes(data[2,])$sorted
#> [1] "var1"

# As an example, this is no longer sorted
attributes(data[2:1,])$sorted
#> NULL

both <- bind_rows(
  data[2, ], 
  data[1, ]
)

# But this looks sorted because of attribute reconstruction by `dplyr_reconstruct()`
attributes(both)$sorted
#> [1] "var1"

# gforce's fast chin must use the `sorted` attribute so this is wrong
both[var1 %in% c("a", "b"), verbose = TRUE]
#> Optimized subsetting with key 'var1'
#> forder.c received 2 rows and 1 columns
#> forder took 0 sec
#> x is already ordered by these columns, no need to call reorder
#> i.var1 has same type (character) as x.var1. No coercion needed.
#> on= matches existing key, using key
#> Starting bmerge ...
#> bmerge done in 0.000s elapsed (0.000s cpu) 
#> Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)
#>    var1
#> 1:    b

Ironically vec_rbind() gives the right thing here, and it is only the usage of dplyr_reconstruct() in bind_rows() that messes us up by re-adding attributes.

library(data.table)

data <- data.table(
  var1 = c("a", "b"),
  key = "var1"
)

data <- vctrs::vec_rbind(
  data[2,],
  data[1,]
)

# No longer sorted, correct
attributes(data)$sorted
#> NULL

Here is another example with mutate(), which also ends up using dplyr_reconstruct() after going through dplyr_col_select()

library(data.table)
library(dplyr, warn.conflicts = FALSE)

data <- data.table(
  var1 = c("a", "b"),
  key = "var1"
)

# Knows it is sorted by `var1`
attributes(data)
#> $names
#> [1] "var1"
#> 
#> $row.names
#> [1] 1 2
#> 
#> $class
#> [1] "data.table" "data.frame"
#> 
#> $.internal.selfref
#> <pointer: 0x7fd6b700d4e0>
#> 
#> $sorted
#> [1] "var1"

data[var1 %in% c("a", "b")]
#>    var1
#> 1:    a
#> 2:    b

data <- mutate(data, var1 = rev(var1))

# Oops, still thinks it is sorted
attributes(data)$sorted
#> [1] "var1"

data[var1 %in% c("a", "b")]
#>    var1
#> 1:    b

It is possible we need a dplyr_reconstruct() method for data.table that just calls as.data.table() on data.

We'd have to consider whether or not the reconstruct method should try and restore "miscellaneous" attributes that look unrelated to data.table, but that might be fragile. We try to do that here, but I feel like we should reconsider how wise that is:

dplyr/R/generics.R

Lines 240 to 244 in e8702df

    
           # Patch base data frames and data.table (#6171) to restore extra attributes that `[.data.frame` drops. 
        
           # We require `[` methods to keep extra attributes for all data frame subclasses. 
        
           if (identical(class(.data), "data.frame") || identical(class(.data), c("data.table", "data.frame"))) { 
        
             out <- dplyr_reconstruct(out, .data) 
        
           }

r2evans · 2023-08-14T20:03:16Z

A little more context, prompted by https://stackoverflow.com/q/76895536/3358272: the user does not have to explicitly (or even knowingly) added an index to a data.table for it to be present.

Using the example from that SO question:

mydt1 = data.table(year=rep(2017:2018, each=3), month=rep(1:3, times=2))
mydt2 = data.table(year=rep(2016:2017, each=3), month=rep(4:6, times=2))
mydt1[year == 2018]    # this appears to assigns `year` as an index to mydt1 

rbindlist(list(mydt1, mydt2))[year == 2017]
### produces expected output:
#    year month
# 1: 2017     1
# 2: 2017     2
# 3: 2017     3
# 4: 2017     4
# 5: 2017     5
# 6: 2017     6

subset(bind_rows(mydt1, mydt2), year == 2017)
### produces the same output as above

bind_rows(mydt1, mydt2)[year == 2017]
#    year month
# 1: 2017     1
# 2: 2017     2
# 3: 2017     3

The last output is incorrect because bind_rows passes the "index" attribute from the first argument. While I agree that it is likely good for bind_rows to preserve attributes it knows nothing about, this one (at least) is an attribute that should be dropped when combining data.table objects.

As stated in the answer in the Stack question and in https://cran.r-project.org/web/packages/data.table/vignettes/datatable-secondary-indices-and-auto-indexing.html, setting

options(datatable.use.index = FALSE)

prevents auto-indexing like this (though I think that only has explanatory value here, no benefit to a change in dplyr behavior).

DavisVaughan added this to the 1.1.1 milestone Feb 7, 2023

DavisVaughan removed this from the 1.1.1 milestone Feb 14, 2023

DavisVaughan added the bug an unexpected problem or unintended behavior label Nov 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`bind_rows()` on a list of data.tables creates corrupt output #6676

`bind_rows()` on a list of data.tables creates corrupt output #6676

emilBeBri commented Feb 1, 2023 •

edited

Loading

hadley commented Feb 1, 2023

emilBeBri commented Feb 2, 2023 •

edited

Loading

mgirlich commented Feb 2, 2023

DavisVaughan commented Feb 7, 2023

r2evans commented Aug 14, 2023

bind_rows() on a list of data.tables creates corrupt output #6676

bind_rows() on a list of data.tables creates corrupt output #6676

Comments

emilBeBri commented Feb 1, 2023 • edited Loading

hadley commented Feb 1, 2023

emilBeBri commented Feb 2, 2023 • edited Loading

mgirlich commented Feb 2, 2023

DavisVaughan commented Feb 7, 2023

r2evans commented Aug 14, 2023

`bind_rows()` on a list of data.tables creates corrupt output #6676

`bind_rows()` on a list of data.tables creates corrupt output #6676

emilBeBri commented Feb 1, 2023 •

edited

Loading

emilBeBri commented Feb 2, 2023 •

edited

Loading