[ideas] Working with DBI - Targets as an ETL tool #1164

DavZim · 2023-10-04T09:12:16Z

DavZim
Oct 4, 2023

Help

I understand and agree to https://books.ropensci.org/targets/help.html.

Description

I want to use targets as a form of a local, file-based ETL tool.

That is, I have some data in some format, which I want to read in, transform into the right format and load into a database. As this involves many files, each step can take long time, and questions around reproducibility in general, I want to use targets as an orchestration tool.

With a couple of workarounds and suboptimal points, a MWE looks like this

library(targets)
library(tarchetypes)

# loads a file from disk into the database. If a table exists, delete the table first
load_table <- function(con, file, table) {
  data <- readr::read_csv(file, col_types = readr::cols())
  
  if (table %in% DBI::dbListTables(con))
    DBI::dbExecute(con, paste("DROP TABLE ", table))
  
  DBI::dbWriteTable(con, table, data)
  
  # don't store the whole table in targets, instead store only the hash of the table
  hash_table(con, table)
}

# returns the hash of a table, so that we don't store the contents of the table again
hash_table <- function(con, table) {
  # this is still sub optimal, as the data needs to be returned to R
  data <- DBI::dbReadTable(con, table)
  # this could be optimized using heuristics to get a hash of the table in a better way
  rlang::hash(data)
}

# list all tables and their row-count
# the ... are used only to build dependencies in targets
report_table <- function(con, ...) {
  tables <- data.frame(table = DBI::dbListTables(con))
  tables$rows <- sapply(
    tables$table,
    function(t) DBI::dbGetQuery(con, paste("SELECT COUNT (*) FROM", t))[[1]]
  )
  tables
}

#' write some data once
#' readr::write_csv(mtcars, "mtcars.csv")
#' readr::write_csv(ggplot2::mpg, "mpg.csv")
#' readr::write_csv(ggplot2::diamonds, "diamonds.csv")


list(
  # define the files to be used
  tar_target(file_mtcars, "mtcars.csv", format = "file"),
  tar_target(file_mpg, "mpg.csv", format = "file"),
  tar_target(file_diamonds, "diamonds.csv", format = "file"),
  
  # load the datasets into the SQL table
  tar_target(table_mtcars, load_table(con, file_mtcars, "mtcars")),
  tar_target(table_mpg, load_table(con, file_mpg, "mpg")),
  tar_target(table_diamonds, load_table(con, file_diamonds, "diamonds")),
  
  # get an overview of the datasets, combining all targets
  tar_target(res, report_table(con, table_mtcars, table_mpg, table_diamonds))
) |>
  # make sure each target has access to the database
  tar_hook_before(hook = {
    con <- DBI::dbConnect(duckdb::duckdb(), "data.db")
    on.exit(DBI::dbDisconnect(con, shutdown = TRUE), add = TRUE)
  })

A couple of things that I find suboptimal in this setup:

when a change to the table in the database is made, targets does detect the changes.
the dependencies in the res target are not used directly as the table_X targets themselves are more or less placeholders.
the tar_hook_before() feels a bit hackish, there might be a more elegant way to make sure that the connection is handled correctly, but it's ok overall.
for large datasets, the hash_table() function will take a long time as all data is returned back into R before it is hashed. Unfortunately, SQL does not have a hash-function of a table. While SQL Server and sqlite seem to have a hashing functionality, others like duckdb do not have this functionality (yet). This can be optimized by heuristically summarizing the table in some other way though.

It would be great to have something like tar_target(table_X, ..., format = "dbi-table") or something similar, that allows to check if the data in the table has changed, but which can also be used in a way like this DBI::dbReadTable(con, table_X), so that the report_table() actually uses the dependencies.

Note I have posted a similar question to SO. But this example here is better imo.

What are your thoughts on this? Is this something that you think belongs to targets and adds value to other users?
I am happy to help on this and

wlandau · 2023-10-04T16:20:14Z

wlandau
Oct 4, 2023
Maintainer

Interesting pattern.

when a change to the table in the database is made, targets does [not] detect the changes.

Unfortunately targets is not designed for this, but maybe it might be possible to protect the database from modifications outside the pipeline.

the dependencies in the res target are not used directly as the table_X targets themselves are more or less placeholders.

It might take more engineering, but you might write a wrapper that accepts a hash and a name and returns the table in memory. In the process, you could even check the expected hash against the hash of the data you just queried, then error out if they disagree. Could help with the previous point.

the tar_hook_before() feels a bit hackish, there might be a more elegant way to make sure that the connection is handled correctly, but it's ok overall.

You may wish to write special wrappers around DB queries which create and destroy the connection object on each query.

for large datasets, the hash_table() function will take a long time as all data is returned back into R before it is hashed. Unfortunately, SQL does not have a hash-function of a table. While SQL Server and sqlite seem to have a hashing functionality, others like duckdb do not have this functionality (yet). This can be optimized by heuristically summarizing the table in some other way though.

Time stamps are much faster, and targets sometimes uses them to avoid hashing. The tricky part about relying on POSIX file timestamps too much is that some hard drives have extremely poor precision (up to 2 seconds in some cases). But if you know your database supports sub-millisecond or sub-nanosecond timestamps, that may work in your case.

It would be great to have something like tar_target(table_X, ..., format = "dbi-table") or something similar, that allows to check if the data in the table has changed, but which can also be used in a way like this DBI::dbReadTable(con, table_X), so that the report_table() actually uses the dependencies.

The tar_format() function lets you define your own storage format.

6 replies

ginolhac Oct 5, 2023

hey, this is relevant i guess https://books.ropensci.org/targets/performance.html#hashes

tar_option_set(trust_object_timestamps = TRUE)
format = "file_fast"

and warning associated on this

DavZim Oct 5, 2023
Author

Thanks @ginolhac, but this uses timestamps for all targets no?
Also, I cannot use the timestamps of the file (=database), but need to provide a custom function that returns the timestamp from the database.

ginolhac Oct 5, 2023

true, I don't know enough about database, could you point to a relevant file from it and rely on format = "file_fast") ?
or have a hook indeed that update a file from the DB and that targets will use based on timestamp only

DavZim Oct 5, 2023
Author

As the database has multiple files, there is no way I can use the timestamp of the whole database.
Say I use the table X in the database DB, which consists of a single file (sqlite or duckdb). The database also hosts table Y.
All tables are loaded by targets.

When I only look at the table X, I cannot use the timestamp for DB, as there might have been changes to table Y, which does not influence table X. Instead I have another table in DB, which holds the last time a table has been modified. Therefore I have a function like this get_latest_modify_timestamp <- function(con, table) DBI::dbGetQuery(con, glue::glue("SELECT timestamp FROM modified_table WHERE tbl = '{table}'")), which is able to quickly tell the last change date.

Now I need to figure out how I can use the get_latest_modify_timestamp function as a "hashing" function within the targets environment, so that this is used to determine if the target for table X needs to be rerun.

ginolhac Oct 5, 2023

interesting, and what about tar_cue() ? If your function get_latest_modify_timestamp can return TRUE or FALSE telling is things needs to be restarted, you can use tar_cue as in this example for cloud storage.
see the doc from tar_cue_skip() I am using for not running things on some machines for example

edalfon · 2023-10-11T07:47:01Z

edalfon
Oct 11, 2023

I want to use targets as a form of a local, file-based ETL tool.
That is, I have some data in some format, which I want to read in, transform into the right format and load into a database. As this involves many files, each step can take long time, and questions around reproducibility in general, I want to use targets as an orchestration tool.

We have dealt with similar use case for a while and although it may not be straightforward because, as wlandau said, targets is not designed for this, we wanted to stick to targets. At some point we had a similar setup as yours (hooks to handle the connection, a fn to hash tables on postgres). We have been moving from postgres to DuckDB and now we are using a custom setup, using a factory that produces targets that:

each target has its own DuckDB file (so typically we have one table per target, but that's not really a restriction)
the target manages the connection, so we do not use the hooks anymore
the target get a hash of the file and store it in targets store
and we use targets dependency detection and loading to attach on-the-fly the upstream targets as needed.

It's still very much experimental, but it's working nicely for us. So, nowadays we would write you example as follows:

library(targets)
library(tarchetypes)
library(flowme) # we have the target factory in this package, so you can take a look. It could be refactored somewhere else later, though

# loads a file from disk into the database. If a table exists, delete the table first
load_table <- function(con, file, table) {
  data <- readr::read_csv(file, col_types = readr::cols())

  if (table %in% DBI::dbListTables(con))
    DBI::dbExecute(con, paste("DROP TABLE ", table))

  DBI::dbWriteTable(con, table, data)

  # don't store the whole table in targets, instead store only the hash of the table
  hash_table(con, table)
}

# returns the hash of a table, so that we don't store the contents of the table again
hash_table <- function(con, table) {
  # this is still sub optimal, as the data needs to be returned to R
  data <- DBI::dbReadTable(con, table)
  # this could be optimized using heuristics to get a hash of the table in a better way
  rlang::hash(data)
}

# list all tables and their row-count
# the ... are used only to build dependencies in targets
report_table <- function(con, ...) {
  tables <- data.frame(table = DBI::dbListTables(con))
  tables$rows <- sapply(
    tables$table,
    function(t) DBI::dbGetQuery(con, paste("SELECT COUNT (*) FROM", t))[[1]]
  )
  tables
}

#' write some data once
#' readr::write_csv(mtcars, "mtcars.csv")
#' readr::write_csv(ggplot2::mpg, "mpg.csv")
#' readr::write_csv(ggplot2::diamonds, "diamonds.csv")


list(
  # define the files to be used
  tar_target(file_mtcars, "mtcars.csv", format = "file"),
  tar_target(file_mpg, "mpg.csv", format = "file"),
  tar_target(file_diamonds, "diamonds.csv", format = "file"),

  # load the datasets into the SQL table
  tar_duck_r(table_mtcars, load_table(db, file_mtcars, "mtcars")),
  tar_duck_r(table_mpg, load_table(db, file_mpg, "mpg")),
  tar_duck_r(table_diamonds, load_table(db, file_diamonds, "diamonds")),

  # get an overview of the datasets, combining all targets
  tar_duck_r(res, report_table(db, table_mtcars, table_mpg, table_diamonds))
)

So, that's pretty much the same code of your example, but using the targets factory tar_duck_r() -and thereby not using hooks and also I renamed the connection because we use db as default-. I am not saying you should use it, because I think this is very much tailored to the DuckDB-workflow we are building. But perhaps you can grab an idea or two from this target factory we wrote and write your own tailored to your needs. We have not written much about this to the public, but the docs of tar_duck_r() and the source code are full of notes keeping track of our thinking on why we decided to do this (because this is very much work in progress).

when a change to the table in the database is made, targets does [not] detect the changes.

We are currently only using the hash to make sure that changes get propagated downstream. But if you modify the duckdb file outside of the pipeline, targets would still not detect it. But with a custom target like this, would be easy to do (-but the hash needs to be checked during the creation of the pipeline, which slows down operations like tar_visnetwork. That's why we haven't done that, considering also that we do not have the need to modify the data from outside the pipeline-).

the tar_hook_before() feels a bit hackish, there might be a more elegant way to make sure that the connection is handled correctly, but it's ok overall.

for large datasets, the hash_table() function will take a long time as all data is returned back into R before it is hashed. Unfortunately, SQL does not have a hash-function of a table. While SQL Server and sqlite seem to have a hashing functionality, others like duckdb do not have this functionality (yet). This can be optimized by heuristically summarizing the table in some other way though.

I feel the pain of this. Back in the day, we could get a hash from the tables in postgres, but it was rather slow. Also, bringing all the data back into R is not a possibility for us, because most of the tables do not fit in memory. That's part of the reason we ended up with the approach of one-table-per-duckdb-file, so we could hash the entire file/table, and not the whole database. We are using rlang::hash_file, which takes some time -hashing lots of data always does-, but I guess it's not that bad (takes about 5 seconds to hash a 25Gb duckdb file with a single table).

1 reply

DavZim Oct 11, 2023
Author

Thanks for the long feedback and ideas and showing me flowme.

At the moment we have changed our workflows slightly to use parquet files as intermediate data storage and then have one target which combines all data into a single duckdb using duckdbs wonderful import functionality. This is helpful in our use case as we only write to the DB once and this way we can use {crew} to parallelize the execution of the other steps without having the problem of multiple writes to the DB at the same time.

noamross · 2023-11-03T00:13:00Z

noamross
Nov 3, 2023
Maintainer

We use Dolt, which is an SQL database (MySQL-like) with git-like versioning. It's not as performant as DuckDB, but every commit to the database has a hash, so one can check the current hash of the database and use that to trigger targets. There isn't a hash for each table AFAICT, but you can get them in different ways such as storing a table per branch, or querying the table history to see if there have been changes.

0 replies

wlandau · 2023-11-14T19:44:29Z

wlandau
Nov 14, 2023
Maintainer

It's been 2 weeks since the last reply. I can reopen the discussion if there are new things to add.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ideas] Working with DBI - Targets as an ETL tool #1164

{{title}}

Replies: 4 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[ideas] Working with DBI - Targets as an ETL tool #1164

DavZim Oct 4, 2023

Help

Description

Replies: 4 comments · 7 replies

wlandau Oct 4, 2023 Maintainer

ginolhac Oct 5, 2023

DavZim Oct 5, 2023 Author

ginolhac Oct 5, 2023

DavZim Oct 5, 2023 Author

ginolhac Oct 5, 2023

edalfon Oct 11, 2023

DavZim Oct 11, 2023 Author

noamross Nov 3, 2023 Maintainer

wlandau Nov 14, 2023 Maintainer

DavZim
Oct 4, 2023

Replies: 4 comments 7 replies

wlandau
Oct 4, 2023
Maintainer

DavZim Oct 5, 2023
Author

DavZim Oct 5, 2023
Author

edalfon
Oct 11, 2023

DavZim Oct 11, 2023
Author

noamross
Nov 3, 2023
Maintainer

wlandau
Nov 14, 2023
Maintainer