Replies: 4 comments 7 replies
-
Interesting pattern.
Unfortunately
It might take more engineering, but you might write a wrapper that accepts a hash and a name and returns the table in memory. In the process, you could even check the expected hash against the hash of the data you just queried, then error out if they disagree. Could help with the previous point.
You may wish to write special wrappers around DB queries which create and destroy the connection object on each query.
Time stamps are much faster, and
The |
Beta Was this translation helpful? Give feedback.
-
We have dealt with similar use case for a while and although it may not be straightforward because, as wlandau said,
It's still very much experimental, but it's working nicely for us. So, nowadays we would write you example as follows: library(targets)
library(tarchetypes)
library(flowme) # we have the target factory in this package, so you can take a look. It could be refactored somewhere else later, though
# loads a file from disk into the database. If a table exists, delete the table first
load_table <- function(con, file, table) {
data <- readr::read_csv(file, col_types = readr::cols())
if (table %in% DBI::dbListTables(con))
DBI::dbExecute(con, paste("DROP TABLE ", table))
DBI::dbWriteTable(con, table, data)
# don't store the whole table in targets, instead store only the hash of the table
hash_table(con, table)
}
# returns the hash of a table, so that we don't store the contents of the table again
hash_table <- function(con, table) {
# this is still sub optimal, as the data needs to be returned to R
data <- DBI::dbReadTable(con, table)
# this could be optimized using heuristics to get a hash of the table in a better way
rlang::hash(data)
}
# list all tables and their row-count
# the ... are used only to build dependencies in targets
report_table <- function(con, ...) {
tables <- data.frame(table = DBI::dbListTables(con))
tables$rows <- sapply(
tables$table,
function(t) DBI::dbGetQuery(con, paste("SELECT COUNT (*) FROM", t))[[1]]
)
tables
}
#' write some data once
#' readr::write_csv(mtcars, "mtcars.csv")
#' readr::write_csv(ggplot2::mpg, "mpg.csv")
#' readr::write_csv(ggplot2::diamonds, "diamonds.csv")
list(
# define the files to be used
tar_target(file_mtcars, "mtcars.csv", format = "file"),
tar_target(file_mpg, "mpg.csv", format = "file"),
tar_target(file_diamonds, "diamonds.csv", format = "file"),
# load the datasets into the SQL table
tar_duck_r(table_mtcars, load_table(db, file_mtcars, "mtcars")),
tar_duck_r(table_mpg, load_table(db, file_mpg, "mpg")),
tar_duck_r(table_diamonds, load_table(db, file_diamonds, "diamonds")),
# get an overview of the datasets, combining all targets
tar_duck_r(res, report_table(db, table_mtcars, table_mpg, table_diamonds))
) So, that's pretty much the same code of your example, but using the targets factory
We are currently only using the hash to make sure that changes get propagated downstream. But if you modify the duckdb file outside of the pipeline,
I feel the pain of this. Back in the day, we could get a hash from the tables in postgres, but it was rather slow. Also, bringing all the data back into R is not a possibility for us, because most of the tables do not fit in memory. That's part of the reason we ended up with the approach of one-table-per-duckdb-file, so we could hash the entire file/table, and not the whole database. We are using |
Beta Was this translation helpful? Give feedback.
-
We use Dolt, which is an SQL database (MySQL-like) with git-like versioning. It's not as performant as DuckDB, but every commit to the database has a hash, so one can check the current hash of the database and use that to trigger targets. There isn't a hash for each table AFAICT, but you can get them in different ways such as storing a table per branch, or querying the table history to see if there have been changes. |
Beta Was this translation helpful? Give feedback.
-
It's been 2 weeks since the last reply. I can reopen the discussion if there are new things to add. |
Beta Was this translation helpful? Give feedback.
-
Help
Description
I want to use targets as a form of a local, file-based ETL tool.
That is, I have some data in some format, which I want to read in, transform into the right format and load into a database. As this involves many files, each step can take long time, and questions around reproducibility in general, I want to use targets as an orchestration tool.
With a couple of workarounds and suboptimal points, a MWE looks like this
A couple of things that I find suboptimal in this setup:
res
target are not used directly as thetable_X
targets themselves are more or less placeholders.tar_hook_before()
feels a bit hackish, there might be a more elegant way to make sure that the connection is handled correctly, but it's ok overall.hash_table()
function will take a long time as all data is returned back into R before it is hashed. Unfortunately, SQL does not have a hash-function of a table. While SQL Server and sqlite seem to have a hashing functionality, others like duckdb do not have this functionality (yet). This can be optimized by heuristically summarizing the table in some other way though.It would be great to have something like
tar_target(table_X, ..., format = "dbi-table")
or something similar, that allows to check if the data in the table has changed, but which can also be used in a way like thisDBI::dbReadTable(con, table_X)
, so that thereport_table()
actually uses the dependencies.Note I have posted a similar question to SO. But this example here is better imo.
What are your thoughts on this? Is this something that you think belongs to targets and adds value to other users?
I am happy to help on this and
Beta Was this translation helpful? Give feedback.
All reactions