OHDSI & targets #810

fdefalco · 2022-03-18T19:25:36Z

fdefalco
Mar 18, 2022

I'm a member of the Observational Health Data Science & Informatics community (OHDSI) (ohdsi.org). OHDSI is a community focused on generating medical evidence from patient-level data leveraging an agreed upon standard data model and a set of R based method libraries. An area where we have not yet found a solution is in the area of reproducible pipeline execution. I was recently tasked with evaluating the landscape of pipeline tooling available for R based analyses and as you might suspect I came across targets as part of that journey.

I have tried to script out the scenario the OHDSI community is often tasked with, specifically, generating evidence at scale, across a distributed community of person-centric data source holders. This tasks requires multiple steps and having a reliable, reproducible process is our goal. Here is a draft stub script with some embedded questions I thought might be helpful to get the discussion started.


library(targets)

loadCohortDefinitions <- function() {
  # a simple load from a file - no questions here
  return(T)
}

runDatabaseFeasibilty <- function(x) {
  # a database query is executed and a file is generated, simple enough
  return(T)
}

generateCohorts <- function(x,y) {
  # database queries are executed, generating data in the database
  # a file is generated to represent the completion of the queries
  # ---
  # Q: Generating each cohort could be run in parallel, although the database server would 
  # potentially be a bottleneck. How should we represent parallelization of the task on the local server
  # Since the task is execution of a database query, minimal local compute is required
  return(T)
}

runEstimationStudy <- function(x,y) {
  # Estimation studies include substantial compute running large scale regularized regression to fit propensity and outcome models
  # Q: how should we represent a distribution of the tasks across a cluster of servers
  return(T)
}

shareResultsWithNetwork <- function(x) {
  # as results are generated, they are either written to a central database
  # or need to be delivered to a central file system.  S3 buckets in AWS are practical in most cases
  # Q: Is it possible to remain loosely coupled from a cloud provider specific infrastructure?
  return(T)
}

tar_option_set(packages = "dplyr")

list(
  tar_target(cohortDefinitions, loadCohortDefinitions()),
  tar_target(databaseFeasibility, runDatabaseFeasibilty(cohortDefinitions)),
  tar_target(generatedCohorts, generateCohorts(databaseFeasibility, cohortDefinitions)),
  tar_target(studyResults, runEstimationStudy(generatedCohorts)),
  tar_target(networkResults, shareResultsWithNetwork(studyResults))
)

Please let me know if there are any details that I left out that would aid in the discussion.

Thanks!

wlandau · 2022-03-20T16:02:53Z

wlandau
Mar 20, 2022
Maintainer

I am not sure I completely follow the use case yet, but I can begin with some comments.

Looks like you are getting your initial data from a database, and it would be ideal to automatically invalidate targets like databaseFeasibility if the appropriate database table changes. If you can get a timestamp or a hash from that table, you could feed it into a tarchetypes function, e.g. tar_target(databaseFeasibility, runDatabaseFeasibility(cohortDefinitions), cue = tarchetypes::tar_cue_force(hash_changed())). Otherwise, you could deliberately invalidate the pipeline at regular intervals of time. As mentioned at ropensci/tarchetypes#72 (comment), you could have a script that calls tar_invalidate(names = tar_older(...)) and then tar_make().

Parallel computing in targets is documented at https://books.ropensci.org/targets/hpc.html. Functions tar_make_clustermq() and tar_make_future(), respectively, leverage clustermq and future to orchestrate concurrent targets across multiple R processes. Those processes run on anything that clustermq and future can do: either local processes on the same computer or resource managers like SLURM, SGE, PBS, TORQUE, or LSF. Conditionally independent targets run simultaneously, and targets with unmet upstream dependencies wait until those dependencies are checked or built. The pipeline you sketched is sequential, with one target after the other, so any parallelism with those targets would need to come from custom code you write inside functions like runEstimationStudy() (i.e. parallelism within targets). But if you were to create a multiple instances of your studyResults target, then tar_make_clustermq() could fit multiple regression models in different R processes using distributed computing.

Integration with cloud storage is documented at https://books.ropensci.org/targets/data.html#cloud-storage. targets can send target output data to one or more buckets on Amazon Web Services or Google Cloud. If versioning is turned on in your bucket, then the targets metadata file will track version IDs, making it straightforward to roll back to a previous version of the pipeline that uses a prior set of data. The downside is cost. If you have a large number of large targets, the costs of uploading, downloading, and checking each target's data could add up. I tried to make sure targets sticks to only the minimum necessary API calls, but I have not personally maintained an AWS-backed pipeline large enough to stress test costs at scale.

Writing to a database is tricky. If you write to the same database table as the data you start with, then your pipeline is circular, which is not a good fit for what targets is trying to do. If you write to a different database table, a targets pipeline can still function properly, but targets itself is not aware of external databases and would not be able to track that output. So one approach is to write data locally (or to AWS/GCP) and treat the output database operations as a side effect that targets does not know about.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OHDSI & targets #810

{{title}}

Replies: 1 comment

{{title}}

Select a reply

OHDSI & targets #810

fdefalco Mar 18, 2022

Replies: 1 comment

wlandau Mar 20, 2022 Maintainer

fdefalco
Mar 18, 2022

wlandau
Mar 20, 2022
Maintainer