diff --git a/README.md b/README.md index a8106ff..05c8d81 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,7 @@ [![Build and publish a Docker image](https://github.com/Sage-Bionetworks/recover-parquet-external/actions/workflows/docker-build.yml/badge.svg?branch=main)](https://github.com/Sage-Bionetworks/recover-parquet-external/actions/workflows/docker-build.yml) -This repo hosts the code for the egress pipeline used by the MHDR (SageBionetworks) for data enrichment and i2b2 summarization of data. +This repo hosts the code for the egress pipeline used by the Digital Health Data Repository (Sage Bionetworks) for data enrichment and summarization for i2b2. ## Requirements @@ -15,27 +15,23 @@ A Synapse authentication token is required for use of the Synapse APIs (e.g. the ## Usage -There are two methods to run this pipeline: 1) **Docker container** or 2) **Manual Job**. Please refer to the [Docker Container](#docker-container) and [Manual Job](#manual-job) sections for their respective usage instructions. +There are two methods to run this pipeline: +1) [**Docker container**](#docker-container), or +2) [**Manual Job**](#manual-job) -### Docker Container - -For the Docker method, there is a pre-published docker image available at [Packages](https://github.com/orgs/Sage-Bionetworks/packages/container/package/recover-pipeline-i2b2). - -The primary purpose of using the Docker method is that the docker image published from this repo contains instructions to: +### Set Synapse Personal Access Token -1. Create a computing environment with the dependencies needed by the machine running the pipeline -2. Install the packages needed in order to run the pipeline -3. Run a script containing the instructions for the pipeline +Regardless of which method you use, you need to set your Synapse Personal Access Token somewhere in your environment. See the examples below -#### Use the pre-built Docker image - -1. Add your Synapse personal access token to the environment +1. Option 1: For only the current shell session: ```Shell -# Option 1: For only the current shell session: export SYNAPSE_AUTH_TOKEN= +``` -# Option 2: For all future shell sessions (modify your shell profile) +2. Option 2: For all future shell sessions (modify your shell profile) + +```Shell # Open the profile file nano ~/.bash_profile @@ -47,127 +43,94 @@ export SYNAPSE_AUTH_TOKEN source ~/.bash_profile ``` -2. Pull the docker image +### Docker Container + +For the Docker method, there is a pre-published docker image available: [ghcr.io/sage-bionetworks/recover-pipeline-i2b2:main](https://github.com/orgs/Sage-Bionetworks/packages/container/package/recover-pipeline-i2b2) + +The primary purpose of using the Docker method is that the docker image published from this repo contains instructions to: + +1. Create a computing environment with the dependencies needed by the machine running the pipeline +2. Install the packages needed in order to run the pipeline +3. Run a script containing the instructions for the pipeline + +If you do not want to use the pre-built Docker image, skip to the next section ([**Build the Docker image yourself**](#build-the-docker-image-yourself)) + +#### Use the pre-built Docker image + +1. Pull the docker image ```Shell -docker pull ghcr.io/Sage-Bionetworks/recover-pipeline-i2b2:main +docker pull ghcr.io/sage-bionetworks/recover-pipeline-i2b2:main ``` -3. Run the docker container +2. Run the docker container ```Shell docker run \ - --name \ + --name container-name \ -e SYNAPSE_AUTH_TOKEN=$SYNAPSE_AUTH_TOKEN \ - -e ONTOLOGY_FILE_ID= \ - -e PARQUET_DIR_ID= \ - -e DATASET_NAME_FILTER= \ - -e CONCEPT_REPLACEMENTS= \ - -e CONCEPT_FILTER_COL= \ - -e SYN_FOLDER_ID= \ ghcr.io/Sage-Bionetworks/recover-pipeline-i2b2:main ``` -For an explanation of the various environment variables required in the `docker run` command, please see [Environment Variables](#environment-variables). +For an explanation of the various config parameters used in the pipeline, please see [Config Parameters](#config-parameters). #### Build the Docker image yourself -1. Add your Synapse personal access token to the environment - -```Shell -# Option 1: For only the current shell session: -export SYNAPSE_AUTH_TOKEN= - -# Option 2: For all future shell sessions (modify your shell profile) -# Open the profile file -nano ~/.bash_profile - -# Append the following -SYNAPSE_AUTH_TOKEN= -export SYNAPSE_AUTH_TOKEN - -# Save the file -source ~/.bash_profile -``` - -2. Clone this repo +1. Clone this repo ```Shell git clone https://github.com/Sage-Bionetworks/recover-pipeline-i2b2.git ``` -4. Build the docker image +2. Build the docker image ```Shell # Option 1: From the directory containing the Dockerfile -cd /path/to/Dockerfile -docker build -t . +cd /path/to/Dockerfile/ +docker build -t image-name . # Option 2: From anywhere -docker build -t -f . +docker build -t image-name -f /path/to/Dockerfile/ . ``` -4. Run the docker container +3. Run the docker container ```Shell docker run \ - --name \ + --name container-name \ -e SYNAPSE_AUTH_TOKEN=$SYNAPSE_AUTH_TOKEN \ - -e ONTOLOGY_FILE_ID= \ - -e PARQUET_DIR_ID= \ - -e DATASET_NAME_FILTER= \ - -e CONCEPT_REPLACEMENTS= \ - -e CONCEPT_FILTER_COL= \ - -e SYN_FOLDER_ID= \ - + image-name ``` -For an explanation of the various environment variables required in the `docker run` command, please see [Environment Variables](#environment-variables). +For an explanation of the various config parameters used in the pipeline, please see [Config Parameters](#config-parameters). ### Manual Job -If you would like to run the pipeline manually, or pass just a single script to a job scheduler, please follow the instructions in this section. +If you would like to run the pipeline manually, please follow the instructions in this section. -1. Add your Synapse personal access token to the environment - -```Shell -# Option 1: For only the current shell session: -export SYNAPSE_AUTH_TOKEN= - -# Option 2: For all future shell sessions (modify your shell profile) -# Open the profile file -nano ~/.bash_profile - -# Append the following -SYNAPSE_AUTH_TOKEN= -export SYNAPSE_AUTH_TOKEN - -# Save the file -source ~/.bash_profile -``` - -2. Clone this repo or get just the [run-pipeline.R](run-pipeline.R) file +1. Clone this repo ```Shell git clone https://github.com/Sage-Bionetworks/recover-pipeline-i2b2.git ``` -2. Modify the variables and parameters in [run-pipeline.R](run-pipeline.R). If you do not modify how the values of the variables in [run-pipeline.R](run-pipeline.R) are read in, then you will need to set the values of those variables as environment variables. If you want to set the values of those variables in an R session, then either use the `Sys.setenv()` function or modify the file itself to assign values to variables the normal way in R, e.g. `var <- val`. -4. Run [run-pipeline.R](run-pipeline.R) +2. Modify the parameters in the [config](config/config.yml) as needed + +3. Run [run-pipeline.R](pipeline/run-pipeline.R) -### Environment Variables +### Config Parameters The environment variables passed to `docker run ...` are the input arguments of `recoversummarizer::summarize_pipeline()`, and as such must be provided in order to use the docker method. Please refer to the [recoverSummarizeR](https://github.com/Sage-Bionetworks/recoverSummarizeR) R package for more information on the `recoverSummarizeR` package and its functions. Variable | Definition | Example ---|---|--- -| `ONTOLOGY_FILE_ID` | A Synapse ID for a CSV file stored in Synapse. For RECOVER, this file is the i2b2 concepts map. | syn12345678 -| `PARQUET_DIR_ID` | A Synapse ID for a folder entity in Synapse where the data is stored. For RECOVER, this would be the folder housing the post-ETL parquet data. | syn12345678 -| `DATASET_NAME_FILTER` | A string found in the names of the files to be read. This acts like a filter to include only the files that contain the string in their names. | fitbit -| `CONCEPT_REPLACEMENTS` | A named vector of strings and their replacements. The names must be valid values of the `concept_filter_col` column of the `concept_map` data frame. For RECOVER, `concept_map` is the ontology file data frame. | "c('mins' = 'minutes', 'avghr' = 'averageheartrate', 'spo2' = 'spo2\_', 'hrv' = 'hrv_dailyrmssd', 'restinghr' = 'restingheartrate', 'sleepbrth' = 'sleepsummarybreath')"

*Must surround `c(…)` in parentheses (as indicated above) in `docker run …`* -| `CONCEPT_FILTER_COL` | The column of the `concept_map` data frame that contains "approved concepts" (column names of dataset data frames that are not to be excluded). For RECOVER, `concept_map` is the ontology file data frame. | concept_cd -| `SYN_FOLDER_ID` | A Synapse ID for a folder entity in Synapse where you want to store a file. | syn12345678 -| `method` | Either `synapse` or `sts` to specify the method to use in getting the parquet datasets. `synapse` will get files directly from a synapse project or folder using the synapse client, while `sts` will use sts-token access to get objects from an sts-enabled storage location, such as an S3 bucket. | synapse -| `s3bucket` | The name of the S3 bucket to access when `method=sts`. | my-bucket -| `s3basekey` | The base key of the S3 bucket to access when `method=sts`. | main/parquet/ -| `downloadLocation` | The location to download input files to. | ./parquet +| `ontologyFileID` | A Synapse ID for the the i2b2 concepts map ontology file stored in Synapse. | syn12345678 +| `parquetDirID` | A Synapse ID for a folder entity in Synapse where the input data is stored. This should be the folder housing the post-ETL parquet data. | syn12345678 +| `concept_replacements` | A named vector of strings and their replacements. The names must be valid values of the `concept_filter_col` column of the `concept_map` data frame. For RECOVER, `concept_map` is the ontology file data frame. | R Example
c('mins' = 'minutes', 'avghr' = 'averageheartrate', 'spo2' = 'spo2\_', 'hrv' = 'hrv_dailyrmssd', 'restinghr' = 'restingheartrate', 'sleepbrth' = 'sleepsummarybreath') | concept_cd +| `synFolderID` | A Synapse ID for a folder entity in Synapse where you want to store the final output files. | syn12345678 +| `s3bucket` | The name of the S3 bucket containing input data | recover-bucket +| `s3basekey` | The base key of the S3 bucket containing input data. | main/archive/2024-.../ +| `downloadLocation` | The location to sync input files to. | ./parquet +| `selectedVarsFileID` | A Synapse ID for the CSV file listing which datasets and variables have been selected for use in this pipeline | syn12345678 +| `outputConceptsDir` | The location to save intermediate and final i2b2 summary files to | ./output-concepts + diff --git a/config/config.yml b/config/config.yml index 43c36bd..47a146f 100644 --- a/config/config.yml +++ b/config/config.yml @@ -16,7 +16,7 @@ prod: "sleependtime" = "enddate") concept_filter_col: CONCEPT_CD synFolderID: syn52504335 - method: sts + # method: sts s3bucket: recover-main-project s3basekey: main/archive/2024-02-29/ downloadLocation: ./temp-parquet