Merge pull request #22 from pranavanba/main

Update README Instructions
Sage-Bionetworks · May 1, 2024 · 87c190b · 87c190b
2 parents 8a3389a + 0fabbb7
commit 87c190b
Show file tree

Hide file tree

Showing 2 changed files with 56 additions and 93 deletions.
diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@
 
 [![Build and publish a Docker image](https://github.com/Sage-Bionetworks/recover-parquet-external/actions/workflows/docker-build.yml/badge.svg?branch=main)](https://github.com/Sage-Bionetworks/recover-parquet-external/actions/workflows/docker-build.yml)
 
-This repo hosts the code for the egress pipeline used by the MHDR (SageBionetworks) for data enrichment and i2b2 summarization of data.
+This repo hosts the code for the egress pipeline used by the Digital Health Data Repository (Sage Bionetworks) for data enrichment and summarization for i2b2.
 
 ## Requirements
 
@@ -15,27 +15,23 @@ A Synapse authentication token is required for use of the Synapse APIs (e.g. the
 
 ## Usage
 
-There are two methods to run this pipeline: 1) **Docker container** or 2) **Manual Job**. Please refer to the [Docker Container](#docker-container) and [Manual Job](#manual-job) sections for their respective usage instructions.
+There are two methods to run this pipeline: 
+1) [**Docker container**](#docker-container), or
+2) [**Manual Job**](#manual-job)
 
-### Docker Container
-
-For the Docker method, there is a pre-published docker image available at [Packages](https://github.com/orgs/Sage-Bionetworks/packages/container/package/recover-pipeline-i2b2).
-
-The primary purpose of using the Docker method is that the docker image published from this repo contains instructions to:
+### Set Synapse Personal Access Token
 
-1. Create a computing environment with the dependencies needed by the machine running the pipeline
-2. Install the packages needed in order to run the pipeline
-3. Run a script containing the instructions for the pipeline
+Regardless of which method you use, you need to set your Synapse Personal Access Token somewhere in your environment. See the examples below
 
-#### Use the pre-built Docker image
-
-1.  Add your Synapse personal access token to the environment
+1.  Option 1: For only the current shell session:
 
 ```Shell
-# Option 1: For only the current shell session:
 export SYNAPSE_AUTH_TOKEN=<your-token>
+```
 
-# Option 2: For all future shell sessions (modify your shell profile)
+2. Option 2: For all future shell sessions (modify your shell profile)
+
+```Shell
 # Open the profile file
 nano ~/.bash_profile
 
@@ -47,127 +43,94 @@ export SYNAPSE_AUTH_TOKEN
 source ~/.bash_profile
 ```
 
-2.  Pull the docker image
+### Docker Container
+
+For the Docker method, there is a pre-published docker image available: [ghcr.io/sage-bionetworks/recover-pipeline-i2b2:main](https://github.com/orgs/Sage-Bionetworks/packages/container/package/recover-pipeline-i2b2)
+
+The primary purpose of using the Docker method is that the docker image published from this repo contains instructions to:
+
+1. Create a computing environment with the dependencies needed by the machine running the pipeline
+2. Install the packages needed in order to run the pipeline
+3. Run a script containing the instructions for the pipeline
+
+If you do not want to use the pre-built Docker image, skip to the next section ([**Build the Docker image yourself**](#build-the-docker-image-yourself))
+
+#### Use the pre-built Docker image
+
+1.  Pull the docker image
 
 ```Shell
-docker pull ghcr.io/Sage-Bionetworks/recover-pipeline-i2b2:main
+docker pull ghcr.io/sage-bionetworks/recover-pipeline-i2b2:main
 ```
 
-3.  Run the docker container
+2.  Run the docker container
 
 ```Shell
 docker run \
-  --name <container-name> \
+  --name container-name \
   -e SYNAPSE_AUTH_TOKEN=$SYNAPSE_AUTH_TOKEN \
-  -e ONTOLOGY_FILE_ID=<synapseID> \
-  -e PARQUET_DIR_ID=<synapseID> \
-  -e DATASET_NAME_FILTER=<string> \
-  -e CONCEPT_REPLACEMENTS=<named-vector-in-parentheses> \
-  -e CONCEPT_FILTER_COL=<concept-map-column-name> \
-  -e SYN_FOLDER_ID=<synapseID> \
   ghcr.io/Sage-Bionetworks/recover-pipeline-i2b2:main
 ```
 
-For an explanation of the various environment variables required in the `docker run` command, please see [Environment Variables](#environment-variables).
+For an explanation of the various config parameters used in the pipeline, please see [Config Parameters](#config-parameters).
 
 #### Build the Docker image yourself
 
-1.  Add your Synapse personal access token to the environment
-
-```Shell
-# Option 1: For only the current shell session:
-export SYNAPSE_AUTH_TOKEN=<your-token>
-
-# Option 2: For all future shell sessions (modify your shell profile)
-# Open the profile file
-nano ~/.bash_profile
-
-# Append the following
-SYNAPSE_AUTH_TOKEN=<your-token>
-export SYNAPSE_AUTH_TOKEN
-
-# Save the file
-source ~/.bash_profile
-```
-
-2. Clone this repo
+1. Clone this repo
 
 ```Shell
 git clone https://github.com/Sage-Bionetworks/recover-pipeline-i2b2.git
 ```
 
-4.  Build the docker image
+2.  Build the docker image
 
 ```Shell
 # Option 1: From the directory containing the Dockerfile
-cd /path/to/Dockerfile
-docker build <optional-arguments> -t <image-name> .
+cd /path/to/Dockerfile/
+docker build <optional-arguments> -t image-name .
 
 # Option 2: From anywhere
-docker build <optional-arguments> -t <image-name> -f <path-to-Dockerfile> .
+docker build <optional-arguments> -t image-name -f /path/to/Dockerfile/ .
 ```
 
-4.  Run the docker container
+3.  Run the docker container
 
 ```Shell
 docker run \
-  --name <docker-container-name> \
+  --name container-name \
   -e SYNAPSE_AUTH_TOKEN=$SYNAPSE_AUTH_TOKEN \
-  -e ONTOLOGY_FILE_ID=<synapseID> \
-  -e PARQUET_DIR_ID=<synapseID> \
-  -e DATASET_NAME_FILTER=<string> \
-  -e CONCEPT_REPLACEMENTS=<named-vector-in-parentheses> \
-  -e CONCEPT_FILTER_COL=<concept-map-column-name> \
-  -e SYN_FOLDER_ID=<synapseID> \
-  <docker-image-name>
+  image-name
 ```
 
-For an explanation of the various environment variables required in the `docker run` command, please see [Environment Variables](#environment-variables).
+For an explanation of the various config parameters used in the pipeline, please see [Config Parameters](#config-parameters).
 
 ### Manual Job
 
-If you would like to run the pipeline manually, or pass just a single script to a job scheduler, please follow the instructions in this section.
+If you would like to run the pipeline manually, please follow the instructions in this section.
 
-1. Add your Synapse personal access token to the environment
-
-```Shell
-# Option 1: For only the current shell session:
-export SYNAPSE_AUTH_TOKEN=<your-token>
-
-# Option 2: For all future shell sessions (modify your shell profile)
-# Open the profile file
-nano ~/.bash_profile
-
-# Append the following
-SYNAPSE_AUTH_TOKEN=<your-token>
-export SYNAPSE_AUTH_TOKEN
-
-# Save the file
-source ~/.bash_profile
-```
-
-2. Clone this repo or get just the [run-pipeline.R](run-pipeline.R) file
+1. Clone this repo
 
 ```Shell
 git clone https://github.com/Sage-Bionetworks/recover-pipeline-i2b2.git
 ```
 
-2. Modify the variables and parameters in [run-pipeline.R](run-pipeline.R). If you do not modify how the values of the variables in [run-pipeline.R](run-pipeline.R) are read in, then you will need to set the values of those variables as environment variables. If you want to set the values of those variables in an R session, then either use the `Sys.setenv()` function or modify the file itself to assign values to variables the normal way in R, e.g. `var <- val`.
-4. Run [run-pipeline.R](run-pipeline.R)
+2. Modify the parameters in the [config](config/config.yml) as needed
+
+3. Run [run-pipeline.R](pipeline/run-pipeline.R)
 
-### Environment Variables
+### Config Parameters
 
 The environment variables passed to `docker run ...` are the input arguments of `recoversummarizer::summarize_pipeline()`, and as such must be provided in order to use the docker method. Please refer to the [recoverSummarizeR](https://github.com/Sage-Bionetworks/recoverSummarizeR) R package for more information on the `recoverSummarizeR` package and its functions.
 
 Variable | Definition | Example
 ---|---|---
-| `ONTOLOGY_FILE_ID` | A Synapse ID for a CSV file stored in Synapse. For RECOVER, this file is the i2b2 concepts map. | syn12345678
-| `PARQUET_DIR_ID` | A Synapse ID for a folder entity in Synapse where the data is stored. For RECOVER, this would be the folder housing the post-ETL parquet data. | syn12345678
-| `DATASET_NAME_FILTER`  | A string found in the names of the files to be read. This acts like a filter to include only the files that contain the string in their names. | fitbit
-| `CONCEPT_REPLACEMENTS` | A named vector of strings and their replacements. The names must be valid values of the `concept_filter_col` column of the `concept_map` data frame. For RECOVER, `concept_map` is the ontology file data frame. | "c('mins' = 'minutes', 'avghr' = 'averageheartrate', 'spo2' = 'spo2\_', 'hrv' = 'hrv_dailyrmssd', 'restinghr' = 'restingheartrate', 'sleepbrth' = 'sleepsummarybreath')" <br><br> *Must surround `c(…)` in parentheses (as indicated above) in `docker run …`*
-| `CONCEPT_FILTER_COL` | The column of the `concept_map` data frame that contains "approved concepts" (column names of dataset data frames that are not to be excluded). For RECOVER, `concept_map` is the ontology file data frame. | concept_cd
-| `SYN_FOLDER_ID` | A Synapse ID for a folder entity in Synapse where you want to store a file. | syn12345678
-| `method` | Either `synapse` or `sts` to specify the method to use in getting the parquet datasets. `synapse` will get files directly from a synapse project or folder using the synapse client, while `sts` will use sts-token access to get objects from an sts-enabled storage location, such as an S3 bucket. | synapse
-| `s3bucket` | The name of the S3 bucket to access when `method=sts`. | my-bucket
-| `s3basekey` | The base key of the S3 bucket to access when `method=sts`. | main/parquet/
-| `downloadLocation` | The location to download input files to. | ./parquet
+| `ontologyFileID` | A Synapse ID for the the i2b2 concepts map ontology file stored in Synapse. | syn12345678
+| `parquetDirID` | A Synapse ID for a folder entity in Synapse where the input data is stored. This should be the folder housing the post-ETL parquet data. | syn12345678
+| `concept_replacements` | A named vector of strings and their replacements. The names must be valid values of the `concept_filter_col` column of the `concept_map` data frame. For RECOVER, `concept_map` is the ontology file data frame. | R Example<br>c('mins' = 'minutes', 'avghr' = 'averageheartrate', 'spo2' = 'spo2\_', 'hrv' = 'hrv_dailyrmssd', 'restinghr' = 'restingheartrate', 'sleepbrth' = 'sleepsummarybreath') | concept_cd
+| `synFolderID` | A Synapse ID for a folder entity in Synapse where you want to store the final output files. | syn12345678
+| `s3bucket` | The name of the S3 bucket containing input data | recover-bucket
+| `s3basekey` | The base key of the S3 bucket containing input data. | main/archive/2024-.../
+| `downloadLocation` | The location to sync input files to. | ./parquet
+| `selectedVarsFileID` | A Synapse ID for the CSV file listing which datasets and variables have been selected for use in this pipeline | syn12345678
+| `outputConceptsDir` | The location to save intermediate and final i2b2 summary files to | ./output-concepts
+
diff --git a/config/config.yml b/config/config.yml
@@ -16,7 +16,7 @@ prod:
                                 "sleependtime" = "enddate")
   concept_filter_col: CONCEPT_CD
   synFolderID: syn52504335
-  method: sts
+  # method: sts
   s3bucket: recover-main-project
   s3basekey: main/archive/2024-02-29/
   downloadLocation: ./temp-parquet