Skip to content

Commit

Permalink
Merge pull request #22 from pranavanba/main
Browse files Browse the repository at this point in the history
Update README Instructions
  • Loading branch information
pranavanba authored May 1, 2024
2 parents 8a3389a + 0fabbb7 commit 87c190b
Show file tree
Hide file tree
Showing 2 changed files with 56 additions and 93 deletions.
147 changes: 55 additions & 92 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

[![Build and publish a Docker image](https://github.com/Sage-Bionetworks/recover-parquet-external/actions/workflows/docker-build.yml/badge.svg?branch=main)](https://github.com/Sage-Bionetworks/recover-parquet-external/actions/workflows/docker-build.yml)

This repo hosts the code for the egress pipeline used by the MHDR (SageBionetworks) for data enrichment and i2b2 summarization of data.
This repo hosts the code for the egress pipeline used by the Digital Health Data Repository (Sage Bionetworks) for data enrichment and summarization for i2b2.

## Requirements

Expand All @@ -15,27 +15,23 @@ A Synapse authentication token is required for use of the Synapse APIs (e.g. the

## Usage

There are two methods to run this pipeline: 1) **Docker container** or 2) **Manual Job**. Please refer to the [Docker Container](#docker-container) and [Manual Job](#manual-job) sections for their respective usage instructions.
There are two methods to run this pipeline:
1) [**Docker container**](#docker-container), or
2) [**Manual Job**](#manual-job)

### Docker Container

For the Docker method, there is a pre-published docker image available at [Packages](https://github.com/orgs/Sage-Bionetworks/packages/container/package/recover-pipeline-i2b2).

The primary purpose of using the Docker method is that the docker image published from this repo contains instructions to:
### Set Synapse Personal Access Token

1. Create a computing environment with the dependencies needed by the machine running the pipeline
2. Install the packages needed in order to run the pipeline
3. Run a script containing the instructions for the pipeline
Regardless of which method you use, you need to set your Synapse Personal Access Token somewhere in your environment. See the examples below

#### Use the pre-built Docker image

1. Add your Synapse personal access token to the environment
1. Option 1: For only the current shell session:

```Shell
# Option 1: For only the current shell session:
export SYNAPSE_AUTH_TOKEN=<your-token>
```

# Option 2: For all future shell sessions (modify your shell profile)
2. Option 2: For all future shell sessions (modify your shell profile)

```Shell
# Open the profile file
nano ~/.bash_profile

Expand All @@ -47,127 +43,94 @@ export SYNAPSE_AUTH_TOKEN
source ~/.bash_profile
```

2. Pull the docker image
### Docker Container

For the Docker method, there is a pre-published docker image available: [ghcr.io/sage-bionetworks/recover-pipeline-i2b2:main](https://github.com/orgs/Sage-Bionetworks/packages/container/package/recover-pipeline-i2b2)

The primary purpose of using the Docker method is that the docker image published from this repo contains instructions to:

1. Create a computing environment with the dependencies needed by the machine running the pipeline
2. Install the packages needed in order to run the pipeline
3. Run a script containing the instructions for the pipeline

If you do not want to use the pre-built Docker image, skip to the next section ([**Build the Docker image yourself**](#build-the-docker-image-yourself))

#### Use the pre-built Docker image

1. Pull the docker image

```Shell
docker pull ghcr.io/Sage-Bionetworks/recover-pipeline-i2b2:main
docker pull ghcr.io/sage-bionetworks/recover-pipeline-i2b2:main
```

3. Run the docker container
2. Run the docker container

```Shell
docker run \
--name <container-name> \
--name container-name \
-e SYNAPSE_AUTH_TOKEN=$SYNAPSE_AUTH_TOKEN \
-e ONTOLOGY_FILE_ID=<synapseID> \
-e PARQUET_DIR_ID=<synapseID> \
-e DATASET_NAME_FILTER=<string> \
-e CONCEPT_REPLACEMENTS=<named-vector-in-parentheses> \
-e CONCEPT_FILTER_COL=<concept-map-column-name> \
-e SYN_FOLDER_ID=<synapseID> \
ghcr.io/Sage-Bionetworks/recover-pipeline-i2b2:main
```

For an explanation of the various environment variables required in the `docker run` command, please see [Environment Variables](#environment-variables).
For an explanation of the various config parameters used in the pipeline, please see [Config Parameters](#config-parameters).

#### Build the Docker image yourself

1. Add your Synapse personal access token to the environment

```Shell
# Option 1: For only the current shell session:
export SYNAPSE_AUTH_TOKEN=<your-token>

# Option 2: For all future shell sessions (modify your shell profile)
# Open the profile file
nano ~/.bash_profile

# Append the following
SYNAPSE_AUTH_TOKEN=<your-token>
export SYNAPSE_AUTH_TOKEN

# Save the file
source ~/.bash_profile
```

2. Clone this repo
1. Clone this repo

```Shell
git clone https://github.com/Sage-Bionetworks/recover-pipeline-i2b2.git
```

4. Build the docker image
2. Build the docker image

```Shell
# Option 1: From the directory containing the Dockerfile
cd /path/to/Dockerfile
docker build <optional-arguments> -t <image-name> .
cd /path/to/Dockerfile/
docker build <optional-arguments> -t image-name .

# Option 2: From anywhere
docker build <optional-arguments> -t <image-name> -f <path-to-Dockerfile> .
docker build <optional-arguments> -t image-name -f /path/to/Dockerfile/ .
```

4. Run the docker container
3. Run the docker container

```Shell
docker run \
--name <docker-container-name> \
--name container-name \
-e SYNAPSE_AUTH_TOKEN=$SYNAPSE_AUTH_TOKEN \
-e ONTOLOGY_FILE_ID=<synapseID> \
-e PARQUET_DIR_ID=<synapseID> \
-e DATASET_NAME_FILTER=<string> \
-e CONCEPT_REPLACEMENTS=<named-vector-in-parentheses> \
-e CONCEPT_FILTER_COL=<concept-map-column-name> \
-e SYN_FOLDER_ID=<synapseID> \
<docker-image-name>
image-name
```

For an explanation of the various environment variables required in the `docker run` command, please see [Environment Variables](#environment-variables).
For an explanation of the various config parameters used in the pipeline, please see [Config Parameters](#config-parameters).

### Manual Job

If you would like to run the pipeline manually, or pass just a single script to a job scheduler, please follow the instructions in this section.
If you would like to run the pipeline manually, please follow the instructions in this section.

1. Add your Synapse personal access token to the environment

```Shell
# Option 1: For only the current shell session:
export SYNAPSE_AUTH_TOKEN=<your-token>

# Option 2: For all future shell sessions (modify your shell profile)
# Open the profile file
nano ~/.bash_profile

# Append the following
SYNAPSE_AUTH_TOKEN=<your-token>
export SYNAPSE_AUTH_TOKEN

# Save the file
source ~/.bash_profile
```

2. Clone this repo or get just the [run-pipeline.R](run-pipeline.R) file
1. Clone this repo

```Shell
git clone https://github.com/Sage-Bionetworks/recover-pipeline-i2b2.git
```

2. Modify the variables and parameters in [run-pipeline.R](run-pipeline.R). If you do not modify how the values of the variables in [run-pipeline.R](run-pipeline.R) are read in, then you will need to set the values of those variables as environment variables. If you want to set the values of those variables in an R session, then either use the `Sys.setenv()` function or modify the file itself to assign values to variables the normal way in R, e.g. `var <- val`.
4. Run [run-pipeline.R](run-pipeline.R)
2. Modify the parameters in the [config](config/config.yml) as needed

3. Run [run-pipeline.R](pipeline/run-pipeline.R)

### Environment Variables
### Config Parameters

The environment variables passed to `docker run ...` are the input arguments of `recoversummarizer::summarize_pipeline()`, and as such must be provided in order to use the docker method. Please refer to the [recoverSummarizeR](https://github.com/Sage-Bionetworks/recoverSummarizeR) R package for more information on the `recoverSummarizeR` package and its functions.

Variable | Definition | Example
---|---|---
| `ONTOLOGY_FILE_ID` | A Synapse ID for a CSV file stored in Synapse. For RECOVER, this file is the i2b2 concepts map. | syn12345678
| `PARQUET_DIR_ID` | A Synapse ID for a folder entity in Synapse where the data is stored. For RECOVER, this would be the folder housing the post-ETL parquet data. | syn12345678
| `DATASET_NAME_FILTER` | A string found in the names of the files to be read. This acts like a filter to include only the files that contain the string in their names. | fitbit
| `CONCEPT_REPLACEMENTS` | A named vector of strings and their replacements. The names must be valid values of the `concept_filter_col` column of the `concept_map` data frame. For RECOVER, `concept_map` is the ontology file data frame. | "c('mins' = 'minutes', 'avghr' = 'averageheartrate', 'spo2' = 'spo2\_', 'hrv' = 'hrv_dailyrmssd', 'restinghr' = 'restingheartrate', 'sleepbrth' = 'sleepsummarybreath')" <br><br> *Must surround `c(…)` in parentheses (as indicated above) in `docker run …`*
| `CONCEPT_FILTER_COL` | The column of the `concept_map` data frame that contains "approved concepts" (column names of dataset data frames that are not to be excluded). For RECOVER, `concept_map` is the ontology file data frame. | concept_cd
| `SYN_FOLDER_ID` | A Synapse ID for a folder entity in Synapse where you want to store a file. | syn12345678
| `method` | Either `synapse` or `sts` to specify the method to use in getting the parquet datasets. `synapse` will get files directly from a synapse project or folder using the synapse client, while `sts` will use sts-token access to get objects from an sts-enabled storage location, such as an S3 bucket. | synapse
| `s3bucket` | The name of the S3 bucket to access when `method=sts`. | my-bucket
| `s3basekey` | The base key of the S3 bucket to access when `method=sts`. | main/parquet/
| `downloadLocation` | The location to download input files to. | ./parquet
| `ontologyFileID` | A Synapse ID for the the i2b2 concepts map ontology file stored in Synapse. | syn12345678
| `parquetDirID` | A Synapse ID for a folder entity in Synapse where the input data is stored. This should be the folder housing the post-ETL parquet data. | syn12345678
| `concept_replacements` | A named vector of strings and their replacements. The names must be valid values of the `concept_filter_col` column of the `concept_map` data frame. For RECOVER, `concept_map` is the ontology file data frame. | R Example<br>c('mins' = 'minutes', 'avghr' = 'averageheartrate', 'spo2' = 'spo2\_', 'hrv' = 'hrv_dailyrmssd', 'restinghr' = 'restingheartrate', 'sleepbrth' = 'sleepsummarybreath') | concept_cd
| `synFolderID` | A Synapse ID for a folder entity in Synapse where you want to store the final output files. | syn12345678
| `s3bucket` | The name of the S3 bucket containing input data | recover-bucket
| `s3basekey` | The base key of the S3 bucket containing input data. | main/archive/2024-.../
| `downloadLocation` | The location to sync input files to. | ./parquet
| `selectedVarsFileID` | A Synapse ID for the CSV file listing which datasets and variables have been selected for use in this pipeline | syn12345678
| `outputConceptsDir` | The location to save intermediate and final i2b2 summary files to | ./output-concepts

2 changes: 1 addition & 1 deletion config/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ prod:
"sleependtime" = "enddate")
concept_filter_col: CONCEPT_CD
synFolderID: syn52504335
method: sts
# method: sts
s3bucket: recover-main-project
s3basekey: main/archive/2024-02-29/
downloadLocation: ./temp-parquet
Expand Down

0 comments on commit 87c190b

Please sign in to comment.