This Nextflow workflow is designed to process a sample sheet (samplesheet.csv
), retrieve files from Synapse based on entityId
, and upload them to an AWS S3 bucket. The workflow consists of three main steps:
- SAMPLESHEET_SPLIT: Filters and samples rows from the input CSV based on file size and retrieves relevant metadata.
- SYNAPSE_GET: Downloads the files from Synapse using the
entityId
from the sample sheet. - CDS_UPLOAD: Uploads the downloaded files to a specified AWS S3 bucket.
- params.input: The path to the sample sheet CSV file (
samplesheet.csv
). This file should include columns likeentityId
,file_url_in_cds
, andFile_Size
.
The SAMPLESHEET_SPLIT
workflow reads the input CSV file, filters the rows based on the file size, randomly samples five rows, and maps the necessary metadata.
samplesheet.csv
: A CSV file with columns includingentityId
,file_url_in_cds
, andFile_Size
.
- Filters rows where
File_Size
is less than 50 MB. - Randomly samples 5 rows from the filtered data.
- Maps
entityId
andaws_uri
(file URL in S3) from the CSV.
- A set of tuples containing
entityid
andaws_uri
.
The SYNAPSE_GET
process downloads files from Synapse based on the entityId
retrieved from the sample sheet.
entityid
: The SynapseentityId
used to identify and download files.
- A tuple containing the
meta
information and the downloaded file path.
- Requires a Synapse authentication token (
SYNAPSE_AUTH_TOKEN
).
The CDS_UPLOAD
process uploads the files downloaded from Synapse to an AWS S3 bucket using the AWS CLI.
meta
: The metadata associated with the file, including the destination S3 URI.entity
: The path to the file that will be uploaded.
- A tuple containing the
meta
information and the file path after upload.
-
Requires AWS credentials (
AWS_ACCESS_KEY_ID
,AWS_SECRET_ACCESS_KEY
, and optionallyAWS_SESSION_TOKEN
).Set these as
nextflow secrets set AWS_ACCESS_KEY_ID <your_access_key_id>
- Ensure Nextflow is installed.
- Ensure you have access to the necessary containers (
synapseclient
,awscli
). - Ensure you have the appropriate credentials for Synapse and AWS.
Run the workflow with the following command:
nextflow run main.nf --input path/to/samplesheet.csv
nextflow run main.nf --input samplesheet.csv
- The final output will be the files successfully uploaded to the specified AWS S3 bucket.
The following environment variables should be set with your credentials:
SYNAPSE_AUTH_TOKEN
: Synapse authentication token.AWS_ACCESS_KEY_ID
: AWS access key ID.AWS_SECRET_ACCESS_KEY
: AWS secret access key.AWS_SESSION_TOKEN
(optional): AWS session token for temporary credentials.
This workflow is provided as-is without any warranties. Modify and use it at your own risk.
This documentation should provide you with a clear understanding of how the workflow operates, the inputs it requires, and how to run it effectively.