Merge pull request #115 from naobservatory/dev

v2.5.2
naobservatory · Nov 27, 2024 · b75ddc6 · b75ddc6
2 parents 8c6809d + 3e72a7e
commit b75ddc6
Show file tree

Hide file tree

Showing 75 changed files with 862 additions and 209 deletions.
diff --git a/.github/workflows/end-to-end.yml b/.github/workflows/end-to-end.yml
@@ -0,0 +1,46 @@
+name: End-to-end MGS workflow test
+
+on: [pull_request]
+
+jobs:
+  test:
+    runs-on: ubuntu-latest
+
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+
+      - name: Set up JDK 11
+        uses: actions/setup-java@v4
+        with:
+          java-version: '11'
+          distribution: 'adopt'
+
+      - name: Setup Nextflow latest-edge
+        uses: nf-core/setup-nextflow@v1
+        with:
+          version: "latest-edge"
+
+      - name: Install nf-test
+        run: |
+          wget -qO- https://get.nf-test.com | bash
+          sudo mv nf-test /usr/local/bin/
+
+      - name: Run index workflow
+        run: nf-test test --tag index --verbose
+
+      - name: Clean docker for more space
+        run: |
+          docker kill $(docker ps -q) 2>/dev/null || true
+          docker rm $(docker ps -a -q) 2>/dev/null || true
+          docker rmi $(docker images -q) -f 2>/dev/null || true
+          docker system prune -af --volumes
+
+      - name: Clean up nf-test dir
+        run: sudo rm -rf .nf-test
+
+      - name: Run run workflow
+        run: nf-test test --tag run --verbose
+
+      - name: Run run_validation workflow
+        run: nf-test test --tag validation --verbose
diff --git a/.gitignore b/.gitignore
@@ -7,3 +7,6 @@ test/output
 test/.nextflow*
 *.Rhistory
 pipeline_report.txt
+
+.nf-test/
+.nf-test.log
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,29 @@
+# v2.5.2
+- Changes to default read filtering:
+    - Relaxed FASTP quality filtering (`--cut_mean_quality` and `--average_qual` reduced from 25 to 20).
+    - Relaxed BBDUK viral filtering (switched from 3 21-mers to 1 24-mer).
+- Overhauled BLAST validation functionality:
+    - BLAST now runs on forward and reverse reads independently
+    - BLAST output filtering no longer assumes specific filename suffixes
+    - Paired BLAST output includes more information
+    - RUN_VALIDATION can now directly take in FASTA files instead of a virus read DB
+    - Fixed issues with publishing BLAST output under new Nextflow version
+- Implemented nf-test for end-to-end testing of pipeline functionality
+    - Implemented test suite in `tests/main.nf.test`
+    - Reconfigured INDEX workflow to enable generation of miniature index directories for testing
+    - Added Github Actions workflow in `.github/workflows/end-to-end.yml`
+    - Pull requests will now fail if any of INDEX, RUN, or RUN_VALIDATION crashes when run on test data.
+    - Generated first version of new, curated test dataset for testing RUN workflow. Samplesheet and config file are available in `test-data`. The previous test dataset in `test` has been removed.
+- Implemented S3 auto-cleanup:
+    - Added tags to published files to facilitate S3 auto-cleanup
+    - Added S3 lifecycle configuration file to `ref`, along with a script in `bin` to add it to an S3 bucket
+- Minor changes
+    - Added logic to check if `grouping` variable in `nextflow.config` matches the input samplesheet, if it doesn't, the code throws an error.
+    - Externalized resource specifications to `resources.config`, removing hardcoded CPU/memory values
+    - Renamed `index-params.json` to `params-index.json` to avoid clash with Github Actions
+    - Removed redundant subsetting statement from TAXONOMY workflow.
+    - Added --group_across_illumina_lanes option to generate_samplesheet
+
 # v2.5.1
 - Enabled extraction of BBDuk-subset putatively-host-viral raw reads for downstream chimera detection.
 - Added back viral read fields accidentally being discarded by COLLAPSE_VIRUS_READS.
@@ -16,7 +42,7 @@
 - Reconfigured QC subworkflow to run FASTQC and MultiQC on each pair of input files separately (fixes bug arising from allowing arbitrary filenames for forward and reverse read files).
 
 # v2.4.0
-- Created a new output directory where we put log files called `logging`. 
+- Created a new output directory where we put log files called `logging`.
 - Added the trace file from Nextflow to the `logging` directory which can be used for understanding cpu, memory usage, and other infromation like runtime. After running the pipeline, `plot-timeline-script.R` can be used to generate a useful summary plot of the runtime for each process in the pipeline.
 - Removed CONCAT_GZIPPED.
 - Replaced the sample input format with something more similar to nf-core, called `samplesheet.csv`. This new input file can be generated using the script `generate_samplesheet.sh`.

diff --git a/README.md b/README.md
@@ -179,6 +179,7 @@ To run this workflow with full functionality, you need access to the following d
 2. **Docker:** To install Docker Engine for command-line use, follow the installation instructions available [here](https://docs.docker.com/engine/install/) (or [here](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/install-docker.html) for installation on an AWS EC2 instance).
 3. **AWS CLI:** If not already installed, install the AWS CLI by following the instructions available [here](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html).
 4. **Git:** To install the Git version control tool, follow the installation instructions available [here](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git).
+5. **nf-test**: To install nf-test, follow the install instructions available [here](https://www.nf-test.com/docs/getting-started/).
 
 #### 2. Configure AWS & Docker
 
@@ -245,35 +246,42 @@ Wait for the workflow to run to completion; this is likely to take several hours
 
 ### Testing & validation
 
-To confirm that the pipeline works in your hands, we provide a small test dataset (`test/raw`) to run through the run workflow. This can be used to test any of the pipeline profiles described above.
+To confirm that the pipeline works in your hands, we provide a small test dataset (`s3://nao-testing/gold-standard-test/raw/`) to run through the run workflow. This can be used to test any of the pipeline profiles described above.
 
 If your EC2 instance has the resources to handle it, the simplest way to start using the pipeline is to run the test data through it locally on that instance (i.e. without using S3). To do this:
 
-1. Navigate to the `test` directory.
-2. Edit `nextflow.config` to set `params.ref_dir` to the index directory you chose or created above (specifically `PATH_TO_REF_DIR/output`).
-3. Still within the `test` directory, run `nextflow run -profile ec2_local .. -resume`.
-4. Wait for the workflow to finish. Inspect the `output` directory to view the processed output files.
+1. Create a new directory outside the repo directory and copy over the run workflow config file as `nextflow.config` in that directory:
+
+```
+mkdir launch
+cd launch
+cp REPO_DIR/configs/run.config nextflow.config
+```
+
+2. Edit `nextflow.config` to set `params.ref_dir` to the index directory you chose or created above (specifically `PATH_TO_REF_DIR/output`). 
+3. Then set the samplesheet path to the test dataset samplesheet `${projectDir}/test-data/samplesheet.csv`. 
+4. Within this directory, run `nextflow run -profile ec2_local .. -resume`. Wait for the workflow to finish. 
+5. Inspect the `output` directory to view the processed output files.
 
 If this is successful, the next level of complexity is to run the workflow with a working directory on S3. To do this:
 
-1. Within the `test` directory, edit `nextflow.config` to set `params.base_dir` to the S3 directory of your choice.
-2. Still within that directory, run `nextflow run -profile ec2_s3 .. -resume`.
+1. Edit `nextflow.config` to set `params.base_dir` to the S3 directory of your choice. 
+2. Still within that directory, run `nextflow run -profile ec2_s3 .. -resume`. 
 3. Wait for the workflow to finish, and inspect the output on S3.
 
 Finally, you can run the test dataset through the pipeline on AWS Batch. To do this, configure Batch as described [here](https://data.securebio.org/wills-public-notebook/notebooks/2024-06-11_batch.html) (steps 1-3), then:
 
-1. Within the `test` directory, edit `nextflow.config` to set `params.base_dir` to a different S3 directory of your choice and `process.queue` to the name of your Batch job queue.
-2. Still within that directory, run `nextflow run -profile batch .. -resume` (or simply `nextflow run .. -resume`).
+1. Edit `nextflow.config` to set `params.base_dir` to a different S3 directory of your choice and `process.queue` to the name of your Batch job queue. 
+2. Still within that directory, run `nextflow run -profile batch .. -resume` (or simply `nextflow run .. -resume`). 
 3. Wait for the workflow to finish, and inspect the output on S3.
 
 ### Running on new data
 
 To run the workflow on another dataset, you need:
 
 1. Accessible raw data files in Gzipped FASTQ format, named appropriately.
-2. A sample sheet file specifying the samples, along with paths to the forward and reverse read files for each sample.
+2. A sample sheet file specifying the samples, along with paths to the forward and reverse read files for each sample.  `generate_samplesheet.sh` (see below) can make this for you.
 3. A config file in a clean launch directory, pointing to:
-    - The directory containing the raw data (`params.raw_dir`).
     - The base directory in which to put the working and output directories (`params.base_dir`).
     - The directory containing the outputs of the reference workflow (`params.ref_dir`).
     - The sample sheet (`params.sample_sheet`).
@@ -285,18 +293,45 @@ To run the workflow on another dataset, you need:
 > - Second column: Path to FASTQ file 1 which should be the forward read for this sample
 > - Third column: Path to FASTQ file 2 which should be the reverse read for this sample
 > 
-> The easiest way to get this file is by using the `generate_samplesheet.sh` script. As input, this script takes a path to raw FASTQ files (`dir_path`), and forward (`forward_suffix`) and reverse (`reverse_suffix`) read suffixes, both of which support regex, and an optional output path (`output_path`). Those using data from s3 should make sure to pass the `s3` parameter. Those who would like to group samples by some metadata can pass a path to a CSV file containing a header column named `sample,group`, where each row gives the sample name and the group to group by (`group_file`) or edit the samplesheet manually after generation (since manually editing the samplesheet will be easier when the groups CSV isn't readily available). As output, the script generates a CSV file named (`samplesheet.csv` by default), which can be used as input for the pipeline.
-
+> The easiest way to get this file is by using the `generate_samplesheet.sh` script. As input, this script takes a path to raw FASTQ files (`dir_path`), and forward (`forward_suffix`) and reverse (`reverse_suffix`) read suffixes, both of which support regex, and an optional output path (`output_path`). Those using data from s3 should make sure to pass the `s3` parameter. Those who would like to group samples by some metadata can pass a path to a CSV file containing a header column named `sample,group`, where each row gives the sample name and the group to group by (`group_file`), edit the samplesheet manually after generation (since manually editing the samplesheet will be easier when the groups CSV isn't readily available), or provide the --group_across_illumina_lanes option if a single library was run across a single Illumina flowcell. As output, the script generates a CSV file named (`samplesheet.csv` by default), which can be used as input for the pipeline.
+>
+> For example:
+> ```
+> ../bin/generate_samplesheet.sh \
+>    --s3
+>    --dir_path s3://nao-restricted/MJ-2024-10-21/raw/ \
+>    --forward_suffix _1 \
+>    --reverse_suffix _2
+> ```
 
 If running on Batch, a good process for starting the pipeline on a new dataset is as follows:
 
 1. Process the raw data to have appropriate filenames (see above) and deposit it in an accessible S3 directory.
 2. Create a clean launch directory and copy `configs/run.config` to a file named `nextflow.config` in that directory.
-3. Create a library metadata file in that launch directory, specifying library/sample mappings and any other metadata (see above).
+3. Create a sample sheet in that launch directory (see above)
 4. Edit `nextflow.config` to specify each item in `params` as appropriate, as well as setting `process.queue` to the appropriate Batch queue.
 5. Run `nextflow run PATH_TO_REPO_DIR -resume`.
 6. Navigate to `{params.base_dir}/output` to view and download output files.
 
+## Run tests using `nf-test` before making pull requests
+
+During the development process, we now request that users run the pipeline using `nf-test` locally before making pull requests (a test will be run automatically on the PR, but it's often useful to run it locally first). To do this, you need to make sure that you have a big enough ec2-instance. We recommend the `m5.xlarge` with at least `32GB` of EBS storage, as this machine closely reflects the VMs on Github Actions. Once you have an instance, run `nf-test run tests/main.test.nf`, which will run all workflows of the pipeline and check that they run to completion. If you want to run a specific workflow, you use the following commands:
+
+```
+nf-test run --tag index  # Runs the index workflow
+nf-test run --tag run     # Runs the run workflow
+nf-test run --tag validation # Runs the validation workflow
+```
+
+Importantly, make sure to periodically delete docker images to free up space on your instance. You can do this by running the following command, although note that this will delete all docker images:
+
+```
+docker kill $(docker ps -q) 2>/dev/null || true
+docker rm $(docker ps -a -q) 2>/dev/null || true
+docker rmi $(docker images -q) -f 2>/dev/null || true
+docker system prune -af --volumes
+```
+
 # Troubleshooting
 
 When attempting to run a released version of the pipeline, the most common sources of errors are AWS permission issues. Before debugging a persistent error in-depth, make sure that you have all the permissions specified in Step 0 of [our Batch workflow guide](https://data.securebio.org/wills-public-notebook/notebooks/2024-06-11_batch.html). Next, make sure Nextflow has access to your AWS credentials, such as by running `eval "$(aws configure export-credentials --format env)"`.

diff --git a/bin/apply-lifecycle-rules.py b/bin/apply-lifecycle-rules.py
@@ -0,0 +1,89 @@
+#!/usr/bin/env python3
+
+import argparse
+import json
+import boto3
+import sys
+from botocore.exceptions import ClientError
+
+def load_lifecycle_config(config_path):
+    try:
+        with open(config_path, 'r') as f:
+            return json.load(f)
+    except json.JSONDecodeError:
+        print(f"Error: {config_path} contains invalid JSON")
+        sys.exit(1)
+    except FileNotFoundError:
+        print(f"Error: Could not find file {config_path}")
+        sys.exit(1)
+
+def print_lifecycle_rules(rules):
+    if not rules:
+        print("No lifecycle rules configured")
+        return
+
+    for rule in rules:
+        print(f"- {rule['ID']}")
+        print(f"  Status: {rule['Status']}")
+        if 'Expiration' in rule:
+            print(f"  Expiration: {rule['Expiration'].get('Days', 'N/A')} days")
+        print()
+
+def get_current_rules(s3, bucket_name):
+    try:
+        response = s3.get_bucket_lifecycle_configuration(Bucket=bucket_name)
+        return response.get('Rules', [])
+    except ClientError as e:
+        if e.response['Error']['Code'] == 'NoSuchLifecycleConfiguration':
+            return []
+        raise
+
+def apply_lifecycle_rules(bucket_name, lifecycle_config):
+    s3 = boto3.client('s3')
+
+    try:
+        # First verify the bucket exists and we have access
+        s3.head_bucket(Bucket=bucket_name)
+
+        # Show current configuration
+        print(f"\nCurrent lifecycle rules for bucket {bucket_name}:")
+        current_rules = get_current_rules(s3, bucket_name)
+        print_lifecycle_rules(current_rules)
+
+        # Apply the new configuration
+        s3.put_bucket_lifecycle_configuration(
+            Bucket=bucket_name,
+            LifecycleConfiguration=lifecycle_config
+        )
+        print(f"\nSuccessfully applied new lifecycle rules to bucket: {bucket_name}")
+
+        # Show the updated configuration
+        print("\nUpdated lifecycle rules:")
+        new_rules = get_current_rules(s3, bucket_name)
+        print_lifecycle_rules(new_rules)
+
+    except ClientError as e:
+        error_code = e.response.get('Error', {}).get('Code', 'Unknown')
+        if error_code == '404':
+            print(f"Error: Bucket {bucket_name} does not exist")
+        elif error_code == '403':
+            print(f"Error: Permission denied for bucket {bucket_name}")
+        else:
+            print(f"Error applying lifecycle rules: {str(e)}")
+        sys.exit(1)
+
+def main():
+    parser = argparse.ArgumentParser(description='Apply S3 lifecycle rules to a bucket')
+    parser.add_argument('config_file', help='Path to lifecycle configuration JSON file')
+    parser.add_argument('bucket_name', help='Name of the S3 bucket')
+
+    args = parser.parse_args()
+
+    # Load the configuration
+    lifecycle_config = load_lifecycle_config(args.config_file)
+
+    # Apply the rules
+    apply_lifecycle_rules(args.bucket_name, lifecycle_config)
+
+if __name__ == '__main__':
+    main()
diff --git a/bin/generate_samplesheet.sh b/bin/generate_samplesheet.sh
@@ -1,5 +1,7 @@
 #!/bin/bash
 
+set -u
+set -e
 
 ##### Input parameters #####
 
@@ -10,7 +12,7 @@ reverse_suffix=""
 s3=0
 output_path="samplesheet.csv"  # Default output path
 group_file=""  # Optional parameter for the group file
-
+group_across_illumina_lanes=false
 
 # Parse command-line arguments
 while [[ $# -gt 0 ]]; do
@@ -39,6 +41,10 @@ while [[ $# -gt 0 ]]; do
             group_file="$2"
             shift 2
             ;;
+        --group_across_illumina_lanes)
+            group_across_illumina_lanes=true
+            shift
+            ;;
         *)
             echo "Unknown option: $1"
             exit 1
@@ -58,6 +64,13 @@ if [[ -z "$dir_path" || -z "$forward_suffix" || -z "$reverse_suffix" ]]; then
     echo -e "  --s3                      Use if files are stored in S3 bucket"
     echo -e "  --output_path <path>      Output path for samplesheet [default: samplesheet.csv]"
     echo -e "  --group_file <path>       Path to group file for sample grouping [header column must have the names 'sample,group' in that order; additional columns may be included, however they will be ignored by the script]"
+    echo -e
+    "  --group_across_illumina_lanes   Create groups by assuming that files that differ only by a terminal _Lnnn are the same library split across multiple lanes."
+    exit 1
+fi
+
+if $group_across_illumina_lanes && [[ -n "$group_file" ]]; then
+    echo "Provide at most one of --group_file and --group_across_illumina_lanes"
     exit 1
 fi
 
@@ -69,11 +82,12 @@ echo "reverse_suffix: $reverse_suffix"
 echo "s3: $s3"
 echo "output_path: $output_path"
 echo "group_file: $group_file"
+echo "group_across_illumina_lanes: $group_across_illumina_lanes"
 
 
 #### EXAMPLES ####
 
-# dir_path="" # Cannot share this as it's restricted, but imagine the read looks like this 
+# dir_path="" # Cannot share this as it's restricted, but imagine the read looks like this
 # forward_suffix="_S[0-9]_L[0-9][0-9][0-9]_R1_001"
 # reverse_suffix="_S[0-9]_L[0-9][0-9][0-9]_R2_001"
 # s3=1
@@ -125,6 +139,17 @@ if [[ -n "$group_file" ]]; then
     # Perform left join with group file
     awk -F',' 'NR==FNR{a[$1]=$2; next} FNR==1{print $0",group"} FNR>1{print $0","(a[$1]?a[$1]:"NA")}' "$group_file" "$temp_samplesheet" > "$output_path"
     echo "CSV file '$output_path' has been created with group information."
+elif $group_across_illumina_lanes; then
+    cat "$temp_samplesheet" | tr ',' ' ' | \
+        while read sample fastq_1 fastq_2; do
+            if [[ $sample = "sample" ]]; then
+                echo $sample $fastq_1 $fastq_2 "group"
+            else
+                echo $sample $fastq_1 $fastq_2 \
+                     $(echo "$sample" | sed 's/_L[0-9][0-9][0-9]$//')
+            fi
+        done | tr ' ' ',' > "$output_path"
+    echo "CSV file '$output_path' has been created with grouping across illumina lanes."
 else
     # If no group file, just use the temporary samplesheet as the final output
     mv "$temp_samplesheet" "$output_path"
@@ -133,4 +158,3 @@ fi
 
 # Remove temporary file if it still exists
 [ -f "$temp_samplesheet" ] && rm "$temp_samplesheet"
-