Skip to content

Commit

Permalink
Merge pull request #115 from naobservatory/dev
Browse files Browse the repository at this point in the history
v2.5.2
  • Loading branch information
willbradshaw authored Nov 27, 2024
2 parents 8c6809d + 3e72a7e commit b75ddc6
Show file tree
Hide file tree
Showing 75 changed files with 862 additions and 209 deletions.
46 changes: 46 additions & 0 deletions .github/workflows/end-to-end.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
name: End-to-end MGS workflow test

on: [pull_request]

jobs:
test:
runs-on: ubuntu-latest

steps:
- name: Checkout
uses: actions/checkout@v4

- name: Set up JDK 11
uses: actions/setup-java@v4
with:
java-version: '11'
distribution: 'adopt'

- name: Setup Nextflow latest-edge
uses: nf-core/setup-nextflow@v1
with:
version: "latest-edge"

- name: Install nf-test
run: |
wget -qO- https://get.nf-test.com | bash
sudo mv nf-test /usr/local/bin/
- name: Run index workflow
run: nf-test test --tag index --verbose

- name: Clean docker for more space
run: |
docker kill $(docker ps -q) 2>/dev/null || true
docker rm $(docker ps -a -q) 2>/dev/null || true
docker rmi $(docker images -q) -f 2>/dev/null || true
docker system prune -af --volumes
- name: Clean up nf-test dir
run: sudo rm -rf .nf-test

- name: Run run workflow
run: nf-test test --tag run --verbose

- name: Run run_validation workflow
run: nf-test test --tag validation --verbose
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,6 @@ test/output
test/.nextflow*
*.Rhistory
pipeline_report.txt

.nf-test/
.nf-test.log
28 changes: 27 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,29 @@
# v2.5.2
- Changes to default read filtering:
- Relaxed FASTP quality filtering (`--cut_mean_quality` and `--average_qual` reduced from 25 to 20).
- Relaxed BBDUK viral filtering (switched from 3 21-mers to 1 24-mer).
- Overhauled BLAST validation functionality:
- BLAST now runs on forward and reverse reads independently
- BLAST output filtering no longer assumes specific filename suffixes
- Paired BLAST output includes more information
- RUN_VALIDATION can now directly take in FASTA files instead of a virus read DB
- Fixed issues with publishing BLAST output under new Nextflow version
- Implemented nf-test for end-to-end testing of pipeline functionality
- Implemented test suite in `tests/main.nf.test`
- Reconfigured INDEX workflow to enable generation of miniature index directories for testing
- Added Github Actions workflow in `.github/workflows/end-to-end.yml`
- Pull requests will now fail if any of INDEX, RUN, or RUN_VALIDATION crashes when run on test data.
- Generated first version of new, curated test dataset for testing RUN workflow. Samplesheet and config file are available in `test-data`. The previous test dataset in `test` has been removed.
- Implemented S3 auto-cleanup:
- Added tags to published files to facilitate S3 auto-cleanup
- Added S3 lifecycle configuration file to `ref`, along with a script in `bin` to add it to an S3 bucket
- Minor changes
- Added logic to check if `grouping` variable in `nextflow.config` matches the input samplesheet, if it doesn't, the code throws an error.
- Externalized resource specifications to `resources.config`, removing hardcoded CPU/memory values
- Renamed `index-params.json` to `params-index.json` to avoid clash with Github Actions
- Removed redundant subsetting statement from TAXONOMY workflow.
- Added --group_across_illumina_lanes option to generate_samplesheet

# v2.5.1
- Enabled extraction of BBDuk-subset putatively-host-viral raw reads for downstream chimera detection.
- Added back viral read fields accidentally being discarded by COLLAPSE_VIRUS_READS.
Expand All @@ -16,7 +42,7 @@
- Reconfigured QC subworkflow to run FASTQC and MultiQC on each pair of input files separately (fixes bug arising from allowing arbitrary filenames for forward and reverse read files).

# v2.4.0
- Created a new output directory where we put log files called `logging`.
- Created a new output directory where we put log files called `logging`.
- Added the trace file from Nextflow to the `logging` directory which can be used for understanding cpu, memory usage, and other infromation like runtime. After running the pipeline, `plot-timeline-script.R` can be used to generate a useful summary plot of the runtime for each process in the pipeline.
- Removed CONCAT_GZIPPED.
- Replaced the sample input format with something more similar to nf-core, called `samplesheet.csv`. This new input file can be generated using the script `generate_samplesheet.sh`.
Expand Down
63 changes: 49 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -179,6 +179,7 @@ To run this workflow with full functionality, you need access to the following d
2. **Docker:** To install Docker Engine for command-line use, follow the installation instructions available [here](https://docs.docker.com/engine/install/) (or [here](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/install-docker.html) for installation on an AWS EC2 instance).
3. **AWS CLI:** If not already installed, install the AWS CLI by following the instructions available [here](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html).
4. **Git:** To install the Git version control tool, follow the installation instructions available [here](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git).
5. **nf-test**: To install nf-test, follow the install instructions available [here](https://www.nf-test.com/docs/getting-started/).

#### 2. Configure AWS & Docker

Expand Down Expand Up @@ -245,35 +246,42 @@ Wait for the workflow to run to completion; this is likely to take several hours

### Testing & validation

To confirm that the pipeline works in your hands, we provide a small test dataset (`test/raw`) to run through the run workflow. This can be used to test any of the pipeline profiles described above.
To confirm that the pipeline works in your hands, we provide a small test dataset (`s3://nao-testing/gold-standard-test/raw/`) to run through the run workflow. This can be used to test any of the pipeline profiles described above.

If your EC2 instance has the resources to handle it, the simplest way to start using the pipeline is to run the test data through it locally on that instance (i.e. without using S3). To do this:

1. Navigate to the `test` directory.
2. Edit `nextflow.config` to set `params.ref_dir` to the index directory you chose or created above (specifically `PATH_TO_REF_DIR/output`).
3. Still within the `test` directory, run `nextflow run -profile ec2_local .. -resume`.
4. Wait for the workflow to finish. Inspect the `output` directory to view the processed output files.
1. Create a new directory outside the repo directory and copy over the run workflow config file as `nextflow.config` in that directory:

```
mkdir launch
cd launch
cp REPO_DIR/configs/run.config nextflow.config
```

2. Edit `nextflow.config` to set `params.ref_dir` to the index directory you chose or created above (specifically `PATH_TO_REF_DIR/output`).
3. Then set the samplesheet path to the test dataset samplesheet `${projectDir}/test-data/samplesheet.csv`.
4. Within this directory, run `nextflow run -profile ec2_local .. -resume`. Wait for the workflow to finish.
5. Inspect the `output` directory to view the processed output files.

If this is successful, the next level of complexity is to run the workflow with a working directory on S3. To do this:

1. Within the `test` directory, edit `nextflow.config` to set `params.base_dir` to the S3 directory of your choice.
2. Still within that directory, run `nextflow run -profile ec2_s3 .. -resume`.
1. Edit `nextflow.config` to set `params.base_dir` to the S3 directory of your choice.
2. Still within that directory, run `nextflow run -profile ec2_s3 .. -resume`.
3. Wait for the workflow to finish, and inspect the output on S3.

Finally, you can run the test dataset through the pipeline on AWS Batch. To do this, configure Batch as described [here](https://data.securebio.org/wills-public-notebook/notebooks/2024-06-11_batch.html) (steps 1-3), then:

1. Within the `test` directory, edit `nextflow.config` to set `params.base_dir` to a different S3 directory of your choice and `process.queue` to the name of your Batch job queue.
2. Still within that directory, run `nextflow run -profile batch .. -resume` (or simply `nextflow run .. -resume`).
1. Edit `nextflow.config` to set `params.base_dir` to a different S3 directory of your choice and `process.queue` to the name of your Batch job queue.
2. Still within that directory, run `nextflow run -profile batch .. -resume` (or simply `nextflow run .. -resume`).
3. Wait for the workflow to finish, and inspect the output on S3.

### Running on new data

To run the workflow on another dataset, you need:

1. Accessible raw data files in Gzipped FASTQ format, named appropriately.
2. A sample sheet file specifying the samples, along with paths to the forward and reverse read files for each sample.
2. A sample sheet file specifying the samples, along with paths to the forward and reverse read files for each sample. `generate_samplesheet.sh` (see below) can make this for you.
3. A config file in a clean launch directory, pointing to:
- The directory containing the raw data (`params.raw_dir`).
- The base directory in which to put the working and output directories (`params.base_dir`).
- The directory containing the outputs of the reference workflow (`params.ref_dir`).
- The sample sheet (`params.sample_sheet`).
Expand All @@ -285,18 +293,45 @@ To run the workflow on another dataset, you need:
> - Second column: Path to FASTQ file 1 which should be the forward read for this sample
> - Third column: Path to FASTQ file 2 which should be the reverse read for this sample
>
> The easiest way to get this file is by using the `generate_samplesheet.sh` script. As input, this script takes a path to raw FASTQ files (`dir_path`), and forward (`forward_suffix`) and reverse (`reverse_suffix`) read suffixes, both of which support regex, and an optional output path (`output_path`). Those using data from s3 should make sure to pass the `s3` parameter. Those who would like to group samples by some metadata can pass a path to a CSV file containing a header column named `sample,group`, where each row gives the sample name and the group to group by (`group_file`) or edit the samplesheet manually after generation (since manually editing the samplesheet will be easier when the groups CSV isn't readily available). As output, the script generates a CSV file named (`samplesheet.csv` by default), which can be used as input for the pipeline.
> The easiest way to get this file is by using the `generate_samplesheet.sh` script. As input, this script takes a path to raw FASTQ files (`dir_path`), and forward (`forward_suffix`) and reverse (`reverse_suffix`) read suffixes, both of which support regex, and an optional output path (`output_path`). Those using data from s3 should make sure to pass the `s3` parameter. Those who would like to group samples by some metadata can pass a path to a CSV file containing a header column named `sample,group`, where each row gives the sample name and the group to group by (`group_file`), edit the samplesheet manually after generation (since manually editing the samplesheet will be easier when the groups CSV isn't readily available), or provide the --group_across_illumina_lanes option if a single library was run across a single Illumina flowcell. As output, the script generates a CSV file named (`samplesheet.csv` by default), which can be used as input for the pipeline.
>
> For example:
> ```
> ../bin/generate_samplesheet.sh \
> --s3
> --dir_path s3://nao-restricted/MJ-2024-10-21/raw/ \
> --forward_suffix _1 \
> --reverse_suffix _2
> ```
If running on Batch, a good process for starting the pipeline on a new dataset is as follows:
1. Process the raw data to have appropriate filenames (see above) and deposit it in an accessible S3 directory.
2. Create a clean launch directory and copy `configs/run.config` to a file named `nextflow.config` in that directory.
3. Create a library metadata file in that launch directory, specifying library/sample mappings and any other metadata (see above).
3. Create a sample sheet in that launch directory (see above)
4. Edit `nextflow.config` to specify each item in `params` as appropriate, as well as setting `process.queue` to the appropriate Batch queue.
5. Run `nextflow run PATH_TO_REPO_DIR -resume`.
6. Navigate to `{params.base_dir}/output` to view and download output files.
## Run tests using `nf-test` before making pull requests
During the development process, we now request that users run the pipeline using `nf-test` locally before making pull requests (a test will be run automatically on the PR, but it's often useful to run it locally first). To do this, you need to make sure that you have a big enough ec2-instance. We recommend the `m5.xlarge` with at least `32GB` of EBS storage, as this machine closely reflects the VMs on Github Actions. Once you have an instance, run `nf-test run tests/main.test.nf`, which will run all workflows of the pipeline and check that they run to completion. If you want to run a specific workflow, you use the following commands:
```
nf-test run --tag index # Runs the index workflow
nf-test run --tag run # Runs the run workflow
nf-test run --tag validation # Runs the validation workflow
```
Importantly, make sure to periodically delete docker images to free up space on your instance. You can do this by running the following command, although note that this will delete all docker images:
```
docker kill $(docker ps -q) 2>/dev/null || true
docker rm $(docker ps -a -q) 2>/dev/null || true
docker rmi $(docker images -q) -f 2>/dev/null || true
docker system prune -af --volumes
```
# Troubleshooting
When attempting to run a released version of the pipeline, the most common sources of errors are AWS permission issues. Before debugging a persistent error in-depth, make sure that you have all the permissions specified in Step 0 of [our Batch workflow guide](https://data.securebio.org/wills-public-notebook/notebooks/2024-06-11_batch.html). Next, make sure Nextflow has access to your AWS credentials, such as by running `eval "$(aws configure export-credentials --format env)"`.
Expand Down
89 changes: 89 additions & 0 deletions bin/apply-lifecycle-rules.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
#!/usr/bin/env python3

import argparse
import json
import boto3
import sys
from botocore.exceptions import ClientError

def load_lifecycle_config(config_path):
try:
with open(config_path, 'r') as f:
return json.load(f)
except json.JSONDecodeError:
print(f"Error: {config_path} contains invalid JSON")
sys.exit(1)
except FileNotFoundError:
print(f"Error: Could not find file {config_path}")
sys.exit(1)

def print_lifecycle_rules(rules):
if not rules:
print("No lifecycle rules configured")
return

for rule in rules:
print(f"- {rule['ID']}")
print(f" Status: {rule['Status']}")
if 'Expiration' in rule:
print(f" Expiration: {rule['Expiration'].get('Days', 'N/A')} days")
print()

def get_current_rules(s3, bucket_name):
try:
response = s3.get_bucket_lifecycle_configuration(Bucket=bucket_name)
return response.get('Rules', [])
except ClientError as e:
if e.response['Error']['Code'] == 'NoSuchLifecycleConfiguration':
return []
raise

def apply_lifecycle_rules(bucket_name, lifecycle_config):
s3 = boto3.client('s3')

try:
# First verify the bucket exists and we have access
s3.head_bucket(Bucket=bucket_name)

# Show current configuration
print(f"\nCurrent lifecycle rules for bucket {bucket_name}:")
current_rules = get_current_rules(s3, bucket_name)
print_lifecycle_rules(current_rules)

# Apply the new configuration
s3.put_bucket_lifecycle_configuration(
Bucket=bucket_name,
LifecycleConfiguration=lifecycle_config
)
print(f"\nSuccessfully applied new lifecycle rules to bucket: {bucket_name}")

# Show the updated configuration
print("\nUpdated lifecycle rules:")
new_rules = get_current_rules(s3, bucket_name)
print_lifecycle_rules(new_rules)

except ClientError as e:
error_code = e.response.get('Error', {}).get('Code', 'Unknown')
if error_code == '404':
print(f"Error: Bucket {bucket_name} does not exist")
elif error_code == '403':
print(f"Error: Permission denied for bucket {bucket_name}")
else:
print(f"Error applying lifecycle rules: {str(e)}")
sys.exit(1)

def main():
parser = argparse.ArgumentParser(description='Apply S3 lifecycle rules to a bucket')
parser.add_argument('config_file', help='Path to lifecycle configuration JSON file')
parser.add_argument('bucket_name', help='Name of the S3 bucket')

args = parser.parse_args()

# Load the configuration
lifecycle_config = load_lifecycle_config(args.config_file)

# Apply the rules
apply_lifecycle_rules(args.bucket_name, lifecycle_config)

if __name__ == '__main__':
main()
30 changes: 27 additions & 3 deletions bin/generate_samplesheet.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
#!/bin/bash

set -u
set -e

##### Input parameters #####

Expand All @@ -10,7 +12,7 @@ reverse_suffix=""
s3=0
output_path="samplesheet.csv" # Default output path
group_file="" # Optional parameter for the group file

group_across_illumina_lanes=false

# Parse command-line arguments
while [[ $# -gt 0 ]]; do
Expand Down Expand Up @@ -39,6 +41,10 @@ while [[ $# -gt 0 ]]; do
group_file="$2"
shift 2
;;
--group_across_illumina_lanes)
group_across_illumina_lanes=true
shift
;;
*)
echo "Unknown option: $1"
exit 1
Expand All @@ -58,6 +64,13 @@ if [[ -z "$dir_path" || -z "$forward_suffix" || -z "$reverse_suffix" ]]; then
echo -e " --s3 Use if files are stored in S3 bucket"
echo -e " --output_path <path> Output path for samplesheet [default: samplesheet.csv]"
echo -e " --group_file <path> Path to group file for sample grouping [header column must have the names 'sample,group' in that order; additional columns may be included, however they will be ignored by the script]"
echo -e
" --group_across_illumina_lanes Create groups by assuming that files that differ only by a terminal _Lnnn are the same library split across multiple lanes."
exit 1
fi

if $group_across_illumina_lanes && [[ -n "$group_file" ]]; then
echo "Provide at most one of --group_file and --group_across_illumina_lanes"
exit 1
fi

Expand All @@ -69,11 +82,12 @@ echo "reverse_suffix: $reverse_suffix"
echo "s3: $s3"
echo "output_path: $output_path"
echo "group_file: $group_file"
echo "group_across_illumina_lanes: $group_across_illumina_lanes"


#### EXAMPLES ####

# dir_path="" # Cannot share this as it's restricted, but imagine the read looks like this
# dir_path="" # Cannot share this as it's restricted, but imagine the read looks like this
# forward_suffix="_S[0-9]_L[0-9][0-9][0-9]_R1_001"
# reverse_suffix="_S[0-9]_L[0-9][0-9][0-9]_R2_001"
# s3=1
Expand Down Expand Up @@ -125,6 +139,17 @@ if [[ -n "$group_file" ]]; then
# Perform left join with group file
awk -F',' 'NR==FNR{a[$1]=$2; next} FNR==1{print $0",group"} FNR>1{print $0","(a[$1]?a[$1]:"NA")}' "$group_file" "$temp_samplesheet" > "$output_path"
echo "CSV file '$output_path' has been created with group information."
elif $group_across_illumina_lanes; then
cat "$temp_samplesheet" | tr ',' ' ' | \
while read sample fastq_1 fastq_2; do
if [[ $sample = "sample" ]]; then
echo $sample $fastq_1 $fastq_2 "group"
else
echo $sample $fastq_1 $fastq_2 \
$(echo "$sample" | sed 's/_L[0-9][0-9][0-9]$//')
fi
done | tr ' ' ',' > "$output_path"
echo "CSV file '$output_path' has been created with grouping across illumina lanes."
else
# If no group file, just use the temporary samplesheet as the final output
mv "$temp_samplesheet" "$output_path"
Expand All @@ -133,4 +158,3 @@ fi

# Remove temporary file if it still exists
[ -f "$temp_samplesheet" ] && rm "$temp_samplesheet"

Loading

0 comments on commit b75ddc6

Please sign in to comment.