Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding single-read functionality to RAW and CLEAN #80

Open
wants to merge 97 commits into
base: harmon_fix_gh_actions_test
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 77 commits
Commits
Show all changes
97 commits
Select commit Hold shift + click to select a range
15354f6
Adding single read option to raw/main.nf
simonleandergrimm Oct 21, 2024
ad2115d
Adding WIP version of run.nf to enable testing raw and clean versions…
simonleandergrimm Oct 21, 2024
03ee37a
Created separate versions of summarize-multiqc-single.R and summarize…
simonleandergrimm Oct 22, 2024
b517340
Split processes in fastp to a single read and paired-end read version.
simonleandergrimm Oct 22, 2024
01ea0c5
Split processes in MultiQC to a single read and paired-end read versi…
simonleandergrimm Oct 22, 2024
ad8faf9
Deleted summarizeMultiqcSingle, which was superseded by summarizeMultiqc
simonleandergrimm Oct 22, 2024
ef0e9c8
Split processes in truncateConcat to a single read and paired-end rea…
simonleandergrimm Oct 22, 2024
2535ccd
Created a single_end if clause in Clean to either use the single read…
simonleandergrimm Oct 22, 2024
cbcb109
Created a single_end if clause in hv_screen to either use the single …
simonleandergrimm Oct 22, 2024
c7f8c83
Created a single_end if clause in qc to either use the single read or…
simonleandergrimm Oct 22, 2024
ff0a8be
Renamed test dir to test-paired-end. Added clause in nextflow.config …
simonleandergrimm Oct 22, 2024
6048dd3
Edited gitignore to leave out test-paired-end and test-single-read ru…
simonleandergrimm Oct 22, 2024
92270e5
Fixed name of test-single-end dir to test-single-read
simonleandergrimm Oct 22, 2024
b13ac94
Created a version of test dir that allows the run of single-read data.
simonleandergrimm Oct 22, 2024
dff2302
Added script to quickly download the s3 output of test single read an…
simonleandergrimm Oct 23, 2024
64bb7f4
Added nextflow config for test paired and test single read.
simonleandergrimm Oct 23, 2024
5bd1aec
Fixed if clause in main.nf
simonleandergrimm Oct 23, 2024
c8fd3ac
Updated gen samplesheet scripts to pull in data from s3://nao-mgs-sim…
simonleandergrimm Oct 23, 2024
578fde0
Updated gitignore
simonleandergrimm Oct 23, 2024
59218b9
Activated CLEAN subworkflow in run.nf
simonleandergrimm Oct 23, 2024
fd9dc1e
Starting to adapt Will's https://data.securebio.org/wills-public-note…
simonleandergrimm Oct 23, 2024
81ff0ba
Adding ignoring mgs-results to gitignore
simonleandergrimm Oct 23, 2024
590b2c3
Adding Will's auxiliary scripts to run his quarto notebooks.
simonleandergrimm Oct 23, 2024
6a650b4
Merge branch 'master' into single-read-raw
simonleandergrimm Oct 23, 2024
9f1eb03
Amended qmd somewhat so data imports work.
simonleandergrimm Oct 24, 2024
9622004
Added a flag to summarize-multiqc-single.R that provides info on the…
simonleandergrimm Oct 25, 2024
c61ed0c
Amended logic of split_sample, so it does not split and pull out read…
simonleandergrimm Oct 25, 2024
f8d9c28
Deleting seperate version of summarize-multiqc I created for paired r…
simonleandergrimm Oct 25, 2024
8e1c7b5
Revert "Split processes in MultiQC to a single read and paired-end re…
simonleandergrimm Oct 25, 2024
0ba0552
Revert "Deleted summarizeMultiqcSingle, which was superseded by summa…
simonleandergrimm Oct 25, 2024
8bafee8
Revert "Created a single_end if clause in qc to either use the single…
simonleandergrimm Oct 25, 2024
68c7c50
Amended main.nf of summarizeMultiqcSingle, clean, qc, and raw, to pro…
simonleandergrimm Oct 25, 2024
1656b33
Amended summarize-multiqc-single.R's basic_info_fastqc so it also sub…
simonleandergrimm Oct 25, 2024
4ec6788
Switched the --paired flag to instead be --read_type, and have it be …
simonleandergrimm Oct 25, 2024
f2bb836
Merge branch 'dev' into single-read-raw
simonleandergrimm Oct 25, 2024
e13acc6
Deleted a directory with testing scripts that was superseded by https…
simonleandergrimm Oct 25, 2024
9c62aa4
this script is now in https://github.com/naobservatory/simon-analysis…
simonleandergrimm Oct 25, 2024
be46ee9
Adding normal test dataset back in.
simonleandergrimm Oct 26, 2024
17d61ff
removing new versions of generate_samplesheet.sh (will add two differ…
simonleandergrimm Oct 26, 2024
0ba23fb
Reinstating dev version of run.nf, and creating new version of run.nf…
simonleandergrimm Oct 26, 2024
118378c
Adding run_dev_se to main.nf, a run specifically used for checking if…
simonleandergrimm Oct 26, 2024
8cd5239
Fixing default value for --read_type in summarize-multiqc-single.R. A…
simonleandergrimm Oct 26, 2024
7d3e725
Dropping commented out sections in split_sample
simonleandergrimm Oct 26, 2024
c107e91
Pulling in newest version of generate_samplesheet.sh
simonleandergrimm Oct 26, 2024
2d07ae6
Fixing single vs paired end read logic in hv_screen
simonleandergrimm Oct 26, 2024
74cb53a
Turned generate_samplesheet.sh back into dev version. Will and single…
simonleandergrimm Oct 28, 2024
3b0a11c
Adding read_type information to run.nf so the correct processes are p…
simonleandergrimm Oct 28, 2024
8f6beda
Extended generate_samplesheet.sh so it also takes in single-read data.
Nov 12, 2024
69f404c
Merge branch 'master' into single-read-raw-clean
simonleandergrimm Nov 19, 2024
654dd1c
Amended subworkflows to take in single end data.
simonleandergrimm Nov 19, 2024
2a01243
Merge branch 'master' into single-read-raw-clean
simonleandergrimm Nov 19, 2024
e9f7384
Reworked summarize_multiqc_pair.R to take in single_end data.
simonleandergrimm Nov 19, 2024
793a061
Made run_dev_se.nf follow updates to run.nf, and fixed single_end det…
simonleandergrimm Nov 19, 2024
fdf81af
Dropped two versions of FASTP, created conditional statement instead.
simonleandergrimm Nov 19, 2024
95dcf91
Dropped two different versions of the truncate_concat and added condi…
simonleandergrimm Nov 19, 2024
ada8c5e
dropped conditional selsection of processes.
simonleandergrimm Nov 19, 2024
a448dc9
Fixed single_end variable passing
simonleandergrimm Nov 19, 2024
e9b89be
Added new single read flagging in run.nf
simonleandergrimm Nov 19, 2024
eb82a32
removed old summarize-multiqc file
simonleandergrimm Nov 19, 2024
00ddcfc
fixed index in nextflow.config for paired end data.
simonleandergrimm Nov 19, 2024
e5b5ec5
added grouping and ndew index info to test-single-read config
simonleandergrimm Nov 19, 2024
8e201e7
Adding improved configs
simonleandergrimm Nov 23, 2024
591138d
dropped single end definition in run file.
simonleandergrimm Nov 23, 2024
27244bd
Adding params to single end variable invocation
simonleandergrimm Nov 23, 2024
517961f
removed whitespace
simonleandergrimm Nov 23, 2024
c28749f
updating nextflow.config of test
simonleandergrimm Nov 23, 2024
e132ec4
fixed single_end config in normal run workflow
simonleandergrimm Nov 23, 2024
51b9cf3
make single-end variable logical.
simonleandergrimm Nov 23, 2024
12c3fdd
Reverted to old gitignore structure.
simonleandergrimm Nov 23, 2024
4fd3ce6
Changed test dirs to only have one dir for run_dev_se.
simonleandergrimm Nov 24, 2024
d460813
Adding WIP progress
simonleandergrimm Nov 24, 2024
f412b07
Merge branch 'dev' into single-read-raw-clean
simonleandergrimm Nov 24, 2024
3d10bb0
Fixing single_end being unbound.
simonleandergrimm Nov 24, 2024
7899979
Merge branch 'dev' into single-read-raw-clean
simonleandergrimm Nov 29, 2024
dd942fa
Took into account new testing setup
simonleandergrimm Nov 29, 2024
50c2edc
adding single end info to config
simonleandergrimm Nov 29, 2024
61ea369
Moved single end eval from config to run files
simonleandergrimm Nov 29, 2024
ad640c6
Update nextflow.config
simonleandergrimm Dec 3, 2024
e85dd45
Merge remote-tracking branch 'origin/harmon_fix_gh_actions_test' into…
simonleandergrimm Dec 3, 2024
3a6f6b5
Merge remote-tracking branch 'origin/harmon_fix_gh_actions_test' into…
simonleandergrimm Dec 4, 2024
a0f5f32
Put single_end into profiles.config
simonleandergrimm Dec 4, 2024
d14da14
fixed run-dev-se config in tests
simonleandergrimm Dec 4, 2024
3fe2bd2
Creating a new config for read_type flag.
simonleandergrimm Dec 4, 2024
d0375ab
added run dev se to end-to-end yml
simonleandergrimm Dec 4, 2024
f5cf80a
Made rundevse index and outputs look the same as run.nf
simonleandergrimm Dec 4, 2024
3dc323e
Fixing setup of run_dev_se test config.
simonleandergrimm Dec 5, 2024
1904931
Update .gitignore (dropped new line)
simonleandergrimm Dec 5, 2024
e24d79e
Setting profiles.config back to original
simonleandergrimm Dec 5, 2024
b38b93d
Updated comments in main.nf to represent the posiblity of not not ala…
simonleandergrimm Dec 9, 2024
21b15b8
Fixed duplicate par statement in fastp.
simonleandergrimm Dec 9, 2024
9d717b7
Responding to Harmon's comments.
simonleandergrimm Dec 9, 2024
ee7baf4
dropped unncessary single-end variable.
simonleandergrimm Dec 9, 2024
034914b
fixed faulty paired-end fastp
simonleandergrimm Dec 10, 2024
c5454b9
added end-to-end-se.yml
simonleandergrimm Dec 10, 2024
4b966d8
adedd subworkflow to create samplesheet
simonleandergrimm Dec 11, 2024
7a3a59b
split truncate concat into two processes/
simonleandergrimm Dec 11, 2024
6ad3ce2
removed run dev se from end to end yml.
simonleandergrimm Dec 11, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonleandergrimm can you sync up with @harmonbhasin re naming here? I think he's going to rename the test directory anyway due to conflict with nf-test.

FWIW I'd prefer something like test/single/... and test/paired/... to keep the main directory clean.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also what are these?

analysis_files/*
mgs-results/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harmonbhasin What are your thoughts regarding having a test dataset for paired-end and single-end data? Could you rejig your test dataset by e.g., simply keeping the forward reads?

Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,4 @@ test/.nextflow*
pipeline_report.txt

.nf-test/
.nf-test.log
.nf-test.log
91 changes: 74 additions & 17 deletions bin/generate_samplesheet.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harmonbhasin to review changes to this file

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harmonbhasin ping on this.

simonleandergrimm marked this conversation as resolved.
Show resolved Hide resolved
simonleandergrimm marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if you're interested in this, but if you want to turn this script into python, I wouldn't be mad lol

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👀

Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
#!/bin/bash


set -u
set -e

Expand All @@ -10,10 +11,28 @@ dir_path=""
forward_suffix=""
reverse_suffix=""
s3=0
single_end=0
output_path="samplesheet.csv" # Default output path
group_file="" # Optional parameter for the group file
group_across_illumina_lanes=false

# Function to print usage
print_usage() {
echo "Usage:"
echo "For paired-end reads:"
echo " $0 --dir_path <path> --forward_suffix <suffix> --reverse_suffix <suffix> [--s3] [--output_path <path>]"
echo "For single-end reads:"
echo " $0 --dir_path <path> --single_end [--s3] [--output_path <path>]"
echo
echo "Options:"
echo " --dir_path Directory containing FASTQ files"
echo " --forward_suffix Suffix for forward reads (required for paired-end only)"
echo " --reverse_suffix Suffix for reverse reads (required for paired-end only)"
echo " --single_end Flag for single-end data"
echo " --s3 Flag for S3 bucket access"
echo " --output_path Output path for samplesheet (default: samplesheet.csv)"
}

# Parse command-line arguments
while [[ $# -gt 0 ]]; do
case $1 in
Expand All @@ -33,10 +52,18 @@ while [[ $# -gt 0 ]]; do
s3=1
shift
;;
--single_end)
single_end=1
shift
;;
--output_path)
output_path="$2"
shift 2
;;
--help)
print_usage
exit 0
;;
--group_file) # Optional group file
group_file="$2"
shift 2
Expand All @@ -47,20 +74,22 @@ while [[ $# -gt 0 ]]; do
;;
*)
echo "Unknown option: $1"
print_usage
exit 1
;;
esac
done

# Check if all required parameters are provided
if [[ -z "$dir_path" || -z "$forward_suffix" || -z "$reverse_suffix" ]]; then
echo "Error: dir_path, forward_suffix, and reverse_suffix are required."
if [[ -z "$dir_path" || -z "$single_end" ]]; then
echo "Error: dir_path and single_end are required."
echo -e "\nUsage: $0 [options]"
echo -e "\nRequired arguments:"
echo -e " --dir_path <path> Directory containing FASTQ files"
echo -e " --forward_suffix <suffix> Suffix identifying forward reads, supports regex (e.g., '_R1_001' or '_1')"
echo -e " --reverse_suffix <suffix> Suffix identifying reverse reads, supports regex (e.g., '_R2_001' or '_2')"
echo -e " --single_end Flag for single-end data"
echo -e "\nOptional arguments:"
echo -e " --forward_suffix <suffix> When single_end is 0, suffix identifying forward reads, supports regex (e.g., '_R1_001' or '_1')"
echo -e " --reverse_suffix <suffix> When single_end is 0, suffix identifying reverse reads, supports regex (e.g., '_R2_001' or '_2')"
echo -e " --s3 Use if files are stored in S3 bucket"
echo -e " --output_path <path> Output path for samplesheet [default: samplesheet.csv]"
echo -e " --group_file <path> Path to group file for sample grouping [header column must have the names 'sample,group' in that order; additional columns may be included, however they will be ignored by the script]"
Expand All @@ -74,15 +103,28 @@ if $group_across_illumina_lanes && [[ -n "$group_file" ]]; then
exit 1
fi

if [ $single_end -eq 0 ]; then
# Paired-end validation
if [[ -z "$forward_suffix" || -z "$reverse_suffix" ]]; then
echo "Error: forward_suffix and reverse_suffix are required for paired-end reads."
print_usage
exit 1
fi
fi

# Display the parameters
echo "Parameters:"
echo "dir_path: $dir_path"
echo "forward_suffix: $forward_suffix"
echo "reverse_suffix: $reverse_suffix"
echo "single_end: $single_end"
echo "s3: $s3"
echo "output_path: $output_path"
echo "group_file: $group_file"
echo "group_across_illumina_lanes: $group_across_illumina_lanes"
if [ $single_end -eq 0 ]; then
echo "forward_suffix: $forward_suffix"
echo "reverse_suffix: $reverse_suffix"
fi



#### EXAMPLES ####
Expand All @@ -109,30 +151,45 @@ echo "group_across_illumina_lanes: $group_across_illumina_lanes"
# Create a temporary file for the initial samplesheet
temp_samplesheet=$(mktemp)

echo "sample,fastq_1,fastq_2" > "$temp_samplesheet"
# Create header based on single_end flag
if [ $single_end -eq 0 ]; then
echo "sample,fastq_1,fastq_2" > "$temp_samplesheet"
else
echo "sample,fastq" > "$temp_samplesheet"
fi
echo "group_file: $group_file"


# Ensure dir_path ends with a '/'
if [[ "$dir_path" != */ ]]; then
dir_path="${dir_path}/"
fi

listing=0

# Get file listing based on s3 flag
if [ $s3 -eq 1 ]; then
listing=$(aws s3 ls ${dir_path} | awk '{print $4}')
else
listing=$(ls ${dir_path} | awk '{print $1}')
fi

echo "$listing" | grep "${forward_suffix}\.fastq\.gz$" | while read -r forward_read; do
sample=$(echo "$forward_read" | sed -E "s/${forward_suffix}\.fastq\.gz$//")
reverse_read=$(echo "$listing" | grep "${sample}${reverse_suffix}\.fastq\.gz$")
# If sample + reverse_suffix exists in s3_listing, then add to samplesheet
if [ -n "$reverse_read" ]; then
echo "$sample,${dir_path}${forward_read},${dir_path}${reverse_read}" >> "$temp_samplesheet"
fi
done
# Process files based on single_end flag
if [ $single_end -eq 0 ]; then
# Paired-end processing
echo "$listing" | grep "${forward_suffix}\.fastq\.gz$" | while read -r forward_read; do
sample=$(echo "$forward_read" | sed -E "s/${forward_suffix}\.fastq\.gz$//")
reverse_read=$(echo "$listing" | grep "${sample}${reverse_suffix}\.fastq\.gz$")
# If sample + reverse_suffix exists in s3_listing, then add to samplesheet
if [ -n "$reverse_read" ]; then
echo "$sample,${dir_path}${forward_read},${dir_path}${reverse_read}" >> "$temp_samplesheet"
fi
done
else
# Single-end processing - just process all fastq.gz files
echo "$listing" | grep "\.fastq\.gz$" | while read -r read_file; do
sample=$(echo "$read_file" | sed -E "s/\.fastq\.gz$//")
echo "$sample,${dir_path}${read_file}" >> "$temp_samplesheet"
done
fi

# Check if group file is provided
if [[ -n "$group_file" ]]; then
Expand Down
37 changes: 37 additions & 0 deletions configs/run_dev_se.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
/************************************************
| CONFIGURATION FILE FOR NAO VIRAL MGS WORKFLOW |
************************************************/

params {
mode = "run_dev_se"


// Directories
base_dir = "s3://nao-mgs-simon/test_single_read" // Parent for working and output directories (can be S3)
ref_dir = "s3://nao-mgs-wb/index-20241113/output" // Reference/index directory (generated by index workflow)

// Files
sample_sheet = "${launchDir}/samplesheet.csv" // Path to library TSV
adapters = "${projectDir}/ref/adapters.fasta" // Path to adapter file for adapter trimming

// Whether the underlying data is paired-end or single-end
single_end = new File(params.sample_sheet).text.readLines()[0].contains('fastq_2') ? false : true

// Numerical
grouping = false // Whether to group samples by 'group' column in samplesheet
n_reads_trunc = 0 // Number of reads per sample to run through pipeline (0 = all reads)
n_reads_profile = 1000000 // Number of reads per sample to run through taxonomic profiling
bt2_score_threshold = 20 // Normalized score threshold for HV calling (typically 15 or 20)
blast_hv_fraction = 0 // Fraction of putative HV reads to BLAST vs nt (0 = don't run BLAST)
kraken_memory = "128 GB" // Memory needed to safely load Kraken DB
quality_encoding = "phred33" // FASTQ quality encoding (probably phred33, maybe phred64)
fuzzy_match_alignment_duplicates = 0 // Fuzzy matching the start coordinate of reads for identification of duplicates through alignment (0 = exact matching; options are 0, 1, or 2)
host_taxon = "vertebrate"
}

includeConfig "${projectDir}/configs/logging.config"
includeConfig "${projectDir}/configs/containers.config"
includeConfig "${projectDir}/configs/resources.config"
includeConfig "${projectDir}/configs/profiles.config"
includeConfig "${projectDir}/configs/output.config"
process.queue = "simon-batch-queue" // AWS Batch job queue
3 changes: 3 additions & 0 deletions main.nf
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
include { RUN } from "./workflows/run"
include { RUN_VALIDATION } from "./workflows/run_validation"
include { INDEX } from "./workflows/index"
include { RUN_DEV_SE } from "./workflows/run_dev_se"

workflow {
if (params.mode == "index") {
Expand All @@ -9,6 +10,8 @@ workflow {
RUN()
} else if (params.mode == "run_validation") {
RUN_VALIDATION()
} else if (params.mode == "run_dev_se") {
RUN_DEV_SE()
}
}

Expand Down
24 changes: 20 additions & 4 deletions modules/local/fastp/main.nf
simonleandergrimm marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,13 @@ process FASTP {
// reads is a list of two files: forward/reverse reads
tuple val(sample), path(reads)
path(adapters)
val single_end
output:
tuple val(sample), path("${sample}_fastp_{1,2}.fastq.gz"), emit: reads
tuple val(sample), path({
single_end ?
"${sample}_fastp.fastq.gz" :
"${sample}_fastp_{1,2}.fastq.gz"
}), emit: reads
tuple val(sample), path("${sample}_fastp_failed.fastq.gz"), emit: failed
tuple val(sample), path("${sample}_fastp.{json,html}"), emit: log
shell:
Expand All @@ -19,13 +24,23 @@ process FASTP {
*/
'''
# Define paths and subcommands
o1=!{sample}_fastp_1.fastq.gz
o2=!{sample}_fastp_2.fastq.gz
of=!{sample}_fastp_failed.fastq.gz
oj=!{sample}_fastp.json
oh=!{sample}_fastp.html
ad=!{adapters}
io="--in1 !{reads[0]} --in2 !{reads[1]} --out1 ${o1} --out2 ${o2} --failed_out ${of} --html ${oh} --json ${oj} --adapter_fasta ${ad}"
if [ $(echo "!{reads}" | wc -w) -eq 2 ]; then
harmonbhasin marked this conversation as resolved.
Show resolved Hide resolved
echo "Processing paired-end reads"
o1=!{sample}_fastp_1.fastq.gz
o2=!{sample}_fastp_2.fastq.gz
io="--in1 !{reads[0]} --in2 !{reads[1]} --out1 ${o1} --out2 ${o2} --failed_out ${of} --html ${oh} --json ${oj} --adapter_fasta ${ad}"
else
echo "Processing single-end reads"
o=!{sample}_fastp.fastq.gz
io="--in1 !{reads[0]} --out1 ${o} --failed_out ${of} --html ${oh} --json ${oj} --adapter_fasta ${ad}"
fi
par="--cut_front --cut_tail --correction --detect_adapter_for_pe --trim_poly_x --cut_mean_quality 25 --average_qual 25 --qualified_quality_phred 20 --verbose --dont_eval_duplication --thread !{task.cpus} --low_complexity_filter"


par="--cut_front --cut_tail --correction --detect_adapter_for_pe --trim_poly_x --cut_mean_quality 20 --average_qual 20 --qualified_quality_phred 20 --verbose --dont_eval_duplication --thread !{task.cpus} --low_complexity_filter"
# Execute
fastp ${io} ${par}
Expand Down Expand Up @@ -66,3 +81,4 @@ process FASTP_NOTRIM {
fastp ${io} ${par}
'''
}

5 changes: 3 additions & 2 deletions modules/local/summarizeMultiqcPair/main.nf
simonleandergrimm marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,11 @@ process SUMMARIZE_MULTIQC_PAIR {
label "single"
input:
tuple val(stage), val(sample), path(multiqc_data)
val(single_end)
output:
tuple path("${stage}_${sample}_qc_basic_stats.tsv.gz"), path("${stage}_${sample}_qc_adapter_stats.tsv.gz"), path("${stage}_${sample}_qc_quality_base_stats.tsv.gz"), path("${stage}_${sample}_qc_quality_sequence_stats.tsv.gz")
shell:
'''
summarize-multiqc-pair.R -i !{multiqc_data} -s !{stage} -S !{sample} -o ${PWD}
summarize-multiqc-pair.R -i !{multiqc_data} -s !{stage} -S !{sample} -r !{single_end} -o ${PWD}
'''
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,24 @@ option_list = list(
help="Stage descriptor."),
make_option(c("-S", "--sample"), type="character", default=NULL,
help="Sample ID."),
make_option(c("-r", "--single_end"), type="character", default=FALSE,
simonleandergrimm marked this conversation as resolved.
Show resolved Hide resolved
help="Single-end flag."),
make_option(c("-o", "--output_dir"), type="character", default=NULL,
help="Path to output directory.")
)
opt_parser = OptionParser(option_list=option_list);
opt = parse_args(opt_parser);

# Convert single_end from string to logical
if (opt$single_end == "true") {
single_end <- TRUE
} else if (opt$single_end == "false") {
single_end <- FALSE
} else {
stop("single_end must be 'true' or 'false'")
}


# Set input paths
multiqc_json_path <- file.path(opt$input_dir, "multiqc_data.json")
fastqc_tsv_path <- file.path(opt$input_dir, "multiqc_fastqc.txt")
Expand Down Expand Up @@ -57,8 +69,19 @@ basic_info_fastqc <- function(fastqc_tsv, multiqc_json){
tab_tsv <- fastqc_tsv %>%
mutate(n_bases_approx = process_n_bases(`Total Bases`)) %>%
select(n_bases_approx, per_base_sequence_quality:adapter_content) %>%
summarize_all(function(x) paste(x, collapse="/")) %>%
mutate(n_bases_approx = n_bases_approx %>% str_split("/") %>% sapply(as.numeric) %>% colSums())
summarize_all(function(x) paste(x, collapse="/"))

if (single_end) {
tab_tsv <- tab_tsv %>%
mutate(n_bases_approx = n_bases_approx %>% as.numeric)
} else {
tab_tsv <- tab_tsv %>%
mutate(n_bases_approx = n_bases_approx %>%
str_split("/") %>%
sapply(as.numeric) %>%
colSums())
}

# Combine
return(bind_cols(tab_json, tab_tsv))
}
Expand Down Expand Up @@ -86,7 +109,7 @@ extract_adapter_data <- function(multiqc_json){
extract_per_base_quality_single <- function(per_base_quality_dataset){
# Convert a single JSON per-base-quality dataset into a tibble
data <- lapply(1:length(per_base_quality_dataset$name), function(n)
per_base_quality_dataset$data[[n]] %>% as.data.frame %>%
per_base_quality_dataset$data[[n]] %>% as.data.frame %>%
mutate(file=per_base_quality_dataset$name[n])) %>%
bind_rows() %>% as_tibble %>%
rename(position=V1, mean_phred_score=V2)
Expand All @@ -103,7 +126,7 @@ extract_per_base_quality <- function(multiqc_json){
extract_per_sequence_quality_single <- function(per_sequence_quality_dataset){
# Convert a single JSON per-sequence-quality dataset into a tibble
data <- lapply(1:length(per_sequence_quality_dataset$name), function(n)
per_sequence_quality_dataset$data[[n]] %>% as.data.frame %>%
per_sequence_quality_dataset$data[[n]] %>% as.data.frame %>%
mutate(file=per_sequence_quality_dataset$name[n])) %>%
bind_rows() %>% as_tibble %>%
rename(mean_phred_score=V1, n_sequences=V2)
Expand Down
24 changes: 19 additions & 5 deletions modules/local/truncateConcat/main.nf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More so than for FASTP, I think this would be better done as a single process with a conditional statement, based either on a boolean paired parameter or (better) on the length of reads.

You could even just do a for loop iterating over every file in reads.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Original file line number Diff line number Diff line change
Expand Up @@ -5,16 +5,30 @@ process TRUNCATE_CONCAT {
input:
tuple val(sample), path(reads)
val n_reads
val single_end
output:
tuple val(sample), path("${sample}_trunc_{1,2}.fastq.gz"), emit: reads

tuple val(sample), path({
single_end ?
"${sample}_trunc.fastq.gz" :
"${sample}_trunc_{1,2}.fastq.gz"
}), emit: reads
shell:
'''
echo "Number of output reads: !{n_reads}"
n_lines=$(expr !{n_reads} \\* 4)
echo "Number of output lines: ${n_lines}"
o1=!{sample}_trunc_1.fastq.gz
o2=!{sample}_trunc_2.fastq.gz
zcat !{reads[0]} | head -n ${n_lines} | gzip -c > ${o1}
zcat !{reads[1]} | head -n ${n_lines} | gzip -c > ${o2}
if [ $(echo "!{reads}" | wc -w) -eq 2 ]; then
echo "Processing paired-end reads"
o1=!{sample}_trunc_1.fastq.gz
o2=!{sample}_trunc_2.fastq.gz
zcat !{reads[0]} | head -n ${n_lines} | gzip -c > ${o1}
zcat !{reads[1]} | head -n ${n_lines} | gzip -c > ${o2}
else
echo "Processing single-end reads"
o=!{sample}_trunc.fastq.gz
zcat !{reads[0]} | head -n ${n_lines} | gzip -c > ${o}
fi

'''
}
Loading