-
Notifications
You must be signed in to change notification settings - Fork 142
Fix POOL_SHORT_READS receiving reads in wrong roder resulting in faulty pooling #894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Fix POOL_SHORT_READS receiving reads in wrong roder resulting in faulty pooling #894
Conversation
|
|
@nf-core-bot fix linting |
…es-contain-unequal-number-of-reads
| // We have to merge reads together to match tuple structure of POOL_SHORT_READS/ | ||
| // This MUST be in a interleaved structure (s1_r1, s1_r2, s2_r1, s2_r2, ...) | ||
| // So we merge the two list of R1 and R2s, and sort them to ensure correct order above | ||
| ch_short_reads_grouped_for_pooling = ch_short_reads_grouped.map { meta, reads1, reads2 -> [meta, [reads1 + reads2].flatten().sort()] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we assume that the reads files here are standardly-named such that a sort() won't break the order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we can assume that because all those files are renamed for ${prefix} at that point, imho. Or is it possible to skip the complete QC so that original files names come through? Not entirely sure...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think it's possible to basically completely skip QC...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How likely do you think it would be that people don't have a _R1 / _R2, _1 / _2, _F, _R in their FASTQ files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be wary of assuming anything about file names unless we have strictly controlled it. One way to do that would be also to force a schema like the above in the samplesheet validation, so we stop early before errors.
Otherwise we have to be careful with channel order, etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typical Illumina output from the sequencing facilities & companies I know is <sample>_R1_<lane>.fastq.gz. Single-end read files might not have any of those pattern to identify direction (R1/1F/whatever).
I think that makes it already more complicated to catch? I am not an regex expert though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@d4straub 's pattern the one of the most common patterns I've seen too... and other people append the adapter index sequence to the end after the lane ID too... so I really don't think this will be simply be solvable.
But for me that isnt needed, potentially we could add a comment (Warning) in the docs about the sorting issue with SPAdes & skipping all QC & file names, maybe to the co-assembly step (https://nf-co.re/mag/5.1.0/docs/usage/#the-group-column?),
I'm erring for this, but I want this to be a democracy.
@dialvarezs any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted my previous comment, as I wasn't sure it actually worked, but I think it does:
ch_a = Channel.of(["meta", ["a", "c", "b"], ["d", "b", "f"]])
ch_a.map { meta, f1, f2 ->
def transposed_pairs = [f1, f2].transpose()
println transposed_pairs
def sorted_pairs = transposed_pairs.sort { it[0] }
println sorted_pairs
def interleaved = sorted_pairs.flatten()
return [meta, interleaved]
}.view()
transposed: [[a, d], [c, b], [b, f]]
sorted: [[a, d], [b, f], [c, b]]
output: [meta, [a, d, b, f, c, b]]
So we can just sort on fasta1's name, avoiding issues with naming entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that code above and it seems fine to me. It also works with e.g.
ch_a = Channel.of(["meta", ["a_s1_R1_a", "c_s3_R1_c", "b_s2_R1_b"], ["d_s1_R2_d", "b_s3_R2_b", "f_s2_R2_f"]])
that is sorted to
[meta, [a_s1_R1_a, d_s1_R2_d, b_s2_R1_b, f_s2_R2_f, c_s3_R1_c, b_s3_R2_b]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK nice thanks for the cross-validation @d4straub ! When I am more functional I will try to implement it!
…es-contain-unequal-number-of-reads
…-unequal-number-of-reads' of github.com:nf-core/mag into 890-metaspades-exit-status-21-paired-read-files-contain-unequal-number-of-reads
| // We have to merge reads together to match tuple structure of POOL_SHORT_READS/ | ||
| // This MUST be in a interleaved structure (s1_r1, s1_r2, s2_r1, s2_r2, ...) | ||
| // So we merge the two list of R1 and R2s, and sort them to ensure correct order above | ||
| ch_short_reads_grouped_for_pooling = ch_short_reads_grouped.map { meta, reads1, reads2 -> [meta, [reads1 + reads2].flatten().sort()] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ch_short_reads_grouped_for_pooling = ch_short_reads_grouped.map { meta, reads1, reads2 -> [meta, [reads1 + reads2].flatten().sort()] } | |
| ch_short_reads_grouped_for_pooling = ch_short_reads_grouped.map { meta, reads1, reads2 -> [meta, [reads1, reads2].transpose().sort { it[0].getName() }.flatten()] } |
Closes #890
--coassembly_groupparameter was missed out in the new config structuresTODO:
PR checklist
nf-core pipelines lint).nextflow run . -profile test,docker --outdir <OUTDIR>).nextflow run . -profile debug,test,docker --outdir <OUTDIR>).docs/usage.mdis updated.docs/output.mdis updated.CHANGELOG.mdis updated.README.mdis updated (including new tool citations and authors/contributors).