Save run-merged reads for single run samples #262

alexhbnr · 2023-03-11T10:59:46Z

Description of the bug

Maybe there is a misunderstanding about what particular parameters of the pipeline mean but I think the way setting the parameter --save_runmerged_reads currently performs when run merging is activated is not what I expected to do. I am not sure whether to file this under bug or enhancement so apologies for calling it a bug.

I have the following use case: I have ten samples with multiple runs that I would like to have merged prior to taxonomic profiling, but I have a single sample with only a single run that doesn't require merging. I would like to have the FastQ files after pre-processing and, if necessary, run-merging for all these files.

At the moment, enabling --save_runmerged_reads will give me the FastQ files for all ten samples that required run merging, but not the single FastQ sample that doesn't need it. I haven't enabled any other parameters for keeping the files, e.g. after host removal (--save_hostremoval_unmapped). Therefore, I am currently lacking the FastQ for this single sample.

I personally would have expected that the FastQ files of samples that don't need run merging are still exported into the results directory. If this isn't happening on purpose, then I would suggest to add a parameter --save_reads_for_taxprofiling or something along the lines that allows to get the final FastQ files per sample. I would like to avoid having to export the reads both after the host removal and run merging because it would duplicate all the data for the samples that were run merged.

Command used and terminal output

nextflow run nf-core/taxprofiler -r dev -profile eva --input samplelist.csv --databases databaselist.csv --perform_shortread_qc --shortread_qc_tool adapterremoval --shortread_qc_mergepairs --shortread_qc_includeunmerged --shortread_qc_minlength 35 --perform_shortread_hostremoval --perform_runmerging --save_runmerged_reads --run_kraken2

Relevant files

No response

System information

No response

The text was updated successfully, but these errors were encountered:

Midnighter · 2023-03-11T12:14:20Z

I understand the confusion. Since you supplied that single FASTQ to the pipeline, I guess, you do have it somewhere but I can see that it would be more convenient to have all reads together after run merging.

jfy133 · 2023-03-11T19:16:05Z

Yes, I agree it is not ideal.

Unfortunately this is an 'intrinsict' thing with the 'react' based design of Nextflow. It's not trivial to work out which is the 'final' FASTQ file that we could publish to a final directory, particularly when you are mixing different data inputs (in this case single run vs multi-run)

We've had the same problem in eager too for a very long time.

The possible options I can see are:

Do nothing to the code, just document this better
Dump all preprocessed files into a single directory
- This will still result in file redundency, as you will still need to turn on the save of both possible 'final' files for run- and non-run-merged, but at least they are all in one directory
Come up with some very complex logic to work out which file to publish and when into a final results diretory
- I keep trying different scenarios in my head but still haven't come up with a satisfactory solution that just doesn't result in horrible spagetti conditional code
- e.g. publish A in the final directory but only if X is on, Y and Z are turned off, but also only when it A goes through run merge, else publish J. But then only publish B if Y is turned on and Z is off (Etc. etc.)
Add an additional dummy process that doesn't actually do anything, but allows us to 'save' the file in the directory

(3) might be possible but will needs a lot of thought going into it, and time to test all possible scenarios.

jfy133 · 2023-03-11T19:17:55Z

EDIT: I HAVE AN IDEA @Midnighter please validate:

what about having a .collectFile(storeDir: ${params.outdir}/analysis-read-reads/) somewhere in the channel going into profiling?

Midnighter · 2023-03-11T20:38:24Z

I need to look at context to comment on your idea.

I was wondering, why not send all fastq to the merging process and publish everything even if they are single runs?

jfy133 · 2023-03-11T20:54:02Z

A cat process that cats a single file into a new file is also redundant: it's just a copy, so to me that's a waste of computing resources, time and disk space

Midnighter · 2023-03-11T21:10:40Z

It could basically be a noop for a single input file but allow to publish all its output files in one directory easily. I'll think about your comment above tomorrow, though.

alexhbnr · 2023-03-12T08:00:11Z

OK, I am not so familiar with the Nextflow design. I would likely do some sort of spaghetti with if/else statements but if there is a smarter way then that's the way to go.

Just stating that you have to activate both --save_hostremoval_unmapped and --save_runmerged_reads wouldn't really solve the

A cat process that cats a single file into a new file is also redundant: it's just a copy, so to me that's a waste of computing resources, time and disk space

issue because you would now duplicate all the data for the samples with multiple sequencing runs.

jfy133 · 2023-03-13T07:44:10Z

Sorry to be clear, none of the above are actual solutions just partly addressing the problem even if it's 'we can't do it'.

Ultimately the issue is working out what is the final step that is run isn't so hard (even if it's a lot conditionals), the problem trying to track samples with the exception where the file to be saved needs to come from the upstream step. In away, this is a different type of metadata than that of just dectecing which processes have been turned on or off

jfy133 · 2023-03-13T07:51:00Z

so for example, for the conditional to save the output from host_removal

!params.perform_run_merging && params.save_final_reads

and for host removal

if ( !params.perform_run_merging || params.perform_run_merging && !${meta.multirun} ) &&  params.perform_<shortread/longread>_hostremoval && params.save_final_reads

I think should work,

But the question now is how do we extract/collect the the ${meta.multirun} info

jfy133 · 2023-03-14T05:11:55Z

If I can get this to work, maybe we could go one step better and at the same time directly generate a samplesheet of the processed reads that can go directly to other pipelines, e.g. mag (idea inspired by @rotifergirl)

jfy133 · 2023-03-14T09:22:59Z

Should also x-ref: nf-core/eager#945

jfy133 · 2023-03-17T20:29:07Z

Suggestion from Sarek:

alexhbnr added the bug Something isn't working label Mar 11, 2023

jfy133 added enhancement Improvement for existing functionality and removed bug Something isn't working labels Mar 11, 2023

jfy133 mentioned this issue Mar 12, 2023

Update docs based on feedback and add missing results directories #263

Merged

9 tasks

jfy133 added this to the 1.1.0 - Augmented Akita milestone Mar 13, 2023

jfy133 mentioned this issue Mar 17, 2023

Deactivate saving of reads and other large files for full tests #269

Merged

9 tasks

jfy133 self-assigned this Mar 17, 2023

jfy133 mentioned this issue Mar 17, 2023

Add 'final reads for profiling' saving mechanism #272

Merged

11 tasks

jfy133 linked a pull request Mar 17, 2023 that will close this issue

Add 'final reads for profiling' saving mechanism #272

Merged

11 tasks

jfy133 closed this as completed Apr 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save run-merged reads for single run samples #262

Save run-merged reads for single run samples #262

alexhbnr commented Mar 11, 2023

Midnighter commented Mar 11, 2023

jfy133 commented Mar 11, 2023 •

edited

Loading

jfy133 commented Mar 11, 2023

Midnighter commented Mar 11, 2023

jfy133 commented Mar 11, 2023

Midnighter commented Mar 11, 2023

alexhbnr commented Mar 12, 2023

jfy133 commented Mar 13, 2023

jfy133 commented Mar 13, 2023

jfy133 commented Mar 14, 2023

jfy133 commented Mar 14, 2023

jfy133 commented Mar 17, 2023

Save run-merged reads for single run samples #262

Save run-merged reads for single run samples #262

Comments

alexhbnr commented Mar 11, 2023

Description of the bug

Command used and terminal output

Relevant files

System information

Midnighter commented Mar 11, 2023

jfy133 commented Mar 11, 2023 • edited Loading

jfy133 commented Mar 11, 2023

Midnighter commented Mar 11, 2023

jfy133 commented Mar 11, 2023

Midnighter commented Mar 11, 2023

alexhbnr commented Mar 12, 2023

jfy133 commented Mar 13, 2023

jfy133 commented Mar 13, 2023

jfy133 commented Mar 14, 2023

jfy133 commented Mar 14, 2023

jfy133 commented Mar 17, 2023

jfy133 commented Mar 11, 2023 •

edited

Loading