Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save run-merged reads for single run samples #262

Closed
alexhbnr opened this issue Mar 11, 2023 · 12 comments · Fixed by #272
Closed

Save run-merged reads for single run samples #262

alexhbnr opened this issue Mar 11, 2023 · 12 comments · Fixed by #272
Assignees
Labels
enhancement Improvement for existing functionality

Comments

@alexhbnr
Copy link
Contributor

Description of the bug

Maybe there is a misunderstanding about what particular parameters of the pipeline mean but I think the way setting the parameter --save_runmerged_reads currently performs when run merging is activated is not what I expected to do. I am not sure whether to file this under bug or enhancement so apologies for calling it a bug.

I have the following use case: I have ten samples with multiple runs that I would like to have merged prior to taxonomic profiling, but I have a single sample with only a single run that doesn't require merging. I would like to have the FastQ files after pre-processing and, if necessary, run-merging for all these files.

At the moment, enabling --save_runmerged_reads will give me the FastQ files for all ten samples that required run merging, but not the single FastQ sample that doesn't need it. I haven't enabled any other parameters for keeping the files, e.g. after host removal (--save_hostremoval_unmapped). Therefore, I am currently lacking the FastQ for this single sample.

I personally would have expected that the FastQ files of samples that don't need run merging are still exported into the results directory. If this isn't happening on purpose, then I would suggest to add a parameter --save_reads_for_taxprofiling or something along the lines that allows to get the final FastQ files per sample. I would like to avoid having to export the reads both after the host removal and run merging because it would duplicate all the data for the samples that were run merged.

Command used and terminal output

nextflow run nf-core/taxprofiler -r dev -profile eva --input samplelist.csv --databases databaselist.csv --perform_shortread_qc --shortread_qc_tool adapterremoval --shortread_qc_mergepairs --shortread_qc_includeunmerged --shortread_qc_minlength 35 --perform_shortread_hostremoval --perform_runmerging --save_runmerged_reads --run_kraken2

Relevant files

No response

System information

No response

@alexhbnr alexhbnr added the bug Something isn't working label Mar 11, 2023
@Midnighter
Copy link
Collaborator

I understand the confusion. Since you supplied that single FASTQ to the pipeline, I guess, you do have it somewhere but I can see that it would be more convenient to have all reads together after run merging.

@jfy133 jfy133 added enhancement Improvement for existing functionality and removed bug Something isn't working labels Mar 11, 2023
@jfy133
Copy link
Member

jfy133 commented Mar 11, 2023

Yes, I agree it is not ideal.

Unfortunately this is an 'intrinsict' thing with the 'react' based design of Nextflow. It's not trivial to work out which is the 'final' FASTQ file that we could publish to a final directory, particularly when you are mixing different data inputs (in this case single run vs multi-run)

We've had the same problem in eager too for a very long time.

The possible options I can see are:

  1. Do nothing to the code, just document this better
  2. Dump all preprocessed files into a single directory
    • This will still result in file redundency, as you will still need to turn on the save of both possible 'final' files for run- and non-run-merged, but at least they are all in one directory
  3. Come up with some very complex logic to work out which file to publish and when into a final results diretory
    • I keep trying different scenarios in my head but still haven't come up with a satisfactory solution that just doesn't result in horrible spagetti conditional code
    • e.g. publish A in the final directory but only if X is on, Y and Z are turned off, but also only when it A goes through run merge, else publish J. But then only publish B if Y is turned on and Z is off (Etc. etc.)
  4. Add an additional dummy process that doesn't actually do anything, but allows us to 'save' the file in the directory

(3) might be possible but will needs a lot of thought going into it, and time to test all possible scenarios.

@jfy133
Copy link
Member

jfy133 commented Mar 11, 2023

EDIT: I HAVE AN IDEA @Midnighter please validate:

what about having a .collectFile(storeDir: ${params.outdir}/analysis-read-reads/) somewhere in the channel going into profiling?

@Midnighter
Copy link
Collaborator

I need to look at context to comment on your idea.

I was wondering, why not send all fastq to the merging process and publish everything even if they are single runs?

@jfy133
Copy link
Member

jfy133 commented Mar 11, 2023

A cat process that cats a single file into a new file is also redundant: it's just a copy, so to me that's a waste of computing resources, time and disk space

@Midnighter
Copy link
Collaborator

It could basically be a noop for a single input file but allow to publish all its output files in one directory easily. I'll think about your comment above tomorrow, though.

@alexhbnr
Copy link
Contributor Author

OK, I am not so familiar with the Nextflow design. I would likely do some sort of spaghetti with if/else statements but if there is a smarter way then that's the way to go.

Just stating that you have to activate both --save_hostremoval_unmapped and --save_runmerged_reads wouldn't really solve the

A cat process that cats a single file into a new file is also redundant: it's just a copy, so to me that's a waste of computing resources, time and disk space

issue because you would now duplicate all the data for the samples with multiple sequencing runs.

@jfy133
Copy link
Member

jfy133 commented Mar 13, 2023

Sorry to be clear, none of the above are actual solutions just partly addressing the problem even if it's 'we can't do it'.

Ultimately the issue is working out what is the final step that is run isn't so hard (even if it's a lot conditionals), the problem trying to track samples with the exception where the file to be saved needs to come from the upstream step. In away, this is a different type of metadata than that of just dectecing which processes have been turned on or off

@jfy133
Copy link
Member

jfy133 commented Mar 13, 2023

so for example, for the conditional to save the output from host_removal

!params.perform_run_merging && params.save_final_reads

and for host removal

if ( !params.perform_run_merging || params.perform_run_merging && !${meta.multirun} ) &&  params.perform_<shortread/longread>_hostremoval && params.save_final_reads

I think should work,

But the question now is how do we extract/collect the the ${meta.multirun} info

@jfy133 jfy133 added this to the 1.1.0 - Augmented Akita milestone Mar 13, 2023
@jfy133
Copy link
Member

jfy133 commented Mar 14, 2023

If I can get this to work, maybe we could go one step better and at the same time directly generate a samplesheet of the processed reads that can go directly to other pipelines, e.g. mag (idea inspired by @rotifergirl)

@jfy133
Copy link
Member

jfy133 commented Mar 14, 2023

Should also x-ref: nf-core/eager#945

@jfy133 jfy133 linked a pull request Mar 17, 2023 that will close this issue
11 tasks
@jfy133 jfy133 closed this as completed Apr 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement for existing functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants