Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possiblity to extract non-host paired-end reads in separate FastQ files #945

Open
alexhbnr opened this issue Feb 22, 2023 · 0 comments
Open
Assignees
Labels
DSL2 enhancement New feature or request
Milestone

Comments

@alexhbnr
Copy link
Contributor

Currently, nf-core/eager does only allow to extract reads that didn't align to the provided reference genome into a single FastQ file. While this is no issue when having merged overlapping read pairs during adapter removal, non-merged sequencing data might still consist of read pairs and single reads at the same time. This simple extraction with samtools view -f 4 /path/to/bam | samtools fastq will then lead to a single file from which it is almost impossible to retrace which are the read pairs and which are the singletons.

Instead nf-core/eager should automatically switch to a different extraction command when confronted with non-merged read pairs:

samtools view -f 4 /path/to/bam | samtools fastq -1 /path/to/fwd_fastq -2 /path/to/rev_fastq -0 /path/to/singleton_fastq

The more complex (but more accurate) solution would be the following:

samtools sort -n -o <sample>.name_sorted.bam
samtools fixmate -mcu <sample>.name_sorted.bam <sample>.fixmate.bam
samtools view -uh -e '(flag.paired && (flag.unmap || flag.munmap)) || (!flag.paired && flag.unmap)' <sample>.fixmate.bam | \
samtools fastq -1 /path/to/fwd_fastq -2 /path/to/rev_fastq -0 /path/to/singleton_fastq

The first two step ensure that paired reads have complementary flags, which is particularly important with respect to the "unmapped" and "mate unmapped" flag. The samtools view command then uses a filtering expression to extract unmapped reads. It distinguishes between read pairs and singletons. The former are extracted if at least one of the mates of the pair couldn't be aligned. The latter are simply extracted when they are unmapped.

Using the more complex approach will ensure that the FastQ file with the forward reads and the FastQ file with the reverse reads of a paired-end sequencing run will have the same number of reads.

@alexhbnr alexhbnr added the enhancement New feature or request label Feb 22, 2023
@jfy133 jfy133 added the DSL2 label Feb 22, 2023
@jfy133 jfy133 self-assigned this Feb 22, 2023
@github-project-automation github-project-automation bot moved this to Todo (Hackathon General) in nf-core Hackathon March 2023 Mar 23, 2023
@jfy133 jfy133 added this to the 3.0 Augsburg milestone May 26, 2023
@ilight1542 ilight1542 mentioned this issue Oct 20, 2023
20 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DSL2 enhancement New feature or request
Projects
No open projects
Development

No branches or pull requests

2 participants