-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: control info-file column separator #816
Comments
Interesting. Where do these tab characters come from? Looking at some of the Nanopore data that I have here, I don’t see any. If possible, I try to avoid adding options to Cutadapt if I can just make the default behavior better. I’m wondering whether an alternative would be to replace all tab characters in the read header with a space character. I consider it a bug that this isn’t done at the moment because the output is an invalid TSV otherwise, as you found out. |
|
@marcelm Here I copy the first line of a fastq file generated with dorado basecaller v 0.7 , using the --emit-fastq option (so it's not a bam from which I later extracted the fastq) |
Quoting from the read header you attached:
This is apparently intended to be used by a read mapper to be added to its SAM output, such as with BWA-MEM’s
This is kind of the inverse of So if you want to let Cutadapt output an info file, manipulate the info file and then write back a FASTQ file that would still be usable in this way, then something needs to be done to the tabs that is reversible. Just replacing them with spaces won’t work because then they cannot be distinguished from spaces. Even in your example, there’s already a value I’m not sure what is best here. Maybe replace tab with backslash plus t ( |
Thanks for your reply - I ended doing a replacement of all tabs for spaces,
before feeding the fastqs to cutadapt. And even though that won't work
backwards (if I needed to restore the headers to their initial status),
that will work for the rest of the pipeline.
Many thanks,
…On Mon, 18 Nov 2024 at 15:57, Marcel Martin ***@***.***> wrote:
Quoting from the read header you attached:
@54c591fa-c560-405b-bc82-b3cd603b84fc st:Z:2024-10-17T01:47:46.145+00:00 ***@***.*** DS:Z:gpu:NVIDIA GeForce RTX 3070 Laptop GPU
This is apparently intended to be used by a read mapper to be added to its
SAM output, such as with BWA-MEM’s -C option:
-C append FASTA/FASTQ comment to SAM output
This is kind of the inverse of samtools fastq -T that Ruben mentioned.
So if you want to let Cutadapt output an info file, manipulate the info
file and then write back a FASTQ file that would still be usable in this
way, then something needs to be done to the tabs that is reversible. Just
replacing them with spaces won’t work because then they cannot be
distinguished from spaces. Even in your example, there’s already a value NVIDIA
GeForce RTX 3070 Laptop GPU that contains spaces.
I’m not sure what is best here. Maybe replace tab with backslash plus t (
"\\t")?
—
Reply to this email directly, view it on GitHub
<#816 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADXKS3ZYFTWZ2D6CKVUZHMT2BH54NAVCNFSM6AAAAABRJDM2XWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBTGI4TGNBUGU>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I use the info-file quite often, either with awk one liners or importing the file in R. I have been using them with Illumina data, with great success. Now I am trying to use them with Nanopore data, with a catch: Different fields in the Nanopore fastq header are also separated by tabs, as the different fields in the info-file. This makes parsing these files more troublesome. Would it be possible, for next versions, to include a way of choosing the desired column-separating character for the info-file?
The text was updated successfully, but these errors were encountered: