Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

output knitted and corrected barcodes to fastq #30

Open
kh49 opened this issue Nov 2, 2021 · 7 comments
Open

output knitted and corrected barcodes to fastq #30

kh49 opened this issue Nov 2, 2021 · 7 comments
Assignees

Comments

@kh49
Copy link

kh49 commented Nov 2, 2021

Hi @moonwatcher,

This tool is super helpful as I am trying to implement a new single cell protocol called scifi-RNA-seq. I have been able to get the desired barcode manipulations and correction to output to an organized bam. Unfortunately, My immediate downstream step requires fastq input so I need to either have pheniqs output fastqs with the corrected barcodes and RNA insert in one run, or set up a second run after generating a corrected BAM that quickly takes the BAM and outputs fastqs.

I have been trying to get the first option to work, but I can only seem to address the input tokens and create fastqs with the raw barcodes. I'm not sure how I can refer to the corrected barcodes.

For the second option, I don't understand how to access sequences in specific tags from a BAM input file. I saw some tutorials addressing interleaved BAM/CRAM, but my data is not interleaved and is one main read with a bunch of barcode stuff in tags.

I am attaching a small test case below. I use scifi_main.json as the config file for generating the correcttest.bam file. scifi_main_fastq.json was my attempt at outputting FASTQs with correction which resulted in uncorrected output (R1_correcttest.fastq and R2_correcttest.fastq).

Thank you!

scifi_pheniqs.tar.gz

@moonwatcher moonwatcher self-assigned this Nov 2, 2021
@moonwatcher
Copy link
Contributor

Hi @kh49

I am so glad you find Pheniqs useful with scifi-RNA-seq. At the moment Pheniqs only writes decoded (error corrected) barcodes to SAM AUX tags. That said, there have been plans to add the functionality you describe since it can potentially make for a more seamless integration with downstream tools. I am looking into enhancing the token and transform syntax to allow what you are asking for.

I will update this ticket soon with additional details.

Regards,
Lior

@moonwatcher
Copy link
Contributor

Hi @kh49

I have a version for you test. Are you comfortable building from source? are you testing on MacOS or Linux? alternatively I can provide instructions to build a static version with the builder.

35966b8

I added the option to address the decoded (error corrected) barcode sequence as one long, concatenated, sequence. So even if you had 3 cellular barcodes and each had two segments the tokens will be addressing the result as one long set of coordinates. To use it simply substitute the input segment index in the token with s for sample, c for cellular and m for molecular.

For instance s:: is the entire sample barcode. c::10 is the first 10 nucleotides of the cellular barcode.

There is one lingering issue... Since we are now pretending the decoded, or error corrected, barcode is a real sequence and allow to embed it in the output read, the error corrected nucleotides need a quality score. This is not a trivial question.

The current version (35966b8) does not modify the quality scores of the corrected nucleotides which is probably a wrong behavior. but I wanted to allow you to test this and give feedback.

One option is to provide a config option and give those corrected nucleotides a fixed quality score, say 30. You basically say: "I trust the decoding and I don't want any downstream tool to filter the read because it thinks the barcode quality is too low".

Another option is use the overall error probability of decoding the entire barcode as the new quality but that will be probably too high for one mismatch and way too low for 3 or 4 mismatches.

We are still debating it, but what will be your preferred method?

@kh49
Copy link
Author

kh49 commented Nov 11, 2021

@moonwatcher Thank you! I will give this a try in the next few days and get back to you. I'm working on Linux and should be fine building from source, but I'll let you know if I have issues.

As far as the quality scores go, I think there would be merit in keeping both options you mentioned if it is not too much work. I think the fixed score case would be good enough in most cases, especially with the ability to tweak and fine tune correction parameters and error sensitivity in pheniqs. The second option sounds more technically correct. Maybe you can add a parameter to the config that lets users set a phred quality cutoff they'd like to use for the corrected barcodes. Then during error correction, barcodes with error probability lower than the set acceptable threshold would get assigned some quality score scaled between the maximum and the cutoff. Barcodes with error probability that is too high would get assigned some quality score scaled between the minimum and the cutoff.

@kh49
Copy link
Author

kh49 commented Nov 13, 2021

Hi @moonwatcher I tested the new version on my test case and the cell barcode redirection seems to work as intended and I can manipulate it as needed for output. However, I get no output from the molecular barcode. For example, if I have the following:

    "template": {
        "transform": {
            "token": [
                "1::","c::","m::","0:8:24"
            ],
			"knit": ["2:1","3"]
        }
    }

I'm expecting the first output file/interleaved line to have the molecular barcode and cellular barcode concatenated. Since I'm using naive molecular barcode decoding, I can always designate the UMI location in the input tokens instead. However, I also noticed that when outputting to BAM, the new version does not output the RX and QX tags, but does still provide OX and BZ. I ran the same configuration (without the new token addressing) on the new version as well as the current conda release and the conda version outputs RX and QX tags as expected but the new version does not.

@moonwatcher
Copy link
Contributor

moonwatcher commented Nov 15, 2021

Hi @kh49

Everything you said is correct :)

  1. I initially wanted to use s|c|m for decoded and S|C|M for raw barcodes but raw barcodes are addressable on the input read segment so its redundant. The naive decoder performs no error correction, It exists mostly to allow to move the molecular barcode to an auxiliary tag. So, as you mentioned, you can just address the location on the input. However for completeness sake I added the assembly step to the naive molecular decoder so it can support the m:: tokenization. NOTICE It will be slightly slower since pheniqs will first assemble a new sequence from the input segments and than, again, tokenize that sequence.

  2. I realize this is just a snippet but I feel compelled to mention it. in the template directive you posted you never use token 0 in the knit. Pheniqs still assembles it but never uses it, so you are just wasting clock cycles ;)

  3. You are correct that the new version writes the molecular barcode to OX/BZ instead of RX/QX. This is a delicate nuance which is open to interpretation... I try to follow the SAM specifications as much as I can. If you look at page 5 and 6 RX/QX state Sequence bases from the unique molecular identifier. These could be either corrected or uncorrected. while OX states Raw (uncorrected) unique molecular identifier bases. So OX/BZ is always for raw while RX/QX is for either corrected or uncorrected. If you used MDD or PAMLD Pheniqs writes the raw to OX/BZ and the corrected to RX/QX, however that is rarely possible since molecular barcodes almost always come from an unknown distribution. On the other hand, I figured that having naive write them to OX/BZ leaves room for a different application to write corrected barcodes to RX/QX. Bottom line is that while it does not defy the specs to write the raw to RX/QX it might be more correct to have them in OX/BZ. Does that interfere with downstream tools? do you prefer it to revert to the previous logic (RX/QX for naive)? I am considering allowing to specify the tags in the config if that helps making the output more compatible with other tools.

Thank you for all the feedback so far. Looking forward to hear your opinion.

@kh49
Copy link
Author

kh49 commented Nov 17, 2021

Hi @moonwatcher, this all sounds good to me!

  1. I think the framework you've laid out is solid. I think a quick update to the validation that can warn users when they are using the m:: form inefficiently would be enough.

  2. Thanks! I was messing around with the config and hadn't thought about leaving pieces in. Good to know that I should clean those up in finalized configs.

  3. For the BAM tags, I think your approach makes sense. It doesn't affect my downstream work, and it's always nice to remove some redundancy. That said, I do think allowing for tag configuration in the config would be a great way to allow for any special cases. It would be nice to have that for both input and output so you could specify additional pairs of sequence and quality tags from an input BAM to load into tokens, then output tokens to specific tag pairs as well. I have a special case with scifi-RNA-seq data where their demultiplexer creates a weird looking BAM. If you look at the example layout in that link, the final output of their BAM generation pipeline results in the quality scores of the RX and r2 tags being lumped into QX with a ~ separator. Right now, I'm using an awk script run via GNU parallel to parse the BAM and patch together new FASTQs which includes teasing out the quality scores from the QX tag. I don't know how common concatenating quality sequences with a ~ is in BAM files so it may not even be worth addressing, but general tag functionality would be great.

@moonwatcher
Copy link
Contributor

Hi @kh49

Sorry for the long delay but I am still working though this.
I added a parameter corrected quality that you can add either globally on the root or at every decoder. This is the phred quality for corrected bases when the corrected code is written to the output read. The parameter takes the numeric phred value and defaults to 30, which is the character ? or a probability value of 0.001. The validator allows values from 2 to 104 for now. I added a small script to print the phred value and their corresponding characters and probabilities.

This raised another question: What should be written when the decoder fails to decode the barcode?

One option is what is currently done on the aux tags, which is to write a sequence of = which is the SAM value for unknown nucleotide (not the same as N which means any nucleotide).

Another option is to write the original sequence with the original qualities.

The read will be marked as failing quality control, as was always the case.

Still working on the other features you mentioned ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants