Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency among *read_assignments.tsv, *.transcript_counts.tsv, and *.gene_counts.tsv #290

Open
airichli opened this issue Feb 19, 2025 · 2 comments
Labels
question Further information is requested

Comments

@airichli
Copy link

airichli commented Feb 19, 2025

Hi,

Thanks for developing this nice tool. I met one issue that the count is inconsistent among files as list in the title.
My command line is:
./IsoQuant-3.5.0/isoquant.py -d nanopore --stranded forward --fastq fastq1 fastq2 fastq3 --reference human_genome.fa --genedb
human_genome.gtf --complete_genedb --output output --prefix human --labels a b c -t 10 --model_construction_strategy default_ont --clean_start --matching_strategy default --splice_correction_strategy default_ont --model_construction_strategy default_ont --transcript_quantification unique_only --gene_quantification unique_only

Take gene ENSG00000290425 (its isoforms include ENST00000652586 and ENST00000617243) as an example. I extract all the information related with ENSG00000290425 from read_assignments.txt file as below.

ENSG00000290425.read_assignments.txt

According to *.transcript_counts.tsv, the transcript counts for ENST00000652586 and ENST00000617243 are 36 and 4, respectively. According to *.gene_counts.tsv, the gene count for ENSG00000290425 is 146.
However, I got different values when extract from the read_assighments.txt file as below:
grep ENST00000652586 ENSG00000290425.read_assignments.txt|awk '$6~/uniq/{print}'|wc -l (this results in 19)
grep ENST00000617243 ENSG00000290425.read_assignments.txt|awk '$6~/uniq/{print}'|wc -l (this results in 3)
grep ENSG00000290425 ENSG00000290425.read_assignments.txt|awk '$6~/uniq/{print}'|wc -l (this results in 35)

Very confusing. Actually, I checked 1490 isoforms, of which 1279 isoforms have consistent count, 211 isoforms have inconsistent count.

It would be great if isoquant could provide the exact read ids that corresponds to *.transcript_counts.tsv and *.gene_counts.tsv.

Looking forward to hearing from you. Thanks.

Best,
Aifu

@airichli airichli changed the title Inconsistency between *read_assignments.tsv, *.transcript_counts.tsv, and *.gene_counts.tsv Inconsistency among *read_assignments.tsv, *.transcript_counts.tsv, and *.gene_counts.tsv Feb 19, 2025
@andrewprzh
Copy link
Collaborator

Dear @airichli

This inconsistency is expected.

To add a +1 to an isoform abundance, IsoQuant requires a read to be assigned uniquely, meaning that it is consistent with the annotation and there is no ambiguity in the assignment. However, to add +1 to a gene count, a read can be assigned to 2 or more isoforms (of this gene) ambiguously. In other words, not all reads that are unambiguously assigned to a gene can be unambiguously assigned to a transcript.
This explains inconsistency between gene counts and isoform counts.

Inconsistency between read assignments and isoforms counts are likely caused by the additional filtering imposed on counts. For example, to report a spliced isoform, it must have at least one spliced read assigned to it.

P.S. I suggest you to update you IsoQuant to the latest 3.6.3 version, there were some quantification improvements recently.

Best
Andrey

@andrewprzh andrewprzh added the question Further information is requested label Feb 20, 2025
@airichli
Copy link
Author

airichli commented Feb 20, 2025

Dear Andrey, thank you so much for your timely reply, I really appreciate it.

I would like to check:
Assuming the *.transcript_counts.tsv, and *.gene_counts.tsv files (the quantification mode is unique-only, so only uniquely assigned reads are counted) are based on *read_assignments.tsv. The unique reads for ENST00000652586 in *read_assignments.tsv are only 19, but the count for ENST00000652586 in *.transcript_counts.tsv file is 36 (if there exists filtering, the number should less than 19). Does *.transcript_counts.tsv file includes non-unique reads as well?

My purpose is just want to know the 36 read IDs for ENST00000652586 in *.transcript_counts.tsv, it seems hard for me to deduce from *read_assignments.tsv. Is there any way to know the corresponding read IDs for the count in *.transcript_counts.tsv?

I will try the latest 3.6.3 version. Thanks.

Best,
Aifu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants