Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1. How to cluster transcripts from multiple query gtf files? 2. Something wired about tss_id #99

Open
dudududu12138 opened this issue Dec 25, 2024 · 1 comment
Assignees

Comments

@dudududu12138
Copy link

dudududu12138 commented Dec 25, 2024

Hi, I used gffcompare to merge multiple assemble results from different samples. The code is listed below:

gffcompare -T -S $gffcompare_inputs -o $output 

I didn't offer a reference annotation cause I just want to merge assemble results from different samples.But I got two questions.

Question 1.

How does gffcompare cluster multiple samples without a reference annotation? There is only description of duplication removing in your manual[https://ccb.jhu.edu/software/stringtie/gffcompare.shtml] and your paper. In my results, I saw some transcripts with contained_in tag. You said you want to remain alternative start site. But there are some transcripts with the same tss_id and one of them with contained_in tag. It is so wired. Below is an example:

image

Question 2.

After I found the problem above, I checked the tss_id in my results. I followed 3 steps:

  • step1: Group transcripts with tss_id

  • step2: Calculate the max distance of start site among transcripts with the same tss_id

  • step3: Get the distribution of distance

Below are 2 figures of my checking results. Although your default setting of parameter -d is 100, there are still transcripts using the same tss_id dist more than 100bp.
Below is a pie plot and Category means max distance=0, >0 & <=100 and >100

image
Below is a histogram of distance, I removed distance=0 firstly.
image

Thank you so much to solve my question and looking forward to your reply!

@dudududu12138
Copy link
Author

Hi, I have figured out why the two transcripts have the same tssid, but the transcription start sites are far apart. You set a cutoff to cluster transcript start site(-d with default 100). But that only applies to positive strand genes, negative strand genes don't adhere to that standard.

@gpertea gpertea self-assigned this Dec 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants