Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about GALBA performance vs. BRAKER3 #56

Open
aleponce4 opened this issue Sep 6, 2024 · 2 comments
Open

Question about GALBA performance vs. BRAKER3 #56

aleponce4 opened this issue Sep 6, 2024 · 2 comments

Comments

@aleponce4
Copy link

Hi,
First, apologies if this is not the best place to ask this question. I'm doing gene annotation for a non-model rodent, and I have tried several approaches:

  1. BRAKER2: with a large protein database,
  2. GALBA: using combined protein data from a few closely related species,
  3. BRAKER3: incorporating both the above protein data and total RNA-Seq data alignments.

I obtained considerably better results using GALBA than BRAKER3. While this is great news, I am surprised that GALBA outperformed BRAKER3 given that BRAKER3 had the additional RNA-Seq data.

Is this within expectations for these tools, or could this indicate an issue with my BRAKER3 run? (For example, could there have been a problem with the RNA-Seq alignments?) I'm mostly basing my assessment of "better" on the results I obtained using OMARK for each annotation.

*I attached a plot of my OMARK results. The braker3 bar was run using RNA data from just 1 sample, while braker3+ used all the RNA data I had available.

Picture1

Thanks in advance for any insights!

@KatharinaHoff
Copy link
Member

BRAKER2 is expected to perform poorly on a mammal, it's not made for large genomes.

BRAKER3 should be able to handle a vertebrate genome, however, we did not design it to work with mammals, specifically. For annotating a mammal, we usually do not train AUGUSTUS at all, we use the human parameter set. For a mammal, you could also consider Tiberius (https://www.biorxiv.org/content/10.1101/2024.07.21.604459v1.full.pdf) but it does not use the RNA-Seq data, and it only predicts one isoform per locus. In that category, it may outperform Galba, though.

What you see for BRAKER3 is the result of a too stringent filtering. Too many gene models without evidence were discarded from the total gene set. The models that you have are very good, but stuff is missing. You may be able to manually restore that.

Can you send me that part of the braker.log where it ran best_by_compleasm.py ? Also, the log file from the subfolder of compleasm may be useful. (However, if you add more genes, the unknown cosistency and fragments etc. may increase just to the level of the galba run, not sure whether it's even worth trying.)

For mammals, you have excellent reference protein donors, so it is not surprising that Galba did well. Also, I added that DIAMOND filter to discard false positives, a while ago. That also helps denoising, and it only works well in your current data situation.

@aleponce4
Copy link
Author

I appreciate the explanation!
I’ll definitely give Tiberius a try when I can (though the GPU partitions at my university are usually pretty busy, so it might take a bit of time). Based on your experience and the OMAR results I've gotten so far, it sounds like GALBA is a good fit for my current project. I just wanted to make sure the behavior I'm seeing comparing with BRAKER3 wasn't unusual.

As for compleasm, I had some issues with it, so I had to run BRAKER3 without the --busco_lineage option. I saw a similar issue mentioned in Compleasm_to_hints error #752, but I haven’t had a chance to build a new Singularity container with the updated TSEBRA yet. I'm still climbing the learning curve.

Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants