Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to merge the TEsorter repeat libraires #52

Open
manoharbisht1998 opened this issue Jan 15, 2024 · 18 comments
Open

How to merge the TEsorter repeat libraires #52

manoharbisht1998 opened this issue Jan 15, 2024 · 18 comments

Comments

@manoharbisht1998
Copy link

manoharbisht1998 commented Jan 15, 2024

Hey, thanks for the tool. How can I merge the output library of TEsorter with the repeatModeler repeat library to run RepeatMasker? Further, can I directly input the output library of TEsorter in RepeatMasker?

@zhangrengang
Copy link
Owner

Yes. In the output library *.cls.lib, the sequences are identical to the input, but their ID have been updated with new classifications.

@manoharbisht1998
Copy link
Author

Okay, thanks for answering the second part of my question. But I still have doubt about merging the two libraries. As the RepeatModeler provides the consensus library where the number of sequences is very less as compared to input genome fasta whereas, the TEsorter provides the number of sequences same as the input genome fasta. So I am wondering that, can I merge both the librarires in one and then run clustered the merged library using tools like CD-Hit?

@zhangrengang
Copy link
Owner

I do not understand. Are you using -genome option to screen a whole genome with TEsorter? Otherwise, you should not input genome fasta, but input TE fasta identified by e.g. RepeatModeler.

@manoharbisht1998
Copy link
Author

Thank you for the prompt reply. Yes, I used the -genome option to screen for the TEs in my genome. However, I was not aware that we can also input the library obtained from RepeatModeler.
Now, I will run the TEsorter with the repeat library obtained from RepeatModeler and with -db rexdb-plant (as my species is a plant). Then the result that I will get can be fed downstream to RepeatMasker. Please correct me if I miss anything.

@zhangrengang
Copy link
Owner

You are right. Please note that the -genome option do not produce a TE library like RepeatModeler, but output annotations (*.dom.gff3) and sequences (*.dom.faa) of TE protein domains across the whole genome.

@manoharbisht1998
Copy link
Author

Okay. I am using the TEsorter v1.4.6, and I did get the *.cls.lib by using the -genome option.

@zhangrengang
Copy link
Owner

It is strange. How did you install it? Is it the last version from github?

@manoharbisht1998
Copy link
Author

I installed with conda environment

@zhangrengang
Copy link
Owner

I test the conda version, but only four files output:

$ TEsorter -genome rice6.9.5.liban -fw
$ ls
rice6.9.5.liban.rexdb.domtbl
rice6.9.5.liban.rexdb.dom.gff3
rice6.9.5.liban.rexdb.dom.faa
rice6.9.5.liban.rexdb.dom.tsv

@manoharbisht1998
Copy link
Author

manoharbisht1998 commented Jan 15, 2024

Oh, it must be because I did not define my genome by parameter -genome instead I used something.
TEsorter my_genome.fa -p 50 -prob 0.9
Which means TEsorter by default took it as a repeat library, I guesss.

@zhangrengang
Copy link
Owner

Yes.

@manoharbisht1998
Copy link
Author

Thank you for the prompt reply. Yes, I used the -genome option to screen for the TEs in my genome. However, I was not aware that we can also input the library obtained from RepeatModeler. Now, I will run the TEsorter with the repeat library obtained from RepeatModeler and with -db rexdb-plant (as my species is a plant). Then the result that I will get can be fed downstream to RepeatMasker. Please correct me if I miss anything.

Further, on this.. I run TEsorter with the RepeatModeler output consesi.fa and it took only one minute to give me the output in *.cl.lib, with the following output on screen
Order Superfamily # of Sequences# of Clade Sequences # of Clades# of full Domains
LTR Copia 75 72 8 3
LTR Gypsy 108 80 6 20
pararetrovirus unknown 7 0 0 0
LINE unknown 22 0 0 0
TIR EnSpm_CACTA 4 0 0 0
TIR MuDR_Mutator 6 0 0 0
TIR PIF_Harbinger 5 0 0 0
TIR hAT 5 0 0 0

Now I am wondering does the pipeline worked or not?

@zhangrengang
Copy link
Owner

It works. It is fast for small TE library.

@manoharbisht1998
Copy link
Author

manoharbisht1998 commented Jan 15, 2024

Hi, I have run the RepeatMasker, and I am getting more repeats classified as "unknown" which I want to reduce. I am attaching the output of repeatMasker for my genome both using RepatModeler ---> RepeatMasker and RepeatModeler ---> TEsorter --->RepeatMasker. Do you have any suggestions on how can I reduce the number of "unknown" TEs? Further, I am also attaching the headers of the file .*cls.lib which I obtained after running TEsorter and input in RepeatMasker.

1_Unknown#Unknown 1_Unknown ( RepeatScout Family Size = 4356, Final Multiple Alignment Size = 100, Localized to 2506 out of 2617 contigs )
AAATATGAAATAAATAAAAATAATACATGGAAATGGAAAATACNGATTATTTAATTANTA

Reuslt

@zhangrengang
Copy link
Owner

You may use the union set of non-unknown TEs from RepatModeler and TEsorter.

@manoharbisht1998
Copy link
Author

I could not get you! are you suggesting to take only those sequences that are annotated by both RepeatModeler and TEsorter output (which we obtain after running with RepeatModeler library)?

@zhangrengang
Copy link
Owner

I mean you may replace the unknown classifications by TEsorter with the known classifications by RepeatModeler, like:

less rice6.9.5.liban.rexdb.cls.lib | awk '{if ($1~"#Unknown"){cls=$1; $1=">"$2; $2=cls}{print}}'

It is just to reduce the number of "unknown" TEs.

@manoharbisht1998
Copy link
Author

Okay, Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants