GitHub - alerpaschoal/apte_pipeline: The pipeline used to build the APTE: Atlas of Plant Transposable Elements

The following tutorial will use A. thaliana to represent the work and codes used to extract and map the Transposable Elements reported in the article:

This tutorial assumes that all the software used for the analysis is installed. To check the instructions, facilities and installations, access:

LTR_retriever: https://github.com/oushujun/LTR_retriever
MGEScan non-ltr: https://mgescan.readthedocs.io/en/latest/nonltr.html
RepeatMasker: http://www.repeatmasker.org/RMDownload.html
RepeatModeler: http://www.repeatmasker.org/RepeatModeler
RepeatScout: https://bix.ucsd.edu/repeatscout
MITEHunter: http://target.iplantcollaborative.org/mite_hunter.html
HelitronScanner: https://sourceforge.net/projects/helitronscanner
PASTEClassifier: https://urgi.versailles.inra.fr/Tools/PASTEClassifier

Also, for automation purposes, some scripts have been developed that are in the "scripts" folder and will be demonstrated step by step with the results of each output of the software above.

1 - After making the annotation using LTR_retriever, you should see a number of files, there are examples in "athaliana / ltr_retriever", of this github. The command below will format in .gff3 for future treatment.

1 - To use LTR_retriever:
    installation_path/./ltr_finder Arabidopsis_thaliana.TAIR10.dna.toplevel.fa -w0 > output 
    installation_path/./LTR_retriever -threads 10 -genome Arabidopsis_thaliana.TAIR10.dna.toplevel.fa -infinder output

2 - To format the output:
    - Make sure to be at athaliana folder
    
    //code
        cd ltrretriever
    grep -v '^#' Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.out.gff > formatItem.gff
    awk -F"\t" {'print $9"\t"$3'} formatItem.gff | sort | uniq -c | sort -k1rg > ltrfamilies.txt
    awk -F" " '{print $1"\t"$2"\t"$3}' ltrfamilies.txt > ltrfamily.txt
    perl ../../codestoformat/formatOutputRetriever.pl Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.out.gff ltrfamily.txt
    //code

    - A LTR_Retriever.gff3 file will be generated.

2 - The output of MGEScan non-ltr is sorted by folder, the TE class being respective, and inside it will contain the sequences that the methodology found, there are examples in "athaliana / non-ltr". Thus, in this structure we run a script from the "script" folder to properly format the output.

1 - To use MGEScan non-ltr:
    mgescan nonltr ./athali --output=./athali/resultNonLTR

2 - To format the output:
    - Make sure to be at athaliana folder

    //code
    cd resultNonLTR
    cd info
    cd full
    #generates the output MGEScan non-ltr into fasta files.
    ls -d */ | while read d
    do
        cd $d
        perl ../../../../../codestoformat/generatingFastaNonLTR.pl ${d////}.dna ${d////}
        cd ..
    done
    #generates MGEScan non-ltr a output in .gff3.  
    cd ..
    perl ../../../codestoformat/formatOutputNonLTR.pl nonltr.gff3 nonltr.fa
    //code

    - A NonLTR.gff3 file will be generated.

3 - With this RepeatMasker methodology, using any library. It will report some files to you, following this structure, we will use a script to format the output and structure the information we have from the output.

1 - To use RepeatMasker:
    installation_path/RepeatMasker -lib RepbaseLibrary.fa -gff -cutoff 250 -nolow -div 60 Arabidopsis_thaliana.TAIR10.dna.toplevel.fa

2 - To format the output:
    - Make sure to be at athaliana folder

    //code
    cd repeatmasker
    perl ../../codestoformat/formatOutputRepeatMaskerGFF.pl Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.out
    //code

    - A RepeatMasker.gff3 file will be generated.

4 - RepeatModeler is a strategy that identifies consensus sequences within the target genome. The result is a "library" that will be used later with the RepeatMasker methodology, there are examples in "athaliana / repeatmodeler". This step is divided into two parts.

1 - To use RepeatModeler:
    perl installation_path/RepeatModeler/BuildDatabase -name Arabidopsis_thaliana.TAIR10.dna.toplevel.fa
    perl installation_path/RepeatModeler/RepeatModeler -database Arabidopsis_thaliana -engine ncbi -pa 20
    installation_path/RepeatMasker/RepeatMasker -lib *-families.fa -gff -cutoff 250 -nolow -div 60 Arabidopsis_thaliana.TAIR10.dna.toplevel.fa

2 - To format the output:
    - Make sure to be at athaliana folder

    //code
    cd repeatmodeler
    perl ../../codestoformat/formatOutputRepeatModelerGFF.pl Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.out
    //code

    - A RepeatModeler.gff3 file will be generated.

5 - RepeatScout is a methodology similar to RepeatModeler, the output steps are similar. A "library" is generated, which we then pass to the RepeatModeler for mapping the entire genome, there are examples in "athaliana / repeatscout".

1 - To use RepeatScout:
    installation_path/build_lmer_table -sequence Arabidopsis_thaliana.TAIR10.dna.toplevel.fa -freq Arabidopsis_thaliana.freq
    installation_path/RepeatScout -sequence Arabidopsis_thaliana.TAIR10.dna.toplevel.fa -output Arabidopsis_thaliana.rs -freq Arabidopsis_thaliana.freq
    cat Arabidopsis_thaliana.rs | installation_path/filter-stage-1.prl > filtered.fasta
    installation_path/RepeatMasker -lib filtered.fasta Arabidopsis_thaliana.TAIR10.dna.toplevel.fa
    cat filtered.fasta | installation_path/filter-stage-2.prl --cat Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.out > filtered2.fasta
    ## Using the library to run RepeatMasker:
    installation_path/RepeatMasker -pa 4 -s -lib filtered2.fasta -nolow -no_is -gff Arabidopsis_thaliana.TAIR10.dna.toplevel.fa

2 - To format the output:
    - Make sure to be at athaliana folder

    - part 1
        //code
        cd repeatscout
        perl ../../codestoformat/formatOutputRepeatScoutGFF.pl Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.out
        //code

        - A RepeatScout.gff3 file will be generated.

    - part 2
        //code
        cat repeatscout/*.fasta > RepeatScout.fa
        awk '/^[>;]/ { if (seq) { print seq }; seq=""; print } /^[^>;]/ { seq = seq $0 } END { print seq }' RepeatScout.fa > RepeatScout_formated.fasta fold -w 60 RepeatScout_formated.fasta > uniqueRepeatScout.fa
        //code 

        - Here, you will need to run PASTEClassifier (Topic 8). It will generate a file "ProjectName.classif"

    - part 3
        - Some manipulation file will be required.
        //code
        sed 's/R=//g' repeatscout/RepeatScout.gff3 > repeatscout/RepeatScoutR.gff3
        sort -k3,3V repeatscout/RepeatScoutR.gff3 > repeatscout/RepeatScoutOrdered.gff3
        perl ../codestoformat/formatOutputUniqueScout.pl ProjectName.classif repeatscout/RepeatScoutOrdered.gff3
        //code

        - A RepeatScout.gff3 file will be generated.

6 - MITEHunter is a strategy that searches the genome and builds consensus only for TE of the MITE type. These results are given in "groups" that later are used for formatting script.

1 - To use MITEHunter:
    perl installation_path/MITE_Hunter_manager.pl -i Arabidopsis_thaliana.TAIR10.dna.toplevel.fa -c 6 -n 6 -S 12345678

2 - To format the output:
    - Make sure to be at athaliana folder

    //code
    cd mitehunter
    for filename in *.fa; do
        perl ../..//codestoformat/formatOutputMITE.pl ${filename}
    done
        //code

    - A MITE.gff3 file will be generated.

7 - HelitronScanner works in a similar way to MITEHunter, and it is also necessary to use a script to format it for future manipulation of these records.

1 - To use HelitronScanner:
    java -jar installation_path/HelitronScanner/HelitronScanner.jar scanHead -lf installation_path/TrainingSet/head.lcvs -bs 5000000 -g Arabidopsis_thaliana.TAIR10.dna.toplevel.fa -th 6 -o installation_path/head.helitronscanner.out   
    java -jar installation_path/HelitronScanner/HelitronScanner.jar scanHead -lf installation_path/TrainingSet/head.lcvs -bs 5000000 -g Arabidopsis_thaliana.TAIR10.dna.toplevel.fa -th 6 --rc -o installation_path/head2.helitronscanner.out 
    java -jar installation_path/HelitronScanner/HelitronScanner.jar scanTail -lf installation_path/TrainingSet/tail.lcvs -bs 5000000 -g Arabidopsis_thaliana.TAIR10.dna.toplevel.fa -th 6 -o installation_path/tail.helitronscanner.out  
    java -jar installation_path/HelitronScanner/HelitronScanner.jar scanTail -lf installation_path/TrainingSet/tail.lcvs -bs 5000000 -g Arabidopsis_thaliana.TAIR10.dna.toplevel.fa -th 6 --rc -o installation_path/tail2.helitronscanner.out    
    java -jar installation_path/HelitronScanner/HelitronScanner.jar pairends -hs installation_path/head.helitronscanner.out -ts installation_path/tail.helitronscanner.out -o installation_path/pairends  
    java -jar installation_path/HelitronScanner/HelitronScanner.jar pairends -hs installation_path/head2.helitronscanner.out -ts installation_path/tail2.helitronscanner.out -o installation_path/pairends2   
    java -jar installation_path/HelitronScanner/HelitronScanner.jar draw -p installation_path/pairends -g Arabidopsis_thaliana.TAIR10.dna.toplevel.fa -o installation_path/output --pure --ext --flanking  
    java -jar installation_path/HelitronScanner/HelitronScanner.jar draw -p installation_path/pairends2 -g Arabidopsis_thaliana.TAIR10.dna.toplevel.fa -o installation_path/output_reverse --pure --ext --flanking

2 - To format the output:
    - Make sure to be at athaliana folder

    //code
    cd helitronscanner
    perl ../../codestoformat/formatOutputHelitron.pl output.ext.hel.fa +
    perl ../../codestoformat/formatOutputHelitron.pl output_reverse.ext.hel.fa -
    //code

    - A Helitrons.gff3 file will be generated.

8 - With PASTEClassifier, we label the strings that are identified by RepeatScout, since it builds the strings by stretching smaller strings. Before passing this "library" through RepeatMasker, a set of scripts is used to format and obtain results accordingly.

1 - To use PASTEClassifier:
    ./PASTEClassifier.py -i uniqueRepeatScout.fa -C PASTEClassifier.cfg >& PASTEC.log

2 - To format the output:
    - A file "ProjectName.classif" will be generated and used at Topic 5, part 3.

After all these processing and automated scripts you will need to unify all these notes in a single file. To do this use: - Make sure to be at athaliana folder

  //code
  cat LTR_Retriever.gff3 RepeatModeler.gff3 RepeatMasker.gff3 NonLTR.gff3 MITE.gff3 Helitrons.gff3 RepeatScout.gff3 > TEAnnotation.asd
  #### Remove some remaining TEs
  sed -e 's/\?//g' TEsannotation.asd > TEAnnotation.gff3
  sed -i '' "/SSR/d" TEAnnotation.gff3
  sed -i '' "/Satellite/d" TEAnnotation.gff3
  sed -i '' "/Simple_repeat/d" TEAnnotation.gff3
  sed -i '' "/Other\/Composite/d" TEAnnotation.gff3
  //code

In this way, we obtain the total set of TEs, however they still contain annotation overlap and redundancy. To work around this problem, we developed a script that calculates the overlap and removes duplications without losing information (cpp.01). Partial overlays are treated in such a way that you do not lose part of the sequence or give preference to one or the other. Partial sequences are extracted in order to preserve the smallest (Start Sequence) and largest (End Sequence) geonumeric position of the overlap.

Make sure to be at athaliana folder

  //code 
  mv TEAnnotation.gff3 original_TEAnnotation.gff3
  sort -k1,1 -k4,4n original_TEAnnotation.gff3 > teste1.gff3
  bedtools merge -i teste1.gff3 -s -c 7,1 -o distinct,count > ts.bed
  awk -F"\t" '{print $1 "\t" "." "\t" "." "\t" $2+1 "\t" $3 "\t" "." "\t" $4 "\t" "." "\t" "." "\t" $5}' ts.bed > ts.gff3
  sed -e 's/SINE2/SINE/g' original_TEAnnotation.gff3 > teste1.gff3
  sed -e 's/|/\//g' teste1.gff3 > teste2.gff3
  sed -e 's/tRNA-L1/tRNA/g' teste2.gff3 > teste1.gff3
  sed -e 's/Ginger2-TDD/Ginger/g' teste1.gff3 > teste2.gff3
  sed -e 's/MULE-MuDR/MuLE-MuDR/g' teste2.gff3 > teste1.gff3
  sed -e 's/RTE-BovB/RTE/g' teste1.gff3 > teste2.gff3
  sed -e 's/noCat/Unknown/g' teste2.gff3 > teste1.gff3
  sed -e 's/SINE\/TRIM/TRIM/g' teste1.gff3 > teste2.gff3
  sed -e 's/Transposable/Unknown/g' teste2.gff3 > teste1.gff3
  sed -e 's/rRNA/SINE\/rRNA/g' teste1.gff3 > teste2.gff3
  sed -e 's/snRNA/SINE\/snRNA/g' teste2.gff3 > teste1.gff3
  bedtools intersect -a ts.gff3 -b teste1.gff3 -wa -wb -s > intersect.gff3
  perl ../codestoformat/formatOutputCoordenation.pl ts.gff3 intersect.gff3 1
  //code

Two files will be generated "TEAnnotationlFinal-1.gff3" and "TEAnnotationCompleto-1.gff3". Basically both have the same content.

With all those steps you will be able to retrieve a complete TE annotation from any genome you want.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
athaliana		athaliana
codestoformat		codestoformat
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases 1

Packages

Languages

License

alerpaschoal/apte_pipeline

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages