This repository is designed to evaluate the performance of TEProf2 on simulation data.

The TEProf2 code has been adapted from the original implementation available at https://github.com/twlab/TEProf2Paper. These modifications were made to ensure compatibility with the updated versions of the Gencode GTF reference file and the Repeatmasker file.

How to prepare Gencode GTF reference file

Download Gencode version GTF reference file
Sort the File

cat <GENCODE GTF>  | awk '{if($3=="transcript"||$3=="exon"||$3=="start_codon"){print}}' |  awk -F "; " '{print $0"\t"$2}' > <OUTPUT_sorted.gtf>`

Use custom script to create dictionary

genecode_to_dic.py <OUTPUT_sorted.gtf>

This step will generate (1) genecode_plus.dic and (2) genecode_minus.dic. Use them to setup the arguments.txt

How to prepare Repeatmasker file

Download RM V4.0.6 file, and convert it to the clean bed format.

clean_RM.ipynb

bgzip the file with samtools

cat | sort -k1,1 -k2,2n > <SORTED.BED>

bgzip <SORTED.BED> > <SORTED.BED.GZ>

Create tabix index

tabix -p bed rmsk.bed.gz

Use rmsk.bed.gz to setup the arguments.txt

Quick Run

After completing the annotation process for each GTF file (refer to the guidance in Run annotation on each GTF file), we created a custom wrapper script to automate the process through to the final table creation. This script is available in the file run_command.sh.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Files

README.md

Latest commit

History

README.md

File metadata and controls