-
Notifications
You must be signed in to change notification settings - Fork 7
Home
This is the wiki for the project BG7.
Click here for the linkable SVG file.
The headers of the FASTA file including the RNA sequences must comply with the format of the .frn files that you can find in Refseq, that means they should look something like this:
>ref|NC_011283|:75804-75898|Sec tRNA| [locus_tag=KPK_0076]
You can find an example here for the RNA file of a Clostridium strain: ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Clostridium_SY8519_uid68705/NC_015737.frn
This program is the main enter point for the project. It relies on the 'executions.xml' file, where sub-programs are specified along with their arguments so that in the end the whole annotation process is performed. The corresponding jar file can be found at the /jars project folder.
Execution times expected:
- FixFastaHeaders: almost instantaneous
- PredictGenes: about 10 minutes
- RemoveDuplicatedGenes: 5/10 minutes
- SolveOverlappings: 10/15 minutes
- FillDataFromUniprot : Directly proportional to the number of proteins (if there are a lot proteins sometimes it kind of gets stucked for some time... we suspect uniprot cuts temporarily the access to our ip)
- FillDataFromBio4j: ~ 1 minute
- GenerateCSVFile: almost instantaneous
Associates an unique id to each fasta header.
Completes protein data performing HTTP requests to Uniprot site.
Completes protein data retrieving it from Bio4j DB.
Removes all genes that are duplicated.
Solves every overlapping found between genes and rnas.
This is one of the most important programs/steps on the semi-automatic annotation process. It carries out the gene prediction phase of the process.
Generates two multifasta files for the genes that have been predicted by the end of the process. One including the nucleotide sequences and other with the amino acid sequences.
Generates both a XML and multifasta file including every intergenic sequence.
Generates the corresponding file in format GFF for the final XML results file.
Exports the fnial XML results file to a CSV file.
It creates a new annotation XML file without any dismissed gene included in the input annotation XML file.
Exports the final xml annotation file to Embl format (one file for each contig).
Exports final xml annotation file to GenBank format.
Exports final xml annotation file to GenBank format.
Looks for weird/wrong syntax <Iteration_query-def> values in blastoutput xml files, specifically wrong number of characters '|'.
Generates some statistics about proteins grouped by organism.
Performs a really basic (still useful) quality control in the final annotation results XML file.
Performs an automatic quality control in some results selected randomly from the final annotation XML file.
Quality control program for GenBank files exporter program: Export GenBank files
Quality control program for '5 columns' GenBank files (those used for genomes submissions) exporter program: Export 5 columns GenBank files
Quality control program for the file generated by the program 'FixFastaHeaders'