Skip to content

Latest commit

 

History

History
37 lines (29 loc) · 3.24 KB

4.4-geneAnnotation.md

File metadata and controls

37 lines (29 loc) · 3.24 KB
← 4.3.2. Data conversion from Hi-C contact matrices ↑ Index --

Gene Annotation Tracks

Gene annotation track shows the locations and structures of genes and transcripts. There are many kinds of formats and sources of gene annotations, such as UCSC known gene table, GTF, GFF, bed, etc. Currently, GIVE only supports UCSC known gene table format. GTF/GFF support is coming in next update. Gene annotation track is set as genePred type in GIVE data source.

UCSC known gene table format

UCSC known gene table format is used by UCSC known gene dataset. The UCSC Known Genes dataset is constructed by a fully automated process, based on protein data from Swiss-Prot/TrEMBL (UniProt) and the associated mRNA data from Genbank. It's a Tab separated 12 column text file format. Here, we describe the content of each column.

name: Name of gene. This name will be shown in the gene annotation track of GIVE genome browser. chrom: Reference sequence chromosome or scaffold
strand: + or - for strand
txStart: Transcription start position (or end position for minus strand item)
txEnd: Transcription end position (or start position for minus strand item)
cdsStart: Coding region start (or end position if for minus strand item)
cdsEnd: Coding region end (or start position if for minus strand item)
exonCount: Number of exons
exonStarts: Exon start positions (or end positions for minus strand item)
exonEnds: Exon end positions (or start positions for minus strand item)
proteinID: (Currently NOT be used in GIVE) UniProt ID, UniProt accession, or RefSeq protein ID
alignID: (Currently NOT be used in GIVE) Unique identifier (GENCODE transcript ID for GENCODE Basic)

The gene annotation file in UCSC known gene table format can be downloaded from UCSC table browser. The default name in the first column is UCSC known gene name (such as uc031tla.1), which will be shown in the genome browser. You might want to use gene symbol instead of the kgID. It can be done in three steps.

  • First step: In the UCSC table browser, choose the correct genome you need, set group as Genes and Gene Predictions, track as GENCODE, table as konwnGene, and output format as selected fields from primary and related tables. Set a file name to the output file. Then click the get output button.
  • Second step: check all of the knownGene fields and check geneSymbol of the kgXref fields. Then click the get output button. Then you will get the gene annotation file with geneSymbol as the last column. The following GIF animation shows a demo of the first and second steps.
  • Third step: In the annotation file got in the second step, replace the first column with the last column and delete the last column. You can use awk, sed or other tools to achieve this. We also provide a script for replacing kgID. You can get the script replace_kgname.sh in GIVE-Toolbox.
    bash replace_kgname.sh -i genePred.txt -r 13 -o genePred_symbol.txt

GTF Format

Coming soon.