-
Notifications
You must be signed in to change notification settings - Fork 1
gff toolbox mongo ingest
rodtheo edited this page Sep 9, 2021
·
1 revision
This command add annotations into an existing GFF mongo database created through gff toolbox convert command. It'll follow GFF convention to include annotations into Dbxref and Ontology_term inside attributes field informing, for each anotation, from which external database the information (DBTAG field), what's the annotation code (ID tag) and optional it can have a description field. For details about gff file convention, please check this link.
# Trigger help
gff-toolbox mongo-ingest -h
# Help
gff-toolbox:
Mongo-ingest
This command add annotations into an already created GFF mongo database.
usage:
gff-toolbox mongo-ingest --input <tsv> [--gff_feature gene --db_name <db_name> --genome_name <genome_name> --mongo_path <mongo_path> ]
gff-toolbox mongo-ingest -h | --help
options:
-h, --help Show this screen
-i, --input=<tsv> Annotation file in TSV (tab-separated values) format describing, for each line, the gff feature
(default: gene, can be changed in parameter --gff_feature) id and
corresponding annotations that should be included in mongodb database.
The annotation file must contain four columns (#locusName\tId\tIdType\tdescription). [Default: stdin].
-l, --gff_feature=<feature_type> Which GFF feature type must be annotated. [Default: gene].
-d, --db_name=<db_name> Name of existing mongodb database to update with anotations.
If database doesnt exist, create it using gff-toolbox convert module [Default: annotation_db].
-n, --genome_name=<genome_name> When loading the mongodb this will be used as collection name. [Default: Genome].
-p, --mongo_path=<mongo_path> Where to load your mongoDB? [Default: ./mongodb].
If you insert a path that already have a mongoDB in it will include (append)
the GFF as new collection (<genome_name>) in a new or existing DB (<db_name>).
example:
## Create a mongo database from a GFF, if it doesnt exist yet
## This will create the GFF mongodb collection named <genome_name> in mongodb <db_name>
## DBs are writen in the localhost 27027 mongo db connection of mongo shell
$ gff-toolbox convert --format mongodb -i Kp_ref.gff --genome_name Kp --db_name annotation_db
## Next, include annotations written in gene.functions.txt file to corresponding
## gene features in existing GFF mongo collection Kp
$ gff-toolbox mongo-ingest -i gene.functions.txt -n Kp
# Example
## Suppose we'd like to annotate our
## First, build the mongo database from a GFF or use an already created
gff-toolbox convert --format mongodb -i Kp_ref.gff --genome_name Kp --db_name annotation_db
## If display the database entry for genes gene-KPHS_00170 and gene-KPHS_02590 in our created collection
## using pymongo (or mongo shell, other tool to interact with mongodb) we get the following output in json format
{ ...
{'recid': 'NC_016845.1',
'source': 'RefSeq',
'type': 'gene',
'start': '22533',
'end': '22802',
'score': '.',
'strand': '+',
'phase': '.',
'attributes': {'ID': 'gene-KPHS_00170',
'Dbxref': 'GeneID:11844995',
'Name': 'KPHS_00170',
'gbkey': 'Gene',
'gene_biotype': 'protein_coding',
'locus_tag': 'KPHS_00170'},
{'recid': 'NC_016845.1',
'source': 'RefSeq',
'type': 'gene',
'start': '298103',
'end': '299212',
'score': '.',
'strand': '+',
'phase': '.',
'attributes': {'ID': 'gene-KPHS_02590',
'Dbxref': 'GeneID:11845246',
'Name': 'KPHS_02590',
'gbkey': 'Gene',
'gene_biotype': 'protein_coding',
'locus_tag': 'KPHS_02590'},
...
}
## Notice that both genes have some attributes inherited from GFF file
## Now, suppose we'd like to supply those genes with more information
## We can do this using by generating a tab-separated annotation file like the one bellow
cat test/gene.functions.tsv
## output
FeatureId AnnotId IdType Description
gene-KPHS_00170 PTHR30520:SF0 PANTHER TRANSPORTER-RELATED
gene-KPHS_00170 GO:0006810 GO transport
gene-KPHS_00170 3.4.16.2 EC Lysosomal Pro-Xaa carboxypeptidase
gene-KPHS_00170 GO:0005215 GO transporter activity
gene-KPHS_02590 GO:0003735 GO structural constituent of ribosome
gene-KPHS_02590 PTHR36029 PANTHER
## Using the annotation file as input to toolbox mongo-ingest to include information for genes gene-KPHS_00170 and gene-KPHS_02590
gff-toolbox mongo-ingest -i gene.functions.txt -n Kp --db_name annotation_db
## Visualization of database entry allows to check if annotations were included
{
...
{'recid': 'NC_016845.1',
'source': 'RefSeq',
'type': 'gene',
'start': '22533',
'end': '22802',
'score': '.',
'strand': '+',
'phase': '.',
'attributes': {'ID': 'gene-KPHS_00170',
'Dbxref': [{'DBTAG': 'GeneID', 'ID': '11844995'},
{'DBTAG': 'PANTHER',
'ID': 'PTHR30520:SF0',
'Description': 'TRANSPORTER-RELATED'},
{'DBTAG': 'EC',
'ID': '3.4.16.2',
'Description': 'Lysosomal Pro-Xaa carboxypeptidase'}],
'Name': 'KPHS_00170',
'gbkey': 'Gene',
'gene_biotype': 'protein_coding',
'locus_tag': 'KPHS_00170',
'Ontology_term': [{'DBTAG': 'GO',
'ID': 'GO:0006810',
'Description': 'transport'},
{'DBTAG': 'GO',
'ID': 'GO:0005215',
'Description': 'transporter activity'}]}},
{'recid': 'NC_016845.1',
'source': 'RefSeq',
'type': 'gene',
'start': '298103',
'end': '299212',
'score': '.',
'strand': '+',
'phase': '.',
'attributes': {'ID': 'gene-KPHS_02590',
'Dbxref': [{'DBTAG': 'GeneID', 'ID': '11845246'},
{'DBTAG': 'PANTHER', 'ID': 'PTHR36029'}],
'Name': 'KPHS_02590',
'gbkey': 'Gene',
'gene_biotype': 'protein_coding',
'locus_tag': 'KPHS_02590',
'Ontology_term': [{'DBTAG': 'GO',
'ID': 'GO:0003735',
'Description': 'structural constituent of ribosome'}]}
...
}