Systematic analysis of the underlying genomic architecture for transcriptional-translational coupling in prokaryotes
Systematic analysis of the underlying genomic architecture for transcriptional-translational coupling in prokaryotes
Richa Bharti, Daniel Siebert, Bastian Blombach, Dominik G. Grimm
The aim of this study is to provide a comprehensive workflow to systematically investigate bacterial genomes for the abundance of transcriptional and translational associated genes clustered in distinct operons.
We have created a comparative genomics pipeline for screening genomic distributions of probable conserved operonic gene cassettes in bacteria (Figure 1). More details can be found in the accompanying manuscript (currently under preparation). The workflow is based an a series of different steps, which are based on custom Python, R and Bash scripts. The full pipeline can be run with a single Bash command: run.sh
(more details can be found below).
Figure 1 gives a general overview about the different steps of the pipeline. More details can be found in the accompanying manuscript.
Basic prerequisites that need to be satisfied:
Any Linux based distro should work. We tested the scripts using:
Distributor ID: Ubuntu
Description: Ubuntu 18.04.2 LTS
Release: 18.04
Codename: bionic
lsb_release -a
on a Ubuntu based system.
In order to reproduce the results please make sure that you have all the libaries on your local machine. Without these the workflow will fail to run.
- Clone this project:
git clone https://github.com/grimmlab/transcriptional-translational-coupling.git
- If not already installed, please install R (>4.0.3) and the taxize library on your local machine:
sudo add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/'
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
sudo apt update
gpg --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
gpg -a --export E298A3A825C0D65DFD57CBB651716619E084DAB9 | sudo apt-key add -
sudo apt install r-base r-base-core r-recommended r-base-dev
sudo Rscript -e 'install.packages("taxize",dependencies = TRUE)'
- Install all Python dependencies (we recommend to first setup a virtual Python environment):
pip3 install -r requirements.txt
To identify and investigate operons containing both transcriptional and translational genes, 2,071 bacterial genomes were downloaded from the DOOR2 database and corresponding annotation files were retrieved from the NCBI Assembly database using the available REST API.
We created a data dump, including all necessary data from the DOOR2 database and the NCBI Assembly database to reproduce the results from the paper. Alternatively all data can be also fetched from the DOOR2 database and from NCBI Assembly database (time consuming).
To download the data dump just run the following commands in your command line.
This is only needed if you do not run the full pipeline. To run the full pipeline have a look at the next section
- First clone this repository:
git clone https://github.com/grimmlab/transcriptional-translational-coupling.git
- Move to the github and data folder:
cd transcriptional-translational-coupling/data
- Merge splitted data zip files:
zip -FF data_dump.zip --out data.zip
- Unzip merged zip file to extract the data
unzip data.zip
The full pipeline can be run by executing a single bash script (running the full pipeline can take several hours):
sh run.sh
This will run the full pipeline, as illustrated in Figure 1B.
In the following we will give some detailes about the individual steps of the pipeline within the run.sh
bash script. More details about the general pipeline can be found in the accompanying manuscript (currently under preparation):
-
The environment and all PATH variables are setup by the script.
-
run_uncompress_door2_and_ncbi_data
: The data dumps from the GitHub repo are merged and unzipped intodata/data
-
run_classification_code
: This is the most time intensive step of the whole pipeline. Here the annotations from the NCBI and the DOOR database are analysed. First, annotation files are extracted from the NCBI Assembly database for each genome and a table of operons is created based on the relative proportions of operons which fall into one of the following categories:
a) Genes in operon associated with only transcription
b) Genes in operon associated with only translation
c) both: Genes in operon associated with both transcription and translation
d) Genes in operon associated with neither transcription and nor translation.
Second, count data is generated for each operon table by comparing them with a list of bacterial transcriptional and translational genes. The resulting gene list consists of gene names and their reported synonyms for each individual entry. A simultaneous keyword (gene name) and synonym-based (gene-synonym) search module is utilized to create a count table containing a catalogue of each of the categories. -
run_concatenate_all_classified_files
: The output files from the previous task are mergerd into a single table, including information such as locus tag, function, gene name and COG id. In addition the table is filtered for operons with only genes associated with both transcription and translation (results can be found inanalyses/genome_list_containing_both.txt
andanalyses/Final_combined_files.txt
) -
run_modify_concatenate_all_classified_files
: Due to various inconsistencies how genes are named and identified as well as inconsitent or missing annotations an additional script is used create a unified and cleaned up table (output can be found inanalyses/Final_combined_files_corrected.txt
) -
run_occurrence_based_ranking
: In this step a occurance based ranking is performed in which genes, functions and COG ids are grouped and clustered based on their occurrence in the genomes. The top 18 overlapping occurrence genes were extracted and used to perform a gene enrichment and clustering-based analyses. Results of this analysis, including plots, can be found inanalyses/occurrence
. -
run_cooccurrence_based_gene_ranking
: All cooccurrencing gene cassettes are grouped and counted. Results of this analysis, can be found inanalyses/co-occurrence/gene_cooccurrence_based_ranking_output.txt
. -
run_cooccurrence_based_functional_ranking
: All cooccurrencing functional gene cassettes are grouped and counted. Results of this analysis, can be found inanalyses/co-occurrence/functional_cooccurrence_based_ranking_output.txt
. -
run_gene_cassette_search
: The STRING v10 database is used to perform a network based analysis for clustering gene cassettes based on gene fusion (genes reportedly existing as hybrids without any intergenic sequence(s)), gene neighborhood (genes within close proximity) and gene co-occurrence (genes existing together on same genomic loci with intergenic sequences and/or other genes). Next, the frequency of the resulting operonic gene cassettes across all extracted genomes are computed. Results of this analysis, can be found inanalyses/analyses/motif_counts_for_genes.txt
and inanalyses/gene_motif/
.