Analysis accompanying the manuscript "Biosurfer for systematic tracking of regulatory mechanisms leading to protein isoform diversity"
This repository contains steps to run the biosurfer analysis, which reproduces the results, summary plots, and figures for the Biosurfer manuscript (bioRxiv).
Contents
- Download Biosurfer analysis repository
- Download and install Biosurfer package
- Download input data
- Run Biosurfer modules
- Global characterization of altered protein regions in the human annotation (GENCODE)
You can use the latest version from the source code.
git clone https://github.com/sheynkman-lab/biosurfer_analysis
cd biosurfer_analysis
conda create --name biosurfer-install --channel conda-forge python=3 pip
conda activate biosurfer-install
conda install --channel conda-forge graph-tool
git clone https://github.com/sheynkman-lab/biosurfer.git
Note: The Biosurfer package will be downloaded within the
biosurfer-analysis
directory.
The editable installation of Biosurfer package looks for the setup.py
within biosurfer directory.
pip install --editable biosurfer
Note: if you get a
importlib.metadata.PackageNotFoundError
error, please deactivate and then activate the conda env again
The input data used for the analysis and the corresponding outputs generated by Biosurfer can be found on Zenodo:
- GENCODE toy:
- Description: Toy dataset generated from GENCODE v38
- Use: This dataset can be used to test the functionality and modules of Biosurfer
- Size: 4.2 MB
- GENCODE v42:
- Description: It contains the basic gene annotation on the primary assembly sequence regions
- Use: Used for the analyses conducted in the manuscript
- Size: 1.29 GB
- WTC11:
- Description: WTC11 is a long-read RNA-seq data from a human induced pluripotent stem cells (iPSC) (Kreitzer et al. 2013)
- Use: Used for the analyses conducted in the manuscript.
- Size: 644 MB
for source in gencode_toy gencode_v42 wtc11
do
bash "./scripts/download_$source.sh"
done
Note: Any GENCODE version can be used with the appropriate GTF, transcript FASTA, and translation FASTA files.
Please also note that in the code, the terms
anchor
andother
correspond to thereference
andalternative
isoforms mentioned in the manuscript.
For more information on the modules, refer to Biosurfer package repo (here)
Running the load database module creates a SQLite database file under biosurfer/databases/
directory.
biosurfer load_db \
--source=GENCODE \
--gtf A_gencode_toy/biosurfer_gencode_toy_data/gencode.v38.toy.gtf \
--tx_fasta A_gencode_toy/biosurfer_gencode_toy_data/gencode.v38.toy.transcripts.fa \
--tl_fasta A_gencode_toy/biosurfer_gencode_toy_data/gencode.v38.toy.translations.fa \
-d gencode_toy
biosurfer load_db \
--source=GENCODE \
--gtf A_gencode_v42/biosurfer_gencode_v42_data/gencode.v42.basic.annotation.gtf \
--tx_fasta A_gencode_v42/biosurfer_gencode_v42_data/gencode.v42.pc_transcripts.fa \
--tl_fasta A_gencode_v42/biosurfer_gencode_v42_data/gencode.v42.pc_translations.fa \
-d gencode_v42
Load the GENCODE v42 GTF annotations first to set the reference isoforms for WTC11 PacBio data
biosurfer load_db \
--source=GENCODE \
--gtf A_gencode_v42/biosurfer_gencode_v42_data/gencode.v42.basic.annotation.gtf \
--tx_fasta A_gencode_v42/biosurfer_gencode_v42_data/gencode.v42.pc_transcripts.fa \
--tl_fasta A_gencode_v42/biosurfer_gencode_v42_data/gencode.v42.pc_translations.fa \
-d wtc11
Load the WTC11 PacBio data
biosurfer load_db \
--source=PacBio \
--gtf A_wtc11/biosurfer_wtc11_data/wtc11_with_cds.gtf \
--tx_fasta A_wtc11/biosurfer_wtc11_data/wtc11_corrected.fasta \
--tl_fasta A_wtc11/biosurfer_wtc11_data/wtc11_orf_refined.fasta \
--sqanti A_wtc11/biosurfer_wtc11_data/wtc11_classification.txt \
-d wtc11
mkdir B_hybrid_aln_results_toy
biosurfer hybrid_alignment \
-d gencode_toy \
-o B_hybrid_aln_results_toy \
--gencode
mkdir B_hybrid_aln_gencode_v42
biosurfer hybrid_alignment \
-d gencode_v42 \
-o B_hybrid_aln_gencode_v42 \
--gencode
mkdir B_hybrid_aln_wtc11
biosurfer hybrid_alignment \
-d wtc11 \
-o B_hybrid_aln_wtc11
Note: Running this step could take some time(~30 mins) depending on the size of the input data.
The below script invokes the plotting module for CRYBG2 gene and outputs a PNG file. Users can alter the below script to view protein isoforms of any gene they desire.
bash ./scripts/isoform_plotting.sh
The following steps reproduces the results for GENCODE v42.
pip install ipykernel xlsxwriter openpyxl plotly
Genome-wide analysis of protein isoforms in the GENCODE annotation/WTC11
python3 ./scripts/genome_wide_summary.py
python3 ./scripts/n_termini_summary.py
python3 ./scripts/internal_summary.py
python3 ./scripts/c_termini_summary.py
To reproduce the results for for WTC11: in plot_config.py
comment line 76
and uncomment line 78