Modelling genetic variation effects in plant gene regulatory networks using transfer learning on genomic and transcription factor binding data
- Once you have forked this repository, download the DAP-seq peaks data from http://neomorph.salk.edu/PlantCistromeDB into the data directory and unzip. The data you need is "Peaks in narrowPeak format (FRiP >=5%)" found using the link above. The downloaded file is named dap_download_may2016_peaks.zip.
- Run the "fetch_data_and_create_subfolders.sh" script:
sh fetch_data_and_create_subfolders.sh
- Run the "prepare_tf_families.py" script:
python prepare_tf_families.py
. This script will join peaks from TFs belonging to the same families into a single bed file. Therefore, this will produce 46 bed files. - Run the "prepare_overlap_matrix.py" script:
python prepare_overlap_matrix.py
. This script creates an overlap matrix, which is a matrix that tells you which of the 46 TF families has experimental binding peaks for each 250 nt window of the Arabidopsis thaliana genome.
- Once the above steps are completed, you can now train models using the "train.py" script
python train.py
. This script will train models, di-nucleotide shuffled models (di-control) and si-nucleotide shuffled models(si-control).
We have two main steps to interpret our models with SHAP. The first step computes importance scores using the
DeepSHAP/DeepExplain implementation and uses MoDisco to generate motifs. The second step extracts these motifs and
saves them into an easier to use file for downstream R scripts that use for example MotifStack.
7. Run the "generate_predictive_motifs.py" script: "train.py" script python generate_predictive_motifs.py
. This will
generate motifs for each family using SHAP and MoDisco.
8. Run the "extract_motifs.py" script: python extract_motifs.py
To perform the mercator-predictions enrichment analysis, we need the following steps.
- Using "Arabidopsis_thaliana.TAIR10.pep.all.fa" found within the data/proteome subdirectory, compute Mercator4 functional annotations using Mercator4 ( https://www.plabipd.de/mercator_main.html ). Download the output and save it in the data directory as mercator4_output.txt.
- Run the "cluster_gene_by_prediction.py" script:
python cluster_genes_by_prediction.py
. This script will create the regulatory modules. - Run the "mercator4_cross_promoter_clusters_enrichment_analysis.py": This will perform enrichment of modules in
Mercator functional groups
python mercator4_cross_promoter_clusters_enrichment_analysis.py
.
We used SNPs from the AraGWAs catalog to perform this analysis. We have also uploaded the SNPs downloaded and used.
- Run the "SNP_effects_on_predictions.py" script:
python SNP_effects_on_predictions.py
.
We used heat stress MOA-seq data from Liang et al., 2022. This analysis compares peaks which showed a positive fold change to those which showed negative fold changes. Precisely, it looks at regions on the genome were recorded MOA-seq footprints increase or decrease and compares these two using the predicted binding profiles generated using our CNN models trained on Arabidopsis thaliana.
- Run the "stress_binding_classification.py" script:
python stress_binding_classification.py
.