GitHub - NAMlab/DeepCistrome: This repository contains code and scripts to reproduce the work on multi-label classifiers for DNA-TF binding

Modelling genetic variation effects in plant gene regulatory networks using transfer learning on genomic and transcription factor binding data

Preparing the necessary data

Once you have forked this repository, download the DAP-seq peaks data from http://neomorph.salk.edu/PlantCistromeDB into the data directory and unzip. The data you need is "Peaks in narrowPeak format (FRiP >=5%)" found using the link above. The downloaded file is named dap_download_may2016_peaks.zip.
Run the "fetch_data_and_create_subfolders.sh" script: sh fetch_data_and_create_subfolders.sh
Run the "prepare_tf_families.py" script: python prepare_tf_families.py. This script will join peaks from TFs belonging to the same families into a single bed file. Therefore, this will produce 46 bed files.
Run the "prepare_overlap_matrix.py" script: python prepare_overlap_matrix.py. This script creates an overlap matrix, which is a matrix that tells you which of the 46 TF families has experimental binding peaks for each 250 nt window of the Arabidopsis thaliana genome.

Training CNN models: Multi-label classifiers

Once the above steps are completed, you can now train models using the "train.py" script python train.py. This script will train models, di-nucleotide shuffled models (di-control) and si-nucleotide shuffled models(si-control).

Model interpretation with SHAP

We have two main steps to interpret our models with SHAP. The first step computes importance scores using the DeepSHAP/DeepExplain implementation and uses MoDisco to generate motifs. The second step extracts these motifs and saves them into an easier to use file for downstream R scripts that use for example MotifStack. 7. Run the "generate_predictive_motifs.py" script: "train.py" script python generate_predictive_motifs.py. This will generate motifs for each family using SHAP and MoDisco. 8. Run the "extract_motifs.py" script: python extract_motifs.py

Predictions-Mercartor4 enrichment analysis

To perform the mercator-predictions enrichment analysis, we need the following steps.

Using "Arabidopsis_thaliana.TAIR10.pep.all.fa" found within the data/proteome subdirectory, compute Mercator4 functional annotations using Mercator4 ( https://www.plabipd.de/mercator_main.html ). Download the output and save it in the data directory as mercator4_output.txt.
Run the "cluster_gene_by_prediction.py" script: python cluster_genes_by_prediction.py. This script will create the regulatory modules.
Run the "mercator4_cross_promoter_clusters_enrichment_analysis.py": This will perform enrichment of modules in Mercator functional groups python mercator4_cross_promoter_clusters_enrichment_analysis.py.

Effects of SNPs on binding profiles

We used SNPs from the AraGWAs catalog to perform this analysis. We have also uploaded the SNPs downloaded and used.

Run the "SNP_effects_on_predictions.py" script: python SNP_effects_on_predictions.py.

Transfer learning and heat stress classification analysis in Zea mays

We used heat stress MOA-seq data from Liang et al., 2022. This analysis compares peaks which showed a positive fold change to those which showed negative fold changes. Precisely, it looks at regions on the genome were recorded MOA-seq footprints increase or decrease and compares these two using the predicted binding profiles generated using our CNN models trained on Arabidopsis thaliana.

Run the "stress_binding_classification.py" script: python stress_binding_classification.py.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
R		R
data		data
SNP_effects_on_predictions.py		SNP_effects_on_predictions.py
annotate_MOA_peaks.py		annotate_MOA_peaks.py
cluster_genes_by_prediction.py		cluster_genes_by_prediction.py
evaluate_model_on_promoters.py		evaluate_model_on_promoters.py
explore_models_with_confusion_matrix.py		explore_models_with_confusion_matrix.py
extract_motifs.py		extract_motifs.py
extract_motifs_family_wise.py		extract_motifs_family_wise.py
family_wise_motif_generator.py		family_wise_motif_generator.py
fetch_data_and_create_subfolders.sh		fetch_data_and_create_subfolders.sh
generate_predictive_motifs.py		generate_predictive_motifs.py
mercator4_cross_promoter_clusters_enrichment_analysis.py		mercator4_cross_promoter_clusters_enrichment_analysis.py
predict_on_peaks.py		predict_on_peaks.py
prepare_overlap_matrix.py		prepare_overlap_matrix.py
prepare_tf_families.py		prepare_tf_families.py
promoter_bin_performance.py		promoter_bin_performance.py
readme.md		readme.md
stress_binding_classification.py		stress_binding_classification.py
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Modelling genetic variation effects in plant gene regulatory networks using transfer learning on genomic and transcription factor binding data

Preparing the necessary data

Training CNN models: Multi-label classifiers

Model interpretation with SHAP

Predictions-Mercartor4 enrichment analysis

Effects of SNPs on binding profiles

Transfer learning and heat stress classification analysis in Zea mays

About

Releases

Packages

Languages

NAMlab/DeepCistrome

Folders and files

Latest commit

History

Repository files navigation

Modelling genetic variation effects in plant gene regulatory networks using transfer learning on genomic and transcription factor binding data

Preparing the necessary data

Training CNN models: Multi-label classifiers

Model interpretation with SHAP

Predictions-Mercartor4 enrichment analysis

Effects of SNPs on binding profiles

Transfer learning and heat stress classification analysis in Zea mays

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages