Research and Development Code for Ultrasound Sonication Induced DNA Breakpoints in Sequencing
Clone the project:
git clone https://github.com/SahakyanLab/DNAFragility_dev.git
Please follow the instructions below on how to acquire the public datasets, setup the directory stucture, and software necessary to run all the studies from the publication. At the end of this README
file, you can find two separate bash script commands that runs the majority of the setup and runs the calculations sequentially.
The resource-demanding computations were performed on a single NVIDIA RTX A6000 GPU with 40GB RAM. The developed workflows and analyses employed the R programming language 4.3.2.
Please run the below script to install the latest versions of the R packages necessary to perform the calculations and analyses.
bash ./setup/install_packages.sh
Please also download and install the below software.
- Please clone the repo from this link (Edlib >= 1.2.7). Place the edlib.h and edlib.cpp into 02_Alignment/lib/edlib folder.
Please download the kseq.h file from this link. Place this into 03_FitCurves/lib folder.
- Please clone the repo from this link. Place the contents of gtl into 03_FitCurves/lib folder.
Please note, if you are using Ubuntu, you may have trouble installing the factoextra R package. However, the below steps has worked for us.
- sudo apt-get update
- sudo apt-get install r-cran-car
- install.packages("factoextra")
This will be needed for the 04_KmericAnalysis folder.
Please download the Cytoscape
desktop application for your operating system from their website. This will be needed for the 04_KmericAnalysis folder.
You also need to install clustermaker2
, enrichmentmap
, and autoannotate
apps within Cytoscape
.
25 ultrasonication-induced breakage datasets were retrieved from NCBI’s Sequence Read Archive (SRA) Run Selector for the Bioproject PRJEB9586 with the SRAs as outlined below.
For each dataset, we used the first run. On the SRA website, we clicked Alignment
, removed any range from Set Range
, and selected the whole FASTA
run. To download the chromosome-separated files, we selected the numeric chromosome number in Choose Reference
. We repeated this for chromosomes 1
to 22
. Then, we click File
to download the whole FASTA
run.
- ERX1104497 (Simons_exp_1)
- ERX1104508 (Simons_exp_2)
- ERX1104516 (Simons_exp_3)
- ERX1104519 (Simons_exp_4)
- ERX1104521 (Simons_exp_5)
- ERX1104529 (Simons_exp_6)
- ERX1097982 (Simons_exp_7)
- ERX1097983 (Simons_exp_8)
- ERX1097984 (Simons_exp_9)
- ERX1097986 (Simons_exp_10)
- ERX1097988 (Simons_exp_11)
- ERX1097992 (Simons_exp_12)
- ERX1097994 (Simons_exp_13)
- ERX1097998 (Simons_exp_14)
- ERX1098018 (Simons_exp_15)
- ERX1098019 (Simons_exp_16)
- ERX1104476 (Simons_exp_17)
- ERX1104481 (Simons_exp_18)
- ERX1104486 (Simons_exp_19)
- ERX1104491 (Simons_exp_20)
- ERX1104498 (Simons_exp_21)
- ERX1104512 (Simons_exp_22)
- ERX1104523 (Simons_exp_23)
- ERX1104527 (Simons_exp_24)
- ERX1104532 (Simons_exp_25)
Place each of the 22 chromosome-separated FASTA files into the respective folders as indicated inside the brackets. All chromosome fasta file has to have the name structure chr$num.fasta.gz
where $num
is any number between 1
and 22
. The downloaded files will automatically be compressed as a .gz
file. Do not uncompress this as we do it automatically when processing the files.
Two nebulization-induced DNA breakage datasets were obtained from the below data accession codes from NCBI's NCBI’s Sequence Read Archive (SRA) Run Selector for the Bioproject PRJNA33851.
Place each of the 22 chromosome-separated FASTA files into the respective folders as indicated inside the brackets. Please follow the same downloading process as from the Ultrasonication DNA breakages section
Five natural decay and fossilisation datasets were retrieved from the Max Planck Institute for Evolutionary Anthropology. These files will be automatically downloaded in the appropriate folders when you run the bash script at the end of this README
file.
bash run_all_setup_files.sh
16 datasets of cell-free DNA fragments coming from individual human peripheral blood plasma were obtained from NCBI’s SRA Run Selector for the Bioproject PRJNA291063 with the SRAs as outlined below.
- SRX1120757 (Healthy)
- SRX1120768 (Breast_cancer_Invasive_infiltratingductal)
- SRX1120766 (Ovarian_cancer)
- SRX1120767 (Skin_cancer_Melanoma)
- SRX1120769 (Lung_cancer_Adenocarcinoma)
- SRX1120771 (Uterine_cancer)
- SRX1120774 (Colorectal_cancer)
- SRX1120776 (Prostate_cancer)
- SRX1120777 (Head_and_neck_cancer)
- SRX1120780 (Liver_cancer_Hepatocellular_carcinoma)
- SRX1120779 (Bladder_cancer)
- SRX1120781 (Kidney_cancer_Clear_cell)
- SRX1120782 (Testicular_cancer_Seminomatous)
- SRX1120784 (Pancreatic_cancer_Ductal_adenocarcinoma)
- SRX1120793 (Esophageal_cancer)
- SRX1120758 (Crohns_disease)
- SRX1120760 (Ulcerative_Colitis)
- SRX1120762 (Systemic_Lupus_Erythematosus)
Place each of the 22 chromosome-separated FASTA files into the respective folders as indicated inside the brackets. Please follow the same downloading process as from the Ultrasonication DNA breakages section
46 physiological breakage datasets were retrieved from various tissues and cell lines.
- GSE115623 (BrITL/MDA-MB-231/APH-ATR-inhib)
- GSE115623 (BrITL/MDA-MB-231/DMSO)
- GSE78172 (DSBCapture/NHEK_DSBs)
- GSE167257 SARseq/iNeuron
- GSE136943 (CCseq/RPE-1_Top2-linked_DNA_DSBs/WG13-WG21_RPE1_WT_G1_UT)
- GSE136943 (CCseq/RPE-1_Top2-linked_DNA_DSBs/WG14-WG22_RPE1_WT_G1_VP16)
- GSE136943 (CCseq/RPE-1_Top2-linked_DNA_DSBs/WG17-WG25A_RPE1_T2B_G1_UT)
- GSE136943 (CCseq/RPE-1_Top2-linked_DNA_DSBs/WG18-WG26_RPE1_T2B_G1_VP16)
- GSE121742 (sBLISS/CD34_Top2_mediated_DSBs/DMSO_rep1)
- GSE121742 (sBLISS/CD34_Top2_mediated_DSBs/DMSO_rep2)
- GSE121742 (sBLISS/CD34_Top2_mediated_DSBs/ETO_rep1)
- GSE121742 (sBLISS/CD34_Top2_mediated_DSBs/ETO_rep2)
- GSE121742 (sBLISS/K562_Top2_mediated_DSBs/ETO)
- GSE121742 (sBLISS/K562_Top2_mediated_DSBs/DMSO)
- GSE121742 (sBLISS/TK6_Top2_mediated_DSBs/DMSO_rep1)
- GSE121742 (sBLISS/TK6_Top2_mediated_DSBs/DMSO_rep2)
- GSE121742 (sBLISS/TK6_Top2_mediated_DSBs/ETO_rep1)
- GSE121742 (sBLISS/TK6_Top2_mediated_DSBs/ETO_rep2)
- GSE145594 (sBLISS/Colibactin_Ecoli_induced_DSBs/Caco-2_etoposide_rep1)
- GSE145594 (sBLISS/Colibactin_Ecoli_induced_DSBs/Caco-2_etoposide_rep2)
- GSE145594 (sBLISS/Colibactin_Ecoli_induced_DSBs/Caco-2_etoposide_rep3)
- GSE145594 (sBLISS/Colibactin_Ecoli_induced_DSBs/Caco-2_etoposide_rep4)
- GSE145594 (sBLISS/Colibactin_Ecoli_induced_DSBs/Caco-2_pks_minus_Ecoli_rep1)
- GSE145594 (sBLISS/Colibactin_Ecoli_induced_DSBs/Caco-2_pks_minus_Ecoli_rep2)
- GSE145594 (sBLISS/Colibactin_Ecoli_induced_DSBs/Caco-2_pks_minus_Ecoli_rep3)
- GSE145594 (sBLISS/Colibactin_Ecoli_induced_DSBs/Caco-2_pks_minus_Ecoli_rep4)
- GSE145594 (sBLISS/Colibactin_Ecoli_induced_DSBs/Caco-2_pks_plus_Ecoli_rep1)
- GSE145594 (sBLISS/Colibactin_Ecoli_induced_DSBs/Caco-2_pks_plus_Ecoli_rep2)
- GSE145594 (sBLISS/Colibactin_Ecoli_induced_DSBs/Caco-2_pks_plus_Ecoli_rep4)
For the GEO accession code GSE139011, download the tar file. For each of G5
, Imatinib
, SN
, and Talazoparib
, unpack and extract the two biological replicates for each of 6h
and 48h
treatment. Place them into their folder structures as follows:
- data/00_Breakage/SSB/G5/6h and data/00_Breakage/SSB/G5/48h
- data/00_Breakage/SSB/Imatinib/6h and data/00_Breakage/SSB/Imatinib/48h
- data/00_Breakage/SSB/SN/6h and data/00_Breakage/SSB/SN/48h
- data/00_Breakage/SSB/Talazoparib/6h and data/00_Breakage/SSB/Talazoparib/48h
For each folder, concatenate the files and retain distinct DNA strand break positions. To do this, please run the below file located in data/00_Breakage/SSB/
Rscript unpack_files.R
For the WRN-associated DNA breakages, download Source Data Fig. 1
from this paper. In the 1e
subsheet, you will find the below datasets. Copy and paste the Chromosome
, Start
, and Stop
header for each and create a .csv
. Then, rename Stop
to End
. Place them into their folder structures as follows:
- ENDseq peaks in HCT116 cells (siWRN) header into data/00_Breakage/ENDseq/WRN_loss_DSBs/HCT116_siWRN as
HCT116_siWRN.csv
- ENDseq peaks in HCT116 cells (shWRN) header into data/00_Breakage/ENDseq/WRN_loss_DSBs/HCT116_shWRN as
HCT116_shWRN.csv
- ENDseq peaks in KM12 cells (siWRN) header into data/00_Breakage/ENDseq/WRN_loss_DSBs/KM12_siWRN as
KM12_siWRN.csv
- ENDseq peaks in KM12 cells (shWRN) header into data/00_Breakage/ENDseq/WRN_loss_DSBs/KM12_shWRN as
KM12_shWRN.csv
For the remaining WRN-associated DNA breakages, download Source Data Fig. 3
from this paper. In the 3c
subsheet, you will find the below datasets. Copy and paste the Chromosome
, Start
, and Stop
header for each and create a .csv
. Then, rename Stop
to End
. Place them into their folder structures as follows:
- ENDseq peaks with APH+ATRi treatment (HCT116) header into data/00_Breakage/ENDseq/WRN_loss_DSBs/HCT116_APH-ATR-inhib as
HCT116_APH-ATR-inhib.csv
- ENDseq peaks with APH+ATRi treatment (KM12) header into data/00_Breakage/ENDseq/WRN_loss_DSBs/KM12_APH-ATR-inhib as
KM12_APH-ATR-inhib.csv
For apoptotic DNA breakages, download Table S2 from the supporting information. The tar file contains the excel file Table_S2-Apoptotic_Breakpoints.xlsx. Create a new AHH001.csv
file, and copy over the Chromosome
, Start
, and End
columns from the Ahh001
subsheet. Save this into data/00_Breakage/Apoptoseq/Apoptotic_DSBs/HL-60-leukemic/AHH001/ folder. Repeat the same for AHH002.csv
into data/00_Breakage/Apoptoseq/Apoptotic_DSBs/HL-60-leukemic/AHH002/ folder.
For the human recombination map, clone this GitHub repo into data/00_Breakage/Recombination_rates/. Then, change all file names from genetic_map_GRCh37_chr$num.txt.gz
to chr$num.txt.gz
, where $num
is any number from 1
to 22
for the autosomes, and uncompress the files.
- GSE149709 (EcoRV_HeLa_cells)
- SRX7808529 (Twist_library/C25cl48-cells). For this dataset only, please follow the same downloading process as from the Ultrasonication DNA breakages section.
For the GEO accession code GSE149709, download the GSE78172_AsiSI.tar.gz file. Unpack and uncompress the AsiSI_restriction_sites.single.bed
file and move to data/00_Breakage/Enzymatic/AsiSI_AID-DIvA_cells/. Then, inside this folder, run the below R script to process the enzymatic breaks.
Rscript Process.R
For the GEO accession code GSE139011, download the tar file. There are 6 files of interest as outlined below.
- Nt_BbvCI_K562_cells/Sap Uncompress the
GSM4126200_Nt.BbvCI.Sap.B$num.without.polyA.untailing.bed.gz
files, where$num
is a number between1
and3
. Combine all 3 files with - Nt_BbvCI_K562_cells/NT For the untreated sample, Uncompress the
GSM4126200_Nt.BbvCI.B$num.without.polyA.untailing.bed.gz
files, where$num
is a number between1
and3
.
Place them into their respective folders, where .Sap
files go into data/00_Breakage/Enzymatic/Nt_BbvCI_K562_cells/Sap/ and the other 3 files go to data/00_Breakage/Enzymatic/Nt_BbvCI_K562_cells/NT/. When done, go one level up data/00_Breakage/Enzymatic/Nt_BbvCI_K562_cells/, and run the below bash script to process the enzymatic breaks.
bash process_files.sh
We retrieved 247 core-validated vertebrate transcript factor binding sites (TFBS) from the JASPAR 2024 database.
- Download the vertebrate dataset from here
- Download the bed files from here as "jaspar_beds"
Unpack and extract the relevant files from above. Place the contents into data/TFBS/ folder.
Alternatively, run the below Rscript or run it as part the setup file at the end of this README
file.
Rscript jaspar_download.R
The quantum mechanical hexameric parameters denergy.txt.gz
can be downloaded from DNAkmerQM. Uncompress the file and place it into data/04_QM_parameters/6mer/Raw/denergy.txt folder.
ATACseq datasets
FAIREseq datasets
DNaseseq datasets
- ENCFF579UXQ (DNaseseq/Caco2)
- ENCFF240LRP (DNaseseq/HCT116)
- ENCFF024UJN (DNaseseq/HeLa_S3)
- ENCFF773SFA (DNaseseq/HL60)
- Hypersensitive clusters (DNaseseq/HyperSensitive_Clusters)
- ENCFF274YGF (DNaseseq/K562)
- ENCFF438LQM (DNaseseq/MCF7)
Chipseq datasets
Histone
- GSM5501177 (Chipseq/Histone/H3K4me2_DMSO_GBM_tumor_initiating_cell_line)
- GSM5501178 (Chipseq/Histone/H3K4me2_LSD1-inhib_GBM_tumor_initiating_cell_line)
- GSM5501179 (Chipseq/Histone/H3K4me3_DMSO_GBM_tumor_initiating_cell_line)
- GSM5501180 (Chipseq/Histone/H3K4me3_LSD1-inhib_GBM_tumor_initiating_cell_line)
- GSM5501175 (Chipseq/Histone/H3K4me1_DMSO_GBM_tumor_initiating_cell_line)
- GSM5501176 (Chipseq/Histone/H3K4me1_LSD1-inhib_GBM_tumor_initiating_cell_line)
- GSM1035424 (Chipseq/SUMO1/Proliferative_fibro_WI38_cells)
- GSM1035433 (Chipseq/SUMO1/Ras-induced-senescent_fibro_WI38_cells)
- GSM1035426 (Chipseq/SUMO2/Proliferative_fibro_WI38_cells)
- GSM1035435 (Chipseq/SUMO2/Ras-induced-senescent_fibro_WI38_cells)
- GSM1035427 (Chipseq/Ubc9/Proliferative_fibro_WI38_cells)
- GSM1035436 (Chipseq/Ubc9/Ras-induced-senescent_fibro_WI38_cells)
- GSM1035441 (Chipseq/PIASy/Proliferative_fibro_WI38_cells)
- GSM1035442 (Chipseq/PIASy/Ras-induced-senescent_fibro_WI38_cells)
Download the two Nhek H3k4me3 replicas from the GEO accession number GSM945175 rep1 and rep2. Concatenate the two files, retain unique ranges, and keep one file in Chipseq/Histone/H3K4me3_NHEK_cells folder.
The following 9 histone marks were obtained from GEO accession GSE29611 with individual files already saved in the appropriate file structure, as also used in the DSBCapture study, including: H2AZ_NHEK_cells, EZH2_NHEK_cells, H3K27ac_NHEK_cells, H3K4me2_NHEK_cells, H3K27me3_NHEK_cells, H3K79me2_NHEK_cells, H3K4me1_NHEK_cells, H3K9me3_NHEK_cells, POL2B.
Please download the ENCODE Blacklist regions for hg19 from here and for hg38 from here. Uncompress them, and name them hg19.bed
and hg38.bed
, respectively.
Place the contents into data/blacklists/.
We retrieved the hg19-annotated Hi-C subcompartment data of the K562 and HeLa cell lines from here.
Place the contents into data/HiC_annotations/.
We processed all datasets in the reference genome version used as per the deposition. For the TFBS, we lifted them over from hg38 to hg19.
Unpack and extract the relevant files. Place the contents into data/liftover/ folder.
- All cpp files are interfaced via the Rcpp library in R with
omp.h
when possible. Please ensure you have this installed. RcppArmadillo.h
andRcppEigen.h
are necessary for the feature extraction process. Please ensure you have this installed. By default, will not use it in case you have not installed it.
If you wish to run all setups, including all the aforementioned bash scripts, please run the below bash script.
bash run_all_setup_files.sh
Please note that many of the calculations were computationally intensive. Most things were run in parallel in smaller batches. However, if you submit the below bash script, it runs all scripts sequentially. This can take several months to complete. Most tasks take up several tens to hundreds of GBs worth of RAM. The entire study requires between 2-4 TB of hard drive space.
You may need to monitor your memory usage, memory cache, and swap to ensure calculations run smoothly.
bash run_dnafragility_dev.sh