Skip to content

Latest commit

 

History

History
319 lines (227 loc) · 32 KB

README.md

File metadata and controls

319 lines (227 loc) · 32 KB

DNAFragility_dev

Research and Development Code for Ultrasound Sonication Induced DNA Breakpoints in Sequencing

Setup

Clone the project:

git clone https://github.com/SahakyanLab/DNAFragility_dev.git

Please follow the instructions below on how to acquire the public datasets, setup the directory stucture, and software necessary to run all the studies from the publication. At the end of this README file, you can find two separate bash script commands that runs the majority of the setup and runs the calculations sequentially.

1. Software requirements

The resource-demanding computations were performed on a single NVIDIA RTX A6000 GPU with 40GB RAM. The developed workflows and analyses employed the R programming language 4.3.2.

Please run the below script to install the latest versions of the R packages necessary to perform the calculations and analyses.

bash ./setup/install_packages.sh

Please also download and install the below software.

Edlib

kseq.h

Please download the kseq.h file from this link. Place this into 03_FitCurves/lib folder.

phmap.hpp via gtl

factoextra

Please note, if you are using Ubuntu, you may have trouble installing the factoextra R package. However, the below steps has worked for us.

  1. sudo apt-get update
  2. sudo apt-get install r-cran-car
  3. install.packages("factoextra")

This will be needed for the 04_KmericAnalysis folder.

Cytoscape desktop app

Please download the Cytoscape desktop application for your operating system from their website. This will be needed for the 04_KmericAnalysis folder.

You also need to install clustermaker2, enrichmentmap, and autoannotate apps within Cytoscape.

2. Public files to download

Ultrasonication DNA breakages

25 ultrasonication-induced breakage datasets were retrieved from NCBI’s Sequence Read Archive (SRA) Run Selector for the Bioproject PRJEB9586 with the SRAs as outlined below.

For each dataset, we used the first run. On the SRA website, we clicked Alignment, removed any range from Set Range, and selected the whole FASTA run. To download the chromosome-separated files, we selected the numeric chromosome number in Choose Reference. We repeated this for chromosomes 1 to 22. Then, we click File to download the whole FASTA run.

Place each of the 22 chromosome-separated FASTA files into the respective folders as indicated inside the brackets. All chromosome fasta file has to have the name structure chr$num.fasta.gz where $num is any number between 1 and 22. The downloaded files will automatically be compressed as a .gz file. Do not uncompress this as we do it automatically when processing the files.

Nebulization DNA breakages

Two nebulization-induced DNA breakage datasets were obtained from the below data accession codes from NCBI's NCBI’s Sequence Read Archive (SRA) Run Selector for the Bioproject PRJNA33851.

Place each of the 22 chromosome-separated FASTA files into the respective folders as indicated inside the brackets. Please follow the same downloading process as from the Ultrasonication DNA breakages section

Naturay decay DNA breakages

Five natural decay and fossilisation datasets were retrieved from the Max Planck Institute for Evolutionary Anthropology. These files will be automatically downloaded in the appropriate folders when you run the bash script at the end of this README file.

bash run_all_setup_files.sh

Cell-free DNA breakages

16 datasets of cell-free DNA fragments coming from individual human peripheral blood plasma were obtained from NCBI’s SRA Run Selector for the Bioproject PRJNA291063 with the SRAs as outlined below.

Place each of the 22 chromosome-separated FASTA files into the respective folders as indicated inside the brackets. Please follow the same downloading process as from the Ultrasonication DNA breakages section

Physiological DNA breakages

46 physiological breakage datasets were retrieved from various tissues and cell lines.

For the GEO accession code GSE139011, download the tar file. For each of G5, Imatinib, SN, and Talazoparib, unpack and extract the two biological replicates for each of 6h and 48h treatment. Place them into their folder structures as follows:

For each folder, concatenate the files and retain distinct DNA strand break positions. To do this, please run the below file located in data/00_Breakage/SSB/

Rscript unpack_files.R

For the WRN-associated DNA breakages, download Source Data Fig. 1 from this paper. In the 1e subsheet, you will find the below datasets. Copy and paste the Chromosome, Start, and Stop header for each and create a .csv. Then, rename Stop to End. Place them into their folder structures as follows:

For the remaining WRN-associated DNA breakages, download Source Data Fig. 3 from this paper. In the 3c subsheet, you will find the below datasets. Copy and paste the Chromosome, Start, and Stop header for each and create a .csv. Then, rename Stop to End. Place them into their folder structures as follows:

For apoptotic DNA breakages, download Table S2 from the supporting information. The tar file contains the excel file Table_S2-Apoptotic_Breakpoints.xlsx. Create a new AHH001.csv file, and copy over the Chromosome, Start, and End columns from the Ahh001 subsheet. Save this into data/00_Breakage/Apoptoseq/Apoptotic_DSBs/HL-60-leukemic/AHH001/ folder. Repeat the same for AHH002.csv into data/00_Breakage/Apoptoseq/Apoptotic_DSBs/HL-60-leukemic/AHH002/ folder.

For the human recombination map, clone this GitHub repo into data/00_Breakage/Recombination_rates/. Then, change all file names from genetic_map_GRCh37_chr$num.txt.gz to chr$num.txt.gz, where $num is any number from 1 to 22 for the autosomes, and uncompress the files.

Enzymatic DNA breakages

For the GEO accession code GSE149709, download the GSE78172_AsiSI.tar.gz file. Unpack and uncompress the AsiSI_restriction_sites.single.bed file and move to data/00_Breakage/Enzymatic/AsiSI_AID-DIvA_cells/. Then, inside this folder, run the below R script to process the enzymatic breaks.

Rscript Process.R

For the GEO accession code GSE139011, download the tar file. There are 6 files of interest as outlined below.

  • Nt_BbvCI_K562_cells/Sap Uncompress the GSM4126200_Nt.BbvCI.Sap.B$num.without.polyA.untailing.bed.gz files, where $num is a number between 1 and 3. Combine all 3 files with
  • Nt_BbvCI_K562_cells/NT For the untreated sample, Uncompress the GSM4126200_Nt.BbvCI.B$num.without.polyA.untailing.bed.gz files, where $num is a number between 1 and 3.

Place them into their respective folders, where .Sap files go into data/00_Breakage/Enzymatic/Nt_BbvCI_K562_cells/Sap/ and the other 3 files go to data/00_Breakage/Enzymatic/Nt_BbvCI_K562_cells/NT/. When done, go one level up data/00_Breakage/Enzymatic/Nt_BbvCI_K562_cells/, and run the below bash script to process the enzymatic breaks.

bash process_files.sh

Transcription Factor data

We retrieved 247 core-validated vertebrate transcript factor binding sites (TFBS) from the JASPAR 2024 database.

Unpack and extract the relevant files from above. Place the contents into data/TFBS/ folder.

Alternatively, run the below Rscript or run it as part the setup file at the end of this README file.

Rscript jaspar_download.R

Quantum mechanical k-mer parameters

The quantum mechanical hexameric parameters denergy.txt.gz can be downloaded from DNAkmerQM. Uncompress the file and place it into data/04_QM_parameters/6mer/Raw/denergy.txt folder.

Epigenome marks

ATACseq datasets

FAIREseq datasets

DNaseseq datasets

Chipseq datasets

Histone

Download the two Nhek H3k4me3 replicas from the GEO accession number GSM945175 rep1 and rep2. Concatenate the two files, retain unique ranges, and keep one file in Chipseq/Histone/H3K4me3_NHEK_cells folder.

The following 9 histone marks were obtained from GEO accession GSE29611 with individual files already saved in the appropriate file structure, as also used in the DSBCapture study, including: H2AZ_NHEK_cells, EZH2_NHEK_cells, H3K27ac_NHEK_cells, H3K4me2_NHEK_cells, H3K27me3_NHEK_cells, H3K79me2_NHEK_cells, H3K4me1_NHEK_cells, H3K9me3_NHEK_cells, POL2B.

ENCODE Blacklist regions

Please download the ENCODE Blacklist regions for hg19 from here and for hg38 from here. Uncompress them, and name them hg19.bed and hg38.bed, respectively.

Place the contents into data/blacklists/.

Hi-C annotations

We retrieved the hg19-annotated Hi-C subcompartment data of the K562 and HeLa cell lines from here.

Place the contents into data/HiC_annotations/.

Liftover files

We processed all datasets in the reference genome version used as per the deposition. For the TFBS, we lifted them over from hg38 to hg19.

Unpack and extract the relevant files. Place the contents into data/liftover/ folder.

Other notes

  • All cpp files are interfaced via the Rcpp library in R with omp.h when possible. Please ensure you have this installed.
  • RcppArmadillo.h and RcppEigen.h are necessary for the feature extraction process. Please ensure you have this installed. By default, will not use it in case you have not installed it.

Run all setup files

If you wish to run all setups, including all the aforementioned bash scripts, please run the below bash script.

bash run_all_setup_files.sh

Run the full DNAFragility_dev study

Please note that many of the calculations were computationally intensive. Most things were run in parallel in smaller batches. However, if you submit the below bash script, it runs all scripts sequentially. This can take several months to complete. Most tasks take up several tens to hundreds of GBs worth of RAM. The entire study requires between 2-4 TB of hard drive space.

You may need to monitor your memory usage, memory cache, and swap to ensure calculations run smoothly.

bash run_dnafragility_dev.sh