Pipeline to simulate vagrant DNA insertion in to nuclear genomes.
- Clone the repository
- Set up a conda environment:
conda create -n vagrantSim -c bioconda -c conda-forge bwa-mem2 "r-base>=4" r-matrix wgsim samtools freebayes scikit-allel zarr
- Activate conda environment:
conda activate vagrantSim
orsource activate vagrantSim
- Run pipeline, for instance:
bash 00pipeline.sh 250M000 250000000 0.00004
. The parameters are: [outname] [nuclear genome size] [insertion rate per year]
The most imortant output is [outname].log
, a three-line text file. Example:
0.001274594 # Observed proportion of vagrant DNA inserts in this simulation run
0.001496896 0.001103315 0.002030592 0.001515834 # intercept estimate, lower 95%-CI, upper 95%-CI, mapping-depth estimate
YES, contained # Whether or not the actual value is within the CI
Multiple of these output files can be summarised doing grep contained *log > grepOut
. Here is an example (also see folder exampleOut
):
001G000.log:YES, contained
001G001.log:YES, contained
001G002.log:YES, contained
001G003.log:YES, contained
001G004.log:YES, contained
001G006.log:YES, contained
001G007.log:YES, contained
001G008.log:NOT contained
These data can then be analysed. An example is given in file 15evaluateSimsExample.R
, see also the graph in section "Summary of results" below.
- In R
- Generates an empty nuclear genome of defined size (sparse array to save memory).
- Genetates a random extranuclear sequence of 16000nt length
- Cylces through 15,000,000 years, checking each year if a substitution happens in the extranuclear DNA and is an insertion happens into the nuclear genome.
- Genomes are written out. The nuclear one is made up randomly while on the fly, filling in vagrant inserts that were recorded during the 15M years.
- Sequencing reads are simulated using wgsim
- Reads are mapped using bwa-mem2
- BAM files are sorted and indexed with samtools. Variants are called using freebayes.
- VCF is converted to allele count tables using scikit-allel in Python
- Numbers of mapped and total basairs in the data are recorded
- Allele count tables are converted into the input data format required for fitting.
- The fit is run.
- Results are stored in
[outname].log
Plot generated with 15evaluateSimsExample.R
The associated coefficient table is:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.6314 2.4267 1.496 0.134543
log(GSinMbp) -1.5034 0.4217 -3.565 0.000364 ***
log(numtDep) 6.7679 0.9332 7.252 4.11e-13 ***