This program calculates pairwise between-population sequence distance (DXY) across genome-wide 4-fold synonymous SNPs.
Follow instructions in main documentation to set up the conda environmnet 'WGS_analysis'.
- Generate vcf files and mpileup files of 4-fold synonymous SNPs and analyzed sites from previously generated vcf and mpileup files: configure file paths and parameters in run_extract_4_syn_sites.sh and run it in bash (this step use the same code as synonymous FST).
- Calculate DXY from files generated using the same scripts for genome-wide DXY: configure file paths and parameters in run_calc_syn_dxy.sh and run it in bash.
In most cases, you do not have to modify the core python scripts which counts the total number of analyzed sites and calculates DXY within each chromosome/contig, as well as the paralleling shell script, which parallelly calculate within-contig DXY and then merged results into genome-wide synonymous DXY.
Similarly, you do not have to modify the core python script which extract 4-fold synonymous sites, and the paralleling shell script, which parallely extract sites within each chromosome/contig.
Check variables in this shell script to see input and output files.
- DXY is an absolute measure of population differentiation, and is independent of levels of within-population diversity.
- Though the formula is applicable to tri- or quadro-allelic sites, the calculation is limited to the two most frequent alleles for consistency with previous calculations of nucleotide diversity.
- The two most frequent alleles doesn't have to be the same among samples, e.g., genotype A/T for sample1 and A/G for sample2, where dxy between sample1 and sample2 shoudl be calculated as AF(1A) x AF(2G) + AF(1T) x AF(2A) + AF(1T) x AF(2G).