This program calculates pairwise population differentiation of allele frequenccies (FST) across genome-wide SNPs.
Follow instructions in main documentation to set up the conda environmnet 'WGS_analysis'.
Configure file paths and parameters in this shell script and run it in bash.
In most cases, you do not have to modify the core python scripts, which filter for analyzed sites within window and calculates FST within each window, as well as the paralleling shell script, which parallelly calculate within-window FST and then merged results into genome-wide FST.
Run the following command to see input and output files:
python calc_Fst_reynolds_ungapped_efs.py -h
- The calculation is separately performed for autosomes and X-chromosome, to account for different allelic sample sizes, demogpraphic histories, and outcomes of natural selection.
- The formulation being used here is a a modification (also called GST) to Wright's original formulation (Reynolds 1983), also known as Reynold's estimator (theta) of the coancestry coefficient, which takes into account unequal samples sizes among populations, as well as sites with more than two alleles.
- The formulation is applicable to tri- or quadro-allelic sites, but it also raises the issue that two populations might shares no allele but will have Fst < 1 (when HS is not 0, same issue with GST). Nevertheless, to avoid a more serious situation in which heterozygosities of populations with one overlaped allele might be wrongly calculated if we only consider the two most frequent alleles, and at least one of the other alleles doesn't overlap between populations (e.g., P1: A=0.01, G=0.99; P2: A=0.02, C=0.98), we should still calculate heterozygosities through all existing allels (i.e. 1 - sum(square(p))).
- In practical calculations, there might be sites where an allele is fixed or lost in both populations while not in other populations. This could lead to HT = 0, which violates the computation. Thus we will identify such cases and assign 0 to the pairwise FST.
- The start position of the first window of each contig is 1, and the following boundries of each window are midpoint between the current ending SNP and the next starting SNP. Beaware that this program is not strictly 'ungapped', because the most 3' SNP of a contig is not necessarily the end base of that contig, and also because some of the most 3' SNP windows are discarded due to the lack of accumulative heterozygosity.
- The window is defined by accumulative heterozygosity (averaged among all samples analyzed) calculated from SNPs, which should have a summed heterozygosity >= 100 for each window and include ~10k sites.
- The within-window FST was calculated using weighted average across sites.
- We get the across-window FST as a weighted average by number of analyzed sites in each window.
- The sample size used to calculate Fst is effective sample size, which is calculated at each site (instead of effective pool size, which is a component of calculating effective sample size, and is set as the actual pool size due to the lack of replicate population samples) based on Ferretti et al, 2013
- Keep in mind that FST is a relative measure of population differentiation, and is strongly influenced by levels of within-population diversity.
- There're many contigs within which the total number of SNPs is far less than what could accumulate enough heterozygosity to pass the threshold. On the other hand, we don't know the consecutive order among contigs. My solution is to skip these contigs for Fst calculation.
- Similar to the above issue, even if a contig has enough SNPs to pass the heterozygosity threshold, there're likely remaining regions at the end of contigs that are not long enough to qualify for a 'window'. In this case, I skipped these tails for Fst calculation. Does this make sense?
- A questions is whether to set the boundry of each window as the positions of starting and end SNPs? Here, we do not count filtered sites outside our windows (i.e. those before the first 5' window SNP and after the last 3' window SNP).