This program calculates divergence between two species for sites of each annotated category.
Follow instructions in main documentation to set up the conda environmnet 'WGS_analysis'.
Configure file paths in the shell script and run it in bash.
In most cases, you do not have to modify the core python script, which count the number of differing reference alleles within each alignment block, and the core shell script, which parallelly count the number of different reference alleles within each alignment block, and calculate the proportion of differing reference alleles across all blocks for each site category.
Run the following command to see input and output files:
python count_divergent_sites.py -h
- Get the coordinates of an alignment block in the target species (i.e. Drosophila. suzukii) from the 'map' file;
- Read the multiple fasta alignment (.mfa) file of this alignment, and map bases of the outgroup species (i.e. Drosophila. biarmipes) onto the genome of the target species. This will generate a bed file of coordinates of matched sites on the target species, and a sequence of allele substitution/consistency (labeled as s/c);
- Use the bed file to generate a tab-delimited table of site annotations;
- Simultaneusly read the site annotation table and the sequence of allele substitution/consistency, while count the allele difference for each annotation category;
- Calculate divergence as the proportion of differentiated sites for each annotation category;
- (Optional) Plot divergence for each annotation category.
Step 1-4 is done by count_divergent_sites.py for each alignment block, and step 5 is implemented by parallel_by_block.sh. Step 6 is implemented by divergence_barplot.R.