Skip to content

Latest commit

 

History

History
113 lines (65 loc) · 10.5 KB

PopulationGeneticsInR.md

File metadata and controls

113 lines (65 loc) · 10.5 KB

Population Genetics using R

The R programming language has many libraries of great use, and increasingly so for both population genetic analysis and for learning and teaching population genetics.

So here is a brief list. As I am still compiling them, for the moment this is quite dis-organized.

lists of lists

As always CRAN Task views is a good place to start for an overall view of libraries that are useful. In particular the Task view for Genetics and for phylogenetics. There are particular areas of overlap in these disciplines (i.e. simulating genealogies) which can be quite important.

Nescent also maintains a list. They also have some tutorials that are useful for practical data analysis here.

Special issue with R packages

There was also a nice special issue in the journal Molecular Ecology Resources on Population Genomics with R. The issue is here. This has numerous papers describing a number of software tools in R. This introduction summarizes lots of it if you want to take a look.

Learning

driftR

shinyPopGen

Where to start

There are some libraries that do many many things. PopGenome is an example of one. There are a few others as well.

PopGenome. : An efficient Swiss Army knife for population Genomic Analyses in R. The vignette is here. Another tutorial on whole genome analysis is here. Another tutorial is here. The paper is here.

Data import, cleaning, transformations and export

StrataG:An r package for manipulating, summarizing and analysing population genetic data.

GENEPOPEDIT: a simple and flexible tool for manipulating multilocus molecular data in R.

apex: phylogenetics with multiple genes. Toolkit for the analysis of multiple gene data. Apex implements the new S4 classes 'multidna', 'multiphyDat' and associated methods to handle aligned DNA sequences from multiple genes.

seqinr: Exploratory data analysis and data visualization for biological sequence (DNA and protein) data. Includes also utilities for sequence data management under the ACNUC system.

SeqArray: r Big Data Management of Genome-Wide Sequencing Variants

Data visualization

POPHELPER: An R package and web app to analyze and visualize population structure.

minotaur:A platform for the analysis and visualization of multivariate results from genome scans with R Shiny.

Genomic Data

VCFR: A package to manipulate and visualize variant call format data in R.

Detecting Selection and hybridization

pcadapt: an R package to perform genome scans for selection based on principal component analysis. Here is an example tutorial.

parallelnewhybrid: an R package for the parallelization of hybrid detection using newhybrids.

rehh 2.0: a reimplementation of the R package rehh to detect positive selection from haplotype structure. Here is a tutorial. Link to the new [paper]http://onlinelibrary.wiley.com/doi/10.1111/1755-0998.12634/abstract.

pegas: Population and Evolutionary Genetics Analysis System. Link to the paper.

OutFLANK:A procedure to find Fst outliers based on an inferred distribution of neutral Fst. See here for a quick tutorial

Population Differentiation and divergence

hierFstat:Estimation and Tests of Hierarchical F-Statistics. mmod: Modern Measures of Population Differentiation

Simulation (both forward and reverse)

scrm: Simulating the Evolution of Biological Sequences:A coalescent simulator that allows the rapid simulation of biological sequences under neutral models of evolution. Different to other coalescent based simulations, it has an optional approximation parameter that allows for high accuracy while maintaining a linear run time cost for long sequences. It is optimized for simulating massive data sets as produced by Next- Generation Sequencing technologies for up to several thousand sequences. Link to the vignette. Link to the paper.

coala:Coalescent simulators can rapidly simulate biological sequences evolving according to a given model of evolution. You can use this package to specify such models, to conduct the simulations and to calculate additional statistics from the results. It relies on existing simulators for doing the simulation, and currently supports the programs 'ms', 'msms' and 'scrm'. It also supports finite-sites mutation models by combining the simulators with the program 'seq-gen'.

phyclust. Phylogenetic clustering (phyloclustering) is an evolutionary Continuous Time Markov Chain model-based approach to identify population structure from molecular data without assuming linkage equilibrium. The package phyclust (Chen 2011) provides a convenient implementation of phyloclustering for DNA and SNP data, capable of clustering individuals into subpopulations and identifying molecular sequences representative of those subpopulations. Can do ms like simulations!!. See the website for more information including examples.

skelesim: an extensible, general framework for population genetic simulation in R (with shiny interface). I think this is mostly a front end (?)

MetaPopGen:an r package to simulate population genetics in large size metapopulations

rmetasim:rmetasim: An Individual-Based Population Genetic Simulation Environment.

inference (estimating theta, demographic parameters, etc..)

coalescentMCMC:MCMC Algorithms for the Coalescent. Flexible framework for coalescent analyses in R. It includes a main function running the MCMC algorithm, auxiliary functions for tree rearrangement, and some functions to compute population genetic parameters. Link to vignette.

Education

population genetics in R. A Primer on applied population genetics. evobiR: Comparative and Population Genetic Analyses. Interactive simulations for teaching.

Other

ape: Analyses of Phylogenetics and Evolution. Many packages depend on this. Link to Book. link to data and scripts associated with the book. This has DNAbin class for DNA alignments.

adegenet: Exploratory Analysis of Genetic and Genomic Data. Tutorials are here. Old sourceforge page (still has useful bits) is here

poppr: Genetic Analysis of Populations with Mixed Reproduction

Data to use

rsnps:Get 'SNP' ('Single-Nucleotide' 'Polymorphism') Data on the Web. A programmatic interface to various 'SNP' 'datasets' on the web: OpenSNP, []'NBCIs' 'dbSNP' database](https://www.ncbi.nlm.nih.gov/projects/SNP), and Broad Institute 'SNP' Annotation and Proxy Search. Functions are included for searching for 'SNPs' for the Broad Institute and 'NCBI'. For 'OpenSNP', functions are included for getting 'SNPs', and data for 'genotypes', 'phenotypes', annotations, and bulk downloads of data by user.

1000 genomes, Drosophila genomes...

Phylogenetics

phangorn

RADami: Phylogenetic Analysis of RADseq Data

phylodyn: an R package for phylodynamic simulation and inference

Not R, but useful

fastsimcoal2

SLiM

angsd. Computing pop gen summary stats. Assumes diploids, no pooling. Link to github page is here msms

msprime

ms-ld Will add ms, msms, simcoal2,...