Skip to content

Latest commit

 

History

History
260 lines (180 loc) · 12.7 KB

README.md

File metadata and controls

260 lines (180 loc) · 12.7 KB

SSMPG 2017

Repository for the Data challenge about Software and Statistical Methods for Population Genetics (SSMPG 2017) (Aussois, September 11-15 2017)

Thanks to all participants for attending to SSMPG 2017, see you at SSMPG 2019.

Participants to SSMPG 2017

If you want to train, you can continue to submit lists of candidates SNPs online.

Introductory lecture about Genome Scans and about the Data Challenges

1. Install software

Install R and Rstudio

To participate to the challenge, you need to install R on your computer. To make R easier to use, we suggest to install RStudio, which is an integrated development environment (IDE) for R.

Install R packages (LEA/LFMM, OutFLANK, pcadapt, rehh)

To install R packages that are useful for the data challenge, copy and paste in R the following pieces of code

#Install R packages for SSMPG 2017

#Package to install packages from github
install.packages("devtools")

#Packages to run LEA and LFMM
devtools::install_github("bcm-uga/LEA")

install.packages("RSpectra")
devtools::install_github("bcm-uga/lfmm")

# Additional package 'cate' 
install.packages("cate")

#Package to run OutFLANK
devtools::install_github("whitlock/OutFLANK")

#Package to run pcadapt
devtools::install_github("bcm-uga/pcadapt")
install.packages("bigstatsr")
devtools::install_github("privefl/bigsnpr")

#Package q-value for controlling FDR
#Try https:// or http:// 
source("http://bioconductor.org/biocLite.R")
biocLite("qvalue")

#Package to run rehh
install.packages("rehh")

#Package to plot population trees
install.packages("ape")

Install BAYPASS

Download the archive from http://www1.montpellier.inra.fr/CBGP/software/baypass/ or directly via the following command run on a terminal:

wget http://www1.montpellier.inra.fr/CBGP/software/baypass/files/baypass_2.1.tar.gz

Extract the archive, e.g., from a terminal:

tar -zxvf baypass_2.1.tar.gz

The source files are to be found in the src subdirectory. BayPass is coded in Fortran90 and can therefore be compiled for any system supporting a Fortran90 compiler using the provided Makefile. This Makefile is designed to work with either the free compiler gfortran (if not already installed in your system, binaries are available at https://gcc.gnu.org/wiki/GFortranBinaries and are easy to install for most Windows, Mac and Linux OS versions) or the commercial ifort intel Fortran compiler. BayPass also uses OpenMP (http://openmp.org/wp/) to implement multithreading, which allows parallel calculation on computer systems that have multiple CPUs or CPUs with multiple cores. Users thus have to make sure that the corresponding libraries are installed (which is usually the case, on Linux OS or following compiler installation previously described). The following instructions run within the src subdirectory allows to compile the code and to produce a binary:

  • using the gfortran free compiler (the command should automatically produce an executable called g_baypass):
make clean all FC=gfortran
  • using the ifort intel Fortran compiler (the command should automatically produce an executable called i_baypass):
make clean all FC=ifort 

Note: Under Linux (or MacOS), before the first use, make sure to give appropriate execution rights to the program. For instance you may run: chmod +x baypass

Install hapFLK

hapflk is available as a python package. It has been tested to work on Linux and MacOSX. Before installing hapflk, you will need to install python 2.7 and numpy and scipy. You also need a C compiler (e.g. gcc) but this should be the case already. Once this is done, hapflk can be installed using pip (copy paste the following in a terminal):

sudo pip install hapflk

In the future, hapflk can be upgraded using :

sudo pip install hapflk --upgrade

Checkout the hapflk webpage for some documentation and companion scripts.

Install SelEstim

Download the archive from http://www1.montpellier.inra.fr/CBGP/software/selestim/, or using the following command line from a terminal:

wget http://www1.montpellier.inra.fr/CBGP/software/selestim/files/SelEstim_1.1.7.zip

Extract the archive, e.g., from a terminal:

unzip SelEstim_1.1.7.zip

The source files are to be found in the src/ subdirectory of that archive. SelEstim is coded using C programming language and can therefore be compiled for any system supported by gcc. To do so, Windows users may need to get a gcc, e.g. by installing MinGW, mingw-64, or Cygwin. To compile the code and get the selestim binary, use the provided Makefile in the src/ subdirectory:

make clean all

Note: with Linux (or Mac OS), before the first use, make sure to give appropriate execution rights to the program. For instance you may run: chmod +x selestim

SelEstim uses OpenMP to implement multithreading, which allows parallel calculation on on computer systems that have multiple CPUs or CPUs with multiple cores. Make sure that the corresponding libraries are installed, which is typically the case on Linux, Mac OS and Windows (provided the above recommendations for installation of gcc have been followed).

Note: The gcc version included with OS X may generate executable code that results in runtime error (Abort trap: 6) when more than one thread is used. In that case, you first need to install a recent version of gcc, following the instructions at http://hpc.sourceforge.net/. Then, you can recompile SelEstim using the following instruction: make clean all CC=/usr/local/bin/gcc (assuming gcc has been installed in the /usr/local/ subdirectory.)

Install SweeD

SweeD is hosted by the github of Nikolaos Alachiotis (https://github.com/alachins/sweed). It's a C software, so you need to download the source code and compile it in your machine. You need to:

git clone https://github.com/alachins/sweed.git

This will download the sweed source code from the github repository, and it will create a folder called sweed

  • change directory within sweed
cd sweed
  • There are several versions of the code. The simplest one uses only a single thread. To compile it you should type:
make -f Makefile.gcc
  • If you want to compile the pthreads version, and be able to exploit multiple treads of your PC type:
make clean
make -f ./Makefile.PTHREADS.gcc

YOu need to remove the *.o files before compiling. The reason is that it's possible that *.o files are associated to another version and you will not be able to produce eventually the executable

Now, you should have a running version of SweeD within the sweed directory. To be able to run SweeD in any folder of your PC, please (i) add the sweed directory in your path, or (ii) copy the SweeD executable in a folder within your path. Don't do both, there is no need For (i): open the .bashrc file with your favorite editor e.g. emacs, and add the line:

PATH=$PATH:<SWEED FOLDER>
export PATH

Don't forget the PATH variable, or your PATH will not be correct and you will not be able to run many commands in Linux For (ii): assuming that $HOME/bin/ is in your PATH, simply:

cp SweeD $HOME/bin/

To run SweeD and assuming that the input file is called input.vcf:

just type:

SweeD -input input.VCF -grid 1000 -name MYRUN

Install OmegaPlus (NOT FOR THE CONTEST)

OmegaPlus detects the LD patterns around the target of beneficial mutation. To install:

git clone https://github.com/alachins/omegaplus.git
cd omegaplus
make clean
make

Install RAiSD (NOT FOR THE CONTEST)

In contrast to SweeD, RAiSD uses all 3 signatures of a selective sweep, namely the reduction of polymorphism levels, the shift of the SFS, and the special patterns of LD. Instead of the full information for SFS and LD, it uses approximations based on SNP vectors.

RAiSD is hosted by github at https://github.com/alachins/raisd

The following commands can be used to download and compile the source code.

$ mkdir RAiSD
$ cd RAiSD
$ wget https://github.com/alachins/raisd/archive/master.zip
$ unzip master.zip
$ cd raisd-master
$ make

The executable is placed in the path RAiSD/raisd-master/bin/release. A link to the executable is placed in the installation folder, i.e., raisd-master.

2. Download datasets

The first challenge is the Dahu challenge. Teams are asked to analyze simulated data for the Dahu challenge.

Download the training and the validation dataset for the Dahu challenge.

The second challenge is the Cichlid challenge. Teams are asked to analyze true data for the Cichlid challenge.

Download the data for the Cichlid challenge.

3. Form a team

To participate to the challenge, you should form teams. A team can be composed of 1, 2, or 3 participants. Once you have chosen a team, click on "create teams" on the submission website.

4. Submit candidate markers

The objective of the two data challenges is to find markers that are involved in adaptation.

To submit a list of markers involved in adaptation, you should use the submission website. The submitted file should be a .txt file with one column containing the indices of the candidate SNPs. An example of submission file containing a list of markers involved in adaptation is contained in the file mysubmission.txt.

Here is a piece of R code that shows how to make a submission file

training<-readRDS("Presentations/pcadapt/sim1a.rds")
stat<-apply(training$G,FUN=mean,MARGIN=2)
write(ou<-order(stat,decreasing = FALSE)[1:100],"ridicule_never_killed_anyone.txt",ncolumns=1)

5. Evaluation

Dahu challenge

Submissions will be evaluated by comparing submitted list to the list of causal adaptive SNPs.

The ranking of the teams will be based on the G score. The G score varies from 0 (minimum score) to 1 (maximum score).

The G score depends on the false discovery rate (FDR), which is the percentage of false positive markers (or regions) in the submitted list, and of the power, which is the percentage of markers (or regions) involved in adaptation, which are found in the submitted list. Two G scores will be computed for each submission; one marker-based score evaluates whether submitted markers are correct and one region-based score evaluates whether submitted markers fall in correct 100 bp regions.

The mathematical definition of the G score is

For the training dataset, G scores will be publicly available (public leaderboard). Participants should use the training dataset for training.

The final ranking of the teams will be based on the validation dataset for which G scores are not public.

A prize will be provided to the best team according to the marker-based score and another prize will be provided to the best team according to the region-best score.

Cichlid challenge

Submissions will be evaluated based on subjective evaluations by instructors based on presentations.

5. Presentations

During the SSMPG prize ceremony, each team will be asked to present 2-3 slides for each challenge.

6. Simulation results

A comparison of results for the first QTL region. Vertical lines indicate quatitative trait nucleotides that have an effect on the trait under selection.Quick look google doc

A full Markdown document