Skip to content

Latest commit

 

History

History
355 lines (212 loc) · 12.7 KB

02-Project-02_sum.md

File metadata and controls

355 lines (212 loc) · 12.7 KB
Error in user YAML: (<unknown>): mapping values are not allowed in this context at line 1 column 40
---
course: NGS for evolutionary biologists: from basic scripting to variant calling
title: Project 01 - Bayesian skyline plot
requires: knowledge of basic shell commands, basic concept of Phylogenetics
author: Enza Colonna, Chiara Batini, Pille Hallast  
time: 
---

Learning Objectives


#Summary

############################################

Analyse sex-specific demographic changes in a European population

The biological question

We have sequenced the whole mitochondrial genome and 2.2Mb of the Y chromosome in 20 individuals from the CEU population included in the HapMap project. We want to compare how the male and female effective population sizes (Ne) changed through time and understand the role of sex-biased processes in the evolution of the European population.

We will use a Bayesian method, Bayesian Skyline Plot (BSP), that co-estimates the genealogy, the demographic history (through the Ne) and the substitution-model parameters from aligned sequences. Take some time to read this review about the methods.

Take some time to read about Bayesian Inference, and its application in phylogeny.

############################################

The software required

############################

From raw data to variants

As the course will cover this part in detail, we will skip it here and talk only about more specific software. In summary you will have to follow the pipeline we have applied during the practicals.

############################

Task specific

Bayesian Evolutionary Analysis Sampling Trees (BEAST)

BEAST is a program for Bayesian analysis of molecular sequences based on the construction of phylogenetic trees. BEAST uses Markov chain Monte Carlo (MCMC) to average over the tree space. This allows to test evolutionary hypothesis by weighing each tree on its posterior probability and avoids fixing the tree topology.

Take some time to read one of the BEAST tutorials. We recommend you read the Divergence Dating (Primates) v1.2a.zip (BEAST v1.6.x), but any other will be good as well. Download the tutorial and take time to go through the files.

We will use BEAST to produce BSP like this:

bsp

This plot represents a Bayesian Skyline Plot of Eastern Pygmies. The y axis represents the female effective population size (Ne) in a log scale and the x axis shows time in thousands of years ago (Kya). The thicker colored lines are the median for Ne and the thinner grey lines represent 95% higher posterior density intervals. What you can observe here is a constant effective population size between 140 ad 20 kya followed by a decline until 4kya and a subsequent expansion.

############################################

Project tasks

For this project we will use a graphical interface, therefore remember to ssh using the -X (capital X!) option that will enable the use of the graphical interface

1. Copy the fastq files in your directory

The Fastq files we will use here were extracted from a custom enrichment experiment. Agilent SureSelect was used to capture 26Mb of the human genome, and paired-end libraries were run on a HiSeq 2000 sequencer with 100bp read length (Hallast et al, 2015).

.fastq files are in this folder: /pico/scratch/userexternal/vcolonna/project_2/fastq

The .bam files are in this folder: /pico/scratch/userexternal/vcolonna/project_2/bamfiles

Copy in your personal data directory the ones you will work with using the shell cp command

############################

2. Process the NGS data

You will align the reads to the reference genome, refine the BAM and perform QC, do the variant calling and filtering. Note that:

  • you need to combine all the individual's vcf to do this analysis.
  • make sure your vcf is bgzipped and indexed

3. Prepare input files for BEAST

BEAST takes as input a file composed of two parts: demographic parameters and aligned DNA sequences. The BEAST's routine BEAUti is used to generate input files for BEAST. In turn, BEAUti takes as input files aligned sequences either in .fasta or .nex NEXUS file formats.

Take some time to familiarize with the file formats. In our case we will go through the .fasta

3.1 Convert .vcf in .fasta files

Our first step is to convert the .vcf in fasta. As usual there are several ways to do this, among which the useful PDGSpider.

You should install PDGSpider on your workspace in few steps. First download the file using the shell command wget

 wget http://www.cmpg.unibe.ch/software/PGDSpider/PGDSpider_2.0.9.0.zip

The zipped file you just downloaded contains the executable file needed. Unzip the folder using the shell command unzip.

unzip PGDSpider_2.0.9.0.zip

Move to the unzipped folder and list its content. You will find several files among which PGDSpider2.sh that is the file we want to launch to open the graphical interface.

cd PGDSpider_2.0.9.0.zip
ls -l

drwxrwxr-x 3 usr usr    4096 set 25 17:35 ./
drwxrwxr-x 3 usr usr    4096 nov 24 15:04 ../
-rw-rw-r-- 1 usr usr    3436 set 25 17:21 changelog.txt
drwxrwxr-x 2 usr usr    4096 set 25 17:29 examples/
-rw-rw-r-- 1 usr usr    1549 gen 12  2015 LICENSE.txt
-rw-rw-r-- 1 usr usr    5079 giu 20  2009 log4j.dtd
-rw-rw-r-- 1 usr usr   10646 ott 12  2011 PGD_schema.xsd
-rw-rw-r-- 1 usr usr   33792 giu 20  2011 PGDSpider2-cli.exe
-rw-rw-r-- 1 usr usr 9199806 set 25 17:31 PGDSpider2-cli.jar
-rw-rw-r-- 1 usr usr   36864 giu 20  2011 PGDSpider2.exe
-rw-rw-r-- 1 usr usr 9474801 set 25 17:31 PGDSpider2.jar
-rw-rw-r-- 1 usr usr      58 giu 21  2011 PGDSpider2.sh
-rw-rw-r-- 1 usr usr 7525387 set 25 17:23 PGDSpider manual_vers 2-0-9-0.pdf
-rw-rw-r-- 1 usr usr    6946 set 25 17:21 README.txt
-rw-rw-r-- 1 usr usr    9662 ott 30  2008 spider48.ico
-rw-rw-r-- 1 usr usr   18918 lug  7  2011 stylesheet_PGD.xsl

To make sure that PGDSpider2.sh is executable let's change its permissions using the unix chmod command

chmod +x PGDSpider2.sh

Now we are finally ready to launch the program:

./PGDSpider2.sh

if all worked, you should see this:

pdg

Follow the instructions to upload the .vcf and convert it. If all worked, this operation will generate a .fasta file in your work directory

ls

-rw-rw-r-- 1 user user        676 19 nov 12:17 myfile.fasta
3.2 Generate .xml file (BEAUti)

To generate the input .xml file for BEAST we will use a routine of BEAST called BEAUti. We will use the BEAUti graphical interface although we will launch it on the PICO cluster. To launch BEAUti we should first load BEAST, as BEAUti is part of it.

module load profile/advanced
module load autoload beast
beauti

Note that the the module load instructions apply specifically to this cluster. Other machines might have a different way to launch BEAST/BEAUti

At this time a graphical interface should appear.

beauti

You will go through a number of steps all guided by the interface. BEAUti does the job of specifying all the demographic model options. Use default ones unless you are familiar with them and you feel you should make changes.

a) Import the FASTA file

b) Set the substitution model

c) Set the ‘molecular clock’ model

d) Set the tree prior

e) Set up the priors

f) Set up the operators

g) Set the MCMC options

At the end you should have generated an .xml file in your working directory

-rw-rw-r-- 1 user user        676 19 nov 12:17 myfile.xml

.xml: this is an equivalent of the NEXUS format written in the Extensible Markup Language (XML) style. While the .xml file format might seem complicated at first sight, it can be easily read by considering that it is subdivided in blocks containing the sequences and the information for BEAST about (a) the parameters of the demographic model, and (b) the parameters of the MCMC.

Take some time to explore the .xml file and try to understand its structure.

3.4 Run BEAST

We are now ready to run BEAST. The syntax to run BEAST is:

beast <options> example.xml

We will use the -working options to have the output

beast -working  myfile.xml
Submit a job to job scheduler

If we are using a very small file, the command line described above can be very fast and run interactively. However in reality files are large and we might want to submit jobs instead.

If we are using a machine with a PBS job scheduler we might want to embed the command line in a PBS script as described in the instructions to run jobs with PBS.

The PBS script will look like:


#!/bin/bash
#PBS -q workq
#PBS -N admix2
#PBS -l nodes=1:ppn=1
#PBS -o /absolutepath/outerr/runbeast.out
#PBS -e /absolutepath/outerr/runbeast.err

cd /myworkingdir
module load autoload beast

beast -working  myfile.xml

This PBS will produce two files:

ls

-rw-rw-r-- 1 user user     27054 19 nov 14:06 myfile.log
-rw-rw-r-- 1 user user     27054 19 nov 14:06 myfile.trees
  • .log This file contains information from the MCMC every n steps, where n is set in BEAUti. The content of the file is described in the header, take some time to explore it.
  • .trees file (NEXUS format): trees (either phylogenies or geneologies) sampled at the same time during the MCMC

3.5 Run Tracer

To analyze the results of BEAST we are going to use Tracer, a program for analysing the trace files generated by Bayesian MCMC runs. It is also distributed with BEAST here. Tracer provides summary statistics of the MCMC runs but it also produces an estimates of the marginal posterior distributions of parameters of our model.

Tracer produces graphical outputs but you can also decide to export Trace output to a file and make your own graphs using R

5. Interpret the output and prepare a report

Here is where we want you to be creative... Prepare two slides:

  • explain the NGS pipeline
  • show the results

brain

Recap

To summarize, below is a list of all the task-specific softwares that we need to use. We will run some of them and use only file formats from others.

Pipeline

References