-
Notifications
You must be signed in to change notification settings - Fork 0
/
pairwise_snp_differences.Rmd
297 lines (226 loc) · 11.2 KB
/
pairwise_snp_differences.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
---
title: "Pairwise SNP difference summary"
author: "Anders Goncalves da Silva"
date: "26 August 2015"
output: pdf_document
---
# The goal
There might come a time when it would be interesting to measure just how different clades are, or how distinct are different classification schemes (e.g., MLST categories). We can use the output from `nullabor`, in particular, the alignment of the SNPs across isolates, to quantify the mean distance (and other quantiles of interest), in terms of distinct SNPs across groups of isolates.
<!--
# Download the script
The script can be downloaded [here](summarise_snp_differences.R). The script can be run both interactively, within an `R` session and from the command line. Some instructions are provided in the script. This script has **NOT** been run on a Windows machine. Please let me know if you have problems.
-->
# How to do it
We can calculate summary measures of differentiation using `R`. Here, I am providing
a script that can be run within `R` or from a command line interface.
## Obtaining the script
Please clone the `git-hub` repository.
## Running within `R`
To run within `R`, make sure the `ape` library is installed:
```{r, install_lib, eval = F}
install.packages("ape")
```
Once that is done, open the script `pairwise_snp_differences.R` in a text editor. If
within `RStudio`, just double click on the script within the `File` pane (usually on
the lower right). Then, edit the following lines:
################################################################################
# If not running off the command line, change these parameters to point to the
# approriate files, e.g.:
# cat = '/home/user/cat.csv'
#
# To test the script substitute below as follows:
#
# cat_file = 'test/woodm_grouping.csv'
# seq_file = 'test/woodm.fa'
cat_file = NULL
# Only one of these files needs to be specified. If both are specified, the
# diff file will have precedence
seq_file = NULL
diff_file = NULL
# Options:
# Change the following options to set output
#
out_basename = 'snp_diff'
tab_fmt = "csv" # options are "csv" or "md"
tab_type = "pretty" # options "pretty" or "raw" --- "pretty" formats numbers in
# scientific format (e.g., 1.05e-9), while "raw" gives
# raw value outputs
fig_fmt = "png" # options are "png" or "pdf"
exclude_ids = NULL # a string to a path to a file with one sequence ID per
# line. these sequences will be excluded from the
# analysis.
################################################################################
Then, select the whole script, and click `run`.
## From the command-line
If running the script from the command-line, just type:
Rscript pairwise_snp_differences.R --help
Or, if in Windows:
Rscript.exe pairwise_snp_differences.R --help
That will give you a run down of the different parameters.
# Building the script
Below, I outline how I got to the script. This will allow interested people to
ammend, and expand depending on their specific needs.
## Load necessary libraries
```{r, load_libs, message=FALSE}
require(ape) # a basic phylogenetic package used to load sequence data and calculate distances
require(spider) # a DNA barcoding package with some functionality of interest
require(geiger) # a package for macroevolutionary simulation and estimating parameters related to diversification from comparative phylogenetic data.
```
If you don't have them, just use the `install.packages()` command to install them:
```{r, install_pack_example, eval=FALSE}
install.packages("ape")
```
**To run the script, you only need the `ape` package. The others are only necessary
if you wish to run this tutorial**
## Load the sequence data
Here, I am using some example data provided with the package `ape`. The data consists of 15 mitochondrial cytochrome `b` sequences from the woodmouse.
```{r, data_loading}
woodm <- read.FASTA(file = "test/woodm.fa") # woodm.fa is a FASTA file containing multiple aligned sequences, as outputted from nullabor
```
It is easy to perform some basic tree-building within `R`. For instance:
```{r, basic_nj}
#calculate distance
woodm_raw_dist <- dist.dna(x = woodm, model = "raw")
#use distance to build a Neighbour-Joining tree
woodm_nj_tree <- nj(X = woodm_raw_dist)
#plot the tree
plot(woodm_nj_tree)
```
## Load metadata
To calculate the statistics of interest, we need some metadata that groups the sequences/isolates into categories. These might be MLST categories, or any other grouping of interest (e.g., isolates could be grouped by geography or year). The grouping categories should be in a `CSV` or `tab` delimited file, with the one column containing the sequence IDs, as in the FASTA file above, and one or more columns corresponding to each of the classification schemes of interest.
For this example, we will calculate summary statistics for the pairwise differences among the three clades identified in the NJ tree above (defined by the three basal branches). I'll first show you how this metadata could be generated from a tree within `R`. But, most likely you will have this information ready from other sources.
```{r, meta_data}
# first, we will figure out the basal node for each of the three clades
woodm_nj_tree$node.label<-((length(woodm_nj_tree$tip)+1):((length(woodm_nj_tree$tip)*2)-1))
plot(woodm_nj_tree, show.node.label = T) # as we can see, these are 17, 20, and 22
# we can then pull out the tips associated with each of basal nodes
clade_a <- tips(phy = woodm_nj_tree, node = 17)
names(clade_a) <- rep("clade_a", length(clade_a))
clade_b <- tips(phy = woodm_nj_tree, node = 20)
names(clade_b) <- rep("clade_b", length(clade_b))
clade_c <- tips(phy = woodm_nj_tree, node = 22)
names(clade_c) <- rep("clade_c", length(clade_c))
#metadata object
metadf <- do.call('rbind', lapply(list(clade_a, clade_b, clade_c), function(group) data.frame(seq_id = group, clade = names(group))))
print(metadf)
```
In case that your data is already saved in a `CSV` or `tab` delimited file, you can load it with the following:
```{r, laod_metadata}
# I have saved the data.frame to a CSV file as an examle: woodm_grouping.csv
# To load a CSV file do the following
metadf <- read.table(file = "test/woodm_grouping.csv", sep = ",", header = T)
# if tab-delimited file change sep = ',' to sep = '\t'
```
Now, we can write a function that will take the distance object we calculated above, and return some summary statistics of interest.
```{r, summary_snp_info}
summ_distances <- function(categories, dist_obj){
#dist_obj is a distance object produced by using the dist.dna() function of ape
#categories is a data.frame with two columns:
# - seq_id: that matches the sequence ids in dist_obj
# - groups: that assigns the individual seq_ids to a group
# some sanity checks
if(class(dist_obj) != 'dist' & class(dist_obj) != 'matrix') {
stop("dist_obj is not an object of type dist or matrix!
Please use dist.dna() to create a distance object first OR
input a CSV file with count of differences produced by
nullabor")
}
if(!is.data.frame(categories)){
stop("categories must be a data.frame!
Please create a data.frame with the metadata first.")
}
if(ncol(categories) > 2) {
warning("Number of colums in categories is >2,
taking the first two columns only")
categories <- categories[,c(1,2)]
}
if(!all(sort(names(categories)) == sort(c("seq_id", "groups")))) {
warning("The columns of categories do not have names this function
recognizes. It will assume that the first column contains seq_ids,
and the second column the relevant categories.")
names(categories) <- c("seq_id", "groups")
}
# calculations
dat <- as.matrix(dist_obj)
taxa <- unique(as.character(categories[,'groups']))
n_taxa <- length(taxa)
total_comp <- (n_taxa^2 + n_taxa)/2
out <- data.frame(grp1 = character(total_comp),
grp2 = character(total_comp),
comp = character(total_comp),
N = numeric(total_comp),
type = rep("inter-group", total_comp),
mu = numeric(total_comp),
sd = numeric(total_comp),
min_dist = numeric(total_comp),
max_dist = numeric(total_comp), stringsAsFactors = F)
n_comp = 1
for(i in 1:n_taxa){
g1 <- taxa[i]
seq_1 <- as.character(categories[categories$groups == g1, 'seq_id'])
for(j in i:n_taxa){
g2 <- taxa[j]
seq_2 <- as.character(categories[categories$groups == g2, 'seq_id'])
tmp_dat <- dat[seq_1, seq_2]
out[n_comp, "grp1"] <- g1
out[n_comp, "grp2"] <- g2
out[n_comp, "N"] <- length(tmp_dat)
out[n_comp, "comp"] <- paste(g1, g2, sep='_')
if(i == j) {
if(length(tmp_dat) > 1){
#if length is one, this results in a empty set.
#so, added this condition to fix the problem
tmp_dat <- tmp_dat[lower.tri(tmp_dat)]
}
out[n_comp, 'type'] <- 'intra-group'
out[n_comp, "comp"] <- g1
}
if (length(tmp_dat) > 1 & max(tmp_dat) > min(tmp_dat)) {
out[n_comp, "mu"] <- mean(tmp_dat)
out[n_comp, "sd"] <- sd(tmp_dat)
out[n_comp, "min_dist"] <- min(tmp_dat)
out[n_comp, "max_dist"] <- max(tmp_dat)
} else {
out[n_comp, "mu"] <- mean(tmp_dat)
out[n_comp, "sd"] <- 0
out[n_comp, "min_dist"] <- min(tmp_dat)
out[n_comp, "max_dist"] <- max(tmp_dat)
}
n_comp = n_comp + 1
}
}
out$type <- factor(out$type, levels = c("intra-group", "inter-group"))
out$comp <- factor(out$comp, levels = out$comp[order(out$type, out$comp)])
return(out)
}
```
```{r, test_func}
names(metadf) <- c("seq_id", "groups")
results <- summ_distances(dist_obj = woodm_raw_dist, categories = metadf)
```
```{r, results_table, results='asis', echo=FALSE, message=FALSE}
require(pander)
pander(subset(results, select = -c(comp)), caption = "Table of summary pairwise SNP differences among groups of woodmouse cytb sequences.", split.tables = Inf)
```
We can then plot the results.
```{r, plot_diff, message=FALSE}
require(ggplot2)
ggplot(results, aes(x = comp, y = mu, colour = type)) +
geom_point(size = 4) +
geom_errorbar(aes(ymax = mu + sd, ymin = mu - sd, width = 0.05)) +
geom_point(aes(x = comp, min_dist), size = 3) +
geom_point(aes(x = comp, max_dist), size = 3) +
xlab("Pairwise comparisons") +
ylab("Mean proportional SNP differences\n(errorbars: sd; points: min/max)") +
scale_colour_discrete(name = "Comparison type") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
# Additional links of potential interest
A more in-depth tutorial on using `R` to do phylogenetic-type analyses can be found [here](http://www.r-phylo.org/wiki/Main_Page).
A more in-depth tutorial on loading and manipulating DNA sequences in `R` can be found [here](http://a-little-book-of-r-for-bioinformatics.readthedocs.org/en/latest/index.html).
# Session info
```{r, session_info, echo = F}
sessionInfo()
```
# Contact info
Anders Goncalves da Silva (andersgs at gmail dot com).