-
Notifications
You must be signed in to change notification settings - Fork 0
/
01_data_comparison.Rmd
454 lines (349 loc) · 26.6 KB
/
01_data_comparison.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
---
title: "Training datasets used for AMP prediction"
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE)
```
```{r, echo = FALSE}
library(ampir)
library(tidyverse)
library(patchwork)
library(ComplexUpset)
library(stringdist)
set.seed(3)
```
# Introduction
The effectiveness of supervised learning methods for predictive modeling depends on the degree to which the training data is a good match for the query data for which prediction is sought. For antimicrobial peptide (AMP) prediction, training typically occurs on a dataset composed of sequences with known antimicrobial activity (positive AMPs) and a background or negative set comprised of sequences that presumably do not have antimicrobial activity (non-AMPs). In this document we will explore the composition of both positive and negative datasets used by AMP predictors. We find that although there is much overlap in their use of positive AMPs there are substantial differences in (a) the use of precursor vs mature peptides, (b) the extent to which positive and negative data have length-matched distributions, and (c) the ratio of positive and negative cases (balance).
Our focus is on the use of AMP predictors as part of an 'omics-based novel AMP discovery workflow. In this context the input/query data is usually the complete set of predicted proteins for an organism, which might be derived from translations of gene models or from translated transcriptome sequences. Here we refer to such a dataset as a proteome.
# Reference proteomes
As representative examples of the type of data that would be used as input in 'omics scanning applications we used the complete proteome sets of *Arabidopsis thaliana* (a plant) and *Homo sapiens* (human). Both organisms have been intensively studied and as a consequence, their reference proteomes are likely to include sequences for the vast majority of protein-coding genes. Functional information for the proteins, including AMPs in these species is among the most complete available, but even for these organisms it is highly likely that some known AMPs have not been identified. Our assumption here is that these proteomes are completely classified for AMP activity (i.e. every AMP correctly identified).
```{r}
reference_proteomes <- read_tsv("data/proteomes/uniprot-proteome UP000005640.tab", show_col_types = FALSE) %>%
rbind(read_tsv("data/proteomes/uniprot-proteome UP000006548.tab", show_col_types = FALSE)) %>%
mutate(class = case_when(str_detect(`Keyword ID`, "KW-0929") ~ "AMP", TRUE ~ "non-AMP")) %>%
mutate(name = case_when(str_detect(Organism, "Homo") ~ "Homo sapiens", TRUE ~ "Arabidopsis thaliana")) %>%
filter(Length >10) %>%
filter(Length <3000) %>%
filter(grepl(Sequence, pattern='^[ARNDCEQGHILKMFPSTWYV]+$')) %>%
add_column(dataset=NA) %>%
select(seq_name=`Entry name`,seq_aa=Sequence, class,dataset,length=Length,name)
```
# AMP predictor data
The training and test sets of nine AMP predictors (**Table 1.1**) were examined to assess the degree to which they varied between predictors, and how well they matched the proteomes of *Arabidopsis thaliana* and *Homo sapiens*.
**Table 1.1:** Summary table of the number of positive and negative sequences present in the training and test set in six AMP predictors
| AMP predictor | Train - AMPs | Train - non-AMPs | Test - AMPs | Test - non-AMPs |
| :--- | :---: | :---: | :---: | :---: |
| iAMP-2L | 897 | 2,405 | 920 | 920 |
| amPEP | 3,268 | 166,791 | *NS* | *NS* |
| Deep-ampEP30 | 1,529 | 1,529 | 94 | 94 |
| amPEPpy | 3,268 | 3,268 | *NS* | *NS* |
| AMP Scanner v2 | 1,066 | 1,066 | 712 | 712
| AMPlify | 3,338 | 3,338 | 835 | 835
| AmpGram | 2,216 | 2,216 | 247 | 247
| ampir_precursor v1 | 1,187 | 11,864 | 296 | 2,966
| ampir_mature v1 | 2,586 | 2,657 | 646 | 664
*NS: not specified
**Table 1.2:** AMP Predictors with their publications and accessibility
| AMP predictor name | Reference | Availability |
| ------------------------ | --------------- | -------------- |
| amPEP | [Bhadra et al. 2018](https://doi.org/10.1038/s41598-018-19752-w) | [MATLAB source code](https://sourceforge.net/projects/axpep/files/AmPEP_MATLAB_code/)
| Deep-amPEP30 | [Yan et al. 2020](https://doi.org/10.1016/j.omtn.2020.05.006) | [Web Server](https://cbbio.online/AxPEP/)
| amPEPpy | [Lawrence et al 2020](http://dx.doi.org/10.1093/bioinformatics/btaa917) | [Python script](https://github.com/tlawrence3/amPEPpy)
| AMP scanner v2 | [Veltri et al. 2018](https://doi.org/10.1093/bioinformatics/bty179) | [amp scanner webserver](https://www.dveltri.com/ascan/v2/ascan.html)
| AMPlify | [Li et al. 2020](https://www.biorxiv.org/content/10.1101/2020.06.16.155705v1.full) | [Github repository](https://github.com/bcgsc/AMPlify)
| AmpGram | [Burdukiewicz et al 2020](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7352166/) | [Github repository](https://github.com/michbur/AmpGram)
| ampir | [Fingerhut et al 2020](https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btaa653) | [Github repository](https://github.com/legana/ampir)
**iAMP-2L Data**
The benchmark data provided by [Xiao et al. 2013](https://doi.org/10.1016/j.ab.2013.01.019) used for [iAMP-2L](http://www.jci-bioinfo.cn/iAMP/data.html) has been used in several studies to provide a somewhat independent estimate of prediction accuracy. Their training data, or benchmark dataset as they termed it, comprises 897 AMPs and 2,405 non-AMPs. Their test or independent dataset comprises of 920 AMPs and 920 non-AMPs.
```{r}
iAMP2L_train <- read_faa("data/amp_predictors/iAMP-2L/xiao_benchmark.fasta") %>%
mutate(class = ifelse(grepl(seq_name,pattern = "^AP"), "AMP", "non-AMP")) %>%
distinct(seq_name, .keep_all = TRUE) %>%
add_column(dataset="Train")
iAMP2L_test <- read_faa("data/amp_predictors/iAMP-2L/xiao_independent.fasta") %>%
mutate(class = ifelse(grepl(seq_name,pattern = "^AP"), "AMP", "non-AMP")) %>%
add_column(dataset="Test")
iAMP2L_data <- rbind(iAMP2L_train, iAMP2L_test) %>%
mutate(length = nchar(seq_aa)) %>% add_column(name="iAMP-2L")
```
**AmPEP Training Data**
The AmPEP 2018 AMP predictor provides its training data available directly for download from [https://cbbio.cis.um.edu.mo/software/AmPEP/](https://cbbio.cis.um.edu.mo/software/AmPEP/). The final training dataset used by amPEP is a large dataset of 166,791 non-AMP sequences and 3,268 AMPs. amPEP used the Xiao et al. 2013 dataset from the iAMP-2L predictor as a test set (see above).
```{r}
ampep_data <- read_faa("data/amp_predictors/amPEP/M_model_train_nonAMP_sequence.fasta") %>% add_column(class="non-AMP") %>%
rbind(read_faa("data/amp_predictors/amPEP/M_model_train_AMP_sequence.fasta") %>% add_column(class="AMP")) %>%
add_column(dataset = "Train") %>%
mutate(length = nchar(seq_aa)) %>% add_column(name="AmPEP") %>%
mutate(seq_name = paste0("amPEP_trainset_neg", 1:n()))
```
AmPEP was redesigned in 2020 as Deep-AmPEP30 to focus on short AMPs ( < 30 amino acids) by [Yan et al](https://doi.org/10.1016/j.omtn.2020.05.006) and its training and test data is available [here](https://cbbio.online/AxPEP/?action=dataset). Deep-AmPEP30's training set consists of 1,529 AMPs and non-AMPs and their test set consists of 94 AMPs and non-AMPs.
```{r}
deep_ampep_data <- read_faa("data/amp_predictors/deepamPEP30/train_ne.fasta") %>%
rbind(read_faa("data/amp_predictors/deepamPEP30/test_ne.fasta")) %>%
add_column(class = "non-AMP") %>%
rbind(read_faa("data/amp_predictors/deepamPEP30/train_po.fasta") %>%
rbind(read_faa("data/amp_predictors/deepamPEP30/test_po.fasta")) %>%
add_column(class = "AMP")) %>%
mutate(dataset = case_when(
str_detect(seq_name, "^test") ~ "Test",
str_detect(seq_name, "^uni") ~ "Test",
TRUE ~ "Train")) %>%
mutate(length = nchar(seq_aa)) %>%
add_column(name="deep_AmPEP")
```
AmPEP was additionally created as a python application, amPEPpy, by [Lawrence et al. 2020](https://doi.org/10.1093/bioinformatics/btaa917). amPEPpy's training data originated from amPEP but were modified to ensure that positive and negative sets had the same length distribution. The data were obtained via amPEPpy's [GitHub page](https://github.com/tlawrence3/amPEPpy). No specific test data were used for this predictor. The authors used OOB error to calculate elements of the confusion matrix meaning that the composition of test data was effectively identical to the training set.
```{r}
ampeppy_data <- read_faa("data/amp_predictors/amPEPpy/M_model_train_nonAMP_sequence.numbered.proplen.subsample.fasta") %>%
add_column(class="non-AMP") %>%
rbind(read_faa("data/amp_predictors/amPEPpy/M_model_train_AMP_sequence.numbered.fasta") %>%
add_column(class="AMP")) %>%
add_column(dataset = "Train") %>%
mutate(length = nchar(seq_aa)) %>%
add_column(name="AmPEPpy")
```
**AMP Scanner v2 Data**
AMP Scanner ([Veltri et al. 2018](https://doi.org/10.1093/bioinformatics/bty179])) data used for training, testing and evaluation are available directly for download from [https://www.dveltri.com/ascan/v2/about.html](https://www.dveltri.com/ascan/v2/about.html). AMP Scanner's training data consisted of 1,066 AMP and non-AMP sequences. Their testing data consisted of 712 AMP and non-AMP sequences.
```{r}
ampscan_train_data <- read_faa("data/amp_predictors/AMP_Scan2_OrigPaper_Dataset/AMP.tr.fa") %>%
rbind(read_faa("data/amp_predictors/AMP_Scan2_OrigPaper_Dataset/AMP.eval.fa")) %>%
rbind(read_faa("data/amp_predictors/AMP_Scan2_OrigPaper_Dataset/DECOY.tr.fa")) %>%
rbind(read_faa("data/amp_predictors/AMP_Scan2_OrigPaper_Dataset/DECOY.eval.fa")) %>%
add_column(dataset="Train")
ampscan_test_data <- rbind(read_faa("data/amp_predictors/AMP_Scan2_OrigPaper_Dataset/AMP.te.fa")) %>%
rbind(read_faa("data/amp_predictors/AMP_Scan2_OrigPaper_Dataset/DECOY.te.fa")) %>%
add_column(dataset="Test")
ampscan_data <- rbind(ampscan_train_data, ampscan_test_data) %>%
mutate(class = case_when(str_detect(seq_name, "^Uni") ~ "non-AMP", TRUE ~ "AMP")) %>%
mutate(length = nchar(seq_aa)) %>%
add_column(name="AMP Scanner v2") %>%
relocate(class, .before = dataset)
```
**AMPlify data**
AMPlify used multiple attention mechanisms and ensemble deep learning to create an AMP prediction model ([Li et al. 2020](https://doi.org/10.1101/2020.06.16.155705)). Similar to most AMP predictors, a proportion of their AMP dataset originated from the general [Antimicrobial Peptide Database](http://aps.unmc.edu/AP). However, they also used the [Database of Anuran Defense Peptides](http://split4.pmfst.hr/dadp/0) which focuses on AMPs from frogs and toads. Their negative dataset originated from Swiss-Prot and, like most AMP predictors, they excluded any proteins that had annotations which referred to potential antimicrobial activity. However, like ampir, they retain the secreted proteins. Their data is available from the [AMPlify's software GitHub page](https://github.com/bcgsc/AMPlify). Their training set consists of 3,338 AMPs and 3,338 non-AMPs and their test set consists of 835 AMPs and 835 non-AMPs.
```{r}
amplify_data <- read_faa("data/amp_predictors/AMPlify/AMP_train_20190414.fa") %>%
rbind(read_faa("data/amp_predictors/AMPlify/non_AMP_train_20190414.fa")) %>%
add_column(dataset = "Train") %>%
rbind(read_faa("data/amp_predictors/AMPlify/AMP_test_20190414.fa") %>%
rbind(read_faa("data/amp_predictors/AMPlify/non_AMP_test_20190414.fa")) %>%
add_column(dataset = "Test")) %>%
mutate(class = case_when(str_detect(seq_name, "^trAMP|^teAMP") ~ "AMP", TRUE ~ "non-AMP")) %>%
mutate(length = nchar(seq_aa)) %>% add_column(name="AMPlify") %>% relocate(class, .before = dataset)
```
**AmpGram data**
AmpGram was created in 2020 and in addition to the standard AMPs, it also focuses on predicting longer proteins that contain antimicrobial activity, such as the milk protein, lactoferrin, and on non-AMP proteins which produce antimicrobial proteolysis products, such as human thrombin. AmpGram uses amino acid motifs and the random forest model to classify AMPs [Burdukiewicz et al. 2020](https://doi.org/10.3390/ijms21124310). AmpGram's analysis details and data can be obtained from [AmpGram's analysis repository](https://github.com/michbur/AmpGram-analysis). Their training data is not specified as a file but their benchmark data can be found in the `benchmark.fasta` file. This test set contains 247 AMP and non-AMP sequences. AmpGram also used the dataset from [Gabere and Noble 2017](https://doi.org/10.1093/bioinformatics/btx081) which used AMPs from the [DAMPD](https://dx.doi.org/10.1093%2Fnar%2Fgkr1063) and [APD3](https://doi.org/10.1093/nar/gkv1278) AMP databases and non-AMPs from UniProt.
```{r}
ampgram_test_data <- read_faa("data/amp_predictors/AmpGram/benchmark.fasta") %>%
mutate(class = case_when(str_detect(seq_name, "^dbAMP") ~ "AMP", TRUE ~ "non-AMP")) %>%
add_column(dataset = "Test") %>%
mutate(length = nchar(seq_aa)) %>%
add_column(name = "AmpGram")
```
**ampir**
ampir was divided in two different models, precursor, which focuses on longer sequences (between 60-300) and mature, which only contains short sequences (between 10-50)
```{r}
ampir_prec_feats_train <- readRDS("data/ampir_v1/featuresTrain_precursor_imbal.rds") %>% mutate(dataset = "Train")
ampir_prec_feats_test <- readRDS("data/ampir_v1/featuresTest_precursor_imbal.rds") %>% mutate(dataset = "Test")
ampir_prec_feats <- rbind(ampir_prec_feats_train, ampir_prec_feats_test) %>% mutate(name = "ampir_precursor")
ampir_mat_feats_train <- readRDS("data/ampir_v1/featuresTrain_mature.rds") %>% mutate(dataset = "Train")
ampir_mat_feats_test <- readRDS("data/ampir_v1/featuresTest_mature.rds") %>% mutate(dataset = "Test")
ampir_mat_feats <- rbind(ampir_mat_feats_train, ampir_mat_feats_test) %>% mutate(name = "ampir_mature")
ampir_feats <- rbind(ampir_prec_feats, ampir_mat_feats) %>% mutate(class = ifelse(Label == "Tg", "AMP","non-AMP"))
ampir_data <- ampir_feats %>% select(seq_name, seq_aa, class, dataset, length, name)
write_tsv(ampir_data,"data/clustering/ampir_data.tsv")
```
```{r}
all_predictor_data <- rbind(iAMP2L_data, ampep_data, deep_ampep_data, ampeppy_data, ampscan_data, amplify_data, ampgram_test_data, ampir_data)
all_predictor_data <- all_predictor_data %>% filter(length >10)
```
First we compare the test and training data of each predictor. For all predictors other than AmPEP the length distributions of test and training were almost identical reflecting the fact the in most cases the test set was a randomly held-back portion of the same data used to generate the training set. AmPEP is the exception to this and has a different distribution for the test data, this being derived from the Xiao et al benchmark set.
```{r}
# For these two predictors the test datasets are already in the data.
# We just need to copy some rows to add them.
ampep_test <- all_predictor_data %>%
filter(name=="iAMP-2L") %>%
filter(dataset=="Test") %>%
mutate(name="AmPEP")
test_vs_train_data <- rbind(all_predictor_data, ampep_test) %>%
mutate(name = factor(name, levels = c("iAMP-2L", "AMP Scanner v2", "AmPEP", "AmPEPpy", "deep_AmPEP", "AMPlify", "AmpGram", "ampir_precursor", "ampir_mature")))
```
```{r, fig.width=8, fig.height=10}
model_seqlength_tr <- test_vs_train_data %>%
filter(dataset %in% c("Train")) %>%
filter(!(name %in% c("Homo sapiens", "Arabidopsis thaliana"))) %>%
ggplot() +
geom_line(aes(x = length, colour = class), stat = "density") +
facet_wrap(~name, ncol=1,scales = "free_y", drop = FALSE) +
labs(x = "Sequence length", y = "Density", colour = "") +
xlim(0,300) +
theme(legend.position = "bottom",
strip.text.y.right = element_text(angle = 0, hjust = 0),
strip.background = element_blank(),
panel.grid.minor = element_blank(),
panel.grid.major = element_blank(),
panel.background = element_rect(fill = "white", colour = "grey"),
legend.key = element_rect(fill = "white"),
strip.text = element_text(face = "italic")) +
scale_colour_manual(values = c("AMP" = "blueviolet", "non-AMP" = "forestgreen"))
model_seqlength_te <- test_vs_train_data %>%
filter(dataset %in% c("Test")) %>%
filter(!(name %in% c("Homo sapiens", "Arabidopsis thaliana"))) %>%
ggplot() +
geom_line(aes(x = length, colour = class), stat = "density") +
facet_wrap(~name, ncol=1,scales = "free_y", drop = FALSE) +
labs(x = "Sequence length", y = "Density", colour = "") +
xlim(0,300) +
theme(legend.position = "bottom",
strip.text.y.right = element_text(angle = 0, hjust = 0),
strip.background = element_blank(),
panel.grid.minor = element_blank(),
panel.grid.major = element_blank(),
panel.background = element_rect(fill = "white", colour = "grey"),
legend.key = element_rect(fill = "white"),
strip.text = element_text(face = "italic")) +
scale_colour_manual(values = c("AMP" = "blueviolet", "non-AMP" = "forestgreen"))
proteomes <- ggplot(reference_proteomes) +
geom_line(aes(x = length, colour = class), stat = "density") +
facet_wrap(~name, ncol=1,scales = "free_y", drop = FALSE) +
labs(x = "Sequence length", y = "Density", colour = "") +
xlim(0,300) +
theme(legend.position = "bottom",
strip.text.y.right = element_text(angle = 0, hjust = 0),
strip.background = element_blank(),
panel.grid.minor = element_blank(),
panel.grid.major = element_blank(),
panel.background = element_rect(fill = "white", colour = "grey"),
legend.key = element_rect(fill = "white"),
strip.text = element_text(face = "italic")) +
scale_colour_manual(values = c("AMP" = "blueviolet", "non-AMP" = "forestgreen"))
(model_seqlength_tr + model_seqlength_te + proteomes) + plot_annotation(tag_levels = "A") + plot_layout(guides = "collect") & theme(legend.position = "bottom")
ggsave("figures/train_vs_test_vs_proteomes.png", width = 9, height = 9)
```
**Figure 1.1:** Comparison of sequence length distributions for positive (AMP; purple) and negative (non-AMP; green) fractions in training (A) and test data (B) of nine AMP predictors, and for the proteomes of *A. thaliana* and *H. sapiens* (C).
# Overlap in positive AMP data between predictors
There are several databases that list AMPs with confirmed activity, including [APD 3](http://aps.unmc.edu/AP/), [DRAMP 2.0](http://dramp.cpu-bioinfor.org/), [dbAMP](http://140.138.77.240/~dbamp/index.php) and [UniProt](https://www.uniprot.org/uniprot/?query=keyword%3A%22Antimicrobial+%5BKW-0929%5D%22&sort=score) and most predictors use one or more of these as the basis for constructing their positive AMP dataset.
Here we explore the degree of overlap between AMP predictor positive training data that exists as a result of these shared origins, and how this interacts with the length distribution of AMPs and degree to which mature peptides vs precursor proteins are included.
To examine sequence overlap between databases we used the stringdist package to calculate the [Jaro distance](https://en.wikipedia.org/wiki/Jaro–Winkler_distance) between all pairs of positive AMP sequences across all predictor training/test datasets as well as reference proteomes for Arabidopsis and Human. The Jaro distance was chosen because it is normalised for the length of both sequences and produces a value between 0 (exact match) and 1 (completely dissimilar). Highly similar sequences (Jaro distance <0.2) were considered to be the same as these are likely strong homologs and manual inspection revealed that in many cases these matches occurred between mostly identical sequences with minor differences, likely due to reporting conventions and/or minor discrepancies between databases.
```{r}
raw_train_amp_data <- list.files("data/clustering/","*train.amp.fasta",full.names = T) %>%
map_dfr(read_faa)
raw_test_amp_data <- list.files("data/clustering/","*test.amp.fasta",full.names = T) %>%
map_dfr(read_faa)
signalp_train_amp <- read_table("data/clustering/train.amp.signalp", skip = 2,col_names = FALSE)[,c(1,10)]
signalp_test_amp <- read_table("data/clustering/test.amp.signalp", skip = 2,col_names = FALSE)[,c(1,10)] %>%
filter(!str_detect(X1, "^xu")) # Exclude due to uncertain provenance
# Check that row order is identical
if ( length(which(signalp_train_amp$X1!=raw_train_amp_data$seq_name))>0 | length(which(signalp_test_amp$X1!=raw_test_amp_data$seq_name))>0){
stop()
}
amp_data <- rbind(raw_train_amp_data,raw_test_amp_data) %>%
add_column(signalp = c(signalp_train_amp$X10=="Y",signalp_test_amp$X10=="Y")) %>%
tidyr::extract(seq_name,into=c("predictor","id"),regex= "([^_]+)_(.*)_t[a-z]+_amp")
```
```{r}
amp_data_wide <- amp_data %>%
select(-id) %>%
distinct() %>%
add_column(included=TRUE) %>%
pivot_wider(id_cols = c(seq_aa, signalp), names_from = predictor, values_from = included, values_fill = FALSE)
# Calculation of Jaro distance. This is time consuming (~ 1 minute)
if ( file.exists("cache/amp_data_jw.rds")){
seqdm <- read_rds("cache/amp_data_jw.rds")
} else {
seqdm <- stringdistmatrix(amp_data_wide$seq_aa, method = "jw")
write_rds(seqdm,file="cache/amp_data_jw.rds")
}
seq_clust <- hclust(seqdm)
seq_groups <- cutree(seq_clust,h=0.2)
```
```{r}
set_data <- amp_data_wide %>%
add_column(group = seq_groups) %>%
group_by(group) %>%
summarise(seq_aa = first(seq_aa), signalp = sum(signalp)>0, across(ampep:homosapiens, ~ sum(.x)>0)) %>%
select(-group) %>%
mutate(seq_len = str_length(seq_aa)) %>%
column_to_rownames(var = "seq_aa") %>%
as.data.frame() %>%
mutate(signalp = ifelse(signalp, "Signal Peptide", "No Signal Peptide"))
preds <- c("ampep","ampeppy","ampirmature","ampirprecursor","amplify","ampscan2","deepampep","iamp2l","homosapiens","arabidopsisthaliana")
```
```{r,fig.width=12,fig.height=10}
amps_upset <- upset(set_data, preds,
annotations = list(
"Sequence length" = list(
aes=aes(x = intersection, y= seq_len),
geom=geom_violin()
)
),
queries=list(
upset_query(
intersect = c("ampep","ampeppy"),
color="black",
only_components = c("intersections_matrix")
),
upset_query(
intersect = c("ampirprecursor"),
color="black",
only_components = c("intersections_matrix")
),
upset_query(
intersect=c("ampep","ampeppy","ampirprecursor"),
color="black",
only_components = c("intersections_matrix")
)
),
base_annotations=list(
"Intersection size" = intersection_size(
counts = FALSE,
mapping = aes(fill=signalp)
) + scale_fill_manual(values = c("Signal Peptide" = "blueviolet", "No Signal Peptide" = "forestgreen"))
+ theme(legend.position = c(-0.15, 0.9), legend.title = element_blank())
),
name = NULL,
width_ratio = 0.1,
min_size=50
)
amps_upset
ggsave(filename="figures/amp_data_overlaps.png", width = 10, height = 8)
```
**Figure 1.3:** UpSet plot showing overlap between positive training data used for eight AMP predictors, and known AMPs within the reference proteomes of *Arabidopsis thaliana* and *Homo sapiens*.
The UpSet plot shown highlights some key patterns of overlap between datasets and how this relates to their length and precursor protein status (indicated by signal peptide). One key trend is the high degree of uniqueness of sequences used by `ampir` (first and second bars). Also note that there are several large groups of AMPs shared by many predictors (columns 3 and 4) which appear to be almost exclusively comprised of mature sequences (red bars, short length distributions). These mature peptides comprise only a very small fraction of the A. thaliana and H. sapiens proteomes.
# Typical Sequence Structure of AMP Precursor Proteins
As shown in Figures 1.2 and 1.3 there is a significant divide between the composition of AMP predictor training sets based on the length distribution of included sequences. This is significant because many AMPs are first secreted as a longer precursor sequence that then undergoes cleavage to produce a mature (shorter) peptide. Some databases only list this shorter peptide sequence as it is the active peptide, however, when working with raw proteome input data the sequences given will be translated coding sequences from gene models and therefore correspond to the precursor sequence with no reliable way to accurately deduce the mature sequence. A survey of precursor sequences for AMPs listed in Uniprot can be used to reveal the typical sequence structure of precursors. Here we take advantage of the fact that for many well characterised AMPs in SwissProt the positions of the signal peptide, mature peptide and c-terminal sequence are given.
```{r}
uniprot <- readxl::read_excel("data/uniprot-keyword-Antimicrobial+[KW-0929]-reviewed-April2021.xlsx")
uniprot_precursor <- uniprot %>%
mutate(pep_start = as.integer(str_match(Peptide, "PEPTIDE.*?([0-9]+)")[,2])) %>%
mutate(pep_end = as.integer(str_match(Peptide, "PEPTIDE.*?([0-9]+)..>?([0-9]+)")[,3])) %>%
mutate(sig_start = as.integer(str_match(`Signal peptide`, "SIGNAL.*?([0-9]+)")[,2])) %>%
mutate(sig_end = as.integer(str_match(`Signal peptide`, "SIGNAL.*?([0-9]+)..([0-9]+)")[,3])) %>%
mutate(pep_length = pep_end - pep_start + 1) %>%
mutate(is_mature_peptide = abs(pep_length - Length)<3) %>%
mutate(signal = !is.na(`Signal peptide`)) %>%
mutate(maturity = ifelse(is.na(is_mature_peptide),"Unknown",is_mature_peptide)) %>%
mutate(maturity = ifelse(maturity==FALSE,"Precursor",maturity)) %>%
mutate(maturity = ifelse(maturity==TRUE,"Mature",maturity)) %>%
filter(Length>50) %>% filter(!is.na(pep_start))
# filter(maturity=="Precursor")
pos_s <- uniprot_precursor %>%
select(sig_start,sig_end) %>% na.omit() %>%
pmap_dfr( ~ data.frame(pos=seq(.x,.y),value="s"))
pos_p <- uniprot_precursor %>%
select(pep_start,pep_end) %>% na.omit() %>%
pmap_dfr( ~ data.frame(pos=seq(.x,.y),value="p"))
pos_c <- uniprot_precursor %>%
select(pep_end,Length) %>% na.omit() %>%
pmap_dfr( ~ data.frame(pos=seq(.x+1,.y),value="c"))
rbind(pos_s, pos_p,pos_c) %>%
ggplot() + geom_line(aes(x = pos, colour = value), stat = "density", size = 1) +
scale_colour_manual(label = c("c" = "C-terminal", "p" = "Mature AMP", "s" = "Signal Peptide"),
values = c("#CC79A7", "#009E73", "#D55E00")) +
labs(x = "Amino Acid Position in Precursor Sequence", y = "Density", colour = "") +
theme_minimal() +
theme(legend.position = "bottom",
panel.grid.major = element_blank(),
plot.background = element_rect(fill = "white", colour = "white"))
ggsave("figures/precursor_sequence.png", width = 7, height = 6)
```
**Figure 1.4:** Sequence composition of 831 AMP sequences with length > 50 (likely precursors) in SwissProt showing the relative locations of Signal Peptide, Mature AMP and C-terminal sequences.