-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathiso_quant_phospho.Rmd
272 lines (204 loc) · 9.45 KB
/
iso_quant_phospho.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
# Isobaric Quantification: Phosphoproteomics {#iso-phospho}
This pipeline shows how to process phosphoproteomics TMT data with `PlexedPiper`, though it can be used for any type of post-translational modification (PTM) TMT data. We will use data package 3626, which is "PlexedPiperTestData phospho". In addition to PlexedPiper, we will also need MSnID (the basis for PlexedPiper), PNNL.DMS.utils to interface with PNNL's DMS, and Biostrings to create an `AAStringSet` object from a FASTA file. Since a lot of these steps are the same as in Section \@ref(iso-global), a lot of the details will be omitted.
```{r include=FALSE}
# Global chunk options
knitr::opts_chunk$set(message=FALSE, warning=FALSE,
fig.align='center', fig.asp=0.65, out.width='75%')
```
```{r iso-lab-setup-2}
## Install missing packages
if (!require("remotes", quietly = T)) install.packages("remotes")
git_packages <- c("MSnID@pnnl-master", "PlexedPiper", "PNNL.DMS.utils")
for (pkg_i in git_packages) {
if (!require(sub("@.*", "", pkg_i), quietly = T, character.only = T))
remotes::install_github(file.path("PNNL-Comp-Mass-Spec", pkg_i))
}
if (!requireNamespace("BiocManager", quietly = T))
install.packages("BiocManager")
if (!require("Biostrings", quietly = T))
BiocManager::install("Biostrings")
## ------------------------
library(MSnID)
library(PlexedPiper)
library(PNNL.DMS.utils)
library(Biostrings)
```
```{r include=FALSE}
library(knitr) # embed images
library(kableExtra)
library(dplyr)
```
## Prepare MS/MS Identifications
### Read MS-GF+ Data
```{r results='hide'}
# Read MS-GF+ data
data_package_num <- 3626 # phospho
msnid <- read_msgf_data_from_DMS(data_package_num)
```
```{r}
show(msnid)
```
### Correct Isotope Selection Error
```{r}
# Correct for isotope selection error
msnid <- correct_peak_selection(msnid)
```
### Remove Unmodified Peptides {#remove-unmodified-peptides}
Generally, we will remove unmodified peptides before any sort of filtering steps; however, unmodified peptides will be removed automatically in Section \@ref(map-mod-sites), so this step can be skipped if we need to tally the number of modified and unmodified peptides toward the end of processing.
In this case, the phosphorylation of an amino acid is marked by a `*` appearing next in the sequence. We can filter out peptides that do not contain this symbol with `apply_filter`. In regular expressions, the `*` is a special character called a metacharacter that must be escaped with backslashes, and the backslashes must also be escaped, since they are enclosed within a nested string (`"''"`). For non-metacharacters, it is not necessary to include the backslashes.
```{r}
# Remove non-phosphorylated peptides
# (peptides that do not contain a *)
msnid <- apply_filter(msnid, "grepl('\\\\*', peptide)")
show(msnid)
```
### Remove Contaminants
```{r}
# Remove contaminants
msnid <- apply_filter(msnid, "!grepl('Contaminant', accession)")
show(msnid)
```
### Improve Phosphosite Localization
Phospho datasets involve Ascore jobs for improving phosphosite localization. There should be one AScore job per data package. If the Ascore job does not exist, see <a href="https://prismwiki.pnl.gov/wiki/AScore_Job_Creation">AScore Job Creation</a> for how to set it up. The fetched object is a data.frame that links datasets, scans and original PTM localization to newly suggested locations. Importantly, it contains `AScore` column that "measures the probability of correct phosphorylation site localization" [@beausoleil_probability-based_2006]. AScore > 17 is considered confident.
```{r results='hide'}
# Filter PTMs by Ascore - only for phospho data
ascore <- get_AScore_results(data_package_num)
msnid <- best_PTM_location_by_ascore(msnid, ascore)
```
### MS/MS ID Filter: Peptide Level
```{r}
# 1% FDR filter at the peptide level
msnid <- filter_msgf_data(msnid, level = "peptide", fdr.max = 0.01)
show(msnid)
```
### MS/MS ID Filter: Protein Level
This step is unnecessary for PTM data, since the cross-tab is not created at the protein level, so it is skipped.
### Inference of Parsimonious Protein Set
If a protein was detected in the global proteomics results, we may be more confident that it will appear in the PTM results. We can perform prioritized inference of the protein set to ensure that, if a protein is reported in the global cross-tab, and it is present in the PTM MSnID after filtering, it will be included in the final PTM MSnID. We set the proteins from the global cross-tab as the prior. By default, peptides are allowed to match multiple proteins in the prior. If duplicates are not allowed, we can set the `refine_prior` argument to `TRUE`.
```{r}
# Proteins from global proteomics cross-tab
load("./data/3442_global_proteins.RData")
# Prioritized inference of parsimonious protein set
msnid <- infer_parsimonious_accessions(msnid, unique_only = FALSE,
prior = global_proteins,
refine_prior = FALSE)
show(msnid)
```
</br>
### Map Sites to Protein Sequences {#map-mod-sites}
`MSnID::map_mod_sites` creates a number of columns describing mapping of the modification sites onto the protein sequences. The most important for the user is `SiteID`. `names(fst)` must match `accessions(msnid)`; usually, we will have to modify names to remove everything after the first word.
```{r}
# Create AAStringSet
path_to_FASTA <- path_to_FASTA_used_by_DMS(data_package_num)
fst <- readAAStringSet(path_to_FASTA)
# Remove contaminants
fst <- fst[!grepl("Contaminant", names(fst)), ]
# First 6 names
head(names(fst))
```
```{r}
# Modify names to match accessions(msnid)
# Remove any space followed by any number of characters
names(fst) <- sub(" .*", "", names(fst))
# First 6 names
head(names(fst))
```
The names are in the proper format, so we can continue with the main mapping call. This will also remove any unmodified peptides, if Section \@ref(remove-unmodified-peptides) was skipped.
```{r}
# Main mapping call
msnid <- map_mod_sites(object = msnid,
fasta = fst,
accession_col = "accession",
peptide_mod_col = "peptide",
mod_char = "*", # asterisk for phosphorylation
site_delimiter = ";") # semicolon between multiple sites
```
Table \@ref(tab:phospho-msnid-table) shows the first 6 rows of the processed MS-GF+ output.
```{r phospho-msnid-table, echo=FALSE}
fix_phos <- function(x) {
gsub("\\*", "\\\\*", x)
}
x <- head(psms(msnid), 6) %>%
mutate_if(is.character, fix_phos)
kable(x, digits = 3, row.names = FALSE,
caption = "<left>First 6 rows of the processed MS-GF+ results.</left>",
escape = FALSE, format = "html") %>%
kable_styling(full_width = FALSE, font_size = 12,
bootstrap_options = c("hover", "condensed")) %>%
scroll_box(height = "90%") # restrict width
```
</br>
### Remove Decoy PSMs
```{r}
# Remove Decoy PSMs
msnid <- apply_filter(msnid, "!isDecoy")
show(msnid)
```
## Prepare Reporter Ion Intensities
### Read MASIC Output
```{r results='hide'}
# Read MASIC data
masic_data <- read_masic_data_from_DMS(data_package_num,
interference_score = TRUE)
```
### Filter MASIC Data
```{r}
# Filter MASIC data
masic_data <- filter_masic_data(masic_data,
interference_score_threshold = 0.5,
s2n_threshold = 0)
```
## Create Study Design Tables
Aside from the fractions table, the other study design tables can be the same as those created for the global proteomics data. This is because the datasets are different. The study design tables have been added to the data package (end of Section \@ref(global-references)), so we can use `get_study_design_by_dataset_package`.
```{r}
# Create fractions table
datasets <- unique(intersect(msnid$Dataset, masic_data$Dataset))
fractions <- data.frame(Dataset = datasets) %>%
mutate(PlexID = gsub(".*_P_(S\\d{1})_.*", "\\1", Dataset))
# Use global samples and references tables
study_design <- get_study_design_by_dataset_package(3442)
samples <- study_design$samples
references <- study_design$references
```
```{r}
# Save phospho fractions table
write.table(fractions, file = "data/3626_fractions.txt",
sep = "\t", quote = FALSE, row.names = FALSE)
```
## Create Quantitative Cross-tab
```{r}
# Create cross-tab - aggregate to SiteID level
crosstab <- create_crosstab(msnid = msnid,
reporter_intensities = masic_data,
aggregation_level = "SiteID",
fractions = fractions,
samples = samples,
references = references)
```
```{r echo=FALSE}
x <- head(crosstab)
# rownames(x) <- gsub("\\*", "\\\\*", rownames(x))
# rownames(x) <- gsub("\\@", "\\@\\\\hphantom{}", rownames(x))
kable(x, row.names = TRUE,
caption = "<left>First 6 rows of the phospho quantitative cross-tab.</left>", format = "html",
escape = FALSE) %>%
kable_styling(font_size = 12,
bootstrap_options = c("hover", "condensed")) %>%
scroll_box(height = "90%") # restrict width
```
</br>
```{r}
# Save cross-tab
write.table(crosstab, file = "data/3662_phospho_crosstab.txt",
sep = "\t", quote = FALSE, row.names = TRUE)
```
## Create MSnSet {#phospho-msnset}
```{r}
# Create MSnSet
m <- create_msnset(crosstab = crosstab, samples = samples)
m
```
```{r}
# Save phospho MSnSet
save(m, file = "data/phospho_msnset.RData", compress = TRUE)
```