-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathisobaric_quantification.Rmd
810 lines (607 loc) · 34.1 KB
/
isobaric_quantification.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
# Isobaric Quantification Pipelines {#isobaric-quantification-pipelines}
<!---
TODO:
* Overview of isobaric labeling
--->
<!---
Keep in mind that there are many other functions available in `PlexedPiper` and `MSnID` that will not be covered that can be used to add additional columns or further filter the results. This section just showcases the foundations of the pipelines.
--->
## Global Proteomics Data {#global-proteomics-data}
This pipeline shows how to process TMT data that is processed outside of PNNL's DMS. Section \@ref(phosphoproteomics-data) shows how to process data from the DMS. For convenience, the results of MS-GF+ and MASIC processing are provided in a companion `PlexedPiperTestData` package. In addition, we will need `PlexedPiper` for isobaric quantification, `dplyr` to manipulate data frames, and `MSnbase` to create MSnSet objects.
```{r include=FALSE}
# Global chunk options
knitr::opts_chunk$set(message = FALSE, warning = FALSE, fig.align='center')
```
```{r iso-lab-setup}
# Setup
library(PlexedPiper)
library(PlexedPiperTestData)
library(dplyr)
```
```{r include=FALSE}
library(ggplot2) # plotting
library(knitr) # embed images
library(kableExtra)
```
The pipeline can be broken up into 4 major chunks: prepare MS/MS identifications, prepare reporter ion intensities, create a quantitative cross-tab, and create an MSnSet object.
### Read MS-GF+ Data
The first step in the preparation of the MS/MS identifications is to fetch the data. In this case, the data exists in the PlexedPiperTestData package in a local folder, so we use `system.file` to get the file path and `read_msgf_data` to read the MS-GF+ output.
```{r read-msgf, results='hide'}
# Get file path
path_to_MSGF_results <- system.file("extdata/global/msgf_output",
package = "PlexedPiperTestData")
# Read MS-GF+ data from path
msnid <- read_msgf_data(path_to_MSGF_results)
```
Normally, this would display a progress bar in the console as the data is being fetched. However, the output was suppressed to save space. We can view a summary of the MSnID object with the `show()` function.
```{r}
show(msnid)
```
This summary tells us that `msnid` consists of 4 spectrum files (datasets), and contains a total of 1,156,754 peptide-spectrum-matches (PSMs), 511,617 total peptides, and 128,378 total accessions (proteins). The reported FDR is the empirical **false-discovery rate**, which is calculated as the ratio of the number of false (decoy) PSMs, peptides, or accessions to their true (non-decoy) counterparts. Calculation of these counts and their FDRs is shown below.
```{r eval=FALSE}
# How to calculate the counts and FDRs from the show() output
## Spectrum Files
# Count
psms(msnid) %>%
distinct(Dataset) %>%
nrow() # 48
## PSMs
# Count
psms(msnid) %>%
distinct(Dataset, Scan, peptide, isDecoy) %>%
# Assign intermediate to variable
assign("x_psm", ., envir = globalenv()) %>%
nrow() # 1156754
# FDR
nrow(x_psm[x_psm$isDecoy == TRUE, ]) /
nrow(x_psm[x_psm$isDecoy == FALSE, ])
# 0.3127463 = 31%
## peptides
# Count
psms(msnid) %>%
distinct(peptide, isDecoy) %>%
assign("x_peptide", ., envir = globalenv()) %>%
nrow() # 511617
# FDR
nrow(x_peptide[x_peptide$isDecoy == TRUE, ]) /
nrow(x_peptide[x_peptide$isDecoy == FALSE, ])
# 0.611245 = 61%
## accessions
# Count
length(accessions(msnid)) # or
psms(msnid) %>%
distinct(accession, isDecoy) %>%
assign("x_acc", ., envir = globalenv()) %>%
nrow() # 128378
# FDR
nrow(x_acc[x_acc$isDecoy == TRUE, ]) /
nrow(x_acc[x_acc$isDecoy == FALSE, ])
# 0.9827024 = 98%
```
Now that we have an `MSnID` object, we need to process it. We begin by correcting for the isotope selection error.
### Correct Isotope Selection Error
Carbon has two stable isotopes: $^{12}\text{C}$ and $^{13}\text{C}$, with natural abundances of 98.93% and 1.07%, respectively. That is, we expect that about 1 out of every 100 carbon atoms is naturally going to be a $^{13}\text{C}$, while the rest are $^{12}\text{C}$. In larger peptides with many carbon atoms, it is more likely that at least one atom will be a $^{13}\text{C}$ than all atoms will be $^{12}\text{C}$. In cases such as these, a non-monoisotopic ion will be selected by the instrument for fragmentation.
```{r MS1_peak, echo=FALSE, fig.cap="MS1 spectra with peak at non-monoisotopic precursor ion."}
include_graphics("images/MS1_non_monoisotopic.PNG")
```
In Figure \@ref(fig:MS1_peak), the monoisotopic ion (m/z of 1427.29) is not the most abundant, so it is not selected as the precursor. Instead, the ion with a $^{13}\text{C}$ in place of a $^{12}\text{C}$ is selected for fragmentation. We calculate the mass difference between these two ions as the difference between the mass-to-charge ratios multiplied by the ion charge. In this case, the mass difference is 1 Dalton, or about the difference between $^{13}\text{C}$ and $^{12}\text{C}$. (More accurately, the difference between these isotopes is 1.0033548378 Da.) While MS-GF+ is still capable of correctly identifying these peptides, the downstream calculations of mass measurement error need to be fixed because they are used for filtering later on (Section \@ref(global-peptide-filter)). The `correct_peak_selection` function corrects these mass measurement errors, and Figure \@ref(fig:mass-to-charge-diff) shows the distribution of the mass measurement errors before and after correction.
```{r echo=FALSE}
# m/z difference
delta_mz_pre <- msnid$experimentalMassToCharge - msnid$calculatedMassToCharge
# Mass difference
delta_mass_pre <- delta_mz_pre * msnid$chargeState # m/z * z = m
```
```{r}
# Correct for isotope selection error
msnid <- correct_peak_selection(msnid)
```
```{r mass-to-charge-diff, fig.cap="Histogram of mass measurement errors before and after correction.", fig.asp=0.5, echo=FALSE}
# m/z difference
delta_mz_post <- msnid$experimentalMassToCharge - msnid$calculatedMassToCharge
# Mass difference
delta_mass_post <- delta_mz_post * msnid$chargeState # m/z * z = m
delta_mass_df <- data.frame(delta_mass = c(delta_mass_pre, delta_mass_post),
group = rep(c("pre", "post"),
each = length(delta_mass_post))) %>%
mutate(group = factor(group, levels = c("pre", "post")))
facet_labs <- c("Before Correction", "After Correction")
names(facet_labs) <- c("pre", "post")
# # Plot mass measurement error
ggplot(delta_mass_df) +
geom_histogram(aes(x = delta_mass), binwidth = 0.01) +
facet_grid(cols = vars(group), labeller = labeller(group = facet_labs),
scales = "free_x") +
scale_y_continuous(name = "Count", labels = scales::label_scientific(),
expand = expansion(mult = c(0, 0.05))) +
labs(x = "Mass Measurement Error") +
theme_bw(base_size = 12) +
theme(strip.background = element_blank(),
strip.text = element_text(size = 14, face = "bold"))
```
### Remove Contaminants
Now, we will remove contaminants such as the trypsin that was used for protein digestion. We can use `grepl` to search for all accessions that contain the string `"Contaminant"`. Displaying these contaminants is not necessary during processing. This is just for demonstration purposes to see what will be removed.
```{r}
# All unique contaminants
accessions(msnid)[grepl("Contaminant", accessions(msnid))]
```
To remove contaminants, we use `apply_filter` with an appropriate character string that tells the function what rows to keep. In this case, we keep rows where the accession does not contain "Contaminant". We will use `show` to see how the counts change.
```{r}
# Remove contaminants
msnid <- apply_filter(msnid, "!grepl('Contaminant', accession)")
show(msnid)
```
We can see that the number of PSMs decreased by about 1300, peptides by ~400, and proteins by 25 (the 25 contaminants that were displayed).
### MS/MS ID Filter: Peptide Level {#global-peptide-filter}
The next step is to filter the MS/MS identifications such that the empirical peptide-level FDR is less than some threshold and the number of MS/MS IDs is maximized. We will use the $-log_{10}$ of the `PepQValue` column as one of our filtering criteria and assign it to a new column in `psms(msnid)` called `msmsScore`. The `PepQValue` column is the MS-GF+ Spectrum E-value, which reflects how well the theoretical and experimental fragmentation spectra match; therefore, high values of `msmsScore` indicate a good match (see Figure \@ref(fig:plot-msmsScore)).
```{r plot-msmsScore, fig.cap="Density plot of msmsScore.", fig.asp=0.5, echo=FALSE}
msmsScore <- -log10(msnid$PepQValue) # > 2
absParentMassErrorPPM <- abs(mass_measurement_error(msnid)) # < 10
isDecoy <- msnid$isDecoy
# Plot msmsScore
ggplot() +
geom_density(aes(x = msmsScore, color = isDecoy),
fill = NA, na.rm = TRUE, size = 1.4) +
scale_y_continuous("Density", expand = expansion(mult = c(0, 0.05))) +
scale_x_continuous(expand = expansion(0)) +
scale_color_manual(values = c("orange", "skyblue"), breaks = c(TRUE, FALSE)) +
theme_bw(base_size = 12)
```
</br>
The other filtering criteria is the absolute deviation of the mass measurement error of the precursor ions in parts-per-million (ppm), which is assigned to the `absParentMassErrorPPM` column in `psms(msnid)` (see Figure \@ref(fig:plot-mass-error)).
```{r plot-mass-error, fig.cap="Density plot of absParentMassErrorPPM.", echo=FALSE, fig.asp=0.5}
ggplot() +
geom_density(aes(x = absParentMassErrorPPM, color = isDecoy),
fill = NA, na.rm = TRUE, size = 1.4) +
scale_y_continuous("Density", expand = expansion(mult = c(0, 0.05))) +
scale_x_continuous(expand = expansion(0)) +
scale_color_manual(values = c("orange", "skyblue"), breaks = c(TRUE, FALSE)) +
theme_minimal(base_size = 12)
```
</br>
Now, we will filter the PSMs.
```{r global-fdr-filter-peptide}
# 1% FDR filter at the peptide level
msnid <- filter_msgf_data(msnid,
level = "peptide",
fdr.max = 0.01)
show(msnid)
```
We can see that filtering drastically reduces the number of PSMs, and the empirical peptide-level FDR is now 1%. However, notice that the empirical protein-level FDR is still fairly high.
### MS/MS ID Filter: Protein Level
Now, we need to filter proteins so that the FDR is at most 1%. A while ago, the proteomics field established the hard-and-fast two-peptides-per-protein rule. That is, the confident identification of a protein requires the confident identification of at least 2 peptides. This rule penalizes short proteins and doesn't consider that there are some very long proteins (e.g. Titin 3.8 MDa) that easily have more then two matching peptides even in the reversed sequence. Thus, we propose to normalize the number of peptides per protein length and use that as a filtering criterion (Figure \@ref(fig:plot-num-pep)).
In order to get the protein lengths, we need the FASTA (pronounced FAST-AYE) file that contains the protein sequences used in the database search. The first three entries of the FASTA file are shown in Figure \@ref(fig:fasta-ex).
```{r fasta-ex, echo=FALSE, fig.cap="First three entries of the FASTA file."}
include_graphics("images/FASTA_example_MoTrPAC.PNG")
```
</br>
For each protein, we divide the number of associated peptides by the length of that protein and multiply this value by 1000. This new `peptides_per_1000aa` column is used as the filter criteria.
```{r message=FALSE, warning=FALSE}
# Get path to FASTA file
path_to_FASTA <- system.file(
"extdata/Rattus_norvegicus_NCBI_RefSeq_2018-04-10.fasta.gz",
package = "PlexedPiperTestData"
)
# Compute number of peptides per 1000 amino acids
msnid <- compute_num_peptides_per_1000aa(msnid, path_to_FASTA)
```
```{r plot-num-pep, fig.cap="Density plot of peptides_per_1000aa. The plot area has been zoomed in.", echo=FALSE, fig.asp=0.5}
peptides_per_1000aa <- msnid$peptides_per_1000aa
isDecoy <- msnid$isDecoy
ggplot() +
geom_density(aes(x = peptides_per_1000aa, color = isDecoy),
fill = NA, na.rm = TRUE, size = 1.4) +
scale_y_continuous("Density", expand = expansion(mult = c(0, 0.05))) +
scale_x_continuous("peptides_per_1000aa", expand = expansion(0)) +
scale_color_manual(values = c("orange", "skyblue"), breaks = c(TRUE, FALSE)) +
theme_minimal(base_size = 12) +
coord_cartesian(xlim = c(0, 250)) +
guides(color = guide_legend(title = "isDecoy"))
```
</br>
Now, we filter the proteins to 1% FDR.
```{r message=FALSE, warning=FALSE}
# 1% FDR filter at the protein level
msnid <- filter_msgf_data(msnid,
level = "accession",
fdr.max = 0.01)
show(msnid)
```
### Inference of Parsimonious Protein Set
The situation when a certain peptide sequence matches multiple proteins adds complication to the downstream quantitative analysis, as it is not clear which protein this peptide is originating from. There are common ways for dealing with this. One is to simply retain uniquely matching peptides and discard shared peptides (`unique_only = TRUE`). Alternatively, assign the shared peptides to the proteins with the larger number of uniquely mapping peptides (`unique_only = FALSE`). If there is a choice between multiple proteins with equal numbers of uniquely mapping peptides, the shared peptides are assigned to the first protein according to alphanumeric order (Figure \@ref(fig:parsimony)).
<!---
This step could be done prior to filtering at the accession level, but if peptides are assigned to a low-confidence protein, and that protein is removed during filtering, those peptides will be lost. Instead, it is better to filter to the set of confidently-identified proteins and then determine the parsimonious set.
--->
```{r parsimony, echo=FALSE, fig.cap="Visual explanation of the inference of the parsimonious protein set."}
include_graphics("images/parsimonious-protein-set-inference.PNG")
```
</br>
```{r}
# Inference of parsimonious protein set
msnid <- infer_parsimonious_accessions(msnid, unique_only = FALSE)
show(msnid)
```
Notice that the protein-level FDR increased slightly above the 1% threshold. In this case, the difference isn't significant, so we can ignore it.
*Note:*
If the peptide or accession-level FDR increases significantly above 1% after inference of the parsimonious protein set, consider lowering the FDR cutoff (for example, to 0.9%) and redoing the previous processing steps. Filtering at the peptide and accession level should each be done a single time.
### Remove Decoy PSMs
The final step in preparing the MS/MS identifications is to remove the decoy PSMs. We use the `apply_filter` function again and only keep entries where `isDecoy` is `FALSE`.
```{r}
# Remove Decoy PSMs
msnid <- apply_filter(msnid, "!isDecoy")
show(msnid)
```
After processing, we are left with 444,345 PSMs, 90,232 peptides, and 5,196 proteins. Table \@ref(tab:global-msnid-table) shows the first 6 rows of the processed MS-GF+ output.
```{r global-msnid-table, echo=FALSE}
kable(head(psms(msnid), 6), digits = 3, row.names = FALSE,
caption = "<left>First 6 rows of the processed MS-GF+ results.</left>",
escape = FALSE, format = "html") %>%
kable_styling(full_width = FALSE, font_size = 12,
bootstrap_options = c("hover", "condensed"))
```
</br>
### Read MASIC Output
MASIC is a tool for extracting ion intensities. With proper parameter settings, it can be used for extracting TMT (or iTRAQ) reporter ion intensities. In addition, it reports a number of other helpful metrics. Notably, the interference score at the precursor ion level and the signal-to-noise ratio (S/N) at the reporter ion level (computed by Thermo software). The interference score reflects the proportion of the ion population that was isolated for fragmentation that is due to the targeted ion. In other words, `1 - InterferenceScore` is due to co-isolated species that have similar elution time and precursor ion m/z. The first step in the preparation of the reporter ion intensity data is to read the MASIC results. We use a local file path and the `read_masic_data` function. By default, the interference score is not included, so we need to set that argument to `TRUE` in order to filter the results.
```{r results='hide'}
# Path to MASIC data
path_to_MASIC_results <- system.file("extdata/global/masic_output",
package = "PlexedPiperTestData")
# Read MASIC data
masic_data <- read_masic_data(path_to_MASIC_results, interference_score = TRUE)
```
Normally, this would display progress bars in the console as the data is being fetched. However, the output was suppressed to save space.
```{r global-masic-table, echo=FALSE}
kable(head(masic_data, 6), digits = 3, row.names = FALSE,
caption = "<left>First 6 rows of the MASIC data.</left>",
escape = FALSE, format = "html") %>%
kable_styling(full_width = FALSE, font_size = 12,
bootstrap_options = c("hover", "condensed"))
```
</br>
Table \@ref(tab:global-masic-table) shows the first 6 rows of `masic_data`.
### Filter MASIC Data
The only other step in reporter ion intensity data preparation is to filter the results. Currently, we recommend keeping entries where at least 50% of the ion population is due to the targeted ion (interference score $\geq$ 0.5) and not filtering by S/N.
```{r}
# Filter MASIC data
masic_data <- filter_masic_data(masic_data,
interference_score_threshold = 0.5,
s2n_threshold = 0)
```
### Create Study Design Tables {#fetch-study-design-tables}
To convert from PSMs and reporter ion intensities to meaningful quantitative data, it is necessary to know what are the samples in the reporter channels and what is the intended reference channel (or combination of channels). The entire study design is captured by three tables - fractions, samples, references. With newly processed data, these typically do not exist, and must be created. The next sections show how to create these tables.
#### Fractions
The fractions table consists of two columns: `Dataset` and `PlexID`. The `Dataset` column contains all of the unique datasets from `msnid$Dataset` or `masic_data$Dataset`. The `PlexID` column contains the plex ID associated with each dataset, and is typically an "S" followed by a number ("S1", "S2", etc.). We can extract the plex ID from the datasets. In this case, the plex ID always comes after "_W_", so we can use a regular expression (regex) to capture it (the first argument of `gsub`). The regex below says to capture an "S" followed by a single digit that appears after "_W_" and before an underscore.
```{r}
# Create fractions table
fractions <- data.frame(Dataset = unique(masic_data$Dataset)) %>%
mutate(PlexID = gsub(".*_W_(S\\d{1})_.*", "\\1", Dataset))
```
```{r fractions-table, echo=FALSE}
kable(head(fractions, 10), row.names = FALSE,
caption = "<left>First 10 rows of the fractions table.</left>",
escape = FALSE, format = "html") %>%
kable_styling(full_width = FALSE, font_size = 12,
bootstrap_options = c("hover", "condensed"))
```
</br>
Table \@ref(tab:fractions-table) shows the first 10 rows of `fractions`.
#### Samples
The samples table contains columns `PlexID`, `QuantBlock`, `ReporterName`, `ReporterAlias`, and `MeasurementName`. The plex ID must be the same as the plex ID in the `fractions` table. `ReporterName` is the reporter ion name ("126", "127N", "127C", etc.). `ReporterAlias` is the intermediate between `ReporterName` and `MeasurementName` and is used for defining the reference. `MeasurementName` determines the column names for the final cross-tab, and must be unique and begin with a letter. Finally, `QuantBlock` can be thought of as a way of defining sub-plex. In a typical TMT experiment, `QuantBlock` is always 1. In case of 5 pairwise comparisons within TMT10, there will be 5 QuantBlocks (1-5) with a reference for each `QuantBlock`.
For this experiment, channel 131 will serve as the reference, so we set `MeasurementName` to `NA` when `ReporterName` is `"131"`. This will make the reference channel absent from the quantitative cross-tab. In cases where reporter ion intensities are not normalized by a reference channel (reference = 1) or they are normalized by the average of select channels, do not set any `MeasurementName` to `NA`.
```{r}
# Create samples table
samples <- read.delim("data/MoTrPAC_pilot_TMT_labeling.txt") %>%
dplyr::rename(ReporterName = TMT10_channel,
ReporterAlias = sample_ID) %>%
mutate(QuantBlock = 1,
MeasurementName = ifelse(ReporterAlias == "ref", NA, ReporterAlias))
```
```{r samples-table, echo=FALSE}
kable(head(samples, 10), row.names = FALSE,
caption = "<left>First 10 rows of the samples table.</left>",
escape = FALSE, format = "html") %>%
kable_styling(full_width = FALSE, font_size = 12,
bootstrap_options = c("hover", "condensed"))
```
</br>
Table \@ref(tab:samples-table) shows the first 10 rows of `samples`.
#### References
Reference can be a certain channel, average of multiple channels, or 1. The general form is an expression with `ReporterAlias` names as variables. It is evaluated for each `PlexID`/`QuantBlock` combination and applied to divide reporter ion intensities within corresponding `PlexID`/`QuantBlock`.
```{r}
# Create references table
references <- samples %>%
# Filter to reference channel (ReporterName == "131", ReporterAlias == "ref")
filter(ReporterName == "131") %>%
# Select required columns and rename ReporterAlias to Reference
select(PlexID, Reference = ReporterAlias, QuantBlock)
```
```{r references-table, echo=FALSE}
kable(references, row.names = FALSE,
caption = "<left>References table.</left>",
escape = FALSE, format = "html") %>%
kable_styling(full_width = FALSE, font_size = 12,
bootstrap_options = c("hover", "condensed"))
```
</br>
Table \@ref(tab:references-table) shows the first 10 rows of `references`. The code to use the geometric average instead of a single channel as the reference is shown below. The geometric average is the product of the reporter ion channels to the power of (1/number of channels). For each `PlexID` group, collapse the vector of reporter ion names with `*`, surround them in parentheses, and raise to the power of (1/number of channels).
```{r eval=FALSE}
# Use geometric average as reference
references <- samples %>%
group_by(PlexID, QuantBlock) %>%
summarise(Reference = sprintf("(%s)^(1/%d)",
paste(ReporterAlias, collapse = "*"), n()))
# Do not normalize by reference channel (use 1 as the reference)
references <- samples %>%
distinct(PlexID, QuantBlock) %>%
mutate(Reference = 1)
```
Now that we have the three study design tables, we should save them.
<!---
TODO:
Explain how to add the study design tables to the DMS.
--->
```{r}
# Save study design tables
write.table(fractions, file = "data/fractions.txt",
sep = "\t", quote = FALSE, row.names = FALSE)
write.table(samples, file = "data/samples.txt",
sep = "\t", quote = FALSE, row.names = FALSE)
write.table(references, file = "data/references.txt",
sep = "\t", quote = FALSE, row.names = FALSE)
```
### Create Quantitative Cross-tab {#global-quant-crosstab}
This is the step where MS/MS IDs and reporter ions are linked together and aggregated to the peptide or accession (i.e. protein) level. To retain protein IDs while aggregating to peptide level, set `aggregation_level <- c("accession","peptide")`. The abundances are converted to relative abundances by dividing by the reference and then $log_2$-transformed.
```{r}
# Set aggregation level
aggregation_level <- c("accession")
# Create cross-tab
crosstab <- create_crosstab(msnid, masic_data,
aggregation_level = aggregation_level,
fractions, samples, references)
```
```{r echo=FALSE}
kable(head(crosstab, 6), row.names = TRUE,
caption = "<left>First 6 rows of the global quantitative cross-tab.</left>",
escape = FALSE, format = "html") %>%
kable_styling(full_width = FALSE, font_size = 12,
bootstrap_options = c("hover", "condensed"))
```
</br>
Now that we have the cross-tab, we should save it.
```{r}
# Save cross-tab
write.table(crosstab, file = "data/global_quant_crosstab.txt",
sep = "\t", quote = FALSE, row.names = TRUE)
```
We will also save the proteins (row names) of this cross-tab in order to demonstrate prioritized inference in Section \@ref(phosphoproteomics-data).
```{r}
# Save global proteins
global_proteins <- rownames(crosstab)
save(global_proteins, file = "data/3442_global_proteins.RData")
```
### Create MSnSet Object
The final step is to create an MSnSet object, which is necessary for downstream data analysis tools such as the `plot_pca_*` or `limma_*` functions in MSnSet.utils. An MSnSet combines three tables into one object: `exprs`, `fData`, and `pData.` `exprs` is a matrix of protein abundance data with proteins as rows and samples as columns. `fData` is used to convert between different feature IDs such as RefSeq, UniProt accession, Entrez gene ID, gene symbol, etc. It is optional, but the rownames of `fData` must be the same as the row names of `exprs`. Lastly, `pData` contains the metadata with samples as rows, and the row names must be the same as the column names of `exprs` (same order as well). For this example, we don't have an actual metadata table. We just know group assignment from the sample names. Samples with an "R" are part of the exeRcised group, while samples with an "S" are part of the Sedentary group. We can make a metadata table with a single column called `group`.
```{r}
metadata <- data.frame(group = gsub("([RS]).*", "\\1", colnames(crosstab))) %>%
mutate(group = ifelse(group == "R", "exercised", "sedentary"))
rownames(metadata) <- colnames(crosstab)
```
```{r echo=FALSE}
kable(metadata, row.names = TRUE,
caption = "<left>Metadata for MSnSet.</left>",
escape = FALSE, format = "html") %>%
kable_styling(full_width = FALSE, font_size = 12,
bootstrap_options = c("hover", "condensed"))
```
Since we used the sample data to create the metadata, everything is in the right order; however, we usually need to reorder the rows of the metadata. We can do this with
```{r eval = FALSE}
# Reorder rows of metadata to match sample order in crosstab
# Need to add drop = FALSE if the metadata is a single column
metadata <- metadata[colnames(crosstab), , drop = FALSE]
```
We have the metadata and protein abundance data, so we can now create the MSnSet. We will call it `m1` and we should save it as an .RData file.
```{r}
# Create MSnSet
m1 <- MSnbase::MSnSet(exprs = as.matrix(crosstab), pData = metadata)
validObject(m1)
```
If we type the name of the MSnSet in the console, a summary will be displayed. This will show information about each of the tables, and the MSnSet will keep track of all processing done such as filtering at the bottom under "Processing information".
```{r}
# Display summary
m1
```
```{r, eval = FALSE}
# Save unprocessed MSnSet
save(m1, file = "data/global_msnset.RData")
```
## Phosphoproteomics Data {#phosphoproteomics-data}
This pipeline shows how to process phosphoproteomics data from the DMS.
For this section, we will use data package number `3626`. We will need the `PlexedPiper` package for isobaric quantification and `PNNL.DMS.utils` to interface with the DMS. Also, details will be omitted if they were already provided in Section \@ref(global-proteomics-data).
```{r message=FALSE}
# Setup
library(PNNL.DMS.utils)
library(PlexedPiper)
library(Biostrings)
library(dplyr) # %>%
library(MSnbase)
```
### Read MS-GF+ Output
```{r results='hide'}
# Read MS-GF+ data
data_package_num <- 3626
msnid <- read_msgf_data_from_DMS(data_package_num)
```
```{r}
show(msnid)
```
### Correct Isotope Selection Error
```{r}
# Correct for isotope selection error
msnid <- correct_peak_selection(msnid)
```
### Remove Non-Phosphorylated Peptides
In this case, the phosphorylation of an amino acid is marked by a `*` inserted into the sequence after said amino acid. We will not consider unmodified peptides, so we can filter out peptides that do not contain this symbol with `apply_filter`. The `*` is a special character that must be escaped with backslashes, and the backslashes must also be escaped, since they are enclosed within a nested string (`"''"`).
```{r}
# Remove non-phosphorylated peptides
# (peptides that do not contain a *)
msnid <- apply_filter(msnid, "grepl('\\\\*', peptide)")
show(msnid)
```
### Remove Contaminants
```{r}
# Remove contaminants
msnid <- apply_filter(msnid, "!grepl('Contaminant', accession)")
show(msnid)
```
### Improve Phosphosite Localization
Phospho datasets involve AScore jobs for improving phosphosite localization. There should be one AScore job per data package. If the AScore job does not exist, see <a href="https://prismwiki.pnl.gov/wiki/AScore_Job_Creation">AScore Job Creation</a> for how to set it up. The fetched object is a data.frame that links datasets, scans and original PTM localization to newly suggested locations. Importantly, it contains `AScore` column that signifies the confidence of PTM assignment. AScore > 17 is considered confident.
```{r results='hide'}
# Filter PTMs by Ascore
ascore <- get_AScore_results(data_package_num)
msnid <- best_PTM_location_by_ascore(msnid, ascore)
```
<!---
Why are some of the original sequences in ascore not phosphorylated? (AScore = -1)
--->
```{r}
show(msnid)
```
### MS/MS ID Filter: Peptide Level
```{r}
# 1% FDR filter at the peptide level
msnid <- filter_msgf_data(msnid,
level = "peptide",
fdr.max = 0.01)
show(msnid)
```
### MS/MS ID Filter: Protein Level
```{r message=FALSE, warning=FALSE}
# Get path to FASTA file
path_to_FASTA <- path_to_FASTA_used_by_DMS(data_package_num)
# Compute number of peptides per 1000 amino acids
msnid <- compute_num_peptides_per_1000aa(msnid, path_to_FASTA)
# 1% FDR filter at the protein level
msnid <- filter_msgf_data(msnid,
level = "accession",
fdr.max = 0.01)
show(msnid)
```
### Inference of Parsimonious Protein Set
<!---
TODO:
* Talk about using prior information from global cross-tab to improve inference of the parsimonious protein set for phospho data.
--->
```{r}
# Load proteins from global crosstab
global_proteins <- readRDS("data/3442_global_protein_names.rds")
# Inference of parsimonious protein set
msnid <- infer_parsimonious_accessions(msnid, unique_only = FALSE,
prior = global_proteins)
show(msnid)
```
### Remove Decoy PSMs
```{r}
# Remove Decoy PSMs
msnid <- apply_filter(msnid, "!isDecoy")
show(msnid)
```
### Map Sites to Protein Sequences
Prepare FASTA to make sure entry names in FASTA file match MSnID accessions. The plan is to make this conversion automatic. `map_mod_sites` creates a number of columns describing mapping of the sites onto the protein sequences. The most important for the user is `SiteID`.
```{r}
# Create AAStringSet
fst <- readAAStringSet(path_to_FASTA)
# Remove contaminants
fst <- fst[!grepl("Contaminant", names(fst)), ]
# First 6 names
head(names(fst))
```
```{r}
# Modify names to match accessions(msnid)
names(fst) <- strsplit(names(fst), split = " ") %>%
# Select text before first space
lapply(function(x) x[1]) %>%
unlist()
# First 6 names
head(names(fst))
```
```{r}
# Main mapping call
msnid <- map_mod_sites(object = msnid, fasta = fst,
accession_col = "accession",
peptide_mod_col = "peptide",
mod_char = "*",
site_delimiter = "lower")
```
```{r phospho-msnid-table, echo=FALSE}
fix_phos <- function(x) {
gsub("\\*", "\\\\*", x)
}
x <- head(psms(msnid), 6) %>%
mutate_if(is.character, fix_phos)
kable(x, digits = 3, row.names = FALSE,
caption = "<left>First 6 rows of the processed MS-GF+ results.</left>",
escape = FALSE, format = "html") %>%
kable_styling(full_width = FALSE, font_size = 12,
bootstrap_options = c("hover", "condensed"))
```
</br>
Table \@ref(tab:phospho-msnid-table) shows the first 6 rows of the processed MS-GF+ output.
### Read MASIC Output
```{r results='hide'}
# Read MASIC data
masic_data <- read_masic_data_from_DMS(data_package_num,
interference_score = TRUE)
```
### Filter MASIC Data
```{r}
# Filter MASIC data
masic_data <- filter_masic_data(masic_data,
interference_score_threshold = 0.5,
s2n_threshold = 0)
```
### Create Study Design Tables
If study design tables are on the DMS, they can be accessed in the following way.
```{r eval=FALSE}
# Read study design tables from DMS
study_design <- read_study_design_from_DMS(data_package_num)
fractions <- study_design$fractions
samples <- study_design$samples
references <- study_design$references
```
While the study design tables are not on the DMS, we already created them in Section \@ref(fetch-study-design-tables). We just need to recreate the fractions table because the dataset names are different.
```{r}
# Read tables from folder
samples <- read.delim("data/samples.txt")
references <- read.delim("data/references.txt")
fractions <- data.frame(Dataset = unique(masic_data$Dataset)) %>%
mutate(PlexID = gsub(".*_P_(S\\d{1})_.*", "\\1", Dataset))
```
### Create Quantitative Cross-tab {#phospho-create-crosstab}
```{r}
# Set aggregation level
aggregation_level <- c("SiteID")
# Create cross-tab
crosstab <- create_crosstab(msnid, masic_data,
aggregation_level = aggregation_level,
fractions, samples, references)
```
```{r echo=FALSE}
x <- head(crosstab)
# rownames(x) <- gsub("\\*", "\\\\*", rownames(x))
# rownames(x) <- gsub("\\@", "\\@\\\\hphantom{}", rownames(x))
kable(x, row.names = TRUE,
caption = "<left>First 6 rows of the phospho quantitative cross-tab.</left>", format = "html",
escape = FALSE) %>%
kable_styling(font_size = 12,
bootstrap_options = c("hover", "condensed"))
```
</br>
We will save the cross-tab for later sections.
```{r}
# Modify cross-tab for saving
crosstab <- crosstab %>%
as.data.frame() %>%
tibble::rownames_to_column("SiteID")
# Save cross-tab
write.table(crosstab, file = "data/phosphosite_quant_crosstab.txt",
sep = "\t", quote = FALSE, row.names = FALSE)
```