-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathbioc-workflow-resubmit.Rnw
2051 lines (1751 loc) · 90.5 KB
/
bioc-workflow-resubmit.Rnw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<<env, echo=FALSE, warning=FALSE>>=
library("BiocStyle")
suppressPackageStartupMessages(library("knitr"))
suppressPackageStartupMessages(library("xtable"))
suppressPackageStartupMessages(library("MSnbase"))
suppressPackageStartupMessages(library("pRoloc"))
suppressPackageStartupMessages(library("pRolocdata"))
cache <- FALSE
@
\section*{Introduction}
Quantitative mass spectrometry-based spatial proteomics involves
elaborate, expensive and time consuming experimental protocols and
considerable effort is invested in the generation of such data.
Multiple research groups have described a variety of approaches to
establish high quality proteome-wide datasets (see for example
\cite{Gatto:2010} for a review, and
\cite{hyper,Itzhak:2016,Jean_Beltran:2016,Itzhak:2017,Hirst:2018} for
recent examples). However, data analysis is as critical as data
production for reliable and insightful biological
interpretation. Here, we walk the reader through a typical pipeline
for the analysis of such data using several
Bioconductor~\cite{Huber:2015} packages for the R statistical
programming environment.
The main package to analyse protein localisation data is
\Biocpkg{pRoloc}, which offers a set of dedicated functions for the
analysis of such data. \Biocpkg{pRoloc} itself relies on
\Biocpkg{MSnbase} to manipulate and process quantitative proteomics
data. Many other packages are used by \Biocpkg{pRoloc} for clustering,
classification and visualisation. Support for interactive
visualisation is offered by the \Biocpkg{pRolocGUI} package.
In this workflow, we will describe how to prepare the spatial
proteomics data starting from a spreadsheet containing quantitative
mass spectrometry data, through to some essential data processing
steps, and finish with different applications of machine learning
(Figure \ref{fig:overview}). We focus on a recent pluripotent mouse
embryonic stem cells experiment \cite{hyper}. These data, as well as
additional annotated and pre-formatted datasets from various species
are readily available in the \Biocexptpkg{pRolocdata} package.
Installation of Bioconductor packages is documented in detail on the
\href{http://bioconductor.org/install/#install-bioconductor-packages}{Bioconductor
installation help page}. Below, we show how to install the four main
packages used in this workflow:
<<install, eval=FALSE>>=
source("https://bioconductor.org/biocLite.R")
biocLite(c("MSnase", "pRoloc", "pRolocdata", "pRolocGUI"))
@
This procedure is also applicable to any packages, from
\href{https://cran.r-project.org/}{CRAN} as well as GitHub. Once a
package has been installed, it needs to be loaded for its
functionality to become available in the R session; this is done with
the \texttt{library} function e.g. to load the \Biocpkg{pRoloc} package one
would type \texttt{library("pRoloc")} after installation.
\bigskip
If you have questions about this workflow in particular, or about
other Bioconductor packages in general, they are best asked on the
\href{https://support.bioconductor.org/}{Bioconductor support site}
following the
\href{http://www.bioconductor.org/help/support/posting-guide/}{posting
guidelines}. Questions can be tagged with specific package names or
keywords. For more general information about mass spectrometry and
proteomics, the readers are invited to read the
\Biocexptpkg{RforProteomics} package vignettes and associated
papers \cite{Gatto:2014,Gatto:2015}.
\begin{figure}[!ht]
\centering
\includegraphics[width=.5\textwidth]{./Figures/overview.pdf}
\caption{Schematic overview of the pRoloc pipeline from data import, through to data processing, machine learning and data export.}
\label{fig:overview}
\end{figure}
\section*{Reading and processing spatial proteomics data}
\subsection*{The use-case: predicting sub-cellular localisation in pluripotent embryonic mouse stem cells}
As a use-case, we analyse a recent high-throughput spatial proteomics
dataset from pluripotent mouse embryonic stem cells (E14TG2a)
\cite{hyper}. The data was generated using hyperplexed LOPIT
(hyperLOPIT), a state-of-the-art method relying on improved
sub-cellular fractionation and more accurate quantitation, leading to
more reliable classification of protein localisation across the whole
sub-cellular space. The method uses an elaborate sub-cellular
fractionation scheme, enabled by the use of Tandem Mass Tag (TMT)
\cite{Thompson:2003} 10-plex and application of the MS data
acquisition technique named synchronous precursor selection MS$^3$
(SPS-MS$^3$) \cite{McAlister:2014}, for TMT quantification with high
accuracy and precision. Three biological replicates were generated
from the E14TG2a experiment, the first was to target low density
fractions and the second and third were to emphasis separation of the
denser organelles. The intersect of replicates 1 and 2 was treated as
a 20-plex dataset for the analysis. As discussed in the publication
\cite{hyper}, it has been shown that combining replicates from
different gradients can increase spatial resolution
\cite{Trotter:2010}. The combination of replicates resulted in 5032
proteins common to both experiments.
These, as well as many other data are directly available as properly
structured and annotated datasets from the \Biocexptpkg{pRolocdata}
experiment package. In this workflow, we will start with a description
of how to generate these ad hoc data objects starting from an
arbitrary spreadsheet, as produced by many popular third-party
applications.
While we focus here on a LOPIT-type dataset, these analyses are
relevent for any quantitative spatial proteomics data,
irrespective of the fractionation (i.e. density gradient or
differential centrifugation \cite{Itzhak:2016}) or quantitation
(i.e. labelled or label-free) methods.
\subsection*{The infrastructure: \Biocpkg{pRoloc} and \Biocpkg{MSnbase} packages}
To make use of the full functionality of the \Biocpkg{pRoloc} software,
users need to import their data into R and prepare them as an
\texttt{MSnSet}. The \texttt{MSnSet} is a dedicated data structure for
the efficient manipulation and processing of mass spectrometry and
proteomics data in R. Figure \ref{fig:msnset} illustrates a simplified view of the
\texttt{MSnSet} structure; there exists 3 key sub-parts (termed slots)
to such a data object: (1) the \texttt{exprs} (short for
\textit{expression} data) slot for storing the quantitation data, (2)
the \texttt{fData} slot (short for \textit{feature}-metadata) for
storing the feature meta-data, and finally (3) the \texttt{pData} slot
(short for \textit{pheno}-metadata, i.e. sample phenotypic data) for
storing the sample meta-data.
\begin{figure}[!ht]
\centering
\includegraphics[width=.5\textwidth]{./Figures/msnset.png}
\caption{Simplified representation of the \texttt{MSnSet} data
structure (reproduced with permission from the \Biocpkg{MSnbase}
vignette)}
\label{fig:msnset}
\end{figure}
Feature metadata typically contains general annotation about the
proteins (accession numbers, description, \ldots), information related
to the identification search (confidence scores, number of peptides,
\ldots) as well as annotation about known sub-cellular location (see in
particular the \textit{Markers} section) and results from data
analysis. The sample metadata would, for example, record what stable
isotope labels were used for the respective fraction (when labelled
quantitation is used), replicate number, fraction number along the
gradient and pooling information.
Another slot of interest is \texttt{processingData}, that logs the
processing \texttt{MSnSet} objects undergo. The processing log can be
accessed with the \texttt{processingData} function and is displayed
under \textit{Processing information} in the textual object summary
when an \texttt{MSnSet}'s name is typed in the R console (see example
below).
\subsection*{Importing data}
There are a number of ways to import quantitation data and create an
\texttt{MSnSet} instance. All methods are described in the
\Biocpkg{MSnbase}
\href{http://bioconductor.org/packages/release/bioc/vignettes/MSnbase/inst/doc/MSnbase-io.pdf}{input/output
capabilities vignette}. The simplest method is to use the
function \texttt{readMSnSet2}. The function takes a single spreadsheet
file name as input and extracts the columns containing the
quantitation data, as identified by the argument \texttt{ecol}, to
create the expression data, while the other columns in the spreadsheet
are appended to the feature meta-data slot. By example, in the code
chunk below we read in the \texttt{csv} spreadsheet containing the
quantitation data from the intersect of replicates 1 and 2 of the
mouse map \cite{hyper}, using the \texttt{readMSnSet2} function. The
data is as available online with the manuscript (see tab 2 of the
\texttt{xlsx} supplementary data set 1 in \cite{hyper}, which should
be exported as a text-based spreadsheet). It is also available as a
\texttt{csv} in the Bioconductor \Biocexptpkg{pRolocdata} data
package, which we make use of below.
To use the \texttt{readMSnSet2} function, as a minimum one must
specify the file path to the data and which columns of the spreadsheet
contain quantitation data. In the code chunk below, we start by
identifying the file that we want to use. The \texttt{system.file}
function is used to return the path to the \texttt{extdata} directory
from the \Biocexptpkg{pRolocdata} package, which is where our file of
interest resides. We then use the \texttt{dir} function to list the
content of that directory and store the path that matches the file
name of interest in the \texttt{csvfile} variable. Note that these two lines
are only needed here to locate a file in a package; in a more general
use case, the user would define the \texttt{csvfile} variable
containing the file name of interest directly.
A common pitfall here is to provide only the file name, rather than
full path to the file (which is what is shown below with
\texttt{basename}; we don't print the full path, as it will vary from
computer to computer). Note that only specifying the name of the file
is sufficient when it exists in the working directory (i.e. the
directory in which R is running, which can be queried and changed with
the \texttt{getwd} and \texttt{setwd} functions respectively).
<<getFilename, cache = cache>>=
extdatadir <- system.file("extdata", package = "pRolocdata")
csvfile <- dir(extdatadir, full.names = TRUE,
pattern = "hyperLOPIT-SIData-ms3-rep12-intersect.csv")
basename(csvfile)
@
Note that the file is compressed (as indicated by the \texttt{gz}, for
\texttt{gzip}, extension), and will be decompressed on-the-fly when
read into R.
Next, we need to identify which columns in the spreadsheet contain the
quantitation data. This can be done using the \texttt{getEcols}
function inside R. The spreadsheet deposited by the authors contains
two headers, with the second header containing information about where
the quantitation data is stored.
\begin{figure}[!ht]
\centering
\includegraphics[width=.85\textwidth]{./Figures/spreadsheet-screenshot.png}
\caption{A screenshot of the data in the spreadsheet.}
\label{fig:spreadsheet}
\end{figure}
We can display the names of the second header by calling the
\texttt{getEcols} function with the argument \texttt{n = 2} (the
default value is \texttt{n = 1}), to specify that we wish to display
the column names of the second line. We also specify the name of the
spreadsheet file (defined as \texttt{csvfile} above) and the
separator that splits cells.
<<getEcols>>=
getEcols(csvfile, split = ",", n = 2)
@
It is now easy for one to identify that the quantitation data,
corresponding to the 10 TMT isobaric tags, is located in columns 8
to 27. We now have the two mandatory arguments to \texttt{readMSnSet2},
namely the file name (stored in the \texttt{csvfile} variable) and the
quantitation column indices. In addition to these, it is also possible
to pass the optional argument \texttt{fnames} to indicate which column to use
as the labels by which to identify each protein in the sample. Here,
we use \texttt{fnames = 1} to use the UniProt identifiers contained in the
first (unnamed) column of the spreadsheet. We also need to specify to
skip the first line of the file (for the same reason that we used
\texttt{n = 2} in \texttt{getEcols} above) to read the \texttt{csv} data and convert it to an
\texttt{MSnSet} object, named \texttt{hl} (for hyperLOPIT).
<<readMSnSet2>>=
hl <- readMSnSet2(csvfile, ecol = 8:27, fnames = 1, skip = 1)
@
Below, we display a short summary of the data. The data contains
\Sexpr{nrow(hl)} proteins/features common across the 2 biological replicates
for the respective 2 x 10-plex reporter tags (\Sexpr{ncol(hl)}
columns or samples), along with associated feature meta-data such as
protein markers, protein description, number of quantified peptides
etc (see below).
<<showhl>>=
hl
@
Below, we examine the quantitative information along the whole
gradient for first 5 proteins. It is also possible to access specific
rows and columns by naming the proteins and TMT tag channels of
interest.
<<assesshl>>=
exprs(hl)[1:5, ]
exprs(hl)[c("Q9ERU9", "Q9Z2R6"), c("X126", "X131.1")]
@
The feature meta-data is stored in the \texttt{fData} slot and can be
accessed by \texttt{fData(hl)}. When using \texttt{readMSnSet2}
everything that is not defined as quantitation data by \texttt{ecol}
is deposited to the \texttt{fData} slot.
We see the \texttt{fData} contains 25 columns describing information such as
the number of peptides, associated markers, machine learning results
etc. To identify the feature variable names we can use the function
\texttt{fvarLabels}. We see that the first 6 feature variable names contain
non-discriminatory label names, so we relabel them to help us identify
what feature data information is stored in the associated columns.
<<removefData>>=
fvarLabels(hl)
fvarLabels(hl)[1:3] <- c("uniprot.accession", "uniprot.id", "description")
fvarLabels(hl)[4:6] <- paste0("peptides.expt", 1:3)
## feature vars 1, 2, and 4 to 6
fData(hl)[1:4, c(1:2, 4:6)]
@
Note that when using the simple \texttt{readMSnSet2} procedure, the
\texttt{pData} slot which is used to store information about the
samples/channels is kept empty. As illustrated below, one can use the
\texttt{\$} operator to access (or create) individual columns in the
metadata slot. It is advised to annotate the channels as well. Below,
we annotate the replicate from which the profiles originate and the
TMT tag (extracted from the sample/channel names). To do so, we use
the sample names that were assigned automatically using the
quantiation column names and remove leading \texttt{X} and trailing
\texttt{.1} using the \texttt{sub} function.
<<pdata>>=
pData(hl)$Replicate <- rep(1:2, each = 10)
pData(hl)$Tag <- sub("\\.1$", "", sub("^X", "", sampleNames(hl)))
pData(hl)
@
Throughout this workflow we refer to the different columns that
are found in the \texttt{exprs} (expression data) slot as channels (short for TMT channels).
In the frame of LOPIT and hyperLOPIT these channels constitute the
relative abundance of each protein (along the rows) in the channel of
interest. Each TMT channel originates from fractions collected from
the density gradient, or a set of pooled fractions or may be a sample
originating from an alternative preparation e.g. such as from the
chromatin enrichment performed in Christoforou et al \cite{hyper}.
Information about which gradient fractions were used for which tag
should also be stored in the sample meta-data \texttt{pData} slot.
The sample meta-data that is distributed with the
\Biocexptpkg{pRolocdata} package for Christoforou's
hyperLOPIT experiment and (as above) the quantitation data file, are
located in the \texttt{extdata} in the \Biocexptpkg{pRolocdata}
package on the hard drive.
In the code chunk below we again use the
\texttt{dir} function to locate the filepath to the meta-data \texttt{csv} file and
then read it into R using \texttt{read.csv}. We then append the meta-data to
the \texttt{pData} slot. Information about the gradient fractions used and
the associated subcellular fraction densities in % w/v Iodixanol for
each replicate are stored here.
<<morepdata>>=
expinfo <- dir(extdatadir, full.names = TRUE,
pattern = "hyperLOPIT-SIData-fraction-info.csv")
fracinfo <- read.csv(expinfo, row.names=1, skip = 2,
header = FALSE, stringsAsFactors = FALSE)
pData(hl)$Gradient.Fraction <- c(fracinfo[, 1], fracinfo[, 2])
pData(hl)$Iodixonal.Density <- c(fracinfo[, 4], fracinfo[, 5])
pData(hl)
@
\subsection*{Data processing}
\subsubsection*{Normalisation}
There are two aspects related to data normalisation that are relevent
to spatial proteomics data processing. The first one focuses on
reducing purely technical variation between channels without affecting
biological variability (i.e. the shape of the quantitatives
profiles). This normalisation will depend on the underlying
quantitative technology and the experimental design, and will not be
addressed in this workflow. The second aspect, and more specific to
spatial proteomics data, is scaling all the organelle-specific
profiles into the same intensity interval (typically 0 and 1) by, for
example, dividing each intensity by the sum of the intensities for
that quantitative feature. This is not necessary in this example as
the intensities for each replicate have already been re-scaled to 1 in
Proteome Discoverer v1.4 Thermo Fisher. However, if the data require
normalisation, the user can execute the \texttt{normalise} function as
demonstrated in the below code chunk.
<<normsum, eval = FALSE>>=
hl <- normalise(hl, method = "sum")
@
This transformation of the data assures cancellation of the effect of
the absolute intensities of the quantitative features along the rows,
and focus subsequent analyses on the relative profiles along the
sub-cellular channels.
The same \texttt{normalise} function (or \texttt{normalize}, both
spellings are supported) can also be applied in the first case
described above. Different normalisation methods, such as mean or
median scaling, variance stabilisation or quantile normalisation, to
cite a few, can be applied to accomodate different needs (see
\texttt{?normalise} for available options).
As previously mentioned, before combination, the two replicates in the
\texttt{hl} data that we read into R were separately normalised by sum (i.e.
to 1) across the 10 channels for each replicate respectively. We can
verify this by summing each rows for each replicate:
<<normcheck>>=
summary(rowSums(exprs(hl[, hl$Replicate == 1])))
summary(rowSums(exprs(hl[, hl$Replicate == 2])))
@
We see that some features do not add up exactly to 1 due to rounding
errors after exporting to intermediate files. These small deviations
do not bear any consequences here.
\subsubsection*{Combining acquisitions}
The spreadsheet that was used to create the \texttt{hl}
\texttt{MSnSet} included the two replicates within one .csv file. We
also provide individual replicates in the \Biocexptpkg{pRolocdata}
package. Below, we show how to combine \texttt{MSnSet} objects and,
subsequently, how to filter and handle missing values. We start by
loading the \Biocexptpkg{pRolocdata} package and the equivalent
replicates using the \texttt{data} function.
<<reps4combine>>=
library("pRolocdata")
data(hyperLOPIT2015ms3r1)
data(hyperLOPIT2015ms3r2)
@
At the R prompt, typing
<<pkgdata, eval=FALSE>>=
pRolocdata()
@
will list the \Sexpr{nrow(pRolocdata()$results)} datasets that are
available in \Biocexptpkg{pRolocdata}.
%% $
Combining data is performed with the \texttt{combine} function. This
function will inspect the feature and sample names to identify how to
combine the data. As we want our replicates to be combined along the
columns (same proteins, different sets of channels), we need to assure
that the respective sample names differ so they can be identified from
one another. The function \texttt{updateSampleNames} can be used do
this.
<<samplenames>>=
identical(sampleNames(hyperLOPIT2015ms3r1), sampleNames(hyperLOPIT2015ms3r2))
hyperLOPIT2015ms3r1 <- updateSampleNames(hyperLOPIT2015ms3r1, 1)
hyperLOPIT2015ms3r2 <- updateSampleNames(hyperLOPIT2015ms3r2, 2)
sampleNames(hyperLOPIT2015ms3r1)
sampleNames(hyperLOPIT2015ms3r2)
@
In addition to matching names, the content of the feature metadata for
identical feature annotations must match exactly across the data to be
combined. In particular for these data, we expect the same proteins in
each replicate to be annotated with the same UniProt entry names and
descriptions, but not with the same coverage of number of peptides or
peptide-spectrum matches (PSMs).
<<fvarnames>>=
fvarLabels(hyperLOPIT2015ms3r1)
fvarLabels(hyperLOPIT2015ms3r2)
@
Below, we update the replicate specific feature variable names and
remove the shared annotation. In the first line, we update only the
feature variable names 3 to 5 (by appending a \texttt{1}) of the first
replicate and in the second line, we apply the
\texttt{updateFvarLabels} function to update all feature variable
names (by appending a \texttt{2}) of the second replicate. In lines 3
and 4, we retain the first 5 feature variables for the first replicate
and the relevant third to fourth variables for the second replicate.
<<fvarnames2>>=
fvarLabels(hyperLOPIT2015ms3r1)[3:5] <- paste0(fvarLabels(hyperLOPIT2015ms3r1)[3:5], 1)
hyperLOPIT2015ms3r2 <- updateFvarLabels(hyperLOPIT2015ms3r2, "2", sep = "")
fData(hyperLOPIT2015ms3r1) <- fData(hyperLOPIT2015ms3r1)[1:5]
fData(hyperLOPIT2015ms3r2) <- fData(hyperLOPIT2015ms3r2)[3:5]
fvarLabels(hyperLOPIT2015ms3r1)
fvarLabels(hyperLOPIT2015ms3r2)
@
We can now combine the two experiments into a single \texttt{MSnSet}:
<<combine, cache = cache>>=
combined <- combine(hyperLOPIT2015ms3r1, hyperLOPIT2015ms3r2)
combined
@
More details about combining data are given in the dedicated
\textit{Combining MSnSet instances} section of the \Biocpkg{MSnbase}
\href{http://bioconductor.org/packages/release/bioc/vignettes/MSnbase/inst/doc/MSnbase-demo.pdf}{tutorial
vignette}.
\subsubsection*{Missing data}
Missing data are a recurrent issue in mass spectrometry applications,
and should be addressed independently of this workflow
\cite{Webb-Robertson:2015,Lazar:2016}. In \cite{Gatto:2014b}, we have
described how a high content in missing values in spatial proteomics
data and their inappropriate handling leads to a reduction of
sub-cellular resolution. We can impute missing data using
\Biocpkg{MSnbase}'s \texttt{impute} function. The method underlying
the imputation method is then determined by a \texttt{methods}
parameter (see \texttt{?impute} for available options). To impute
missing values using nearest neighbour imputation, one would
<<impute, eval=FALSE>>=
hl <- impute(hl, method = "knn")
@
In our particular case, missing values are indicative of protein
groups that were not acquired in both replicates (Figure
\ref{fig:namap}, produced with the \texttt{image2} function).
\begin{figure}[!ht]
\centering
<<namap, out.width=".65\\textwidth">>=
image2(is.na(combined), col = c("black", "white"),
main = "Missing values (white cells) after combining replicates")
@
\caption{Heatmap of missing values. Note that the features are
re-ordered to highlight clusters of proteins with similar numbers of
missing values.}
\label{fig:namap}
\end{figure}
We prefer to remove proteins that were not assayed in both replicated
experiments. This is done with the \texttt{filterNA} function that
removes features (proteins) that contain more than a certain
proportion (default is 0) missing values. The \textit{Processing
information} section summarises how combining and filtering missing
values (subsetting) changed the dimensions of the data.
<<filterNA>>=
combined <- filterNA(combined)
combined
@
When more than 2 datasets are to be combined and too many proteins
have not been consistently assayed, leading to too many proteins being
filtered out, we suggest to implement an ensemble of classifiers
voting on protein-sub-cellular niche membership over the output of
several experiments (see section \textit{Supervised machine learning}
for the description of sub-cellular assignments).
\section*{Quality Control}
Data quality is routinely examined through visualisation to verify
that sub-cellular niches have been separated along the gradient. Based
on De Duve's principle \cite{DeDuve:1981} proteins that co-localise in
a cell, exhibit similar quantitation profiles across the gradient
fractions employed. One approach that has been widely used to
visualise and inspect high throughput mass spectrometry-based
proteomics data is principal components analysis (PCA). PCA is one of
many dimensionality reduction methods, that allows one to effectively
summarise multi-dimensional data in to 2 or 3 dimensions to enable
visualisation. Very generally, the original continuous
multi-dimensional data is transformed into a set of orthogonal
components ordered according to the amount of variability that they
describe. The \texttt{plot2D} and \texttt{plot3D} functions in
\Biocpkg{pRoloc} allows one to plot the principal components (PCs) of
a dataset against one another. By default, the first two components are
plotted on the x- and y-axis for the \texttt{plot2D} function, and
first three components are loaded for the \texttt{plot3D} function,
respectively (the \texttt{dims} argument can be used to plot other
PCs). If distinct clusters are observed, we assume that there is
organellar separation present in the data. Although, representing the
multi-dimensional data along a limited set of PCs does not give us a
hard quantitative measure of separation, it is extremely useful
summarising complex experimental information in one figure, to get a
simplified overview of the data.
In the code chunk below we produce a 2-dimensional PCA plot of the
mouse stem cell dataset (Figure \ref{fig:pcahl}). Each point on the
plot represents one protein. We can indeed see several distinct
protein clusters. We specify \texttt{fcol = NULL} to ignore feature
metadata columns and not annotate any feature (protein) with a
colour. We will see later how to use this argument to annotate the PCA
plot with prior information about sub-cellular localisation.
\begin{figure}[!ht]
\centering
<<qcplot, fig.show = "hold", message=FALSE, warning=FALSE, out.width=".45\\textwidth">>=
library("pRoloc")
plot2D(hl, fcol = NULL, col = "black")
plot2D(hl, method = "hexbin")
@
\caption{PCA plot of the mouse stem cell data \texttt{hl}. Each dot
represents a single protein, and cluster of proteins represent
proteins residing in the same sub-cellular niche. The figure on the
right bins proteins and represent the bins density to highlight the
presence of protein clusters. }
\label{fig:pcahl}
\end{figure}
In the first instance we advise one to visualise their data without
any annotation (i.e. with \texttt{fcol = NULL}), before proceeding with data
annotation. The identification of well resolved clusters in the data,
constitutes an unbiased assessment of the data structure,
demonstrating the successful separation of sub-cellular clusters.
It is also useful to visualise the relative intensities along the
gradient to identify channels displaying particularly low yield. This
can be done using the \texttt{plotDist} and \texttt{boxplot}
functions, that plot the protein profiles occupancy along the gradient
(we also display the mean channel intensities below) and a
\texttt{boxplot} of the column intensities. In the two plots
displayed on figure \ref{fig:qcbx}, we re-order the TMT channels to
pair corresponding channels in the two replicates (rather than
ordering the channels by replicate).
\begin{figure}[!ht]
\centering
<<qcbx, out.width="\\textwidth", fig.asp=1/2>>=
par(mfrow = c(1, 2), ## creates a two-panel figure
las = 2, ## axis labels orientation
cex.axis = .7) ## axis label size
o <- order(hl$Iodixonal.Density)
plotDist(hl[, o], pcol = "#00000010", xlab = "")
lines(colMeans(exprs(hl[, o])), col = "red", type = "b")
boxplot(exprs(hl[, o]))
@
\caption{Protein profiles and distribution of channel intensities. The
red dots represent the mean relative intensity for each channel. }
\label{fig:qcbx}
\end{figure}
\pagebreak
\section*{Markers}
In the context of spatial proteomics, a marker protein is defined as a
well-known resident of a specific sub-cellular niche in a species
\textit{and} condition of interest. Applying this to machine learning
(ML), and specifically supervised learning for the task of protein
localisation prediction, these markers constitute the labelled
training data to use as input to a classification analyses. Defining
well-known residents, and obtaining labelled training data for ML
analyses can be time consuming, but it is important to define markers
that are representative of the multivariate data space and on which a
classifier will be trained and generated. \Biocpkg{pRoloc} provides a
convenience function, \texttt{addMarkers}, to directly add markers to
an \texttt{MSnSet} object, as demonstrated in the code chunk
below. These marker sets can be accessed using the
\texttt{pRolocmarkers()} function. Marker sets are stored as a simple
named vector in R, and originate from in-house user-defined sets of
markers or from previous published studies \cite{Gatto:2014b}, which
are continuosly updated and integrated.
<<markers>>=
## List available marker sets
pRolocmarkers()
@
These markers can then be mapped to an \texttt{MSnSet}'s
\texttt{featureNames}. The mouse dataset used here has Uniprot IDs
stored as the \texttt{featureNames} (see
\texttt{head(featureNames(hl))}) and the names of the vector of the
mouse markers stored in \Biocpkg{pRoloc} (\texttt{mmus} markers) are
also Uniprot IDs (see \texttt{head(mrk)} in the code chunk below, that
displays the 6 first markers), so it is straightforward to match names
between the markers and the \texttt{MSnSet} instance using the
\texttt{addMarkers} function.
<<addmarkers>>=
## Use mouse markers
mrk <- pRolocmarkers(species = "mmus")
head(mrk)
## Add mouse markers
hl <- addMarkers(hl, mrk)
@
We recommend at least 13 markers per sub-cellular class (see the
\textit{Optimisation} section for details about the algorithmic
motivation of this number). Markers should be chosen to confidently
represent the distribution of genuine residents of a sub-cellular
niche. We generally recommend a conservative approach in defining
markers to avoid false assignments when assigning sub-cellular
localisation of proteins of unknown localisation. A more relaxed
definition of markers, i.e. one that broadly or over-confidently
defines markers, risks the erroneous assignment of proteins to a
single location, when, in reality, they reside in multiple locations
(including the assumed unique location). One can not expect to
identify exact boundaries between sub-cellular classes through marker
annotation alone; the definition of these boundaries is better handled
algorithmically, i.e. after application of the supervised learning
algorithm, using the prediction scores (as described in the
\textit{Classification} section, in particular Figure
\ref{fig:plotSVM}).
If the protein naming between the marker sets and the \texttt{MSnSet}
dataset are different e.g. the markers are labelled by Uniprot
accession numbers and the dataset entries are labelled by Uniprot
entry names, one will have to convert and match the proteins according
to the appropriate identifier. Sometimes, we find the equivalent entry
name, Uniprot ID or accession number is stored in the feature
metadata, which makes conversion between identifers relatively
straightforward. If this is not the case however, conversion can be
performed using \Biocpkg{biomaRt}, the Bioconductor
\href{http://bioconductor.org/help/workflows/annotation/Annotation_Resources/}{annotation
resources} or any conversion softwares available online.
\subsection*{Adding user-defined marker lists}
It is also possible for users to use their own marker list with the
\texttt{addMarkers} function. The user needs to create a named vector
of marker localisation, or a create a csv file with two columns (one
for the protein names, one for the corresponding sub-cellular marker
annotation) and pass the vector or file name respectively to the
function. As previously mentioned, the protein names of these markers
must match some (but not necessarily all) of the \texttt{MSnSet}'s
feature names. See \texttt{?addMarkers} for more details.
In general, the Gene Ontology (GO) \cite{Ashburner:2000}, and in
particular the cellular compartment (CC) namespace are a good starting
point for protein annotation and marker definition. It is important to
note however that automatic retrieval of sub-cellular localisation
information, from \Biocpkg{pRoloc} or elsewhere, is only the
beginning in defining a marker set for downstream analyses. Expert
curation is vital to check that any annotation added is in the correct
context for the biological question under investigation.
\subsection*{Visualising markers}
Having added the mouse markers to our \texttt{fData} from the
\texttt{pRolocmarkers}, we can now visualise these annotations on
the PCA plot using the \texttt{plot2D} function and then use the
\texttt{addLegend} function to map the marker classes to the
pre-defined colours. As previously mentioned, PCA transforms the
original high dimensional data into a set of linearly uncorrelated
principal components (PCs) such that the first accounts for as much
variability in the data as possible and each succeeding component in
turn has the highest variance possible under the constraint that it
be orthogonal to the preceding components. We saw in the previous
section how visualisation of the PCs is useful for quality control and checking
organelle seperation. Adding marker definiton allows one to quickly
see if known residents appear in defined clusters. One must be
careful though as different organelles may be resolved in different
dimensions. For example, we can display the data along the first and
seventh PCs using the \texttt{dims} argument. Note that in these
calls to the \texttt{plot2D} function, we have omitted the
\texttt{fcol} argument and so by default the feature variable named
\texttt{"markers"} is used to annotate the plot. We choose to display PCs 1
and 7 to illustrate that while upper principal components explain
much less variability in the data (2.23\% for PC7, as opposed to
48.41\% for PC1), we see that the mitochondrial (purple) and
peroxisome (dark blue) clusters can be differentiated, despite the
apparent overlap in the visualisation of the two first PCs (Figure
\ref{fig:plotmarkers}).
\begin{figure}[!ht]
\centering
<<plotmarkers, out.width="\\textwidth", fig.width=12, fig.asp=1/2>>=
par(mfrow = c(1, 2))
plot2D(hl, main = "pRolocmarkers for mouse")
addLegend(hl, cex = .6)
plot2D(hl, dims = c(1, 7), main = "Marker resolution along PC 1 and 7")
@
\caption{Annotated PCA plots of the \texttt{hl} dataset, as produced
with \texttt{plot2D}.}
\label{fig:plotmarkers}
\end{figure}
This is further highlighted if we plot the
profiles of these two clusters using the \texttt{plotDist} function
(Figure \ref{fig:plotDist2}). The \texttt{plotDist} function is
another useful visualisation that relies on marker annotation. It
allows one to represent the protein profiles occupancy along the
gradient. While the PCA
plot enables efficient visualisation of the complete dataset and
assessment the relative separation of different sub-cellular niches,
comparing profiles of a few marker clusters is useful to assess how
exactly they differ (in terms of peak channels, for example). On
figure \ref{fig:plotDist2}, we plot the profiles of the
mitochondrial and peroxisome markers to highlight the differences
in the channels labelled with tag 129C, also represented above
along the 7th PC on the PCA plot on figure \ref{fig:plotmarkers}.
\begin{figure}[!ht]
\centering
<<plotDist, out.width=".6\\textwidth", fig.asp=3/5>>=
hlo <- hl[, order(hl$Iodixonal.Density)]
plotDist(hlo[fData(hlo)$markers == "Mitochondrion", ],
pcol = "purple", fractions = "Tag")
title(main = "Marker occupancy profiles along the gradient")
matlines(t(exprs(hlo[fData(hlo)$markers == "Peroxisome", ])),
lty = 1, col = "darkblue", type = "l")
legend("topleft", c("Mitochondrion", "Peroxisome"),
lty = 1, col = c("purple", "blue"), bty = "n")
@
\caption{Mitochondrion and peroxisome protein profiles.}
\label{fig:plotDist2}
\end{figure}
The data can also be visualised along three PCs using the
\texttt{plot3D} function (Figure \ref{fig:plotmarkers3d}). When
produced interactively, the plot can be rotated and zoomed using the
mouse.
\begin{figure}[htb]
\centering
<<plotmarkers3d, eval=FALSE>>=
plot3D(hl, dims = c(1, 2, 7))
@
\includegraphics[width=.55\textwidth]{./Figures/plot3d.pdf}
\caption{Using the \texttt{plot3D} function to visualise the
\texttt{hl} dataset along PCs 1, 2 and 7. }
\label{fig:plotmarkers3d}
\end{figure}
The default colours for plotting have been defined so as to enable the
differentiation of up to 30 classes. If more are provided, different
character symbols (circles, squares, ... and empty and solid symbols)
are used. The colours and the default plotting characters (solid dots
for the markers and empty circles for the features of unknown
localisation) can of course be changed, as described in the
\texttt{setStockcol} manual page.
As demonstrated in \cite{hyper} and illustrated in the PCA plot
(Figure \ref{fig:plotmarkers}), the Golgi apparatus proteins (dark
brown) display a dynamic pattern, noting sets of Golgi marker proteins
that are distributed amongst other subcellular structures, an
observation supported by microscopy. As such, we are going to reset
the annotation of Golgi markers to unknown using the
\texttt{fDataToUnknown} function. It is often used to replace empty
strings (\texttt{""}) or missing values in the markers definition to a
common definition of \textit{unknown} localisation.
<<removeGA>>=
hl <- fDataToUnknown(hl, from = "Golgi apparatus", to = "unknown")
getMarkers(hl)
@
\pagebreak
\subsection*{Features of interest}
In addition to adding annotation using the \texttt{addMarkers}
function, one can store specific sets of proteins by using the
\textit{Features of interest} infrastructure from the
\Biocpkg{MSnbase} package. If users have specific subsets of proteins
they wish to highlight in their data (possibly across multiple
experiments) they would first create a \texttt{FeaturesOfInterest}
object and then use the \textit{highlightOnPlot} function to
visualise these. For example, if we wanted to highlight
proteins with the accession numbers Q8CG48, Q8CG47, Q8K2Z4, and
Q8C156, which are some of the proteins known to form part of the
13S condensin complex, we would call the code displayed
on Figure \ref{fig:foi}. Users can also create several sets of \texttt{FeaturesOfInterest}
object and store them in a \texttt{FoICollection}.
\begin{figure}[!ht]
\centering
<<foi, fig.width=10, out.width=".8\\textwidth">>=
prots <- c("Q8CG48", "Q8CG47", "Q8K2Z4", "Q8C156")
foi13s <- FeaturesOfInterest(description = "13S consensin proteins",
fnames = prots,
object = hl)
foi13s
plot2D(hl)
addLegend(hl, cex = .6)
highlightOnPlot(hl, foi13s)
@
\caption{Highlighting protein features of interest.}
\label{fig:foi}
\end{figure}
\bigskip
It is also worthy of note that it is possible to search for a
specific protein of interest by \texttt{featureNames} or using any
identifying information found in the \texttt{fData} columns by using
the search box on the \texttt{pRolocVis} application part of the
\Biocpkg{pRolocGUI} package (see section on interactive
visualisation). This can be handy for quickly searching and
highlighting proteins on the fly, the disavanatge here is that
proteins can only be searched for a one-by-one basis.
\pagebreak
\section*{Replication}
With the aim of maximising the sub-cellular resolution and,
consequently, the reliability in protein sub-cellular assignments, we
follow the advice in \cite{Trotter:2010} and combine replicated spatial
proteomics experiments as described above. Indeed, Trotter \textit{et
al.} have shown a significant improvement in protein–organelle
association upon direct combination of single experiments, in
particular when these resolve different subcellular niches.
Direct comparisons of individual channels in replicated experiments
do not provide an adequate, goal-driven assessment of different
experiments. Indeed, due to the nature of the experiment and gradient
fraction collection, the quantitative channels do not correspond to
identical selected fractions along the gradient. For example, in the
table below (taken from \texttt{hl}'s \texttt{pData}) TMT channels
127C (among others) in both replicates originate from different sets
of gradient fractions (gradient fractions 7 - 9 and 8 - 9 for each
replicate, respectively). Different sets of gradient fractions are
often pooled to obtain enough material and optimise acurate
quantitation.
<<pdtab, echo=FALSE, eval = TRUE, results = 'asis'>>=
i <- c(3, 13)
xtable(pData(hl)[i, ],
caption = "Differences in gradient fraction pooling.",
label = "tab:pdtab")
@
\pagebreak
The more relevent comparison unit is not a single channel, but rather
the complete protein occupancy profiles, which are best visualised
experiment-wide on a PCA plot. As such, we prefer to focus on the
direct, qualitative comparison of individual replicate PCA plots,
assuring that each displays acceptable sub-cellular resolution. Note
that in the code chunk below, we mirror the x-axis to represent the
two figures with the same orientation. The interactive "compare"
application part of the \Biocpkg{pRolocGUI} package is also useful for
examining replicate experiments (see the next section
\textit{Interactive visualisation} for details).
\begin{figure}[!ht]
\centering
<<plot2Drep, fig.width=12, out.width="\\textwidth", fig.asp=1/2>>=
par(mfrow = c(1, 2))
plot2D(hl[, hl$Replicate == 1], main = "Replicate 1")
plot2D(hl[, hl$Replicate == 2], main = "Replicate 2", mirrorX = TRUE)
@
\caption{PCA plots of replicates 1 and 2}
\label{fig:plot2Drep}
\end{figure}
In addition, the reproducibility can be assessed by performing
independent classification analyses on each replicate (see the section
on \textit{Supervised machine learning} below) and comparing the the
results. Even when the gradient conditions different (for unexpected
technical or voluntary reasons, to maximise resolution when combining
experiments \cite{Trotter:2010}), one expects agreement in the most
confident organelle assignments.
\pagebreak
\section*{Interactive visualisation}
Visualisation and data exploration is an important aspect of data
analyses allowing one to shed light on data structure and patterns of
interest. Using the \Biocpkg{pRolocGUI} package, we can interactively
visualise, explore and interrogate quantitative spatial proteomics
data. The \Biocpkg{pRolocGUI} package relies on the \texttt{shiny}
framework for reactivity and interactivity. It distributes 3 different
GUI's (\textit{main} (default), \textit{compare} or \textit{classify})
which are wrapped and launched by the \texttt{pRolocVis} function.
\subsubsection*{The main application}
In the below code chunk we lauch the main
app (note, we do not need to specify the argument, \texttt{app =
"main"} as it is the default).
<<mainapp, eval = FALSE, echo = TRUE>>=
library("pRolocGUI")
pRolocVis(hl)
@
\begin{figure}[!ht]
\centering
\includegraphics[width=\textwidth]{./Figures/mainapp.png}
\caption{A screen shot of clickable interface and zoomable PCA plot
of the main app in the \Biocpkg{pRolocGUI} package. }
\label{fig:app}
\end{figure}
As diplayed in the screenshot in figure \ref{fig:app}, the
\textit{main} application is designed for exploratory data analysis
and is divided into 3 tabs: (1) PCA, (2) Profiles and (3) Table
selection. The default view upon loading is the \textit{PCA} tab,
which features a clickable interface and zoomable PCA plot with an
interactive data table for displaying the quantitation information.
Particular proteins of interest can be highlighted using the text
search box. There is also a \textit{Profiles} tab for visualisation of
the protein profiles, which can be used to examine the patterns of
proteins of interest. The \textit{Table selection} tab provides an
interface to control data table column selection. A short
animation\footnote{\url{https://github.com/lmsimp/bioc-pRoloc-hyperLOPIT-workflow/blob/master/Figures/pRolocVis\_pca.gif}}
illustrating the interface is available in the manuscript
repository~\cite{ghrepo}.
\subsubsection*{The compare application}
The \textit{compare} application is useful for examining two replicate
experiments, or two experiments from different conditions, treatments
etc. The compare application is called by default if the input object
to \texttt{pRolocVis} is an \texttt{MSnSetList} of 2 \texttt{MSnSets},
but it can also be specified by calling the argument \texttt{app =
"compare"}. For example, in the code chunk below we first create an
\texttt{MSnSetList} of replicates 1 and 2 of the hyperLOPIT data, this
is then passed to \texttt{pRolocVis}.
<<compareapp, eval = FALSE, echo = TRUE>>=
data(hyperLOPIT2015ms3r1)
data(hyperLOPIT2015ms3r2)
hllst <- MSnSetList(list(hyperLOPIT2015ms3r1, hyperLOPIT2015ms3r2))