isobaric_quantification.Rmd

# Isobaric Quantification Pipelines {#isobaric-quantification-pipelines}

<!---
TODO:
* Overview of isobaric labeling
--->
<!---
Keep in mind that there are many other functions available in `PlexedPiper` and `MSnID` that will not be covered that can be used to add additional columns or further filter the results. This section just showcases the foundations of the pipelines.
--->


## Global Proteomics Data {#global-proteomics-data}

This pipeline shows how to process TMT data that is processed outside of PNNL's DMS. Section \@ref(phosphoproteomics-data) shows how to process data from the DMS. For convenience, the results of MS-GF+ and MASIC processing are provided in a companion `PlexedPiperTestData` package. In addition, we will need `PlexedPiper` for isobaric quantification, `dplyr` to manipulate data frames, and `MSnbase` to create MSnSet objects.

```{r include=FALSE}
# Global chunk options
knitr::opts_chunk$set(message = FALSE, warning = FALSE, fig.align='center')
```

```{r iso-lab-setup}
# Setup
library(PlexedPiper)
library(PlexedPiperTestData)
library(dplyr)
```

```{r include=FALSE}
library(ggplot2) # plotting
library(knitr) # embed images
library(kableExtra)
```

The pipeline can be broken up into 4 major chunks: prepare MS/MS identifications, prepare reporter ion intensities, create a quantitative cross-tab, and create an MSnSet object.

### Read MS-GF+ Data

The first step in the preparation of the MS/MS identifications is to fetch the data. In this case, the data exists in the PlexedPiperTestData package in a local folder, so we use `system.file` to get the file path and `read_msgf_data` to read the MS-GF+ output.

```{r read-msgf, results='hide'}
# Get file path
path_to_MSGF_results <- system.file("extdata/global/msgf_output", 
                                    package = "PlexedPiperTestData")
# Read MS-GF+ data from path
msnid <- read_msgf_data(path_to_MSGF_results)
```

Normally, this would display a progress bar in the console as the data is being fetched. However, the output was suppressed to save space. We can view a summary of the MSnID object with the `show()` function.

```{r}
show(msnid)
```

This summary tells us that `msnid` consists of 4 spectrum files (datasets), and contains a total of 1,156,754 peptide-spectrum-matches (PSMs), 511,617 total peptides, and 128,378 total accessions (proteins). The reported FDR is the empirical **false-discovery rate**, which is calculated as the ratio of the number of false (decoy) PSMs, peptides, or accessions to their true (non-decoy) counterparts. Calculation of these counts and their FDRs is shown below.

```{r eval=FALSE}
# How to calculate the counts and FDRs from the show() output
## Spectrum Files
# Count
psms(msnid) %>% 
  distinct(Dataset) %>%
  nrow() # 48

## PSMs
# Count
psms(msnid) %>% 
  distinct(Dataset, Scan, peptide, isDecoy) %>%
  # Assign intermediate to variable
  assign("x_psm", ., envir = globalenv()) %>% 
  nrow() # 1156754
# FDR
nrow(x_psm[x_psm$isDecoy == TRUE, ]) / 
  nrow(x_psm[x_psm$isDecoy == FALSE, ]) 
# 0.3127463 = 31%

## peptides
# Count
psms(msnid) %>% 
  distinct(peptide, isDecoy) %>% 
  assign("x_peptide", ., envir = globalenv()) %>%
  nrow() # 511617
# FDR
nrow(x_peptide[x_peptide$isDecoy == TRUE, ]) / 
  nrow(x_peptide[x_peptide$isDecoy == FALSE, ])
# 0.611245 = 61%

## accessions
# Count
length(accessions(msnid)) # or
psms(msnid) %>% 
  distinct(accession, isDecoy) %>% 
  assign("x_acc", ., envir = globalenv()) %>%
  nrow() # 128378
# FDR
nrow(x_acc[x_acc$isDecoy == TRUE, ]) / 
  nrow(x_acc[x_acc$isDecoy == FALSE, ])
# 0.9827024 = 98%
```

Now that we have an `MSnID` object, we need to process it. We begin by correcting for the isotope selection error.

### Correct Isotope Selection Error 

Carbon has two stable isotopes: $^{12}\text{C}$ and $^{13}\text{C}$, with natural abundances of 98.93% and 1.07%, respectively. That is, we expect that about 1 out of every 100 carbon atoms is naturally going to be a $^{13}\text{C}$, while the rest are $^{12}\text{C}$. In larger peptides with many carbon atoms, it is more likely that at least one atom will be a $^{13}\text{C}$ than all atoms will be $^{12}\text{C}$. In cases such as these, a non-monoisotopic ion will be selected by the instrument for fragmentation.

```{r MS1_peak, echo=FALSE, fig.cap="MS1 spectra with peak at non-monoisotopic precursor ion."}
include_graphics("images/MS1_non_monoisotopic.PNG")
```

In Figure \@ref(fig:MS1_peak), the monoisotopic ion (m/z of 1427.29) is not the most abundant, so it is not selected as the precursor. Instead, the ion with a $^{13}\text{C}$ in place of a $^{12}\text{C}$ is selected for fragmentation. We calculate the mass difference between these two ions as the difference between the mass-to-charge ratios multiplied by the ion charge. In this case, the mass difference is 1 Dalton, or about the difference between $^{13}\text{C}$ and $^{12}\text{C}$. (More accurately, the difference between these isotopes is 1.0033548378 Da.) While MS-GF+ is still capable of correctly identifying these peptides, the downstream calculations of mass measurement error need to be fixed because they are used for filtering later on (Section \@ref(global-peptide-filter)). The `correct_peak_selection` function corrects these mass measurement errors, and Figure \@ref(fig:mass-to-charge-diff) shows the distribution of the mass measurement errors before and after correction.

```{r echo=FALSE}
# m/z difference
delta_mz_pre <- msnid$experimentalMassToCharge - msnid$calculatedMassToCharge
# Mass difference
delta_mass_pre <- delta_mz_pre * msnid$chargeState # m/z * z = m
```

```{r}
# Correct for isotope selection error
msnid <- correct_peak_selection(msnid)
```

```{r mass-to-charge-diff, fig.cap="Histogram of mass measurement errors before and after correction.", fig.asp=0.5, echo=FALSE}
# m/z difference
delta_mz_post <- msnid$experimentalMassToCharge - msnid$calculatedMassToCharge
# Mass difference
delta_mass_post <- delta_mz_post * msnid$chargeState # m/z * z = m

delta_mass_df <- data.frame(delta_mass = c(delta_mass_pre, delta_mass_post),
                            group = rep(c("pre", "post"), 
                                        each = length(delta_mass_post))) %>% 
  mutate(group = factor(group, levels = c("pre", "post")))

facet_labs <- c("Before Correction", "After Correction")
names(facet_labs) <- c("pre", "post")

# # Plot mass measurement error
ggplot(delta_mass_df) +
  geom_histogram(aes(x = delta_mass), binwidth = 0.01) +
  facet_grid(cols = vars(group), labeller = labeller(group = facet_labs),
             scales = "free_x") +
  scale_y_continuous(name = "Count", labels = scales::label_scientific(),
                     expand = expansion(mult = c(0, 0.05))) +
  labs(x = "Mass Measurement Error") +
  theme_bw(base_size = 12) +
  theme(strip.background = element_blank(),
        strip.text = element_text(size = 14, face = "bold"))
```


### Remove Contaminants 

Now, we will remove contaminants such as the trypsin that was used for protein digestion. We can use `grepl` to search for all accessions that contain the string `"Contaminant"`. Displaying these contaminants is not necessary during processing. This is just for demonstration purposes to see what will be removed.

```{r}
# All unique contaminants
accessions(msnid)[grepl("Contaminant", accessions(msnid))]
```

To remove contaminants, we use `apply_filter` with an appropriate character string that tells the function what rows to keep. In this case, we keep rows where the accession does not contain "Contaminant". We will use `show` to see how the counts change.

```{r}
# Remove contaminants
msnid <- apply_filter(msnid, "!grepl('Contaminant', accession)")
show(msnid)
```

We can see that the number of PSMs decreased by about 1300, peptides by ~400, and proteins by 25 (the 25 contaminants that were displayed).

### MS/MS ID Filter: Peptide Level {#global-peptide-filter}

The next step is to filter the MS/MS identifications such that the empirical peptide-level FDR is less than some threshold and the number of MS/MS IDs is maximized. We will use the $-log_{10}$ of the `PepQValue` column as one of our filtering criteria and assign it to a new column in `psms(msnid)` called `msmsScore`. The `PepQValue` column is the MS-GF+ Spectrum E-value, which reflects how well the theoretical and experimental fragmentation spectra match; therefore, high values of `msmsScore` indicate a good match (see Figure \@ref(fig:plot-msmsScore)).

```{r plot-msmsScore, fig.cap="Density plot of msmsScore.", fig.asp=0.5, echo=FALSE}
msmsScore <- -log10(msnid$PepQValue) # > 2
absParentMassErrorPPM <- abs(mass_measurement_error(msnid)) # < 10
isDecoy <- msnid$isDecoy

# Plot msmsScore
ggplot() +
  geom_density(aes(x = msmsScore, color = isDecoy), 
               fill = NA, na.rm = TRUE, size = 1.4) +
  scale_y_continuous("Density", expand = expansion(mult = c(0, 0.05))) +
  scale_x_continuous(expand = expansion(0)) +
  scale_color_manual(values = c("orange", "skyblue"), breaks = c(TRUE, FALSE)) +
  theme_bw(base_size = 12)
```
</br>

The other filtering criteria is the absolute deviation of the mass measurement error of the precursor ions in parts-per-million (ppm), which is assigned to the `absParentMassErrorPPM` column in `psms(msnid)` (see Figure \@ref(fig:plot-mass-error)).

```{r plot-mass-error, fig.cap="Density plot of absParentMassErrorPPM.", echo=FALSE, fig.asp=0.5}
ggplot() +
  geom_density(aes(x = absParentMassErrorPPM, color = isDecoy), 
               fill = NA, na.rm = TRUE, size = 1.4) +
  scale_y_continuous("Density", expand = expansion(mult = c(0, 0.05))) +
  scale_x_continuous(expand = expansion(0)) +
  scale_color_manual(values = c("orange", "skyblue"), breaks = c(TRUE, FALSE)) +
  theme_minimal(base_size = 12)
```

</br>

Now, we will filter the PSMs.

```{r global-fdr-filter-peptide}
# 1% FDR filter at the peptide level
msnid <- filter_msgf_data(msnid,
                          level = "peptide",
                          fdr.max = 0.01)
show(msnid)
```

We can see that filtering drastically reduces the number of PSMs, and the empirical peptide-level FDR is now 1%. However, notice that the empirical protein-level FDR is still fairly high.

### MS/MS ID Filter: Protein Level 

Now, we need to filter proteins so that the FDR is at most 1%. A while ago, the proteomics field established the hard-and-fast two-peptides-per-protein rule. That is, the confident identification of a protein requires the confident identification of at least 2 peptides. This rule penalizes short proteins and doesn't consider that there are some very long proteins (e.g. Titin 3.8 MDa) that easily have more then two matching peptides even in the reversed sequence. Thus, we propose to normalize the number of peptides per protein length and use that as a filtering criterion (Figure \@ref(fig:plot-num-pep)).

In order to get the protein lengths, we need the FASTA (pronounced FAST-AYE) file that contains the protein sequences used in the database search. The first three entries of the FASTA file are shown in Figure \@ref(fig:fasta-ex).

```{r fasta-ex, echo=FALSE, fig.cap="First three entries of the FASTA file."}
include_graphics("images/FASTA_example_MoTrPAC.PNG")
```

</br>

For each protein, we divide the number of associated peptides by the length of that protein and multiply this value by 1000. This new `peptides_per_1000aa` column is used as the filter criteria.

```{r message=FALSE, warning=FALSE}
# Get path to FASTA file
path_to_FASTA <- system.file(
  "extdata/Rattus_norvegicus_NCBI_RefSeq_2018-04-10.fasta.gz", 
  package = "PlexedPiperTestData"
)

# Compute number of peptides per 1000 amino acids
msnid <- compute_num_peptides_per_1000aa(msnid, path_to_FASTA)
```

```{r plot-num-pep, fig.cap="Density plot of peptides_per_1000aa. The plot area has been zoomed in.", echo=FALSE, fig.asp=0.5}
peptides_per_1000aa <- msnid$peptides_per_1000aa
isDecoy <- msnid$isDecoy

ggplot() +
  geom_density(aes(x = peptides_per_1000aa, color = isDecoy), 
               fill = NA, na.rm = TRUE, size = 1.4) +
  scale_y_continuous("Density", expand = expansion(mult = c(0, 0.05))) +
  scale_x_continuous("peptides_per_1000aa", expand = expansion(0)) +
  scale_color_manual(values = c("orange", "skyblue"), breaks = c(TRUE, FALSE)) +
  theme_minimal(base_size = 12) +
  coord_cartesian(xlim = c(0, 250)) +
  guides(color = guide_legend(title = "isDecoy"))
```
</br>

Now, we filter the proteins to 1% FDR.

```{r message=FALSE, warning=FALSE}
# 1% FDR filter at the protein level
msnid <- filter_msgf_data(msnid,
                          level = "accession",
                          fdr.max = 0.01)
show(msnid)
```


### Inference of Parsimonious Protein Set 

The situation when a certain peptide sequence matches multiple proteins adds complication to the downstream quantitative analysis, as it is not clear which protein this peptide is originating from. There are common ways for dealing with this. One is to simply retain uniquely matching peptides and discard shared peptides (`unique_only = TRUE`). Alternatively, assign the shared peptides to the proteins with the larger number of uniquely mapping peptides (`unique_only = FALSE`). If there is a choice between multiple proteins with equal numbers of uniquely mapping peptides, the shared peptides are assigned to the first protein according to alphanumeric order (Figure \@ref(fig:parsimony)). 

<!---
This step could be done prior to filtering at the accession level, but if peptides are assigned to a low-confidence protein, and that protein is removed during filtering, those peptides will be lost. Instead, it is better to filter to the set of confidently-identified proteins and then determine the parsimonious set.
--->

```{r parsimony, echo=FALSE, fig.cap="Visual explanation of the inference of the parsimonious protein set."}
include_graphics("images/parsimonious-protein-set-inference.PNG")
```

</br>

```{r}
# Inference of parsimonious protein set
msnid <- infer_parsimonious_accessions(msnid, unique_only = FALSE)
show(msnid)
```

Notice that the protein-level FDR increased slightly above the 1% threshold. In this case, the difference isn't significant, so we can ignore it. 

*Note:*

If the peptide or accession-level FDR increases significantly above 1% after inference of the parsimonious protein set, consider lowering the FDR cutoff (for example, to 0.9%) and redoing the previous processing steps. Filtering at the peptide and accession level should each be done a single time.


### Remove Decoy PSMs

The final step in preparing the MS/MS identifications is to remove the decoy PSMs. We use the `apply_filter` function again and only keep entries where `isDecoy` is `FALSE`.

```{r}
# Remove Decoy PSMs
msnid <- apply_filter(msnid, "!isDecoy")
show(msnid)
```

After processing, we are left with 444,345 PSMs, 90,232 peptides, and 5,196 proteins. Table \@ref(tab:global-msnid-table) shows the first 6 rows of the processed MS-GF+ output.

```{r global-msnid-table, echo=FALSE}
kable(head(psms(msnid), 6), digits = 3, row.names = FALSE, 
      caption = "<left>First 6 rows of the processed MS-GF+ results.</left>",
      escape = FALSE, format = "html") %>% 
  kable_styling(full_width = FALSE, font_size = 12,
                bootstrap_options = c("hover", "condensed"))
```

</br>


### Read MASIC Output 

MASIC is a tool for extracting ion intensities. With proper parameter settings, it can be used for extracting TMT (or iTRAQ) reporter ion intensities. In addition, it reports a number of other helpful metrics. Notably, the interference score at the precursor ion level and the signal-to-noise ratio (S/N) at the reporter ion level (computed by Thermo software). The interference score reflects the proportion of the ion population that was isolated for fragmentation that is due to the targeted ion. In other words, `1 - InterferenceScore` is due to co-isolated species that have similar elution time and precursor ion m/z. The first step in the preparation of the reporter ion intensity data is to read the MASIC results. We use a local file path and the `read_masic_data` function. By default, the interference score is not included, so we need to set that argument to `TRUE` in order to filter the results.

```{r results='hide'}
# Path to MASIC data
path_to_MASIC_results <- system.file("extdata/global/masic_output", 
                                     package = "PlexedPiperTestData")

# Read MASIC data
masic_data <- read_masic_data(path_to_MASIC_results, interference_score = TRUE)
```

Normally, this would display progress bars in the console as the data is being fetched. However, the output was suppressed to save space.

```{r global-masic-table, echo=FALSE}
kable(head(masic_data, 6), digits = 3, row.names = FALSE, 
      caption = "<left>First 6 rows of the MASIC data.</left>",
      escape = FALSE, format = "html") %>% 
  kable_styling(full_width = FALSE, font_size = 12,
                bootstrap_options = c("hover", "condensed"))
```

</br>

Table \@ref(tab:global-masic-table) shows the first 6 rows of `masic_data`.


### Filter MASIC Data 

The only other step in reporter ion intensity data preparation is to filter the results. Currently, we recommend keeping entries where at least 50% of the ion population is due to the targeted ion (interference score $\geq$ 0.5) and not filtering by S/N.

```{r}
# Filter MASIC data
masic_data <- filter_masic_data(masic_data, 
                                interference_score_threshold = 0.5,
                                s2n_threshold = 0)
```

### Create Study Design Tables {#fetch-study-design-tables}

To convert from PSMs and reporter ion intensities to meaningful quantitative data, it is necessary to know what are the samples in the reporter channels and what is the intended reference channel (or combination of channels). The entire study design is captured by three tables - fractions, samples, references. With newly processed data, these typically do not exist, and must be created. The next sections show how to create these tables.

#### Fractions 

The fractions table consists of two columns: `Dataset` and `PlexID`. The `Dataset` column contains all of the unique datasets from `msnid$Dataset` or `masic_data$Dataset`. The `PlexID` column contains the plex ID associated with each dataset, and is typically an "S" followed by a number ("S1", "S2", etc.). We can extract the plex ID from the datasets. In this case, the plex ID always comes after "_W_", so we can use a regular expression (regex) to capture it (the first argument of `gsub`). The regex below says to capture an "S" followed by a single digit that appears after "_W_" and before an underscore.

```{r}
# Create fractions table
fractions <- data.frame(Dataset = unique(masic_data$Dataset)) %>% 
  mutate(PlexID = gsub(".*_W_(S\\d{1})_.*", "\\1", Dataset))
```

```{r fractions-table, echo=FALSE}
kable(head(fractions, 10), row.names = FALSE, 
      caption = "<left>First 10 rows of the fractions table.</left>",
      escape = FALSE, format = "html") %>% 
  kable_styling(full_width = FALSE, font_size = 12,
                bootstrap_options = c("hover", "condensed"))
```

</br>

Table \@ref(tab:fractions-table) shows the first 10 rows of `fractions`.


#### Samples 

The samples table contains columns `PlexID`, `QuantBlock`, `ReporterName`, `ReporterAlias`, and `MeasurementName`. The plex ID must be the same as the plex ID in the `fractions` table. `ReporterName` is the reporter ion name ("126", "127N", "127C", etc.). `ReporterAlias` is the intermediate between `ReporterName` and `MeasurementName` and is used for defining the reference. `MeasurementName` determines the column names for the final cross-tab, and must be unique and begin with a letter. Finally, `QuantBlock` can be thought of as a way of defining sub-plex. In a typical TMT experiment, `QuantBlock` is always 1. In case of 5 pairwise comparisons within TMT10, there will be 5 QuantBlocks (1-5) with a reference for each `QuantBlock`.

For this experiment, channel 131 will serve as the reference, so we set `MeasurementName` to `NA` when `ReporterName` is `"131"`. This will make the reference channel absent from the quantitative cross-tab. In cases where reporter ion intensities are not normalized by a reference channel (reference = 1) or they are normalized by the average of select channels, do not set any `MeasurementName` to `NA`.

```{r}
# Create samples table
samples <- read.delim("data/MoTrPAC_pilot_TMT_labeling.txt") %>% 
  dplyr::rename(ReporterName = TMT10_channel, 
                ReporterAlias = sample_ID) %>% 
  mutate(QuantBlock = 1,
         MeasurementName = ifelse(ReporterAlias == "ref", NA, ReporterAlias))
```

```{r samples-table, echo=FALSE}
kable(head(samples, 10), row.names = FALSE, 
      caption = "<left>First 10 rows of the samples table.</left>",
      escape = FALSE, format = "html") %>% 
  kable_styling(full_width = FALSE, font_size = 12,
                bootstrap_options = c("hover", "condensed"))
```

</br>

Table \@ref(tab:samples-table) shows the first 10 rows of `samples`.


#### References 

Reference can be a certain channel, average of multiple channels, or 1. The general form is an expression with `ReporterAlias` names as variables. It is evaluated for each `PlexID`/`QuantBlock` combination and applied to divide reporter ion intensities within corresponding `PlexID`/`QuantBlock`.

```{r}
# Create references table
references <- samples %>% 
  # Filter to reference channel (ReporterName == "131", ReporterAlias == "ref")
  filter(ReporterName == "131") %>% 
  # Select required columns and rename ReporterAlias to Reference
  select(PlexID, Reference = ReporterAlias, QuantBlock)
```

```{r references-table, echo=FALSE}
kable(references, row.names = FALSE, 
      caption = "<left>References table.</left>",
      escape = FALSE, format = "html") %>% 
  kable_styling(full_width = FALSE, font_size = 12,
                bootstrap_options = c("hover", "condensed"))
```

</br>

Table \@ref(tab:references-table) shows the first 10 rows of `references`. The code to use the geometric average instead of a single channel as the reference is shown below. The geometric average is the product of the reporter ion channels to the power of (1/number of channels). For each `PlexID` group, collapse the vector of reporter ion names with `*`, surround them in parentheses, and raise to the power of (1/number of channels).

```{r eval=FALSE}
# Use geometric average as reference
references <- samples %>%
  group_by(PlexID, QuantBlock) %>%
  summarise(Reference = sprintf("(%s)^(1/%d)", 
                                paste(ReporterAlias, collapse = "*"), n()))

# Do not normalize by reference channel (use 1 as the reference)
references <- samples %>% 
  distinct(PlexID, QuantBlock) %>% 
  mutate(Reference = 1)
```

Now that we have the three study design tables, we should save them.

<!---
TODO:
Explain how to add the study design tables to the DMS.
--->

```{r}
# Save study design tables
write.table(fractions, file = "data/fractions.txt",
            sep = "\t", quote = FALSE, row.names = FALSE)
write.table(samples, file = "data/samples.txt",
            sep = "\t", quote = FALSE, row.names = FALSE)
write.table(references, file = "data/references.txt",
            sep = "\t", quote = FALSE, row.names = FALSE)
```


### Create Quantitative Cross-tab {#global-quant-crosstab}

This is the step where MS/MS IDs and reporter ions are linked together and aggregated to the peptide or accession (i.e. protein) level. To retain protein IDs while aggregating to peptide level, set `aggregation_level <- c("accession","peptide")`. The abundances are converted to relative abundances by dividing by the reference and then $log_2$-transformed.

```{r}
# Set aggregation level
aggregation_level <- c("accession")
# Create cross-tab
crosstab <- create_crosstab(msnid, masic_data,
                            aggregation_level = aggregation_level,
                            fractions, samples, references)
```

```{r echo=FALSE}
kable(head(crosstab, 6), row.names = TRUE, 
      caption = "<left>First 6 rows of the global quantitative cross-tab.</left>",
      escape = FALSE, format = "html") %>% 
  kable_styling(full_width = FALSE, font_size = 12,
                bootstrap_options = c("hover", "condensed"))
```

</br>

Now that we have the cross-tab, we should save it.

```{r}
# Save cross-tab
write.table(crosstab, file = "data/global_quant_crosstab.txt",
            sep = "\t", quote = FALSE, row.names = TRUE)
```

We will also save the proteins (row names) of this cross-tab in order to demonstrate prioritized inference in Section \@ref(phosphoproteomics-data).

```{r}
# Save global proteins
global_proteins <- rownames(crosstab)
save(global_proteins, file = "data/3442_global_proteins.RData")
```

### Create MSnSet Object

The final step is to create an MSnSet object, which is necessary for downstream data analysis tools such as the `plot_pca_*` or `limma_*` functions in MSnSet.utils. An MSnSet combines three tables into one object: `exprs`, `fData`, and `pData.` `exprs` is a matrix of protein abundance data with proteins as rows and samples as columns. `fData` is used to convert between different feature IDs such as RefSeq, UniProt accession, Entrez gene ID, gene symbol, etc. It is optional, but the rownames of `fData` must be the same as the row names of `exprs`. Lastly, `pData` contains the metadata with samples as rows, and the row names must be the same as the column names of `exprs` (same order as well). For this example, we don't have an actual metadata table. We just know group assignment from the sample names. Samples with an "R" are part of the exeRcised group, while samples with an "S" are part of the Sedentary group. We can make a metadata table with a single column called `group`.

```{r}
metadata <- data.frame(group = gsub("([RS]).*", "\\1", colnames(crosstab))) %>% 
  mutate(group = ifelse(group == "R", "exercised", "sedentary"))
rownames(metadata) <- colnames(crosstab)
```

```{r echo=FALSE}
kable(metadata, row.names = TRUE, 
      caption = "<left>Metadata for MSnSet.</left>",
      escape = FALSE, format = "html") %>% 
  kable_styling(full_width = FALSE, font_size = 12,
                bootstrap_options = c("hover", "condensed"))
```

Since we used the sample data to create the metadata, everything is in the right order; however, we usually need to reorder the rows of the metadata. We can do this with

```{r eval = FALSE}
# Reorder rows of metadata to match sample order in crosstab
# Need to add drop = FALSE if the metadata is a single column
metadata <- metadata[colnames(crosstab), , drop = FALSE]
```

We have the metadata and protein abundance data, so we can now create the MSnSet. We will call it `m1` and we should save it as an .RData file.

```{r}
# Create MSnSet
m1 <- MSnbase::MSnSet(exprs = as.matrix(crosstab), pData = metadata)
validObject(m1)
```

If we type the name of the MSnSet in the console, a summary will be displayed. This will show information about each of the tables, and the MSnSet will keep track of all processing done such as filtering at the bottom under "Processing information".

```{r}
# Display summary
m1
```


```{r, eval = FALSE}
# Save unprocessed MSnSet
save(m1, file = "data/global_msnset.RData")
```


## Phosphoproteomics Data {#phosphoproteomics-data}

This pipeline shows how to process phosphoproteomics data from the DMS.

For this section, we will use data package number `3626`. We will need the `PlexedPiper` package for isobaric quantification and `PNNL.DMS.utils` to interface with the DMS. Also, details will be omitted if they were already provided in Section \@ref(global-proteomics-data).

```{r message=FALSE}
# Setup
library(PNNL.DMS.utils)
library(PlexedPiper)
library(Biostrings)
library(dplyr) # %>%
library(MSnbase)
```


### Read MS-GF+ Output 

```{r results='hide'}
# Read MS-GF+ data
data_package_num <- 3626
msnid <- read_msgf_data_from_DMS(data_package_num)
```

```{r}
show(msnid)
```


### Correct Isotope Selection Error 

```{r}
# Correct for isotope selection error
msnid <- correct_peak_selection(msnid)
```


### Remove Non-Phosphorylated Peptides 

In this case, the phosphorylation of an amino acid is marked by a `*` inserted into the sequence after said amino acid. We will not consider unmodified peptides, so we can filter out peptides that do not contain this symbol with `apply_filter`. The `*` is a special character that must be escaped with backslashes, and the backslashes must also be escaped, since they are enclosed within a nested string (`"''"`).

```{r}
# Remove non-phosphorylated peptides
# (peptides that do not contain a *)
msnid <- apply_filter(msnid, "grepl('\\\\*', peptide)")
show(msnid)
```


### Remove Contaminants 

```{r}
# Remove contaminants
msnid <- apply_filter(msnid, "!grepl('Contaminant', accession)")
show(msnid)
```


### Improve Phosphosite Localization

Phospho datasets involve AScore jobs for improving phosphosite localization. There should be one AScore job per data package. If the AScore job does not exist, see <a href="https://prismwiki.pnl.gov/wiki/AScore_Job_Creation">AScore Job Creation</a> for how to set it up. The fetched object is a data.frame that links datasets, scans and original PTM localization to newly suggested locations. Importantly, it contains `AScore` column that signifies the confidence of PTM assignment. AScore > 17 is considered confident.

```{r results='hide'}
# Filter PTMs by Ascore
ascore <- get_AScore_results(data_package_num)
msnid <- best_PTM_location_by_ascore(msnid, ascore)
```

<!---
Why are some of the original sequences in ascore not phosphorylated? (AScore = -1)
--->

```{r}
show(msnid)
```


### MS/MS ID Filter: Peptide Level 

```{r}
# 1% FDR filter at the peptide level
msnid <- filter_msgf_data(msnid,
                          level = "peptide",
                          fdr.max = 0.01)
show(msnid)
```

### MS/MS ID Filter: Protein Level 

```{r message=FALSE, warning=FALSE}
# Get path to FASTA file
path_to_FASTA <- path_to_FASTA_used_by_DMS(data_package_num)

# Compute number of peptides per 1000 amino acids
msnid <- compute_num_peptides_per_1000aa(msnid, path_to_FASTA)

# 1% FDR filter at the protein level
msnid <- filter_msgf_data(msnid,
                          level = "accession",
                          fdr.max = 0.01)
show(msnid)
```

### Inference of Parsimonious Protein Set 

<!---
TODO:
* Talk about using prior information from global cross-tab to improve inference of the parsimonious protein set for phospho data.
--->

```{r}
# Load proteins from global crosstab
global_proteins <- readRDS("data/3442_global_protein_names.rds")
# Inference of parsimonious protein set
msnid <- infer_parsimonious_accessions(msnid, unique_only = FALSE,
                                       prior = global_proteins)
show(msnid)
```

### Remove Decoy PSMs

```{r}
# Remove Decoy PSMs
msnid <- apply_filter(msnid, "!isDecoy")
show(msnid)
```

### Map Sites to Protein Sequences 

Prepare FASTA to make sure entry names in FASTA file match MSnID accessions. The plan is to make this conversion automatic. `map_mod_sites` creates a number of columns describing mapping of the sites onto the protein sequences. The most important for the user is `SiteID`.

```{r}
# Create AAStringSet
fst <- readAAStringSet(path_to_FASTA)
# Remove contaminants
fst <- fst[!grepl("Contaminant", names(fst)), ]
# First 6 names
head(names(fst))
```

```{r}
# Modify names to match accessions(msnid)
names(fst) <- strsplit(names(fst), split = " ") %>% 
  # Select text before first space
  lapply(function(x) x[1]) %>% 
  unlist()
# First 6 names
head(names(fst))
```

```{r}
# Main mapping call
msnid <- map_mod_sites(object = msnid, fasta = fst, 
                       accession_col = "accession", 
                       peptide_mod_col = "peptide", 
                       mod_char = "*",
                       site_delimiter = "lower")
```

```{r phospho-msnid-table, echo=FALSE}
fix_phos <- function(x) {
  gsub("\\*", "\\\\*", x)
}
x <- head(psms(msnid), 6) %>% 
  mutate_if(is.character, fix_phos)
kable(x, digits = 3, row.names = FALSE, 
      caption = "<left>First 6 rows of the processed MS-GF+ results.</left>",
      escape = FALSE, format = "html") %>% 
  kable_styling(full_width = FALSE, font_size = 12,
                bootstrap_options = c("hover", "condensed"))
```

</br>

Table \@ref(tab:phospho-msnid-table) shows the first 6 rows of the processed MS-GF+ output.


### Read MASIC Output 

```{r results='hide'}
# Read MASIC data
masic_data <- read_masic_data_from_DMS(data_package_num, 
                                       interference_score = TRUE)
```

### Filter MASIC Data 

```{r}
# Filter MASIC data
masic_data <- filter_masic_data(masic_data, 
                                interference_score_threshold = 0.5,
                                s2n_threshold = 0)
```


### Create Study Design Tables

If study design tables are on the DMS, they can be accessed in the following way.

```{r eval=FALSE}
# Read study design tables from DMS
study_design <- read_study_design_from_DMS(data_package_num)
fractions <- study_design$fractions
samples <- study_design$samples
references <- study_design$references
```

While the study design tables are not on the DMS, we already created them in Section \@ref(fetch-study-design-tables). We just need to recreate the fractions table because the dataset names are different.

```{r}
# Read tables from folder
samples <- read.delim("data/samples.txt")
references <- read.delim("data/references.txt")
fractions <- data.frame(Dataset = unique(masic_data$Dataset)) %>% 
  mutate(PlexID = gsub(".*_P_(S\\d{1})_.*", "\\1", Dataset))
```


### Create Quantitative Cross-tab {#phospho-create-crosstab}

```{r}
# Set aggregation level
aggregation_level <- c("SiteID")
# Create cross-tab
crosstab <- create_crosstab(msnid, masic_data,
                            aggregation_level = aggregation_level,
                            fractions, samples, references)
```

```{r echo=FALSE}
x <- head(crosstab)
# rownames(x) <- gsub("\\*", "\\\\*", rownames(x))
# rownames(x) <- gsub("\\@", "\\@\\\\hphantom{}", rownames(x))
kable(x, row.names = TRUE, 
      caption = "<left>First 6 rows of the phospho quantitative cross-tab.</left>", format = "html",
      escape = FALSE) %>% 
  kable_styling(font_size = 12,
                bootstrap_options = c("hover", "condensed"))
```

</br>

We will save the cross-tab for later sections.

```{r}
# Modify cross-tab for saving
crosstab <- crosstab %>% 
  as.data.frame() %>% 
  tibble::rownames_to_column("SiteID")

# Save cross-tab
write.table(crosstab, file = "data/phosphosite_quant_crosstab.txt",
            sep = "\t", quote = FALSE, row.names = FALSE)
```