Skip to content

Commit

Permalink
Update "Get started" vignette. Also update data schema figure and add…
Browse files Browse the repository at this point in the history
… packages to the Suggests section in the DESCRIPTION file.
  • Loading branch information
sdgamboa committed Mar 11, 2024
1 parent add554b commit 9cd2015
Show file tree
Hide file tree
Showing 3 changed files with 114 additions and 23 deletions.
7 changes: 6 additions & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,11 @@ Suggests:
knitr,
rmarkdown,
sessioninfo,
testthat
testthat,
EnrichmentBrowser,
MicrobiomeBenchmarkData,
mia,
stats,
limma
URL: https://github.com/waldronlab/bugphyzz
BugReports: https://github.com/waldronlab/bugphyzz/issues
130 changes: 108 additions & 22 deletions vignettes/bugphyzz.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,9 @@ knitr::opts_chunk$set(
## Introduction

[Bugphyzz](https://github.com/waldronlab/bugphyzzExports)
is an electronic resource that provides harmonized microbial annotations from
different sources. These annotations can be used to create microbial signatures
based on shared attributes and taxonomy. This R package, which shares the same
name, can be used to access such resource directly in R.
is an electronic resource of harmonized microbial annotations from
different sources. These annotations can be used to create signatures of
microbes sharing attributes, and used for bug set enrichment analysis.

## Data schema

Expand All @@ -37,29 +36,25 @@ and an attribute (see the data schema and description below).

</center>

Below, the description of the elements in the data schema.

**Taxon-related**

Taxonomic data was harmonized based on the NCBI taxonomy:
Taxonomic data was harmonized according to the NCBI taxonomy:

1. _NCBI ID_. An integer. The NCBI taxonomy ID (taxid) associated with the
1. _NCBI ID_. An integer. The NCBI taxonomy ID (taxid) associated with a
taxon.
2. _Rank_. A character string describing the taxonomy rank. Valid values:
superkingdom, kingdom, phylum, class, order, family, genus, species, strain.
3. _Taxon name_. A character string describing the scientific name of the taxon.

**Attribute-related**

Attribute data was harmonized using a controlled vocabulary,
which was based on available ontology terms.
Attributes, ontology terms, and ontology libraries used can be found [here]().
Attribute data was harmonized with a controlled vocabulary based on
available ontology terms. Attributes, ontology terms, and ontology libraries
can be found [here]().

4. _Attribute_. A character string describing the name of a trait that can be
observed or measured.
5. _Attribute value_. The values that an attribute could take. Either a
character string, a boolean, or a number.
6. _Attribute type_. A character string describing the data type.
5. _Attribute type_. A character string describing the data type.
* numeric. Attributes that can take numeric values. For example, attribute:
growth temperature; attribute value: 25 C.
* binary. Attributes that can take booleans. For example,
Expand All @@ -69,6 +64,8 @@ character string, a boolean, or a number.
* multistate-union. Attribute that can take three or more values. These
values are always character strings. For example, attribute: aerophilicity;
attribute values: aerobic, anaerobic, or facultatively anaerobic.
6. *Attribute value*. The values that an attribute could take. Either a
character string, a boolean, or a number.

**Attribute value-related**

Expand All @@ -83,7 +80,14 @@ Metadata associated with the attribute values:
* IBD = inferred from biological aspect of descendant.
* ASR = ancestral state reconstruction.

9. _Frequency and Score_.
9. _Support values_.
* Frequency and Score. Confidence that a given taxon exhibits a trait based
on the curator’s knowledge or results of ASR or IBD.
* Validation. Score of the 10-fold cross-validation analysis.
Matthews correlation coefficient (MCC) for discrete attributes and
R-squared for numeric attributes. Default values is 0.5 and above.
* NSTI. Nearest sequence taxon index as described in PICRUSt2 or the
castor package. Relevant for numeric values only.

**Attribute source-related**

Expand Down Expand Up @@ -117,7 +121,8 @@ library(purrr)

bugphyzz is imported with the `importBugphyzz` function as a list of
tidy data.frames, each of them corresponding to an attribute
(or group of attributes in the case of multistate-union).
or group of related attributes in the case of the multistate-union type
(check the data schema description above).

Import bugphyzz and explore available attributes with `names`:

Expand All @@ -136,16 +141,27 @@ Compare the column names with the data schema described above.

## Creating signatures

Create signatures of taxids at the genus level for aerophilicity
After the attributes have been imported, we can use the `makeSignatures`
function to create a list of signatures. `makeSignatures` accepts a few
arguments for filtering such as evidence, frequency, and minimum and maximum
values for numeric attributes. If a more refined filtering is required,
a user could use regular data manipulation functions on the data.frame of
interest (e.g., `dplyr::filter`).

Some examples:

+ Create signatures of taxon names at the genus level for the aerophilicity
attribute (discrete):

```{r}
aer_sigs_g <- makeSignatures(
dat = bp[["aerophilicity"]], tax_id_type = "NCBI_ID", tax_level = "genus"
dat = bp[["aerophilicity"]], tax_id_type = "Taxon_name", tax_level = "genus"
)
map(aer_sigs_g, head)
```

Create signatures of taxa names at the species level for growth temperature
+ Create signatures of taxon names at the species level for the growth
temperature attribute (numeric):

```{r}
gt_sigs_sp <- makeSignatures(
Expand All @@ -155,7 +171,8 @@ gt_sigs_sp <- makeSignatures(
map(gt_sigs_sp, head)
```

Create signatures with custom threshold for numeric attributes
+ Create signatures with a custom threshold for the growth temperature
attribute (numeric):

```{r}
gt_sigs_mix <- makeSignatures(
Expand All @@ -165,7 +182,7 @@ gt_sigs_mix <- makeSignatures(
map(gt_sigs_mix, head)
```

Create signatures for a binary attribute
+ Create signatures for the animal pathogen attribute (boolean):

```{r}
ap_sigs_mix <- makeSignatures(
Expand All @@ -175,7 +192,7 @@ ap_sigs_mix <- makeSignatures(
map(ap_sigs_mix, head)
```

Make signatures for all datasets with a single function call
+ Make signatures for all of the data.frames:

```{r}
sigs <- map(bp, makeSignatures) |>
Expand All @@ -189,6 +206,75 @@ head(map(sigs, head))

## Run an enrichment analysis

Bugphyzz signatures can be used for running enrichment analysis with
existing tools developed in R. For example, using EnrichmenBrowser.

Here is an example of how to run an enrichment analysis using GSEA and
a benchmark dataset.

Load packages:

```{r, message=FALSE}
library(EnrichmentBrowser)
library(MicrobiomeBenchmarkData)
library(mia)
```

Load benchmark data:

```{r, warning=FALSE}
dat_name <- 'HMP_2012_16S_gingival_V35'
tse <- getBenchmarkData(dat_name, dryrun = FALSE)[[1]]
tse_genus <- splitByRanks(tse)$genus
min_n_samples <- round(ncol(tse_genus) * 0.2)
tse_subset <- tse_genus[rowSums(assay(tse_genus) >= 1) >= min_n_samples,]
tse_subset
```

Differential abundance (DA) analysis:

```{r}
tse_subset$GROUP <- ifelse(
tse_subset$body_subsite == 'subgingival_plaque', 0, 1
)
edger <- deAna(
expr = tse_subset, de.method = 'edgeR', padj.method = 'fdr',
filter.by.expr = FALSE,
)
dat <- data.frame(colData(edger))
design <- stats::model.matrix(~ GROUP, data = dat)
assay(edger) <- limma::voom(
counts = assay(edger), design = design, plot = FALSE
)$E
```

Enrichment analysis using GSEA:

```{r, message=FALSE}
gsea <- sbea(
method = 'gsea', se = edger, gs = aer_sigs_g, perm = 1000,
alpha = 0.1
)
gsea_tbl <- as.data.frame(gsea$res.tbl) |>
mutate(
GENE.SET = ifelse(PVAL < 0.05, paste0(GENE.SET, ' *'), GENE.SET),
PVAL = round(PVAL, 3),
) |>
dplyr::rename(BUG.SET = GENE.SET)
knitr::kable(gsea_tbl)
```

## Get taxon signatures

Finally, a user could get all of the signature names to which a given taxon
belongs to. Only taxids should be used.

An example using _Escherichia coli_ (taxid: 562):

```{r}
getTaxonSignatures(tax = "562", bp = bp)
```
## Session information:

```{r}
Expand Down
Binary file modified vignettes/bugphyzz_data_schema.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 9cd2015

Please sign in to comment.