Releases: mikemc/speedyseq
speedyseq v0.2.0
Breaking changes
- The default ordering of new taxa output by
tax_glom()
is different from
previous versions and fromphyloseq::tax_glom()
in phyloseq objects that do
not have phylogenetic trees. See "Minor improvements and fixes" for more
information.
New features
New general-purpose vectorized taxa-merging function
The new merge_taxa_vec()
function provides a vectorized version of
phyloseq::merge_taxa()
that can quickly merge arbitrary groups of taxa and
now forms the basis of all other merging functions. phyloseq::merge_taxa()
takes a phyloseq object or component object x
and a set of taxa eqtaxa
and
merges them into a single taxon. In place of the eqtaxa
argument, speedyseq's
merge_taxa_vec()
takes a vector group
of length ntaxa(physeq)
that
defines how all the taxa in x
should be merged into multiple new groups. Its
syntax and behavior is patterned after that of base::rowsum()
, which it uses
to do the merging in the OTU table. When aiming to merge a large number of taxa
into a smaller but still large number of groups, it is much faster to do all
the merging with one call to merge_taxa_vec()
than to loop through many calls
to merge_taxa()
.
A practical example is clustering amplicon sequence variants (ASVs) into OTUs
defined by a given similarity threshold. Suppose we have a phyloseq object ps
that has the ASV sequences stored in its refseq
slot. We can cluster the ASV
sequences into 97% OTUs using the DECIPHER package with
dna <- refseq(ps)
nproc <- 1 # Increase to use multiple processors
aln <- DECIPHER::AlignSeqs(dna, processors = nproc)
d <- DECIPHER::DistanceMatrix(aln, processors = nproc)
clusters <- DECIPHER::IdClusters(
d,
method = "complete",
cutoff = 0.03, # corresponds to 97% OTUs
processors = nproc
)
Next we merge_taxa_vec()
to get the merged phyloseq object,
ps0 <- merge_taxa_vec(
ps,
group = clusters$cluster,
tax_adjust = 2
)
The names of the new taxa are set to the name of the most abundant taxon within
each group (the same behavior as the tax_glom()
and tip_glom()
functions).
Future versions will likely have a names
argument to control the naming
behavior.
The tax_adjust
argument controls how NAs and within-group
disagreements in taxonomy are handled to determine the taxonomy of the new taxa
(see help(merge_taxa_vec)
for details). An example of the difference between
tax_adjust = 1
(phyloseq::merge_taxa()
behavior) and tax_adjust = 2
can
be seen in the following example from the new tip_glom()
documentation,
data(GlobalPatterns)
set.seed(20190421)
ps <- prune_taxa(sample(taxa_names(GlobalPatterns), 2e2), GlobalPatterns)
ps1 <- tip_glom(ps, 0.1, tax_adjust = 1)
ps2 <- tip_glom(ps, 0.1, tax_adjust = 2)
tax_table(ps1)[c(108, 136, 45),]
#> Taxonomy Table: [3 taxa by 7 taxonomic ranks]:
#> Kingdom Phylum Class Order
#> 578831 "Bacteria" "Bacteroidetes" "Sphingobacteria" "Sphingobacteriales"
#> 2801 "Bacteria" "Planctomycetes" "Planctomycea" "Pirellulales"
#> 185581 "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Oceanospirillales"
#> Family Genus Species
#> 578831 NA "Niabella" NA
#> 2801 NA "Rhodopirellula" NA
#> 185581 "OM60" NA "marinegammaproteobacteriumHTCC2080"
tax_table(ps2)[c(108, 136, 45),]
#> Taxonomy Table: [3 taxa by 7 taxonomic ranks]:
#> Kingdom Phylum Class Order
#> 578831 "Bacteria" "Bacteroidetes" "Sphingobacteria" "Sphingobacteriales"
#> 2801 "Bacteria" "Planctomycetes" "Planctomycea" "Pirellulales"
#> 185581 "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Oceanospirillales"
#> Family Genus Species
#> 578831 NA NA NA
#> 2801 NA NA NA
#> 185581 "OM60" NA NA
Faster and lower-memory implementation of phyloseq::tip_glom()
The new tip_glom()
function provides a speedy version of
phyloseq::tip_glom()
. This function performs a form of indirect phylogenetic
merging of taxa using the phylogenetic tree in phy_tree(physeq)
by 1) using
the tree to create a distance matrix, 2) performing hierarchical clustering on
the distance matrix, and 3) defining new taxonomic groups by cutting the
dendrogram at the height specified by the h
parameter. Speedyseq's
tip_glom()
provides a faster and less memory-intensive alternative to
phyloseq::tip_glom()
through the use of vectorized merging (via
merge_taxa_vec()
) and faster and lower-memory phylogenetic-distance
computation (via get_all_pairwise_distances()
from the
castor package).
Speedyseq's tip_glom()
also has the new tax_adjust
argument, which is
passed on to merge_taxa_vec()
. It is set to 1
by default for phyloseq
compatibility and should give identical results to phyloseq in this case.
For phyloseq compatibility, the default clustering function is left as
cluster::agnes()
. However, equivalent but faster results can be obtained by
using the hclust
function from base R with the method == "average"
option.
Direct phylogenetic merging with tree_glom()
The new tree_glom()
function performs direct phylogenetic merging of taxa.
This function is much faster and arguably more intuitive than tip_glom()
.
Advantages of direct merging over the indirect merging of tip_glom()
are
- A merged group of taxa correspond to a clade in the original tree being
collapsed to a single taxon. - The
resolution
parameter that controls the degree of merging has
units in terms of the tree's branch lengths, making it potentially more
biologically meaningful than theh
parameter intip_glom()
. - The distance-matrix computation and hierarchical clustering in
tip_glom()
can be skipped, makingtree_glom()
much faster and less memory intensive
thantip_glom()
when the number of taxa is large.
tree_glom()
uses functions from the
castor package to
determine which clades are to be merged using one of three criteria. The
default behavior is to merge a clade if the maximum distance from a node to
each of its tips is less than the distance resolution
.
data(GlobalPatterns)
ps1 <- subset_taxa(GlobalPatterns, Phylum == "Chlamydiae")
ntaxa(ps1)
ps2 <- tree_glom(ps1, 0.05)
ntaxa(ps2)
library(dplyr)
library(ggtree)
library(cowplot)
plot1 <- phy_tree(ps1) %>%
ggtree +
geom_tiplab() +
geom_label(aes(x = branch, label = round(branch.length, 4)))
plot2 <- phy_tree(ps2) %>%
ggtree +
geom_tiplab() +
geom_label(aes(x = branch, label = round(branch.length, 4)))
plot_grid(plot1, plot2)
Minor improvements and fixes
-
Fixed bug that applied to taxonomic merge functions when an object named
new_tax_mat
exists outside the function environment; described in
Issue #31 -
Merging functions now maintain the original order of new taxa by default,
except in phyloseq objects with phylogenetic trees (for which order is and
has always been determined by how archetypes are ordered in
phy_tree(ps)$tip.label
). This behavior can lead to different taxa orders
from past speedyseq versions and fromphyloseq::tax_glom()
function.
However, it makes the resulting taxa order more predictable. New taxa can be
be reordered according togroup
or taxonomy inmerge_taxa_vec()
and
tax_glom()
by settingreorder = TRUE
. -
Merging/glom functions now work on relevant phyloseq components as well as
phyloseq objects