Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using intersectRows when different names are used for the same entity #228

Open
llrs opened this issue Dec 4, 2017 · 8 comments
Open

Using intersectRows when different names are used for the same entity #228

llrs opened this issue Dec 4, 2017 · 8 comments
Assignees

Comments

@llrs
Copy link

llrs commented Dec 4, 2017

I have one dataset of 16 S sequencing of intestinal biopsies and another one from the stools which end up into different OTUs. I can find to which taxa does each OTU belong to and in the phylogenetic analysis they are usually merged into a single object (phyloseq, metagenomeSeq) extending the rowData (I assume), or could be stored in rowData because the names of the OTUs (I have OTU_1, OTU_2, ...) aren't really meaningful. What is meaningful is the taxonomy I have in a matrix that is in those objects (phylo-class, MRexperiment-class).

See example output:

MR_i  ## And MR_s is a similar object
## MRexperiment (storageMode: environment)
## assayData: 499 features, 103 samples 
##   element names: counts 
## protocolData: none
## phenoData
##   sampleNames: 5.B009 4.B008 ... 103.B104 (103 total)
##   varLabels: Sample_Code Patient_ID ... ID (12 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: OTU_1 OTU_10 ... OTU_998 (499 total)
##   fvarLabels: Domain Phylum ... Species (7 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  

(MAE  <- MultiAssayExperiment(experiments = list("intestinal" = MR_i, "stools" = MR_s), colData = meta))
## A MultiAssayExperiment object of 2 listed
##  experiments with user-defined names and respective classes. 
##  Containing an ExperimentList class object of length 2: 
##  [1] intestinal: MRexperiment with 499 rows and 103 columns 
##  [2] stools: MRexperiment with 535 rows and 103 columns 
## Features: 
##  experiments() - obtain the ExperimentList instance 
##  colData() - the primary/phenotype DataFrame 
##  sampleMap() - the sample availability DataFrame 
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment 
##  *Format() - convert into a long or wide DataFrame 
##  assays() - convert ExperimentList to a SimpleList of matrices

When I build one of MAE object with them and I use intersectRows I end up with those with the same name but different taxonomic classification.

intersectRows(MAE)
## A MultiAssayExperiment object of 2 listed
##  experiments with user-defined names and respective classes. 
##  Containing an ExperimentList class object of length 2: 
##  [1] intestinal: MRexperiment with 235 rows and 103 columns 
##  [2] stools: MRexperiment with 235 rows and 103 columns 
## Features: 
##  experiments() - obtain the ExperimentList instance 
##  colData() - the primary/phenotype DataFrame 
##  sampleMap() - the sample availability DataFrame 
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment 
##  *Format() - convert into a long or wide DataFrame 
##  assays() - convert ExperimentList to a SimpleList of matrices
c(head(rownames(b)[[1]]), tail(rownames(b)[[1]]))
## [1] "OTU_1"   "OTU_10"  "OTU_100" "OTU_101" "OTU_102" "OTU_103" "OTU_94"  "OTU_95"  "OTU_96"  "OTU_97"  "OTU_98"  "OTU_99" 

Instead the OTU_1073 from intestinal assay and the OTU_1037 from the stools assay are the same species.

Could intersectRows use the rowData (or fvarLabels) of each experiment if available to reorder(?) and select the rows of the experiment?

Also if I have metagenomics and RNA-seq assays in the same object, I would like to tell intersectRows which experiments to subset by row. I could be interested in just one Phylum and relate it to the other assays on the experiment.

The package looks great, thanks for the effort!

@LiNk-NY LiNk-NY self-assigned this Dec 4, 2017
@LiNk-NY
Copy link
Collaborator

LiNk-NY commented Dec 7, 2017

Hi Lluís, @llrs
Thank you for the report.
The assumption here is that all the objects in the ExperimentList support a rowData method.
It would be good to make use of this data perhaps we could add a byRowData argument.
Regards,
Marcel

@llrs
Copy link
Author

llrs commented Dec 7, 2017

I tried building another object (SummarizedExperiment) with the same data:

MultiAssayExperiment(list("intestinal" = SE_i, "stools" = SE_s))
## A MultiAssayExperiment object of 2 listed
##  experiments with user-defined names and respective classes. 
##  Containing an ExperimentList class object of length 2: 
##  [1] intestinal: SummarizedExperiment with 532 rows and 178 columns 
##  [2] stools: SummarizedExperiment with 568 rows and 152 columns 
## Features: 
##  experiments() - obtain the ExperimentList instance 
##  colData() - the primary/phenotype DataFrame 
##  sampleMap() - the sample availability DataFrame 
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment 
##  *Format() - convert into a long or wide DataFrame 
##  assays() - convert ExperimentList to a SimpleList of matrices
colData(mae)
## DataFrame with 330 rows and 0 columns

But then my problem is how to encode the colData, see this question in the support site.

It might be for another enhancement but using each SummarizedExperiment's colData to create a common colData would simplify the creation of the MAE objects. It would have many caveats but maybe looking for common columns and creating a column for the row names of each sample in the SummarizedExperiment would work.

@lwaldron
Copy link
Member

lwaldron commented Dec 8, 2017

@LiNk-NY I wonder if the enhancement should be more general than byRowData - how about function signatures for subsetByRow and subsetByColumn, where the function is something that will be applied to each list element? Something like:

setMethod("subsetByRow", c("ExperimentList", "function"), function(x, y) {
   sublist <- lapply(x, y)
   x <- subsetByRow(x, sublist)
   x
})

This could be used for subsetting by rowData (although with more complicated user syntax than a more specific subsetByrowData), but also for filtering by row means, variance, etc.

@LiNk-NY
Copy link
Collaborator

LiNk-NY commented Apr 18, 2018

I think Martin @mtmorgan would say, you want to define a method for a class rather than a function.
And the desired functionality should either conform to the MultiAssayExperiment API or
extend the class.

(Martin, feel free to chime in)

@stale
Copy link

stale bot commented Jan 2, 2019

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the outdated label Jan 2, 2019
@llrs
Copy link
Author

llrs commented Jan 2, 2019

It's been a while but are there some updates?

I'm commenting to prevent the bot closing the issue

@stale stale bot removed the outdated label Jan 2, 2019
@LiNk-NY
Copy link
Collaborator

LiNk-NY commented Jan 2, 2019

Hi Lluís, @llrs

What you describe seems to require a row map structure where subsets can be done
based on a third variable.
We don't have something like that planned in the immediate future although it is
an important problem to tackle. FWIW, we do have helper functions to homogenize rows
across experiments in TCGAutils (see symbolsToRanges and mirToRanges).
Perhaps you can write a function that will do this for you in terms of matching
and re-ordering OTU rows across experiments using a map. You could then use
a list or List or row names to subset.

If you are working with a consistent number of samples ('colnames') and rows,
it may also be worthwhile to look into data structures that make use of a
row graph representation such as LoomExperiment.

Best regards,
Marcel

@lwaldron
Copy link
Member

lwaldron commented Jan 4, 2019

Just discussed this with @LiNk-NY. This should provide a workable solution with minimal change:

  • the subsetByRow() function should provide an i argument that allows you specify which experiments will be subset, with the default being all.

Other helper functions subsetByRowData() and intersectByRowData() would also be useful. These would provide an additional argument for the column name of the rowData to use instead of column names. They would silently do nothing for any experiments that either 1) don't have rowData, or 2) don't have the specified colname in their rowData.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants