-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-visit implementation decisions around Cell
object
#768
Comments
Under option 2, does the ADC |
Some more options: Option 3 would appear to accomplish want @scharch is asking for. Option 4 seems super messy to me, but perhaps the state of different implementations already. |
I think that I prefer option 1, though it does conflict with our normal way of coding 1:N relationships. |
This looks suspiciously like this discussion Are you sure you want to re-open this can of worms... 8-) |
Well, I took the time to try and dig out some history.
|
@javh says:
@scharch says:
I need to understand what you mean by "container". Is this suggesting having all of the expression data stored as part of the object as in: ![]() With the addition of an array of There would be no external Or do you mean that the
We did discuss all of these options extensively (see above), so it is unclear to me what the issue is with the current solution. |
This one, to my understanding |
I think Option 4 is closest to what we have now, hence the confusion. |
This would be option 1. Which would also imply that |
I really don't see this as complicated at all. In fact it is about as simple as it can get.
|
What you've described would be option 3, which would involve paring down |
You realize that this what a Cell will look like right - 1361 IDs to expression records for this one Cell. This is what you get when you have an N:1 relationship with large "N" store in the "1"
|
Really??
This is not currently the case, as I read the schema
|
I don't object to leaving the rearrangement and receptor arrays. They are small N. I do object to storing an N:1 large N relationship in the 1 |
@bcorrie what study is this? That seems to me like using |
Yes, the current spec does not have the ability to store an array of Expression or Reactivity objects in Rearrangement and Receptor were kept in I don't object to removing the Rearrangement and Receptor arrays in Cell to be consistent. |
Yeah, not in terms of exactly which fields are present. I meant in terms of the logic. I don't see the schema as currently aligned with any option fully, hence the thread... I think the easiest thing to do (and my personal preference) is to remove the bi-directional fields from |
The airr-standards/specs/airr-schema.yaml Line 4565 in 3726047
The airr-standards/specs/airr-schema.yaml Line 4647 in 3726047
So you have the opportunity to do both. CellExpression is the measured genes and their expression levels from the study. Your average Cell is going to have many 100s of these. So your array of |
We store these in the repository (for a specific cell):
Each record looks like this:
|
I don't run a repository, so I guess it's no sweat of my back, but that's a turrrrible way to store RNAseq data and not at all what I thought Beyond that, seems like we have consensus on Option 3, plus/minus bi-directional |
I'd like to hear @bussec weigh in. The central problem here is that we're not all operating on the same set of assumptions of what the model should be. Once we all get on the same page w.r.t. assumptions, then it should be easy to figure out what to do. |
I don't object to this, we actually don't store them to avoid consistency problems:
|
The data isn't necessarily stored in that way in the repository, that is how the /expression endpoint gives it to you if you ask for a JSON response (which is the default). We store it as a big flat mongo document that is very fast to query if indexed correctly (which we are pretty good at). Maybe we run into challenges when we have 5B expression values, but for now it is not a challenge at all... Although we have none of these at the moment, the API could trivially produce TSV files, matrix files, h5ad, etc. We discussed that here: #409 The data is there, the data is queryable, but we don't have great use cases, and our discussions with 10X suggested that implementing any specific h5ad type of file would be a pain to support and by definition niche. Maybe that is changing but that is where we are at today. |
Perfect illustration of the problem. Nulling the field when you could populate the data implies the schema is flawed. You're probably making the correct choice, and it is technically legal to null those required fields, but it's also bypassing the model implied by the |
With scirpy's support of the AIRR TSV format, it is likely that this might be the best way to return GEX data: https://scirpy.scverse.org/en/latest/glossary.html#term-AIRR But the API can't do that yet. We actually have a AIRR JSON Expression to h5ad converter (yes its scary and slow and hacky) 8-) So if you download Cell/Expression data from the Gateway, you can use this to convert the data to h5ad files and then use those as input to tools like Conga and CellTypist (these are Apps we have built into the Gateway). |
I still don't understand how option 3 is different from what we currently have.... |
Mostly because we currently have bi-directional links in |
The +/- fields are what would change. I don't think I made that clear initially. |
A discussion around the necessity and the architecture of the
Cell
object and its associated object recently arose in #705. Indeed the current relations and links betweenCell
,CellExpression
,CellReactivity
and other objects (incl.Receptor
andRearrangement
) appear overly complicated. Various solutions have been proposed, including:Cell
should become a container for otherCell*
objects, which would then not exists as independent top-level objects anymoreCell
should be completely removed and cells would only be represented in a data set via acell_id
in various other objects, but not as an object on its own.The text was updated successfully, but these errors were encountered: