Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clone and tree schema feedback / debrief #333

Closed
eharkins opened this issue Feb 18, 2020 · 10 comments
Closed

Clone and tree schema feedback / debrief #333

eharkins opened this issue Feb 18, 2020 · 10 comments
Assignees
Milestone

Comments

@eharkins
Copy link
Contributor

@javh asked:

A debrief would be nice at some point, but it doesn't have to be today. Ie, what worked, what didn't, why you had to keep unique_seqs_count and ident, anything that was rough and should be changed, etc.

  1. unique_seqs_count we kept because the concept of an individual sequence as a rearrangement isn't well defined in Olmsted and given that context, the wording of sequences seemed like it might be more intuitive to new users. Us as partis users might also confuse rearrangement for an entire clonal family since events in partis are described as:

list of annotations for each rearrangement event (i.e. group of clonally-related sequences)

  1. ident we kept since we dont enforce unique ids otherwise
  2. id fields in the AIRR context look like <entity>.<entity>_id which I can't remember but I think is for DB querying reasons. This doesn't make as much sense in our context where we don't need to do any querying and might care more about being able to use code that takes the id field of any object, which requires all the id fields to be the same key across objects. At the end of the day this seems like not a very big deal.
  3. I think we did a good job of eventually saying: "what is the minimum viable product?" for a schema to allow us to move on from some detailed discussions that were maybe best saved for later versions of the schema
  4. in general I found it hard to follow and track the various pieces of the schema being discussed all in one issue / branch because the only documentation we had of them was that single thread. Ways I can think of doing this differently (both of which allow us to mark pieces of the schema as resolved):
  • commenting on pieces of the schema by starting a github code review and then all discussion related to each piece of the schema taking place in the review threads corresponding to the line(s) in the code
  • using github Projects boards to track many separate issues for each part of the schema

Thanks so much everyone who participated in helping define a schema for Clones and Trees! This will help Olmsted be more widely useful and will hopefully be helpful for other tools and contexts as well.

tagging some folks from our team to be sure they get to include thoughts if they have them @matsen @psathyrella

@javh
Copy link
Contributor

javh commented Feb 19, 2020

Thanks, @eharkins.

1. I think we should find a way to resolve this so that there is a field for this in the schema, rather than relying on a custom field. I suspect the count of (unique) sequence in the clone is going to be a very common field, so we should reserve a name/definition for it. (Related: #161)

2-3. rearrangement_id is supposed to be a universally unique identifier. Could you use this field instead?

5. Yeah, that's always a problem with these big PR threads. We usually just use labels to organize issues, but I'm sure we could start using the Projects broad. It would give us more granularity without the accompanying mess of having a bunch of extra labels.

The problem with in-line comments is that when people commit changes/fixes they lose context. They are great for small and quick changes, but for larger discussions the code and comments get out of sync (we saw this a few times when working on the Clones/Trees.)

@eharkins
Copy link
Contributor Author

  1. Sounds good to me, let me know how I can help with that or otherwise let me know when you have settled on a good name.
  2. Every object in our schema gets an ident. How would rearrangement_id work with objects higher up in the hierarchy needing uuids in our schema like dataset and subject and sample which contain many rearrangements?
  3. Makes sense. I hear you about the in-line comments and so recommend the project board for big projects with multiple issues like this schema. Here is an example of an all-encompassing one that we use for developing a phylogenetic pipeline.

@javh
Copy link
Contributor

javh commented Feb 19, 2020

Every object in our schema gets an ident. How would rearrangement_id work with objects higher up in the hierarchy needing uuids in our schema like dataset and subject and sample which contain many rearrangements?

In that case, can you use the respective id fields for each object? Eg, study_id, subject_id, sample_id. These are defined as being unique within some context, rather than universally unique, but making them universally unique in Olmstead doesn't break that requirement (it's just more strict).

I hear you about the in-line comments and so recommend the project board for big projects with multiple issues like this schema.

Let's give it a try? I made a lineage project and added a few issues to it. We'll figure out if it's more burden than value by using it...

@eharkins
Copy link
Contributor Author

In that case, can you use the respective id fields for each object? Eg, study_id, subject_id, sample_id. These are defined as being unique within some context, rather than universally unique, but making them universally unique in Olmstead doesn't break that requirement (it's just more strict).

I think @metasoarous and I added ident because we felt

making them universally unique in Olmstead

was too strict given our own or others' use of potentially non unique ids in _id fields.

@javh
Copy link
Contributor

javh commented Feb 20, 2020

#246 is relevant (long).

@schristley
Copy link
Member

Every object in our schema gets an ident. How would rearrangement_id work with objects higher up in the hierarchy needing uuids in our schema like dataset and subject and sample which contain many rearrangements?

@eharkins Can you explain what "dataset" is in Olmsted terms? Is it the same as a single study? Is it orthogonal to a study and just a set of repertoires?

For metadata like the study, subject, sample processing, etc., the repertoire_id provides the singular unique link to that information. Subject and sample identifiers in the AIRR world are only unique within a study (primarily because they are user defined), but repertoire_id can be used in combination with them (like a compound key) to create uniqueness. The likely issue is that Olmsted has its own internal representation for this metadata versus using the AIRR repertoire metadata schema. That's reasonable as the schema isn't really a published standard yet.

You could however, if you were interested, allow Olmsted to utilize AIRR repertoire metadata as an option in place of your internal schema, i.e. support both. There are quite a few studies (published data, not examples) loaded up in the data repositories that you could use. Alternatively, we could convert one of your datasets into the AIRR format, which might be more useful as you would know the expected output and functionality.

@javh
Copy link
Contributor

javh commented Mar 11, 2020

I brought this up elsewhere, but what about having an optional alias called id in every object that points to the primary identifier (ie, id is a $ref to <entity>_id)?

@eharkins
Copy link
Contributor Author

@schristley Olmsted dataset is not strictly defined in terms of a study. It can contain many subjects and samples so I would say

orthogonal to a study and just a set of repertoires

is probably accurate.

Subject and sample identifiers in the AIRR world are only unique within a study (primarily because they are user defined

This is the reason why we kept ident (which we create using python uuids upon validation) since all other ids in our schema are user defined and we don't enforce their uniqueness at any level.

The likely issue is that Olmsted has its own internal representation for this metadata versus using the AIRR repertoire metadata schema.

Can you clarify what metadata you are referring to?

@schristley
Copy link
Member

Can you clarify what metadata you are referring to?

I'm not familiar enough with Olmsted but as you are mentioning subjects, samples and so forth, I'm assuming that Olmsted is storing information about them? For example in AIRR, the subject has an id, but it also has a species taxonomy code, a biological sex, an age, and etc. That's the subject metadata, and there's metadata for samples and so on.

@eharkins
Copy link
Contributor Author

Yes we have metadata about those entities as well but as you said don't adhere to AIRR standards in those cases. We could certainly aim to do this in the future!

It wasn't as much of a priority for us as Clones and Trees since we're usually dealing with samples from a single study using a single sample processing setup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants