Clone and tree schema feedback / debrief #333

eharkins · 2020-02-18T19:17:13Z

A debrief would be nice at some point, but it doesn't have to be today. Ie, what worked, what didn't, why you had to keep unique_seqs_count and ident, anything that was rough and should be changed, etc.

unique_seqs_count we kept because the concept of an individual sequence as a rearrangement isn't well defined in Olmsted and given that context, the wording of sequences seemed like it might be more intuitive to new users. Us as partis users might also confuse rearrangement for an entire clonal family since events in partis are described as:

list of annotations for each rearrangement event (i.e. group of clonally-related sequences)

ident we kept since we dont enforce unique ids otherwise
id fields in the AIRR context look like <entity>.<entity>_id which I can't remember but I think is for DB querying reasons. This doesn't make as much sense in our context where we don't need to do any querying and might care more about being able to use code that takes the id field of any object, which requires all the id fields to be the same key across objects. At the end of the day this seems like not a very big deal.
I think we did a good job of eventually saying: "what is the minimum viable product?" for a schema to allow us to move on from some detailed discussions that were maybe best saved for later versions of the schema
in general I found it hard to follow and track the various pieces of the schema being discussed all in one issue / branch because the only documentation we had of them was that single thread. Ways I can think of doing this differently (both of which allow us to mark pieces of the schema as resolved):

commenting on pieces of the schema by starting a github code review and then all discussion related to each piece of the schema taking place in the review threads corresponding to the line(s) in the code
using github Projects boards to track many separate issues for each part of the schema

Thanks so much everyone who participated in helping define a schema for Clones and Trees! This will help Olmsted be more widely useful and will hopefully be helpful for other tools and contexts as well.

tagging some folks from our team to be sure they get to include thoughts if they have them @matsen @psathyrella

The text was updated successfully, but these errors were encountered:

javh · 2020-02-19T17:12:41Z

Thanks, @eharkins.

1. I think we should find a way to resolve this so that there is a field for this in the schema, rather than relying on a custom field. I suspect the count of (unique) sequence in the clone is going to be a very common field, so we should reserve a name/definition for it. (Related: #161)

2-3. rearrangement_id is supposed to be a universally unique identifier. Could you use this field instead?

5. Yeah, that's always a problem with these big PR threads. We usually just use labels to organize issues, but I'm sure we could start using the Projects broad. It would give us more granularity without the accompanying mess of having a bunch of extra labels.

The problem with in-line comments is that when people commit changes/fixes they lose context. They are great for small and quick changes, but for larger discussions the code and comments get out of sync (we saw this a few times when working on the Clones/Trees.)

eharkins · 2020-02-19T17:50:53Z

Sounds good to me, let me know how I can help with that or otherwise let me know when you have settled on a good name.
Every object in our schema gets an ident. How would rearrangement_id work with objects higher up in the hierarchy needing uuids in our schema like dataset and subject and sample which contain many rearrangements?
Makes sense. I hear you about the in-line comments and so recommend the project board for big projects with multiple issues like this schema. Here is an example of an all-encompassing one that we use for developing a phylogenetic pipeline.

javh · 2020-02-19T20:37:54Z

Every object in our schema gets an ident. How would rearrangement_id work with objects higher up in the hierarchy needing uuids in our schema like dataset and subject and sample which contain many rearrangements?

In that case, can you use the respective id fields for each object? Eg, study_id, subject_id, sample_id. These are defined as being unique within some context, rather than universally unique, but making them universally unique in Olmstead doesn't break that requirement (it's just more strict).

I hear you about the in-line comments and so recommend the project board for big projects with multiple issues like this schema.

Let's give it a try? I made a lineage project and added a few issues to it. We'll figure out if it's more burden than value by using it...

eharkins · 2020-02-20T01:32:03Z

In that case, can you use the respective id fields for each object? Eg, study_id, subject_id, sample_id. These are defined as being unique within some context, rather than universally unique, but making them universally unique in Olmstead doesn't break that requirement (it's just more strict).

I think @metasoarous and I added ident because we felt

making them universally unique in Olmstead

was too strict given our own or others' use of potentially non unique ids in _id fields.

javh · 2020-02-20T19:06:31Z

#246 is relevant (long).

schristley · 2020-03-11T20:52:06Z

Every object in our schema gets an ident. How would rearrangement_id work with objects higher up in the hierarchy needing uuids in our schema like dataset and subject and sample which contain many rearrangements?

@eharkins Can you explain what "dataset" is in Olmsted terms? Is it the same as a single study? Is it orthogonal to a study and just a set of repertoires?

For metadata like the study, subject, sample processing, etc., the repertoire_id provides the singular unique link to that information. Subject and sample identifiers in the AIRR world are only unique within a study (primarily because they are user defined), but repertoire_id can be used in combination with them (like a compound key) to create uniqueness. The likely issue is that Olmsted has its own internal representation for this metadata versus using the AIRR repertoire metadata schema. That's reasonable as the schema isn't really a published standard yet.

You could however, if you were interested, allow Olmsted to utilize AIRR repertoire metadata as an option in place of your internal schema, i.e. support both. There are quite a few studies (published data, not examples) loaded up in the data repositories that you could use. Alternatively, we could convert one of your datasets into the AIRR format, which might be more useful as you would know the expected output and functionality.

javh · 2020-03-11T20:58:38Z

I brought this up elsewhere, but what about having an optional alias called id in every object that points to the primary identifier (ie, id is a $ref to <entity>_id)?

eharkins · 2020-03-11T21:49:09Z

@schristley Olmsted dataset is not strictly defined in terms of a study. It can contain many subjects and samples so I would say

orthogonal to a study and just a set of repertoires

is probably accurate.

Subject and sample identifiers in the AIRR world are only unique within a study (primarily because they are user defined

This is the reason why we kept ident (which we create using python uuids upon validation) since all other ids in our schema are user defined and we don't enforce their uniqueness at any level.

The likely issue is that Olmsted has its own internal representation for this metadata versus using the AIRR repertoire metadata schema.

Can you clarify what metadata you are referring to?

schristley · 2020-03-11T22:12:12Z

Can you clarify what metadata you are referring to?

I'm not familiar enough with Olmsted but as you are mentioning subjects, samples and so forth, I'm assuming that Olmsted is storing information about them? For example in AIRR, the subject has an id, but it also has a species taxonomy code, a biological sex, an age, and etc. That's the subject metadata, and there's metadata for samples and so on.

eharkins · 2020-03-11T22:20:35Z

Yes we have metadata about those entities as well but as you said don't adhere to AIRR standards in those cases. We could certainly aim to do this in the future!

It wasn't as much of a priority for us as Clones and Trees since we're usually dealing with samples from a single study using a single sample processing setup.

scharch self-assigned this Jul 10, 2023

scharch added this to the AIRR v2.0.0 milestone Jul 10, 2023

bcorrie mentioned this issue Feb 8, 2024

Add support for lineage schema to the python and R libraries #378

Open

scharch mentioned this issue Mar 21, 2024

Clone-schema-updates #778

Draft

6 tasks

javh closed this as completed Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clone and tree schema feedback / debrief #333

Clone and tree schema feedback / debrief #333

eharkins commented Feb 18, 2020

javh commented Feb 19, 2020 •

edited

Loading

eharkins commented Feb 19, 2020

javh commented Feb 19, 2020

eharkins commented Feb 20, 2020

javh commented Feb 20, 2020

schristley commented Mar 11, 2020

javh commented Mar 11, 2020

eharkins commented Mar 11, 2020

schristley commented Mar 11, 2020

eharkins commented Mar 11, 2020

Clone and tree schema feedback / debrief #333

Clone and tree schema feedback / debrief #333

Comments

eharkins commented Feb 18, 2020

javh commented Feb 19, 2020 • edited Loading

eharkins commented Feb 19, 2020

javh commented Feb 19, 2020

eharkins commented Feb 20, 2020

javh commented Feb 20, 2020

schristley commented Mar 11, 2020

javh commented Mar 11, 2020

eharkins commented Mar 11, 2020

schristley commented Mar 11, 2020

eharkins commented Mar 11, 2020

javh commented Feb 19, 2020 •

edited

Loading