Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First pass at SpecimenProcessing class #12

Merged
merged 22 commits into from
Dec 2, 2024
Merged

Conversation

jamesaoverton
Copy link
Collaborator

@jamesaoverton jamesaoverton commented Nov 6, 2024

First pass at a SpecimenProcessing class to address airr-knowledge/issues#58.

This adds a SpecimenProcessing class and a multivalued specimen_processing slot to Assay. The SpecimenProcessing class should have more slots, as required.

Copy link

github-actions bot commented Nov 6, 2024

PR Preview Action v1.4.8
🚀 Deployed preview to https://airr-knowledge.github.io/ak-schema/pr-preview/pr-12/
on branch gh-pages at 2024-11-29 21:15 UTC

@bcorrie
Copy link
Collaborator

bcorrie commented Nov 6, 2024

Trying to wrap my head around this in relation to the ADC.

Is the intent of this to say that a specific specimen was used in a specific Assay after a specific chain of specimen_processing was applied? So there is one (and only one) Specimen (e.g. a blood sample) and if the blood sample was split into four different aliquots that are processed differently (and eventually Assayed separately) then that splitting would be captured through different chains of SpecimenProcessing steps.

If we are primarily capturing a chain of SpecimenProcessing steps to capture what happens to a Specimen, then does order matter? I would presume so?

@bcorrie
Copy link
Collaborator

bcorrie commented Nov 6, 2024

The current model has a SpecimenProcessing be associated with a single Specimen which means that we can't apply the same SpecimenProcessing to multiple Specimens. That is we can't say that the same SpecimenProcessing was applied to all Specimens from all Particpants in an Investigation if I understand this correctly.

@jamesaoverton
Copy link
Collaborator Author

Yes, the modelling on this PR would use many individual instances of the new SpecimenProcessing class, just like we have many individual instances of the Specimen and Assay classes, and the specimen_processing list is ordered. This is useful if the specimen processing varies between specimens and assays.

On the other hand, if you always want to describe a common specimen processing protocol that's applied to all specimens from all participants in an investigation, then we should model it in a different way. My two initial suggestions would be defining subtypes of Assay (which is what we do in IEDB), or attaching that information to the Investigation or maybe StudyArm classes.

I don't know which approach is the right one for your data, but I'm sure we can find something that works.

@schristley
Copy link
Contributor

Yes, the modelling on this PR would use many individual instances of the new SpecimenProcessing class, just like we have many individual instances of the Specimen and Assay classes, and the specimen_processing list is ordered. This is useful if the specimen processing varies between specimens and assays.

This one. This is also how the data is organized in AIRR. There can be commonality across specimens, but implementing that introduces complexity that we don't really need. Semantically things don't change, it's just an optimization to save space and reduce data duplication.

@bcorrie
Copy link
Collaborator

bcorrie commented Nov 7, 2024

@jamesaoverton any objections to me jumping in and making changes on this branch to Assay to try and incorporate the AIRRSequencingAssay as discussed in airr-knowledge/issues#64

Or maybe I should work through this pull request? #10 since it is changing Assay.

Alternatively, I can wait for #10 to be merged and then create a new branch/pull request.

I think we probably want to generalize Assay such that it doesn't have a value and unit and then specialize it for TCellReceptorEpitopeBindingAssay (which does need a value and unit) and AIRRSequencingAssay which does not.

@bcorrie
Copy link
Collaborator

bcorrie commented Nov 7, 2024

It looks like #10 is closed to merging, so maybe best to wait for @schristley to approve and merge and then create a separate branch/pull request for Assay changes.

@jamesaoverton
Copy link
Collaborator Author

Alternatively, I can wait for #10 to be merged and then create a new branch/pull request.

#10 is ready to merge. I'm hoping @schristley can review and hit the 'Merge' button.

I think we probably want to generalize Assay such that it doesn't have a value and unit and then specialize it for TCellReceptorEpitopeBindingAssay (which does need a value and unit) and AIRRSequencingAssay which does not.

That's fine with me.

@bcorrie
Copy link
Collaborator

bcorrie commented Nov 18, 2024

We need to think a bit of the granularity of sample processing, like we did with sequencing assays. I see that there are:

The challenge for these is that I think the AIRR Standard has different fields that are relevant to different processing steps. So it would seem to me that we might want individual AKC classes for these??? Flow cytometry is the best example I suspect. And I think that many of the fields that are assigned to the AIRR-seq Assay are actually fields that should be assigned to something like a NucleicAcidProcessing AKC class rather than the assay itself.

Recall the AIRR Standard has NucleicAcidProcessing, CellProcessing among others that might be relevant...

@bcorrie
Copy link
Collaborator

bcorrie commented Nov 18, 2024

For example, I would be tempted to create:

  FlowCytometryProcessing:
    is_a: SpecimenProcessing
    class_uri: OBI:00009160 # flow cytometry processing
    slots:
      - cell_subset # From ADC CellProcessing
      - cell_phenotype
      - cell_species

@bcorrie
Copy link
Collaborator

bcorrie commented Nov 18, 2024

class_uri: OBI:00009160 # flow cytometry processing

Actually this class should probably be a Isolation of Cell Population : https://ontobee.org/ontology/OBI?iri=http://purl.obolibrary.org/obo/OBI_0000512

  CellIsolationProcessing:
    is_a: SpecimenProcessing
    class_uri: OBI:00000512 # cell isolation processing
    slots:
      - cell_subset # From ADC CellProcessing
      - cell_phenotype
      - cell_species

This might be done with Flow Cytometry but possible not...

@bcorrie
Copy link
Collaborator

bcorrie commented Nov 18, 2024

I merged in main and resolved the conflicts, I think I did it correctly 8-)

@schristley
Copy link
Contributor

The challenge for these is that I think the AIRR Standard has different fields that are relevant to different processing steps. So it would seem to me that we might want individual AKC classes for these???

At first glance, that seems reasonable to me. The slots/fields are the most important pieces. Then we can organize them into classes that either are similar to AIRR, based upon ontology terms, or some combination of both.

I want to avoid creating too big of a scope though. Before we go off creating lots of classes, I'd like some feedback about the granularity of the specimen processing coming from ImmPort and/or HIPC. We can certainly focus on what's needed for AIRR-seq, which might require just minimal definitions for a few classes. Specimen processing is also where we expect to have a lot of overlap with Christian's NF4DImmuno where they would have a much richer set of classes.

A first pass at classes driven by the ADC info.
@bcorrie
Copy link
Collaborator

bcorrie commented Nov 18, 2024

I have added three sample processing classes as described above, CellIsolationProcessing, NucleicAcidProcessing, LibrayPreparationProcessing.

These map pretty naturally to both the AIRR objects as well as OBI material processing entities.

In my recent looking at both ImmPort and ImmuneSpace, I wasn't able to find this type of information in their study metadata, but I may have been looking in the wrong place. At least through the web portals for both, it is very hard to find either material processing entities or assays that are AIRR-seq related...

@bcorrie
Copy link
Collaborator

bcorrie commented Nov 18, 2024

I also moved some of the fields out of the ReceptorRepertoireSequencingAssay and in to the above processing classes as I think are reasonable.

Not being an expert in running either Assay's or performing SpecimenProcessing I am not sure I have this correct.

It is easy to undo this if we decide this is not a good way to go.

@bcorrie
Copy link
Collaborator

bcorrie commented Nov 20, 2024

@schristley @jamesaoverton conversion of these classes is working from ADC -> AKC. Do you see any issues or require anything more from IEDB side?

Any comment on the above. Is this "good enough" for now? It captures pretty well all of the AIRR fields. Example output from the ADC to AKC conversion is in the Google Drive Study Analysis folder.

Should we merge and close this as a working first implementation?

@bcorrie
Copy link
Collaborator

bcorrie commented Nov 20, 2024

@jamesaoverton as @schristley suggested, I suppose we should look at the HIPC/ImmPort/ImmuneSpace models for specimen processing as well. Are those available somewhere?

@bcorrie
Copy link
Collaborator

bcorrie commented Nov 20, 2024

OK, here is a concrete example: https://immunespace.org/query/study/SDY888 - one of Bjoern's papers. 8-)

This study has T-cell repertoire sequencing, but their is no information about any of this in the metadata on Immunespace for this study as far as I can tell. There doesn't appear to be a repertoire sequencing assay, and no information about any of the sample processing that was carried out. There is a detailed cell sort in this study (which would go into CellIsolationProcessing) and information that we would typically store in our AIRR-seq specific Assay and other Processing classes.

So ImmuneSpace certainly has studies that did AIRR-seq, but it does not seem to be capturing any of that information in the study metadata. Is this intentional or am I missing something?

@schristley
Copy link
Contributor

Here is another that James referenced in another issue.

@bcorrie
Copy link
Collaborator

bcorrie commented Nov 20, 2024

There have been discussions about getting a pipeline in place where we can detect if an ImmPort study has AIRR-seq (hard)

Indeed... I have been trying to use both ImmPort and ImmuneSpace to find AIRR-seq data sets, and it is almost impossible. You have to essentially search study metadata for keywords and hope the authors used them somewhere in their abstract or title to get a hit. I have actually confirmed this with ImmPort tech support 8-(

So it is intentional that ImmuneSpace doesn't support AIRR-seq studies and AIRR-seq data (at least at this time)?

@schristley
Copy link
Contributor

There have been discussions about getting a pipeline in place where we can detect if an ImmPort study has AIRR-seq (hard)

Indeed... I have been trying to use both ImmPort and ImmuneSpace to find AIRR-seq data sets, and it is almost impossible.

HIPC has made a modest change with their single cell template. As you can see in this example, there is a "characteristic" called "library type" that has the value "scBCR-seq". There are other values for TCR and bulk. The issue is that you cannot search those characteristics using NCBI's API. You have to download the SOFT or XML format and search within that.

@bcorrie
Copy link
Collaborator

bcorrie commented Nov 20, 2024

On Immport there are 4 assays that one might find relevant:

  • Assay Methods: B cell receptor repertoire sequencing assay
  • Assay Methods: IgH Sequencing
  • Assay Methods: scRNA-seq
  • Assay Methods: T cell receptor repertoire sequencing assay

Which results in 8 studies.

@bcorrie
Copy link
Collaborator

bcorrie commented Nov 20, 2024

@jamesaoverton as @schristley suggested, I suppose we should look at the HIPC/ImmPort/ImmuneSpace models for specimen processing as well. Are those available somewhere?

So this implies to me (as @jamesaoverton suggested in #12 (comment)) that HIPC/Immport/ImmuneSpace won't be adding much to our SpecimenProcessing - so maybe we can merge and close this?

@bcorrie bcorrie marked this pull request as ready for review November 20, 2024 22:14
@jamesaoverton
Copy link
Collaborator Author

jamesaoverton commented Nov 27, 2024

As we just discussed on the call, ImmuneSpace is planning to include some specimen processing information, but we don't have any yet. (So what I said above wasn't correct concerning ImmuneSpace.) I don't know exactly what we'll need, but I'm happy for you guys to push the design with your immediate needs.

The only question I have right now is: how is the order of the specimen processing steps is being captured? I don't see it in the LinkML here, but I might be missing something. Or maybe you don't want to capture that order?

@bcorrie
Copy link
Collaborator

bcorrie commented Nov 27, 2024

@schristley @jamesaoverton @bpeters42 it sounds to me like we are fine to use the AIRR standard sample processing as the starting point for the HIPC data model in this case. Currently, there is no such processing in the HIPC model.

We have added CellIsolationProcessing, NucleicAcidProcessing, and LibrayPreparationProcessing to the AKC model.

If these get added to the HIPC model, then the HIPC model will have a sample processing data model that can be directly mapped back to the ADC which is I think what is desirable if I understood the discussion correctly.

So - I think we can merge?? 8-)

@bcorrie
Copy link
Collaborator

bcorrie commented Nov 27, 2024

@jamesaoverton in the AIRR Standard there is no order other than that implied in the order they appear in the spec. Not sure if that is intentional or not, I would need someone who has actually done some sample processing to comment 8-)

How do you show order in a list in LinkML?

A merge of the ADC and IEDB examples.
@bcorrie
Copy link
Collaborator

bcorrie commented Nov 27, 2024

I just pushed the combined JSON file, which is a merged AIRRKnowledgeCommons object across all 11 ADC studies and the example IEDB data. So we have 12 investigations along with all of the other data.

@jamesaoverton
Copy link
Collaborator Author

@jamesaoverton in the AIRR Standard there is no order other than that implied in the order they appear in the spec. Not sure if that is intentional or not, I would need someone who has actually done some sample processing to comment 8-)

How do you show order in a list in LinkML?

I expect we will care about order inside ImmuneSpace, but I'll discuss that with Bjoern, and that shouldn't block this PR.

LinkML represents multivalued fields as JSON arrays, so the simplest way to capture order might be a specimen_processings slot on Specimen. Our LifeEvents model times relative to T0, but I doubt we will have that information for specimen processing.

@bcorrie
Copy link
Collaborator

bcorrie commented Nov 29, 2024

So - I think we can merge?? 8-)

@schristley I think this is ready to merge, I am slowly adding more things on this branch, but they are not really related to specimen processing. Each AKC object instance now has a source_uris slot that points back to the list of ADC_REPERTOIRE:XXX URIs from which the object was created (See airr-knowledge/issues#55 and airr-knowledge/issues#56).

See https://github.com/airr-knowledge/ak-schema/tree/specimen-processing/examples/adc for examples.

@schristley schristley merged commit 0ea5d91 into main Dec 2, 2024
3 checks passed
@bcorrie bcorrie mentioned this pull request Dec 14, 2024
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

How are we going to represent AIRR sample processing in the AKC
3 participants