Comparing Entity-Protocol Relationships in ISA JSON with the Experiment Description Schema (EDS) #31

ptth222 · 2023-07-28T20:22:27Z

ptth222
Jul 28, 2023
Maintainer

Introduction

I have not liked how the EDS stores entity-protocol relationships from the beginning. It has a few too many ambiguities and special rules for my liking. I also don't like ISA JSON in general and there are issues with how it does entity-protocol relationships as well, but there are parts of it I like much better and I think we might be able to come up with something better by comparing the 2.

What has really driven me to write this is in starting to implement some of the decisions we have discussed and made previously. Specifically, we discussed creating a "storage%measurement" protocol type that we could use to connect entities in the entity table to a measurement protocol because when creating ISA JSON there will be no measurements table. While this works it feels like a patch, and I think if we adopt some of the things I am going to discuss we may not need to do it.

ISA JSON Process

ISA JSON is a little tough to describe because it is more complicated than the EDS. I'm not sure whether the complexity is really necessary. If the experiment is more complicated than what we are used to (metabolomics experiments from CESB), or you really want to capture more information than we typically do then the complexity is nice and maybe even necessary. But for what we are used to and what information we capture it seems like more trouble than it is worth.

The main object in ISA JSON for entity-protocol relationships is a processSequence object, but I am just going to refer to it as a process because the object isn't actually a sequence of processes it is a single process. The study object has a processSequence attribute that is an array of these objects, so that is actually a sequence of processes, but each object itself is just a process. I'm not sure why they use that confusing naming convention. An ISA JSON process is similar to an EDS protocol, and has a "protocol" attribute that is an ISA JSON protocol, but the process is more than its protocol. An ISA JSON protocol is a little more abstract than an EDS protocol. It is disconnected from any particular samples or measurements and from time. It is merely saying this is the set of steps, components, parameters, etc. that describe this procedure. The process is then used to connect the protocol to a specific time, performer, sample, parameter value(s), component(s), etc. The process encapsulates or wraps around the protocol.

This encapsulation is part of the added complexity of ISA JSON when compared to the EDS. If you do not need to capture the same protocol being ran at different times or by different performers or with different components and parameters then you don't need 2 separate objects, you can simply put all of that information on the protocol directly like we do in the EDS. Most protocols are going to have a 1 to 1 relationship with processes, that is, only 1 process will use the protocol. The big exception are treatment protocols, where you do the same procedure but with a different concentration of treatment for example. In the EDS we just simply create a separate protocol for each treatment variation, but in ISA JSON it would be 1 protocol that would be used in multiple processes, 1 process for each different treatment. I'm not sure whether treatment protocols alone would be enough to justify the additional nestedness of protocols within a process. You have to weigh having "extra" protocols or repeated information vs the overhead of an additional process object. For us, the number of non-treatment protocols heavily outweighs the treatment protocols, so I think the additional overhead for those is worse than repeating a treatment protocol multiple times.

Inputs and Outputs

The parts I like about the process object are the parts that the EDS has entirely absent. Namely, the process object has attributes for "inputs", "outputs", "nextProcess", and "previousProcess". "inputs" and "outputs" are the subjects, samples, and files that the process/protocol used and created, respectively. In ISA JSON "inputs" is an array of 4 possible objects, "sources" (similar to subjects), "samples" (similar to samples), "dataFiles" (basically just a filename), and "otherMaterials" (basically a sample). "otherMaterials" is the strangest one. I don't see it as uniquely different from a sample and the only difference in the object is that it has a "type" attribute that must be either "Extract" or "Labeled Extract". It just seems like a more specific sample object that could have been handled using the "characteristics" attribute in the sample object. "outputs" is the same as "inputs", but it cannot have any "sources".

I like "inputs" and "outputs" being on the process/protocol a lot. In the EDS the protocols aren't connected to entities directly (except for the measurement type protocol). Instead the protocol is put on the entity/measurement object, but not in a consistent way. For example, the protocol is not always placed on the input entity or not always placed on the output entity, it varies by protocol type. I have created a table to visualize:

Protocol Type	Input	Output	Holding Entity	Notes
treatment	subject	sample	input
sample_prep	sample/subject	sample/subject	input
collection	sample/subject	sample/subject	output
measurement	sample	measurement	output
storage	sample/subject	file	input/output	This is quite a unique "protocol" and is really more of a patch to solve a problem.

This is the part that I never liked. It's confusing and seems somewhat arbitrary. The idea of explicitly listing inputs and outputs makes everything consistent, so I like it. One inherent problem with inputs and outputs though is that it implies 1 protocol per entity. If you have a protocol that takes a sample as input and then produces another as output it doesn't make sense to have another protocol that takes the same input and produces the same output. This also requires you to check for consistency in this regard. No 2 protocols can have the same input and different outputs. I don't like this aspect because I do like that in the EDS we can have multiple sample prep protocols and put them all on 1 entity. This makes it so you don't have to have 1 giant sample prep protocol with many details and steps or create a "new" sample each time you apply each smaller sample prep protocol. A compromise might be to allow protocols with the exact same inputs and outputs, or protocols with the same set of sample names for its inputs and outputs, and it will be understood that this is just to condense the sample ancestry so it isn't polluted with names unnecessarily.

Next and Previous

"nextProcess" and "previousProcess" are technically process objects, so you could nest the next and previous processes inside a process, but the examples and the converter tools that ISA provides just put an "@id" field that acts as a pointer/reference to the next and previous processes that are defined elsewhere. This possibility of nesting an object or pointing to it is everywhere in ISA JSON and is my biggest issue with it. There should be 1 central place for storing objects and then use pointers/references to them when needed. If we just take "nextProcess" and "previousProcess" as pointers to other processes/protocols then I like this idea a lot. In the EDS to figure out protocol order you have to first figure out the entity ancestry and then grab the protocols from the entities. But there is still ambiguity because entities can have a list of protocols, so there is no way of knowing the order in which the protocols in the list were executed. (We have previously discussed this and talked about adding a protocol%order field as well as defaulting to the order in the list being the order of the protocols, but that is not currently implemented so I am mentioning it here.) Pointers to the next and previous process/protocol remove ambiguity and make it easier to determine protocol order.

ISA JSON also has a "derivesFrom" attribute on its "samples" and "otherMaterials" objects, which is basically the same thing as the EDS "parent_id", so it is just as easy to determine entity ancestry in ISA JSON.

Drawbacks

The down side to having all of these fields in ISA JSON is that there is not one source of truth, so you have to have extra validation to make sure the JSON is consistent with itself. For example, if you have sample1 that indicates it derives from sample2, but then have a process that has sample2 as the input and sample1 as the output this is a contradiction because you cannot create sample1 from sample2 in a process and also say sample2 derives from sample1. This contradiction is impossible in the EDS since we only have 1 mechanism to identify entity-protocol relationships. Perhaps an even bigger issue beyond the validation is what attributes will be required. In the previous example both the entities and protocols had values for their fields that specified entity-protocol relationships, but what if they didn't? Let's say that sample1 did not indicate that it derived from sample2, but you can infer from the inputs and outputs of the protocols that sample1 does derive from sample2. Do you require that entities indicate what they derive from directly or not? This gets a little more complicated if you try to combine both the EDS approach and ISA JSON approach. Specifically, if you adopted ISA JSON's fields, but also still allowed a "protocol" field on the entities. In this scenario a user could mix and match different ways to indicate entity-protocol relationships within the same JSON and it becomes a more significant task to determine sample ancestry, protocol order, and validate for consistency.

Possible Improvements

I think I have described everything, but now is it possible to take all of this and come up with something better? I see a few options:

Switch from our more entity-centric approach and adopt ISA JSON's more process/protocol centric approach. Specifically, take "protocol.id" off of entities and measurements and add "inputs", "outputs", "nextProcess", and/or "previousProcess" to protocols.
Same as 1, but leave "protocol.id" on entities and measurements and implement extra validation and deal with multiple ways of describing entity-protocol relationships.
Instead of inputs and outputs on protocols we put "input_of" and "output_of" on the entities/measurements which will be lists of protocols. The protocols can still have pointers to next and previous protocols.

All of these have some issues and would require a significant amount of coding changes. 1 is not directly backward compatible with our current implementation so that would have to be addressed. 2 would be complicated and I can't really foresee the issues until I actually try to code it. 3 is also similarly complicated and it's hard to see all of the ramifications without starting to implement it, as well as not being backward compatible. I don't really have a preference for any of these.

It is apparent from my phrasing in the above paragraphs, but I want to specifically talk about it here, that in ISA JSON there are no specific "measurements" as they are known in the EDS. In ISA JSON all measurements are relegated to files which are then simply "outputs". There are no measurement objects like there are in the EDS.

Summary

I think ISA JSON has a better way of handling entity-protocol relationships that we can borrow from and improve the EDS. Specifically, being able to indicate directly what the inputs and outputs of a protocol are and what the next and previous protocols are clears up a lot of ambiguity. Unfortunately, there is no easy way to transition the EDS to borrow from ISA JSON as any changes will require significant work and backwards compatibility will be hard to achieve.

I think some questions we need to answer are:

Do we think ISA JSON's encapsulation of protocols inside a process is a good idea and worth borrowing?
Is the "inputs", "outputs", "nextProcess", and "previousProcess" a good idea and is it worth the pain of incorporating into the EDS?

hunter-moseley · 2023-07-29T15:41:31Z

hunter-moseley
Jul 29, 2023
Maintainer

Protocol could include a "linked" field to indicate if it is linked to an "input" or an "output".
ParentIDs already capture much of the input and output logic between samples.
If we add a "protocol%sequence" protocol, we could include an array of "protocol.id" that are executed in a given sequence. This has an advantage of implementing protocols that are more complex than can be captured by "protocol%next" and "protocol%previous" fields on protocols.
Storage protocols can still be added separately to entities, since samples are often stored before, between, and after various protocol steps. They can also be added to a protocol%sequence to clearly indicate when samples are stored in complex procedures.

Let me know what you think?
Do you think these 4 implementation details are easy to implement while providing most/all of the benefits of ISA JSON?

0 replies

ptth222 · 2023-07-31T22:38:03Z

ptth222
Jul 31, 2023
Maintainer Author

Are you saying to have a single "linked" field instead of an "inputs" and "outputs" field? This just seems like a more ambiguous way to link entities and protocols. My main issue is in trying to clear up this ambiguity. I would rather have the 2 separate fields. What is the advantage of being less precise and using only 1 field?
Sample ancestry does allow you to easily follow the transformations of a single sample, but it's not great at telling a story from a 30,000 foot view or what happened to groups of samples. You can reconstruct this simply from ancestries, but it's not as clear and easy as it is when compared to following a process sequence. I am not suggesting any changes to parentID's.
This feels like another patch solution. A "protocol%sequence" feels categorically different than an individual protocol, so having them in the same table does not feel right. I'm not opposed to having a protocol sequence object, but it should have its own table, or this could be an attribute of a study similar to how ISA JSON has a process sequence as an attribute of studies and assays. I'm not sure what you mean that this could capture more complex protocols. Do you mean more complex sequences? Can you give an example?
I didn't say anything about changing storage protocols. Used the way you are describing makes sense, and is exactly how I describe them in the documentation. But that's not how we have actually used them in the past. They have been used more as a patch to associate entities with files or other protocols. That is what I am opposed to.

I think looking at things from a story telling perspective might be helpful. It's much easier to look at ISA JSON's process sequence and see the story of the study than it is to look at the EDS entities and see the story.

Below is 2 samples from one of the CESB lipid FTMS experiments. Even after having worked with this same data multiple times I still get confused about things if I try to follow the protocols. The "*-lipid" sample has a "lipid_extraction" protocol. By the sample name I would think that this was the extracted lipids we are talking about, which means the parent is the entity that actually had its lipids extracted. Also you would never know that the "*-lipid" sample is the one that goes on to be measured by FTMS.

"01_A0_Colon_naive_0days_170427_UKy_GCH_rep1": {
      "id": "01_A0_Colon_naive_0days_170427_UKy_GCH_rep1",
      "parent_id": "01_A0_naive_0days_UKy_GCH_rep1",
      "protocol.id": [
        "mouse_tissue_collection",
        "tissue_quench",
        "frozen_tissue_grind"
      ],
      "type": "sample"
    },
    "01_A0_Colon_naive_0days_170427_UKy_GCH_rep1-lipid": {
      "id": "01_A0_Colon_naive_0days_170427_UKy_GCH_rep1-lipid",
      "parent_id": "01_A0_Colon_naive_0days_170427_UKy_GCH_rep1",
      "protocol.id": [
        "lipid_extraction"
      ],
      "type": "sample",
      "weight": "0.7987",
      "weight%units": "g"
    }

Now compare this to adding "input_of" and "output_of" to the same entities. It's clearer what they were inputs of and outputs of, but it is still a little confusing in my opinion because of the list of inputs. Once you get to lipid_extraction it's hard to keep in mind that this thing has already been frozen and ground up before going into lipid_extraction. I can see why ISA doesn't allow multiple protocols like this. You can also see that the "*-lipid" sample is an input to "DI-FTMS1", so you know it is measured.

"01_A0_Colon_naive_0days_170427_UKy_GCH_rep1": {
      "id": "01_A0_Colon_naive_0days_170427_UKy_GCH_rep1",
      "parent_id": "01_A0_naive_0days_UKy_GCH_rep1",
      "input_of": ["tissue_quench", "frozen_tissue_grind", "lipid_extraction"],
      "output_of": "mouse_tissue_collection"
      "protocol.id": [
        "mouse_tissue_collection",
        "tissue_quench",
        "frozen_tissue_grind"
      ],
      "type": "sample"
    },
    "01_A0_Colon_naive_0days_170427_UKy_GCH_rep1-lipid": {
      "id": "01_A0_Colon_naive_0days_170427_UKy_GCH_rep1-lipid",
      "parent_id": "01_A0_Colon_naive_0days_170427_UKy_GCH_rep1",
      "input_of": ["DI-FTMS1"],
      "output_of": "lipid_extraction"
      "protocol.id": [
        "lipid_extraction"
      ],
      "type": "sample",
      "weight": "0.7987",
      "weight%units": "g"
    }

Now let's compare to looking at things from a protocol point of view. Note I am using ellipsis instead of listing all of the samples. I also have not included a next and previous protocol, but they are listed in order. It's quite easy to see what steps were done and which samples were involved at each step. It is a little weird that tissue_quench and frozen_tissue_grind have the same inputs and outputs, but we have to do something like that or create a new entity for each protocol. I do see one issue with the measurement protocol. Listing every measurement in the measurement table for outputs seems excessive, and won't be possible for ISA submissions. We could let this be file names or left blank with an understanding that the outputs are the entire measurement table.

"mouse_tissue_collection": {
      "description": "Mouse is sacrificed and tissues are harvested.",
      "filename": "mouse_tissue_procedure.pdf",
      "id": "mouse_tissue_collection",
      "sample_type": "mouse",
      "type": "collection",
      "inputs": ["01_A0_naive_0days_UKy_GCH_rep1", ...]
      "outputs": ["01_A0_Colon_naive_0days_170427_UKy_GCH_rep1", ...]
    },
"tissue_quench": {
      "description": "Tissue is frozen in liquid nitrogen to stop metabolic processes.",
      "filename": "No tissue_quench file.",
      "id": "tissue_quench",
      "type": "sample_prep",
      "inputs": ["01_A0_Colon_naive_0days_170427_UKy_GCH_rep1", ...]
      "outputs": ["01_A0_Colon_naive_0days_170427_UKy_GCH_rep1", ...]
    },
"frozen_tissue_grind": {
      "description": "Frozen tissue is ground in a SPEX grinder under liquid nitrogen to homogenize the sample.",
      "filename": "No frozen_tissue_grind file.",
      "id": "frozen_tissue_grind",
      "type": "sample_prep",
      "inputs": ["01_A0_Colon_naive_0days_170427_UKy_GCH_rep1", ...]
      "outputs": ["01_A0_Colon_naive_0days_170427_UKy_GCH_rep1", ...]
    },
"lipid_extraction": {
      "description": "Lipid extraction from homogenate.",
      "filename": "4B_Extract_Polar_Lipid_Prot_Fan_070417.pdf",
      "id": "lipid_extraction",
      "type": "sample_prep",
      "inputs": ["01_A0_Colon_naive_0days_170427_UKy_GCH_rep1", ...]
      "outputs": ["01_A0_Colon_naive_0days_170427_UKy_GCH_rep1-lipid", ...]
    },
"DI-FTMS1": {
      "data_files": [
        "01_A0_Colon_naive_170427_UKy_GCB_rep1-quench_pos.raw",
        "02_A1_Colon_naive_170427_UKy_GCB_rep2-quench_pos.raw",
        "03_A2_Colon_naive_170427_UKy_GCB_rep3-quench_pos.raw",
        "04_B0_Colon_syngenic_170427_UKy_GCB_rep1-quench_pos.raw",
        "05_B1_Colon_syngenic_170427_UKy_GCB_rep2-quench_pos.raw",
        "06_B2_Colon_syngenic_170427_UKy_GCB_rep3-quench_pos_rerun.raw",
        "07_C1-1_Colon_allogenic_170427_UKy_GCB_rep1-quench_pos_rerun.raw",
        "08_C1-2_Colon_allogenic_170427_UKy_GCB_rep2-quench_pos.raw",
        "09_C2-0_Colon_allogenic_170427_UKy_GCB_rep1-quench_pos.raw",
        "10_B1-0_Colon_syngenic_170427_UKy_GCB_rep1-quench_pos.raw",
        "11_B1-1_Colon_syngenic_170427_UKy_GCB_rep2-quench_pos.raw",
        "12_B1-2_Colon_syngenic_170427_UKy_GCB_rep3-quench_pos.raw",
        "13_C1-1_Colon_allogenic_170427_UKy_GCB_rep1-quench_pos.raw",
        "14_C1-2_Colon_allogenic_170427_UKy_GCB_rep2-quench_pos.raw",
        "15_C1-20_Colon_allogenic_170427_UKy_GCB_rep3-quench_pos.raw"
      ],
      "data_files%entity_id": [
        "01_A0_Colon_naive_0days_170427_UKy_GCH_rep1-lipid",
        "02_A1_Colon_naive_0days_170427_UKy_GCH_rep2-lipid",
        "03_A2_Colon_naive_0days_170427_UKy_GCH_rep3-lipid",
        "04_B0_Colon_syngenic_42days_170427_UKy_GCH_rep1-lipid",
        "05_B1_Colon_syngenic_42days_170427_UKy_GCH_rep2-lipid",
        "06_B2_Colon_syngenic_42days_170427_UKy_GCH_rep3-lipid",
        "07_C1-1_Colon_allogenic_42days_170427_UKy_GCH_rep1-lipid",
        "08_C1-2_Colon_allogenic_42days_170427_UKy_GCH_rep2-lipid",
        "09_C2-0_Colon_allogenic_42days_170427_UKy_GCH_rep1-lipid",
        "10_B1-0_Colon_syngenic_7days_170427_UKy_GCH_rep1-lipid",
        "11_B1-1_Colon_syngenic_7days_170427_UKy_GCH_rep2-lipid",
        "12_B1-2_Colon_syngenic_7days_170427_UKy_GCH_rep3-lipid",
        "13_C1-1_Colon_allogenic_7days_170427_UKy_GCH_rep1-lipid",
        "14_C1-2_Colon_allogenic_7days_170427_UKy_GCH_rep2-lipid",
        "15_C1-20_Colon_allogenic_7days_170427_UKy_GCH_rep3-lipid"
      ],
      "description": "Measurements made using direct infusion (nano-electrospray) FTMS.",
      "id": "DI-FTMS1",
      "instrument": "Orbitrap Fusion",
      "instrument_type": "FTMS",
      "ion_mode": "POSITIVE",
      "ionization": "ESI",
      "machine_type": "MS",
      "parent_protocol": "labeled_DI-FTMS_measurement",
      "type": "measurement",
      "inputs": ["01_A0_Colon_naive_0days_170427_UKy_GCH_rep1-lipid", ...]
      "outputs": ["IMF9_01_A0_Colon_naive_0days_170427_UKy_GCH_rep1-lipid", ...]
    }

Should we meet to discuss?

0 replies

ptth222 · 2023-08-01T22:03:36Z

ptth222
Aug 1, 2023
Maintainer Author

We met and discussed this. A summary of what was decided:

Instead of a storage%measurement protocol we have pseudo/dummy measurement entities that only require an entity.id and protocol.id. Still not the best, but is better and does match the workflow already established for the Metabolomics Workbench. I like this much better.
Look into adding a protocol%sequence that can be used to group protocols into a sequence that can then be used by an entity instead of providing the protocols in a list on the entity. This has some utility, but is fundamentally different from the ISA JSON processSequence which describes the whole process of protocols spanning multiple entities.
It was agreed that the different protocol types being on sometimes inputs and sometimes outputs is confusing. Hunter clarified that the "linked" field would be a field with the value of either "input" or "output" and would indicate whether the protocol was placed on the input entity or output entity.

We would likely need to meet and discuss 2 and 3 again. It wasn't until near the end of the meeting that I understood the protocol%sequence Hunter was talking about would be essentially a protocol on 1 entity and not the sequence of the whole study/assay like the ISA JSON processSequence. In another meeting we would need to address these ideas separately.

Even with the clarification of the "linked" field I still think one of the 2 options I demonstrated above is a better solution. Hunter did recommend changing "input_of" and "output_of" to better names like "to_protocol" and "from_protocol" or something similar. Since the "linked" field is not actually malleable (what a protocol is linked to is defined by it's type), this would only allow the user to make a mistake and would require more validation. It's also not as easy to see and understand without reading some documentation. Seeing "inputs" and "outputs" on a protocol can be implicitly understood, but a "linked" field cannot.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparing Entity-Protocol Relationships in ISA JSON with the Experiment Description Schema (EDS) #31

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Comparing Entity-Protocol Relationships in ISA JSON with the Experiment Description Schema (EDS) #31

ptth222 Jul 28, 2023 Maintainer

Introduction

ISA JSON Process

Inputs and Outputs

Next and Previous

Drawbacks

Possible Improvements

Summary

Replies: 3 comments

hunter-moseley Jul 29, 2023 Maintainer

ptth222 Jul 31, 2023 Maintainer Author

ptth222 Aug 1, 2023 Maintainer Author

ptth222
Jul 28, 2023
Maintainer

hunter-moseley
Jul 29, 2023
Maintainer

ptth222
Jul 31, 2023
Maintainer Author

ptth222
Aug 1, 2023
Maintainer Author