Replies: 3 comments
-
Let me know what you think? |
Beta Was this translation helpful? Give feedback.
-
I think looking at things from a story telling perspective might be helpful. It's much easier to look at ISA JSON's process sequence and see the story of the study than it is to look at the EDS entities and see the story. Below is 2 samples from one of the CESB lipid FTMS experiments. Even after having worked with this same data multiple times I still get confused about things if I try to follow the protocols. The "*-lipid" sample has a "lipid_extraction" protocol. By the sample name I would think that this was the extracted lipids we are talking about, which means the parent is the entity that actually had its lipids extracted. Also you would never know that the "*-lipid" sample is the one that goes on to be measured by FTMS.
Now compare this to adding "input_of" and "output_of" to the same entities. It's clearer what they were inputs of and outputs of, but it is still a little confusing in my opinion because of the list of inputs. Once you get to lipid_extraction it's hard to keep in mind that this thing has already been frozen and ground up before going into lipid_extraction. I can see why ISA doesn't allow multiple protocols like this. You can also see that the "*-lipid" sample is an input to "DI-FTMS1", so you know it is measured.
Now let's compare to looking at things from a protocol point of view. Note I am using ellipsis instead of listing all of the samples. I also have not included a next and previous protocol, but they are listed in order. It's quite easy to see what steps were done and which samples were involved at each step. It is a little weird that tissue_quench and frozen_tissue_grind have the same inputs and outputs, but we have to do something like that or create a new entity for each protocol. I do see one issue with the measurement protocol. Listing every measurement in the measurement table for outputs seems excessive, and won't be possible for ISA submissions. We could let this be file names or left blank with an understanding that the outputs are the entire measurement table.
Should we meet to discuss? |
Beta Was this translation helpful? Give feedback.
-
We met and discussed this. A summary of what was decided:
We would likely need to meet and discuss 2 and 3 again. It wasn't until near the end of the meeting that I understood the protocol%sequence Hunter was talking about would be essentially a protocol on 1 entity and not the sequence of the whole study/assay like the ISA JSON processSequence. In another meeting we would need to address these ideas separately. Even with the clarification of the "linked" field I still think one of the 2 options I demonstrated above is a better solution. Hunter did recommend changing "input_of" and "output_of" to better names like "to_protocol" and "from_protocol" or something similar. Since the "linked" field is not actually malleable (what a protocol is linked to is defined by it's type), this would only allow the user to make a mistake and would require more validation. It's also not as easy to see and understand without reading some documentation. Seeing "inputs" and "outputs" on a protocol can be implicitly understood, but a "linked" field cannot. |
Beta Was this translation helpful? Give feedback.
-
Introduction
I have not liked how the EDS stores entity-protocol relationships from the beginning. It has a few too many ambiguities and special rules for my liking. I also don't like ISA JSON in general and there are issues with how it does entity-protocol relationships as well, but there are parts of it I like much better and I think we might be able to come up with something better by comparing the 2.
What has really driven me to write this is in starting to implement some of the decisions we have discussed and made previously. Specifically, we discussed creating a "storage%measurement" protocol type that we could use to connect entities in the entity table to a measurement protocol because when creating ISA JSON there will be no measurements table. While this works it feels like a patch, and I think if we adopt some of the things I am going to discuss we may not need to do it.
ISA JSON Process
ISA JSON is a little tough to describe because it is more complicated than the EDS. I'm not sure whether the complexity is really necessary. If the experiment is more complicated than what we are used to (metabolomics experiments from CESB), or you really want to capture more information than we typically do then the complexity is nice and maybe even necessary. But for what we are used to and what information we capture it seems like more trouble than it is worth.
The main object in ISA JSON for entity-protocol relationships is a processSequence object, but I am just going to refer to it as a process because the object isn't actually a sequence of processes it is a single process. The study object has a processSequence attribute that is an array of these objects, so that is actually a sequence of processes, but each object itself is just a process. I'm not sure why they use that confusing naming convention. An ISA JSON process is similar to an EDS protocol, and has a "protocol" attribute that is an ISA JSON protocol, but the process is more than its protocol. An ISA JSON protocol is a little more abstract than an EDS protocol. It is disconnected from any particular samples or measurements and from time. It is merely saying this is the set of steps, components, parameters, etc. that describe this procedure. The process is then used to connect the protocol to a specific time, performer, sample, parameter value(s), component(s), etc. The process encapsulates or wraps around the protocol.
This encapsulation is part of the added complexity of ISA JSON when compared to the EDS. If you do not need to capture the same protocol being ran at different times or by different performers or with different components and parameters then you don't need 2 separate objects, you can simply put all of that information on the protocol directly like we do in the EDS. Most protocols are going to have a 1 to 1 relationship with processes, that is, only 1 process will use the protocol. The big exception are treatment protocols, where you do the same procedure but with a different concentration of treatment for example. In the EDS we just simply create a separate protocol for each treatment variation, but in ISA JSON it would be 1 protocol that would be used in multiple processes, 1 process for each different treatment. I'm not sure whether treatment protocols alone would be enough to justify the additional nestedness of protocols within a process. You have to weigh having "extra" protocols or repeated information vs the overhead of an additional process object. For us, the number of non-treatment protocols heavily outweighs the treatment protocols, so I think the additional overhead for those is worse than repeating a treatment protocol multiple times.
Inputs and Outputs
The parts I like about the process object are the parts that the EDS has entirely absent. Namely, the process object has attributes for "inputs", "outputs", "nextProcess", and "previousProcess". "inputs" and "outputs" are the subjects, samples, and files that the process/protocol used and created, respectively. In ISA JSON "inputs" is an array of 4 possible objects, "sources" (similar to subjects), "samples" (similar to samples), "dataFiles" (basically just a filename), and "otherMaterials" (basically a sample). "otherMaterials" is the strangest one. I don't see it as uniquely different from a sample and the only difference in the object is that it has a "type" attribute that must be either "Extract" or "Labeled Extract". It just seems like a more specific sample object that could have been handled using the "characteristics" attribute in the sample object. "outputs" is the same as "inputs", but it cannot have any "sources".
I like "inputs" and "outputs" being on the process/protocol a lot. In the EDS the protocols aren't connected to entities directly (except for the measurement type protocol). Instead the protocol is put on the entity/measurement object, but not in a consistent way. For example, the protocol is not always placed on the input entity or not always placed on the output entity, it varies by protocol type. I have created a table to visualize:
This is the part that I never liked. It's confusing and seems somewhat arbitrary. The idea of explicitly listing inputs and outputs makes everything consistent, so I like it. One inherent problem with inputs and outputs though is that it implies 1 protocol per entity. If you have a protocol that takes a sample as input and then produces another as output it doesn't make sense to have another protocol that takes the same input and produces the same output. This also requires you to check for consistency in this regard. No 2 protocols can have the same input and different outputs. I don't like this aspect because I do like that in the EDS we can have multiple sample prep protocols and put them all on 1 entity. This makes it so you don't have to have 1 giant sample prep protocol with many details and steps or create a "new" sample each time you apply each smaller sample prep protocol. A compromise might be to allow protocols with the exact same inputs and outputs, or protocols with the same set of sample names for its inputs and outputs, and it will be understood that this is just to condense the sample ancestry so it isn't polluted with names unnecessarily.
Next and Previous
"nextProcess" and "previousProcess" are technically process objects, so you could nest the next and previous processes inside a process, but the examples and the converter tools that ISA provides just put an "@id" field that acts as a pointer/reference to the next and previous processes that are defined elsewhere. This possibility of nesting an object or pointing to it is everywhere in ISA JSON and is my biggest issue with it. There should be 1 central place for storing objects and then use pointers/references to them when needed. If we just take "nextProcess" and "previousProcess" as pointers to other processes/protocols then I like this idea a lot. In the EDS to figure out protocol order you have to first figure out the entity ancestry and then grab the protocols from the entities. But there is still ambiguity because entities can have a list of protocols, so there is no way of knowing the order in which the protocols in the list were executed. (We have previously discussed this and talked about adding a protocol%order field as well as defaulting to the order in the list being the order of the protocols, but that is not currently implemented so I am mentioning it here.) Pointers to the next and previous process/protocol remove ambiguity and make it easier to determine protocol order.
ISA JSON also has a "derivesFrom" attribute on its "samples" and "otherMaterials" objects, which is basically the same thing as the EDS "parent_id", so it is just as easy to determine entity ancestry in ISA JSON.
Drawbacks
The down side to having all of these fields in ISA JSON is that there is not one source of truth, so you have to have extra validation to make sure the JSON is consistent with itself. For example, if you have sample1 that indicates it derives from sample2, but then have a process that has sample2 as the input and sample1 as the output this is a contradiction because you cannot create sample1 from sample2 in a process and also say sample2 derives from sample1. This contradiction is impossible in the EDS since we only have 1 mechanism to identify entity-protocol relationships. Perhaps an even bigger issue beyond the validation is what attributes will be required. In the previous example both the entities and protocols had values for their fields that specified entity-protocol relationships, but what if they didn't? Let's say that sample1 did not indicate that it derived from sample2, but you can infer from the inputs and outputs of the protocols that sample1 does derive from sample2. Do you require that entities indicate what they derive from directly or not? This gets a little more complicated if you try to combine both the EDS approach and ISA JSON approach. Specifically, if you adopted ISA JSON's fields, but also still allowed a "protocol" field on the entities. In this scenario a user could mix and match different ways to indicate entity-protocol relationships within the same JSON and it becomes a more significant task to determine sample ancestry, protocol order, and validate for consistency.
Possible Improvements
I think I have described everything, but now is it possible to take all of this and come up with something better? I see a few options:
All of these have some issues and would require a significant amount of coding changes. 1 is not directly backward compatible with our current implementation so that would have to be addressed. 2 would be complicated and I can't really foresee the issues until I actually try to code it. 3 is also similarly complicated and it's hard to see all of the ramifications without starting to implement it, as well as not being backward compatible. I don't really have a preference for any of these.
It is apparent from my phrasing in the above paragraphs, but I want to specifically talk about it here, that in ISA JSON there are no specific "measurements" as they are known in the EDS. In ISA JSON all measurements are relegated to files which are then simply "outputs". There are no measurement objects like there are in the EDS.
Summary
I think ISA JSON has a better way of handling entity-protocol relationships that we can borrow from and improve the EDS. Specifically, being able to indicate directly what the inputs and outputs of a protocol are and what the next and previous protocols are clears up a lot of ambiguity. Unfortunately, there is no easy way to transition the EDS to borrow from ISA JSON as any changes will require significant work and backwards compatibility will be hard to achieve.
I think some questions we need to answer are:
Beta Was this translation helpful? Give feedback.
All reactions