-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What is the reasoning for embedding json quoted inside json in strategus specs? #80
Comments
What is the benefit of quoting json inside json? |
This is something I did back to the origins of Atlas/WebPI, all the way back when CIRCE was a stand-alone application. We did not want to make any assumptions or restrictions about the object model that was embedded in the definition. So, we encode the expression as a String, and leave it to the consumers to extract the content. If you look at the CohortDefinition type in WebAPI, you can see there's 2 elements to it: the ExpressionType and the CohortDefinitionDetails (which simply wraps the expresson as a String, this was split for database concerns). So, the ExpressionType would inform what is in the String field, and we planned on 3 options:
So, the first two (simple_expresson or custom_sql) would have been placed into the 'cohort expression' field as a string, just in one case it's interpreted as json, in the other it would be interpreted as sql (based on EXPRESSION_TYPE). In the 2015-2016 time frame, there was a push to establish a standard way to define cohort definitions, meaning that we'd develop CIRCE, provide necessary features (ie: improve object model, 'print friendly', etc) and encourage people to use the tools to define cohorts so that they can follow standard cohort-building semantics (as opposed to everyone writing their own custom sql). So, that led to basically never using the 2 other EventTypes, but, the object model that persisted the the json expression as a string lives on. It also gives us some flexibility: we can store any json structure we want in it, and it can change over time without needing to have an associated object model defined for each one of those things. Of course, this has trade-offs which I'm sure you'll point out, but it has served its purpose pretty well up to this piont. The question I was ask you is what is the problem you'd try to solve if you made the object model of a cohort definition contain a fully-qualified object graph as the expression vs. storing as a String? |
Thanks Chris! I think I followed what you're saying but I'm probably missing some of the context as I'm still a new developer. So please feel free to correct me. My understanding is that the quoting of the cohort definitions give you extra flexibility in that you can pass in circe json or SQL code for custom cohorts. I can see the benefit of that. I guess I'm wondering if that is the plan with Strategus. Do we expect that custom sql should be supported in the cohort definition fields?
I'm not sure I fully understand the question but I'll try to answer. The problem I want to solve is to parse and validate strategus json files. By this I mean I want to parse the file into an object and check that it conforms to some specification. By doing this I should have greater confidence that the json file I create will indeed work and be supported by the Strategus program/software. So I'm pushing for a clear unambiguous definition of the strategus json interface that supports programmatic validation. By allowing any character string in the cohort definition fields we kind of create the need for a second level of parsing and validation. Because now I need to not only parse and check the study json but also then parse and validate each cohort string. Something about this quoted json in json seems off to me. If you want the option to provide SQL can't you just have a SQL field in the json nested under the cohort definition which would make the intention of the character string more clear to the receiver (i.e. strategus)? (There's also the problem that we currently don't have a way to programmatically validate OHDSI-SQL so if Strategus json did include sql we wouldn't know if it is correct syntax until we actually try and run it on all dbms.) A compiler is useful in part because it can catch a lot of errors before runtime. Similarly I think a program that validates strategus json files would be useful in catching errors before sending the analysis specification across a network of data nodes for execution. (OHDSI network study runtime is a fairly expensive type of runtime, at least more than your typical program execution). Also it will help us identify bugs in strategus. If a json passes validation but fails in execution then there is a bug in either strategus or the validator. |
Right, that's the trade-off I mentioned earlier that I predicted you would point out :). You can validate the current Strategus json, it's just that the expression field is a string, not a complex object. Yes, you need to take an additional step of extracting that string, parsing it again to do a second-order validation, but this raises the question: against what version of an expression schema are you going to validate with? Storing as a string means anything can fit in there meaning you don't change that schema on each version of a cohort expression change. However, it means that you need to do this second-level parsing against the correct version of the schema you are validating against. The alternative is you put the json object directly encoded into the root json (ie: not a string) but then that sub-graph has it's own schema that you need to validate, and as that schema evolves, you will have to account for the different versions. So, getting back to your concern: sure you could take it out of a string and put it into a fully-defined object model. Then the question becomes how do you manage the different versions of the object model over time? You sorta need to solve that problem if you're going to encode the actual object model of the cohort expression into the JSON object. The reason why I originally proposed serializing as string was to deal with this particular problem (or avoid it, if you prefer). It puts the onus on the client to extract the content and parse/validate it themselves against the intended target deserialization, but it removes a lot of complexity from the object model. |
Well since a major purpose of Strategus is to fix the versions of all R package dependencies wouldn't it make sense that Strategus would also fix the version of the expression schema? (This is in Circe right?) By specifying an exact version of CirceR isn't Strategus also fixing the expression schema version as well?
But "anything" could include junk that has no hope of executing successfully. With a Strategus json validation function these errors could be caught before the code goes out for execution. Failed executions do have a cost of data node fatigue particularly for volunteers who don't have a ton of time to devote. My strong opinion is that Strategus needs a validation function. This function takes a json file as input and output TRUE if the json should work Strategus and FALSE otherwise. Very simple but quite useful. Without this function we have no idea if a study specification will work before actually running it. Worse, it may work on some dbms but not others. I think the benefits of input validation significantly outweigh the benefits of flexibility your bringing up. But I realize this is a tradeoff and there is no true correct answer here. What about versions of strategus json and the validation function (you might ask)? Well Strategus is an R package that has releases. The validation function would be in that package and it would return TRUE if the json conforms to the specification for that version of Strategus. - It's very simple, just validate the program inputs before execution/runtime and give study designers the ability to validate their json before sending it to Strategus for execution. Specifically check the json against a schema. The quoted strings that are actually objects that have an implicit structure seem to me to run counter to this goal. Where do others fall on this design tradeoff? |
@chrisknoll, How much has the expression schema changed over the years? Have their been breaking changes where if we encode the most recent version of the expression schema it will not work with older cohort expressions? |
I might have missed it but I'm not sure where we lost the ability to validate the JSON specification...albeit in 2 passes (a strategus level and then a cohort expression level) . I thought you mentioned that it would have to be done in 2 passes (frst to make sure it's a strategus spec, and second to make sure that the current version of the cohort expression processor will work with the embedded expression). Can't this be done before execution?
I think you're raising a different issue that goes beyond simple strategus json validation?
It has changed a lot over the course of 7 years: once upon a time, there was no notion of exit strategy, collapse days, cohort censor events, etc. But, I've been very careful that new things do not break backwards compatibility, but you will find that other aspects of OHDSI have not been this careful. So, maybe it's a outlier to point at circe and say this would never be an issue because we've been very good about backwards compatability. But, eventually circe will be moving to a 2.x semver, and there will be stuff dropped that are no longer supported (or new structures introduced that are non-compatable, and so there will be a day that it will break. But, when that happens, and the expressions are in string-form, strategus doesn't need to keep up it's schema version with whatever changes are going on in circe, plp, ple, cohort method, incidence rate analysis, etc (or maybe it does, for those others f those elements are not string encoded). This is probably a better convo to have over on a HADES meeting, as I'm thinking it would be more effective. |
Great. Thanks Chris. Yes let's please discuss in the next Hades meeting before this approach is considered "finalized". In the meantime (assuming I have time and it isn't too hard) I'll try to implement the validation function that I want to exist. I guess it's also tricky because each module sould probably define it's own schema for inputs. I'll see how it goes. I get your point that we can still validate by parsing the json inside the json but then it can't be done with a generic language agnostic json schema validator since you need to have a loop inside the validator to parse and validate each cohort field. And I guess it would be helpful to decide if we think that the cohort field will contain SQL or something other than circe expressions. |
@ablack3 @chrisknoll thanks for all of the discussion on this topic. I'd like to focus on this point that Adam raised:
The settings that go into a Strategus analysis specification are driven by the modules that are in the specification. Leaving a link to the Creating Analysis Specifications Vignette for more details for readers that may be interested. So in my mind, any type of input validation would require some code in the module that Strategus could call to verify that the module could accept the Also want to highlight another important point that Adam raised:
The failed executions absolutely cause fatigue - in my mind, this is where we need to apply more unit tests to both Strategus and the modules. Strategus modules currently have a unit test that creates a sample As for Strategus, we need more tests for its own Happy to discuss on the next HADES call and happy to continue to share thoughts here. |
Thanks for the comments and links @anthonysena.
So modules have the responsibility to able to validate their input. Maybe there is a generic validate function that dispatches to each module. So the strategus validation function is something like
One thought... Wouldn't the cohort definitions be considered inputs to the cohort generator module and treated like any other module inputs? |
Another reason that enociding as a string: you avoid bad serialization problems that you get using jsonlite. |
I noticed this "quoted json inside json" pattern in WebAPI as well.
instead of
The latter seems nicer to me since you avoid all the quote escaping, but I don't really know. |
But if you use jsonlite to serialize you still have to deal with any issues in the unquoted fields of the json object right? I'm not sure how quoting one input (cohort definitions or treemap data) really gets around the serialization issues. |
Right, if you're only using R, it's inescapable (pun!) to run into those issues. However, the context I was thinking was the json is parsed and serialized in the Java space and it just "moves between" the R space as a string. |
I also don't understand the problem with jsonlite... seems like you can get the input to match the output with the input <- "{\"x\":1,\"y\":[1]}"
x0 <- jsonlite::fromJSON(input, simplifyVector = F)
output <- jsonlite::toJSON(x0, auto_unbox = T)
cat(input)
#> {"x":1,"y":[1]}
cat(output)
#> {"x":1,"y":[1]}
input == output
#> [1] TRUE input <- "{\"x\":1,\"y\":[1,2],\"z\":[[3],[4]]}"
x0 <- jsonlite::fromJSON(input, simplifyVector = F)
output <- jsonlite::toJSON(x0, auto_unbox = T)
cat(input)
#> {"x":1,"y":[1,2],"z":[[3],[4]]}
cat(output)
#> {"x":1,"y":[1,2],"z":[[3],[4]]}
input == output
#> [1] TRUE Created on 2023-09-13 with reprex v2.0.2 |
Yes, so I thought in the other thread referenced (where you replied with examples) there was a case where it wasn't working, but it might have been a misuse on my part of simplifyVector and auto_unbox. But, to get the correct output, i think it's worth trying a branch on parallel longer where when we save settingsToJson, we auto_unbox=T. And, I did a test where I didn't set simplifyVector, and it still worked. So...strange. But definitely I saw the issue in serializing the strategus design: integer fields that are arrays were being converted to just integers. But I think your examples show that it can be made to work. |
Here is another change that jsonlite makes when serializing and deserializing. Floats are rounded. Could be a problem when passing Strategus parameters. (input <- '{"fileTreeProcessingTime":310.4735880413999999999}')
#> [1] "{\"fileTreeProcessingTime\":310.4735880413999999999}"
(output <- jsonlite::fromJSON(input, simplifyVector = F, digits = 9) |>
jsonlite::toJSON(auto_unbox = T, digits = NA))
#> {"fileTreeProcessingTime":310.4735880414}
input == output
#> [1] FALSE digits = NA implies using max precision which is not a problem if the data is quoted.. (input <- '{"fileTreeProcessingTime":"310.4735880413999999999"}')
#> [1] "{\"fileTreeProcessingTime\":\"310.4735880413999999999\"}"
(output <- jsonlite::fromJSON(input, simplifyVector = F) |>
jsonlite::toJSON(auto_unbox = T))
#> {"fileTreeProcessingTime":"310.4735880413999999999"}
input == output
#> [1] TRUE Created on 2023-09-15 with reprex v2.0.2 |
but if the data is quoted then the datatype isn't float or double, it is string, which would be a problem if you loaded into a strongly typed language. I'm not sure rounding of floats is a problem, but losing precision on ints would be a problem (such as large concept IDs or covariate ids). |
To piggy back on this topic, my preference would be to support multiple json documents by reference, rather than embedded strings or sub documents. This could be relative to the location on disk and a json schema (if we get to formally defining one could reference that this can be a json object of a given type, or a reference to a file with a path relative to the project root). The reason for this is that we potentially have the use case where the cohort definitions can go into the hundreds or thousands and storing this in a single document makes debugging and maintenance challenging. |
I agree with @azimov. Preference would be for the Upon further thought, I think I want to go back on something I said on the HADES call. The A master validator function is nice but I would not suggest that as a priority considering the complexity of this package and its modules. |
I'm wondering why the choice was made to embed quoted json inside json. Wouldn't it be better to simply nest the json without creating a quoted context? My thinking is that when you include quoted json inside json a json parser needs to parse, unqote, and parse again. If you nest the json then the parser just needs parse the whole thing once.
What's the reasoning here?
The text was updated successfully, but these errors were encountered: