Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the reasoning for embedding json quoted inside json in strategus specs? #80

Open
ablack3 opened this issue Jun 14, 2023 · 21 comments

Comments

@ablack3
Copy link

ablack3 commented Jun 14, 2023

I'm wondering why the choice was made to embed quoted json inside json. Wouldn't it be better to simply nest the json without creating a quoted context? My thinking is that when you include quoted json inside json a json parser needs to parse, unqote, and parse again. If you nest the json then the parser just needs parse the whole thing once.

What's the reasoning here?

@ablack3
Copy link
Author

ablack3 commented Aug 18, 2023

@anthonysena @chrisknoll

What is the benefit of quoting json inside json?

@chrisknoll
Copy link
Contributor

chrisknoll commented Aug 19, 2023

This is something I did back to the origins of Atlas/WebPI, all the way back when CIRCE was a stand-alone application. We did not want to make any assumptions or restrictions about the object model that was embedded in the definition. So, we encode the expression as a String, and leave it to the consumers to extract the content. If you look at the CohortDefinition type in WebAPI, you can see there's 2 elements to it: the ExpressionType and the CohortDefinitionDetails (which simply wraps the expresson as a String, this was split for database concerns).

So, the ExpressionType would inform what is in the String field, and we planned on 3 options:

  • SIMPLE_EXPRESSION: this is basically CIRCE json, but Circe wasn't really well formed at that point, so if I had to do it over, i would have made this CIRCE_EXPRESSION
  • CUSTOM_SQL: Would let you write your own SQL that would build your cohort! Neat! (but never made it an option in the UI)
  • EXTERNAL_SOURCED: This would be a case where you are going to populate your results.cohort table yourself, but you want a cohort ID from WebAPI, so this type of cohort was intended as a 'stub' of some externally generated cohort but could be used in other analyitics in Atlas like characterization, pathways or incidence rates. Again, never realy used in Atlas.

So, the first two (simple_expresson or custom_sql) would have been placed into the 'cohort expression' field as a string, just in one case it's interpreted as json, in the other it would be interpreted as sql (based on EXPRESSION_TYPE). In the 2015-2016 time frame, there was a push to establish a standard way to define cohort definitions, meaning that we'd develop CIRCE, provide necessary features (ie: improve object model, 'print friendly', etc) and encourage people to use the tools to define cohorts so that they can follow standard cohort-building semantics (as opposed to everyone writing their own custom sql). So, that led to basically never using the 2 other EventTypes, but, the object model that persisted the the json expression as a string lives on.

It also gives us some flexibility: we can store any json structure we want in it, and it can change over time without needing to have an associated object model defined for each one of those things. Of course, this has trade-offs which I'm sure you'll point out, but it has served its purpose pretty well up to this piont.

The question I was ask you is what is the problem you'd try to solve if you made the object model of a cohort definition contain a fully-qualified object graph as the expression vs. storing as a String?

@ablack3
Copy link
Author

ablack3 commented Aug 19, 2023

Thanks Chris!

I think I followed what you're saying but I'm probably missing some of the context as I'm still a new developer. So please feel free to correct me. My understanding is that the quoting of the cohort definitions give you extra flexibility in that you can pass in circe json or SQL code for custom cohorts. I can see the benefit of that.

I guess I'm wondering if that is the plan with Strategus. Do we expect that custom sql should be supported in the cohort definition fields?

The question I was ask you is what is the problem you'd try to solve if you made the object model of a cohort definition contain a fully-qualified object graph as the expression vs. storing as a String?

I'm not sure I fully understand the question but I'll try to answer. The problem I want to solve is to parse and validate strategus json files. By this I mean I want to parse the file into an object and check that it conforms to some specification. By doing this I should have greater confidence that the json file I create will indeed work and be supported by the Strategus program/software.

So I'm pushing for a clear unambiguous definition of the strategus json interface that supports programmatic validation.

By allowing any character string in the cohort definition fields we kind of create the need for a second level of parsing and validation. Because now I need to not only parse and check the study json but also then parse and validate each cohort string.

Something about this quoted json in json seems off to me. If you want the option to provide SQL can't you just have a SQL field in the json nested under the cohort definition which would make the intention of the character string more clear to the receiver (i.e. strategus)?

(There's also the problem that we currently don't have a way to programmatically validate OHDSI-SQL so if Strategus json did include sql we wouldn't know if it is correct syntax until we actually try and run it on all dbms.)

A compiler is useful in part because it can catch a lot of errors before runtime. Similarly I think a program that validates strategus json files would be useful in catching errors before sending the analysis specification across a network of data nodes for execution. (OHDSI network study runtime is a fairly expensive type of runtime, at least more than your typical program execution). Also it will help us identify bugs in strategus. If a json passes validation but fails in execution then there is a bug in either strategus or the validator.

@chrisknoll
Copy link
Contributor

chrisknoll commented Aug 21, 2023

I'm not sure I fully understand the question but I'll try to answer. The problem I want to solve is to parse and validate strategus json files. By this I mean I want to parse the file into an object and check that it conforms to some specification. By doing this I should have greater confidence that the json file I create will indeed work and be supported by the Strategus program/software.

Right, that's the trade-off I mentioned earlier that I predicted you would point out :). You can validate the current Strategus json, it's just that the expression field is a string, not a complex object. Yes, you need to take an additional step of extracting that string, parsing it again to do a second-order validation, but this raises the question: against what version of an expression schema are you going to validate with? Storing as a string means anything can fit in there meaning you don't change that schema on each version of a cohort expression change. However, it means that you need to do this second-level parsing against the correct version of the schema you are validating against. The alternative is you put the json object directly encoded into the root json (ie: not a string) but then that sub-graph has it's own schema that you need to validate, and as that schema evolves, you will have to account for the different versions.

So, getting back to your concern: sure you could take it out of a string and put it into a fully-defined object model. Then the question becomes how do you manage the different versions of the object model over time? You sorta need to solve that problem if you're going to encode the actual object model of the cohort expression into the JSON object. The reason why I originally proposed serializing as string was to deal with this particular problem (or avoid it, if you prefer). It puts the onus on the client to extract the content and parse/validate it themselves against the intended target deserialization, but it removes a lot of complexity from the object model.

@ablack3
Copy link
Author

ablack3 commented Aug 21, 2023

against what version of an expression schema are you going to validate with?

Well since a major purpose of Strategus is to fix the versions of all R package dependencies wouldn't it make sense that Strategus would also fix the version of the expression schema? (This is in Circe right?)

By specifying an exact version of CirceR isn't Strategus also fixing the expression schema version as well?

Storing as a string means anything can fit in there meaning you don't change that schema on each version of a cohort expression change.

But "anything" could include junk that has no hope of executing successfully. With a Strategus json validation function these errors could be caught before the code goes out for execution. Failed executions do have a cost of data node fatigue particularly for volunteers who don't have a ton of time to devote.

My strong opinion is that Strategus needs a validation function. This function takes a json file as input and output TRUE if the json should work Strategus and FALSE otherwise. Very simple but quite useful. Without this function we have no idea if a study specification will work before actually running it. Worse, it may work on some dbms but not others. I think the benefits of input validation significantly outweigh the benefits of flexibility your bringing up. But I realize this is a tradeoff and there is no true correct answer here.

What about versions of strategus json and the validation function (you might ask)? Well Strategus is an R package that has releases. The validation function would be in that package and it would return TRUE if the json conforms to the specification for that version of Strategus. - It's very simple, just validate the program inputs before execution/runtime and give study designers the ability to validate their json before sending it to Strategus for execution. Specifically check the json against a schema. The quoted strings that are actually objects that have an implicit structure seem to me to run counter to this goal.

Where do others fall on this design tradeoff?
@anthonysena @schuemie @jreps ?

@ablack3
Copy link
Author

ablack3 commented Aug 21, 2023

@chrisknoll, How much has the expression schema changed over the years? Have their been breaking changes where if we encode the most recent version of the expression schema it will not work with older cohort expressions?
My impression was that there have not been many or any breaking changes to the cohort expression json schema but I'm not sure.

@chrisknoll
Copy link
Contributor

But "anything" could include junk that has no hope of executing successfully. With a Strategus json validation function these errors could be caught before the code goes out for execution. Failed executions do have a cost of data node fatigue particularly for volunteers who don't have a ton of time to devote.

I might have missed it but I'm not sure where we lost the ability to validate the JSON specification...albeit in 2 passes (a strategus level and then a cohort expression level) . I thought you mentioned that it would have to be done in 2 passes (frst to make sure it's a strategus spec, and second to make sure that the current version of the cohort expression processor will work with the embedded expression). Can't this be done before execution?

Without this function we have no idea if a study specification will work before actually running it. Worse, it may work on some dbms but not others.

I think you're raising a different issue that goes beyond simple strategus json validation?

@chrisknoll, How much has the expression schema changed over the years? Have their been breaking changes where if we encode the most recent version of the expression schema it will not work with older cohort expressions?
My impression was that there have not been many or any breaking changes to the cohort expression json schema but I'm not sure.

It has changed a lot over the course of 7 years: once upon a time, there was no notion of exit strategy, collapse days, cohort censor events, etc. But, I've been very careful that new things do not break backwards compatibility, but you will find that other aspects of OHDSI have not been this careful. So, maybe it's a outlier to point at circe and say this would never be an issue because we've been very good about backwards compatability. But, eventually circe will be moving to a 2.x semver, and there will be stuff dropped that are no longer supported (or new structures introduced that are non-compatable, and so there will be a day that it will break. But, when that happens, and the expressions are in string-form, strategus doesn't need to keep up it's schema version with whatever changes are going on in circe, plp, ple, cohort method, incidence rate analysis, etc (or maybe it does, for those others f those elements are not string encoded).

This is probably a better convo to have over on a HADES meeting, as I'm thinking it would be more effective.

@ablack3
Copy link
Author

ablack3 commented Aug 21, 2023

Great. Thanks Chris. Yes let's please discuss in the next Hades meeting before this approach is considered "finalized".

In the meantime (assuming I have time and it isn't too hard) I'll try to implement the validation function that I want to exist. I guess it's also tricky because each module sould probably define it's own schema for inputs. I'll see how it goes. I get your point that we can still validate by parsing the json inside the json but then it can't be done with a generic language agnostic json schema validator since you need to have a loop inside the validator to parse and validate each cohort field. And I guess it would be helpful to decide if we think that the cohort field will contain SQL or something other than circe expressions.

@anthonysena
Copy link
Collaborator

@ablack3 @chrisknoll thanks for all of the discussion on this topic. I'd like to focus on this point that Adam raised:

My strong opinion is that Strategus needs a validation function. This function takes a json file as input and output TRUE if the json should work Strategus and FALSE otherwise. Very simple but quite useful. Without this function we have no idea if a study specification will work before actually running it. Worse, it may work on some dbms but not others. I think the benefits of input validation significantly outweigh the benefits of flexibility your bringing up. But I realize this is a tradeoff and there is no true correct answer here.

The settings that go into a Strategus analysis specification are driven by the modules that are in the specification. Leaving a link to the Creating Analysis Specifications Vignette for more details for readers that may be interested. So in my mind, any type of input validation would require some code in the module that Strategus could call to verify that the module could accept the jobContext that we will pass into the execute function. I support this type of extension to the modules and I think it links back to #9.

Also want to highlight another important point that Adam raised:

But "anything" could include junk that has no hope of executing successfully. With a Strategus json validation function these errors could be caught before the code goes out for execution. Failed executions do have a cost of data node fatigue particularly for volunteers who don't have a ton of time to devote.

The failed executions absolutely cause fatigue - in my mind, this is where we need to apply more unit tests to both Strategus and the modules. Strategus modules currently have a unit test that creates a sample jobContext that aims to test the execute method so we know that the JSON specification will work with the module. Here is an example unit test for CohortGenerator for reference. Additionally, the GitHub Action workflow for the modules will test the renv restore process to ensure that R can restore the dependencies in the renv.lock file. These have fallen into some disrepair and I'm actively working to resolve these issues. As an example, the CohortGenerator module has a pre-release that is now using the HADES-wide lock file (ref). My feeling is that these are good safeguards that the module should work for JSON settings created with that version of the module. If we add a validation function, we could add unit tests for that as well to ensure that we have a mechanism to test settings for a module and also that it will run when calling its execute function. Currently the mechanism for adding the circe JSON expression to the Strategus analysis specification is here. The assumption made here is that we're embedding a circe JSON expression since that's largely what we're using in OHDSI. I'm pointing this out to note that you could validate the circe expression since you'd expect that version of the module defined how to construct it. Currently this is JSON in a string and we could revise it to remove it from a string and just have the JSON block. In reading this thread though, we'd probably want to revise how we're defining that expression to allow for flexibility for defining custom SQL (as an example).

As for Strategus, we need more tests for its own execute function. The current unit test is also in disrepair and I'm working to update it so that we can run it across all of the RDBMS that we currently have support for via GitHub Actions. I've been debating if Strategus should test all HADES modules in a unit test but in my mind this feels a bit redundant since the HADES modules run their own tests (as described above). That said, Strategus does have its own elements that require testing in terms of obtaining CDM metadata, etc. To test this, I'm working on a very simple test module for unit tests that simply connects to the CDM and ensures it can issue a simple query to ensure that there are no problems with the way we're passing the connectionDetails (for example). This could be expanded as needed if we need to test other core functions of Strategus and the way it is passing information to the modules.

Happy to discuss on the next HADES call and happy to continue to share thoughts here.

@ablack3
Copy link
Author

ablack3 commented Aug 28, 2023

Thanks for the comments and links @anthonysena.

any type of input validation would require some code in the module that Strategus could call to verify that the module could accept the jobContext that we will pass into the execute function.

So modules have the responsibility to able to validate their input. Maybe there is a generic validate function that dispatches to each module. So the strategus validation function is something like

validate <- function(specification) {
    # validate the overall strategus structure first
    # for each module, call validate which hands off checking to the module
}

One thought... Wouldn't the cohort definitions be considered inputs to the cohort generator module and treated like any other module inputs?

@chrisknoll
Copy link
Contributor

Another reason that enociding as a string: you avoid bad serialization problems that you get using jsonlite.

@ablack3
Copy link
Author

ablack3 commented Sep 13, 2023

I noticed this "quoted json inside json" pattern in WebAPI as well.
This endpoint http://webapidoc.ohdsi.org/index.html#-549285132 returns

{
  "summary": {
    "baseCount": 5234,
    "finalCount": 3181,
    "lostCount": 0,
    "percentMatched": "60.78%"
  },
  "inclusionRuleStats": [
    {
      "id": 0,
      "name": "age",
      "percentExcluded": "39.22%",
      "percentSatisfying": "60.78%",
      "countSatisfying": 3181
    }
  ],
  "treemapData": "{\"name\" : \"Everyone\", \"children\" : [{\"name\" : \"Group 1\", \"children\" : [{\"name\": \"1\", \"size\": 3181},{\"name\" : \"Group 0\", \"children\" : [{\"name\": \"0\", \"size\": 2053}]}]}]}"
}

instead of

{
  "summary": {
    "baseCount": 5234,
    "finalCount": 3181,
    "lostCount": 0,
    "percentMatched": "60.78%"
  },
  "inclusionRuleStats": [
    {
      "id": 0,
      "name": "age",
      "percentExcluded": "39.22%",
      "percentSatisfying": "60.78%",
      "countSatisfying": 3181
    }
  ],
  "treemapData": {
    "name": "Everyone",
    "children": [
      {
        "name": "Group 1",
        "children": [
          {
            "name": "1",
            "size": 3181,
            "children": {}
          },
          {
            "name": "Group 0",
            "children": [
              {
                "name": "0",
                "size": 2053
              }
            ]
          }
        ]
      }
    ]
  }
}


The latter seems nicer to me since you avoid all the quote escaping, but I don't really know.

@ablack3
Copy link
Author

ablack3 commented Sep 13, 2023

Another reason that enociding as a string: you avoid bad serialization problems that you get using jsonlite.

But if you use jsonlite to serialize you still have to deal with any issues in the unquoted fields of the json object right? I'm not sure how quoting one input (cohort definitions or treemap data) really gets around the serialization issues.

@chrisknoll
Copy link
Contributor

Right, if you're only using R, it's inescapable (pun!) to run into those issues. However, the context I was thinking was the json is parsed and serialized in the Java space and it just "moves between" the R space as a string.

@ablack3
Copy link
Author

ablack3 commented Sep 14, 2023

I also don't understand the problem with jsonlite... seems like you can get the input to match the output with thesimplifyVector and auto_unbox arguments like in the example below. I might be missing something though. Is there some json where the example below would not return the original json unchanged (ignoring whitespace differences)? Also I see that several OHDSI/Hades packages use RJSONIO (https://github.com/search?q=org%3AOHDSI+RJSONIO&type=code) .

input <- "{\"x\":1,\"y\":[1]}"
x0 <- jsonlite::fromJSON(input, simplifyVector = F)
output <- jsonlite::toJSON(x0, auto_unbox = T)

cat(input)
#> {"x":1,"y":[1]}
cat(output)
#> {"x":1,"y":[1]}

input == output
#> [1] TRUE
input <- "{\"x\":1,\"y\":[1,2],\"z\":[[3],[4]]}"
x0 <- jsonlite::fromJSON(input, simplifyVector = F)
output <- jsonlite::toJSON(x0, auto_unbox = T)

cat(input)
#> {"x":1,"y":[1,2],"z":[[3],[4]]}
cat(output)
#> {"x":1,"y":[1,2],"z":[[3],[4]]}

input == output
#> [1] TRUE

Created on 2023-09-13 with reprex v2.0.2

@chrisknoll
Copy link
Contributor

chrisknoll commented Sep 15, 2023

Yes, so I thought in the other thread referenced (where you replied with examples) there was a case where it wasn't working, but it might have been a misuse on my part of simplifyVector and auto_unbox. But, to get the correct output, i think it's worth trying a branch on parallel longer where when we save settingsToJson, we auto_unbox=T. And, I did a test where I didn't set simplifyVector, and it still worked. So...strange.

But definitely I saw the issue in serializing the strategus design: integer fields that are arrays were being converted to just integers.

But I think your examples show that it can be made to work.

@ablack3
Copy link
Author

ablack3 commented Sep 15, 2023

Here is another change that jsonlite makes when serializing and deserializing. Floats are rounded. Could be a problem when passing Strategus parameters.

(input <- '{"fileTreeProcessingTime":310.4735880413999999999}')
#> [1] "{\"fileTreeProcessingTime\":310.4735880413999999999}"
(output <- jsonlite::fromJSON(input, simplifyVector = F, digits = 9) |> 
  jsonlite::toJSON(auto_unbox = T, digits = NA))
#> {"fileTreeProcessingTime":310.4735880414}

input == output
#> [1] FALSE

digits = NA implies using max precision

which is not a problem if the data is quoted..

(input <- '{"fileTreeProcessingTime":"310.4735880413999999999"}')
#> [1] "{\"fileTreeProcessingTime\":\"310.4735880413999999999\"}"
(output <- jsonlite::fromJSON(input, simplifyVector = F) |> 
  jsonlite::toJSON(auto_unbox = T))
#> {"fileTreeProcessingTime":"310.4735880413999999999"}

input == output
#> [1] TRUE

Created on 2023-09-15 with reprex v2.0.2

@chrisknoll
Copy link
Contributor

but if the data is quoted then the datatype isn't float or double, it is string, which would be a problem if you loaded into a strongly typed language.

I'm not sure rounding of floats is a problem, but losing precision on ints would be a problem (such as large concept IDs or covariate ids).

@azimov
Copy link
Collaborator

azimov commented Sep 19, 2023

To piggy back on this topic, my preference would be to support multiple json documents by reference, rather than embedded strings or sub documents. This could be relative to the location on disk and a json schema (if we get to formally defining one could reference that this can be a json object of a given type, or a reference to a file with a path relative to the project root).

The reason for this is that we potentially have the use case where the cohort definitions can go into the hundreds or thousands and storing this in a single document makes debugging and maintenance challenging.

@mdlavallee92
Copy link

I agree with @azimov. Preference would be for the cohortDefinition slot to go to a reference of a file either from the directory root, github, or ATLAS. An ohdsi study must have a trace to a circe cohort so I think it is good that strategus is imposing this slot as a string of a circe at a minimum.

Upon further thought, I think I want to go back on something I said on the HADES call. The analysisSpecification.json file is probably not the place to make things legible (ie splitting settings files or improving readability). To me as long as there are compliments to this massive execution file within the study repository that explain its contents to an implementer of the study it is good. A cohortsToCreate folder storing all circe json in use for the study and the specification builder file would be examples. The auditability of the study directory remains very important imo, but we can solve this outside a single file that is going to be hard to read anyways.

A master validator function is nice but I would not suggest that as a priority considering the complexity of this package and its modules.

@anthonysena anthonysena added this to the v0.2.0 milestone Dec 4, 2023
@anthonysena anthonysena modified the milestones: v1.0.0, Backlog Aug 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants