-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Study Analysis Specification JSON Schema #114
base: develop
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #114 +/- ##
========================================
Coverage 69.19% 69.19%
========================================
Files 8 8
Lines 922 922
========================================
Hits 638 638
Misses 284 284 ☔ View full report in Codecov by Sentry. |
Here is a chatgipty made example document that conforms to this schema:
|
The reason we made negate control concept sets (NCCS) different from a regular concept set (CS) is that they work differently; In a NCCS, each listed concept will result in a cohort, and you can separately specify whether you want to include all descendants of a concept when creating the cohort for that concept (and importantly: there will be no cohorts for each descendant). In a CS, you define logic defining a set of concepts, for example by listing concepts that should be included or excluded from the overall set, with or without their descendants. I therefore propose to keep the NCCS definition as it was. |
Noted and I have updated the spec to include this however, it would be possible to include these as optional fields in a concept set definition that are validated differently. I have also aded a |
This PR is designed to create an initial json schema for study designs that learns from the lessons of the past year as well as creating a formal spec for the overall project.
A lot of what I write here is opinion so up for debate, please add to the discussion if you are confused by anything or oppose any of the decisions made here.
Some changes were made from the current study specs (which could easily be converted to this definition with a small script).
Added an API version - this is currently missing and will create obvious headaches for us going forward if we need to retire old versions
The cohortDefinition set object uses JSON HAL see #98 references to refer to cohort definitions rather than baking them in to the json document itself. JSON HAL spec still supports these resources being included in the final document as "_embeded" objects but this makes references more explicit.
We also have no json schema for cohorts and that would be too time consuming to include here. This is also really more of a Circe thing so outside of scope.
Think of a HAL resource as being "collected" when the document is "processed". These resources are then cached. The link is also preserved so if the resource is updated we can find that out. E.g. if a phenotype library definition has changed.
(on a side note, I also don't really see a good reason to have a massive json document as a preference over a tarball of a directory that you can navigate. This is how jar files work, for example).
The use of json strings inside a json document is also bad practice and removed because its unneccesary if with consider them as resources that can be remote URLs or on disks.
The negative control concept set is now just concept sets that can optionally be negative controls, concepts can also include a cohortDefinitionId as this is there currently
Shared resources can also be references to other remote json schemas (so you can extend this to whatver objects you like, and how that is handled is internal to the packages)
renv is now a single study wide file that can be optionally overridden in a module. This is referenced as a hal resource (so either some remote URL or some path local to this file)
modules are now an array of module types. I didn't include the remote repo because I had a long think and I really think there are few cases where we need a separate git repository for a module that is not just an R package. So the approach taken here is as follows - a "module" just has settings which are defined in a schemaRef which is a remote resource. Naturally this URL should relate to the specific package version and branch. It should be up to the module/package maintainer to keep the json schemas for these up to date (which should be easy to achieve if we have github actions checking them on every PR)
Martijn has already mentioned a utility to make this easier for packages so that should go some of the way there for adding validators for that.
Things I missed out:
Also a note on types - "integer" in json schema does not describe the size of data types so assume its 64 bit storage for big ints as needed for cohort ids and concept ids.