Study Analysis Specification JSON Schema #114

azimov · 2024-01-15T19:31:58Z

This PR is designed to create an initial json schema for study designs that learns from the lessons of the past year as well as creating a formal spec for the overall project.

A lot of what I write here is opinion so up for debate, please add to the discussion if you are confused by anything or oppose any of the decisions made here.

Some changes were made from the current study specs (which could easily be converted to this definition with a small script).

Added an API version - this is currently missing and will create obvious headaches for us going forward if we need to retire old versions
The cohortDefinition set object uses JSON HAL see #98 references to refer to cohort definitions rather than baking them in to the json document itself. JSON HAL spec still supports these resources being included in the final document as "_embeded" objects but this makes references more explicit.
We also have no json schema for cohorts and that would be too time consuming to include here. This is also really more of a Circe thing so outside of scope.
Think of a HAL resource as being "collected" when the document is "processed". These resources are then cached. The link is also preserved so if the resource is updated we can find that out. E.g. if a phenotype library definition has changed.
(on a side note, I also don't really see a good reason to have a massive json document as a preference over a tarball of a directory that you can navigate. This is how jar files work, for example).
The use of json strings inside a json document is also bad practice and removed because its unneccesary if with consider them as resources that can be remote URLs or on disks.
The negative control concept set is now just concept sets that can optionally be negative controls, concepts can also include a cohortDefinitionId as this is there currently
Shared resources can also be references to other remote json schemas (so you can extend this to whatver objects you like, and how that is handled is internal to the packages)
renv is now a single study wide file that can be optionally overridden in a module. This is referenced as a hal resource (so either some remote URL or some path local to this file)
modules are now an array of module types. I didn't include the remote repo because I had a long think and I really think there are few cases where we need a separate git repository for a module that is not just an R package. So the approach taken here is as follows - a "module" just has settings which are defined in a schemaRef which is a remote resource. Naturally this URL should relate to the specific package version and branch. It should be up to the module/package maintainer to keep the json schemas for these up to date (which should be easy to achieve if we have github actions checking them on every PR)

Martijn has already mentioned a utility to make this easier for packages so that should go some of the way there for adding validators for that.

Things I missed out:

description fields for these properties - this should be added but they're pretty self explanatory

Also a note on types - "integer" in json schema does not describe the size of data types so assume its 64 bit storage for big ints as needed for cohort ids and concept ids.

codecov · 2024-01-15T19:38:38Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (13d020e) 69.19% compared to head (a31486f) 69.19%.
Report is 1 commits behind head on develop.

Additional details and impacted files

@@           Coverage Diff            @@
##           develop     #114   +/-   ##
========================================
  Coverage    69.19%   69.19%           
========================================
  Files            8        8           
  Lines          922      922           
========================================
  Hits           638      638           
  Misses         284      284

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

azimov · 2024-01-15T19:46:58Z

Here is a chatgipty made example document that conforms to this schema:

{
  "apiVersion": "1.0",
  "studyName": "MyOHDSIStudy",
  "studyDescription": "A sample OHDSI study definition",
  "sharedResources": [
    {
      "cohortDefinitionSet": {
        "cohortDefinitions": [
          {
            "_links": {"self": {"href": "/cohorts/123"}},
            "cohortName": "DrugX",
            "cohortDefinitionId": 123
          },
          {
            "_links": {"self": {"href": "/cohorts/456"}},
            "cohortName": "ConditionY",
            "cohortDefinitionId": 456
          },
          {
            "cohortDefinitionId": 789,
            "targetCohortDefinitionId": 101,
            "subsetDefinitionId": 202,
            "cohortName": "CombinedCohort"
          }
        ],
        "subsetDefinitions": [
          {
            "_links": {"self": {"href": "/subsets/101"}},
            "subsetDefinitionName": "SubsetX",
            "subsetDefinitionId": 101
          },
          {
            "_links": {"self": {"href": "/subsets/202"}},
            "subsetDefinitionName": "SubsetY",
            "subsetDefinitionId": 202
          }
        ]
      }
    },
    {
      "conceptSetDefinition": {
        "id": 123,
        "conceptSetName": "DrugX Concept Set",
        "isNegativeControlConceptSet": false,
        "concepts": [
          {
            "conceptId": 789,
            "conceptName": "DrugX Concept A",
            "standardConcept": "S",
            "standardConceptCaption": "Standard Concept",
            "invalidReason": "Invalid Reason",
            "invalidReasonCaption": "Invalid Reason Caption",
            "conceptCode": "ABC123",
            "domainId": 1,
            "vocabularyId": 2,
            "conceptClassId": 3,
            "validStartDate": "2022-01-01",
            "validEndDate": "2022-12-31",
            "cohortDefinitionId": 123
          },
          {
            "conceptId": 456,
            "conceptName": "DrugX Concept B",
            "standardConcept": "NS",
            "standardConceptCaption": "Non-Standard Concept",
            "invalidReason": "No Invalid Reason",
            "invalidReasonCaption": "No Invalid Reason Caption",
            "conceptCode": "XYZ789",
            "domainId": 4,
            "vocabularyId": 5,
            "conceptClassId": 6,
            "validStartDate": "2022-02-01",
            "validEndDate": "2022-11-30",
            "cohortDefinitionId": 123
          }
        ]
      }
    },
    {
      "sharedResource": {
        "name": "MySharedResource",
        "schemaRef": "http://example.com/shared-resource-schema"
      }
    }
  ],
  "renv": {
    "_links": {"self": {"href": "/renv-lock/789"}}
  },
  "modules": [
    {
      "module": "ModuleX",
      "version": "1.2.3",
      "schemaRef": "http://example.com/module-x-schema",
      "settings": {"key1": "value1", "key2": "value2"},
      "renv": {
        "_links": {"self": {"href": "/module-x-renv/789"}}
      }
    },
    {
      "module": "ModuleY",
      "version": "4.5.6",
      "schemaRef": "http://example.com/module-y-schema",
      "settings": {"key3": "value3", "key4": "value4"},
      "renv": {
        "_links": {"self": {"href": "/module-y-renv/789"}}
      }
    }
  ]
}

schuemie · 2024-01-16T05:50:22Z

The reason we made negate control concept sets (NCCS) different from a regular concept set (CS) is that they work differently; In a NCCS, each listed concept will result in a cohort, and you can separately specify whether you want to include all descendants of a concept when creating the cohort for that concept (and importantly: there will be no cohorts for each descendant). In a CS, you define logic defining a set of concepts, for example by listing concepts that should be included or excluded from the overall set, with or without their descendants.

I therefore propose to keep the NCCS definition as it was.

azimov · 2024-01-17T18:27:50Z

The reason we made negate control concept sets (NCCS) different from a regular concept set (CS) is that they work differently; In a NCCS, each listed concept will result in a cohort, and you can separately specify whether you want to include all descendants of a concept when creating the cohort for that concept (and importantly: there will be no cohorts for each descendant). In a CS, you define logic defining a set of concepts, for example by listing concepts that should be included or excluded from the overall set, with or without their descendants.

I therefore propose to keep the NCCS definition as it was.

Noted and I have updated the spec to include this however, it would be possible to include these as optional fields in a concept set definition that are validated differently.

I have also aded a resourceName field to top level resources which can be referenced in the modules as requiredResource array as a runtime check.

Created first draft of study json schema object

f09d41b

azimov requested review from anthonysena, chrisknoll, msuchard and schuemie January 15, 2024 19:32

azimov changed the base branch from main to develop January 15, 2024 19:32

Changes to allow modules to have internal renvs

5df2bb9

Name and description fields and study name and study description fields

f6ddaf3

azimov added 3 commits January 15, 2024 12:59

formatting

8794d1a

fixed broken end tag

5610469

Added an optional calling function argument to modules

90afc6a

anthonysena mentioned this pull request Jan 16, 2024

Proposed new input specification format #112

Closed

Added resource names for run time validation of module requirements

3e4f2cd

anthonysena linked an issue Jan 17, 2024 that may be closed by this pull request

Add linkability to design documents to implement JSON Hypertext Application Language (HAL) #98

Open

Removed renv from modules and calling function to simplify design

a31486f

anthonysena mentioned this pull request Feb 7, 2024

Sync Strategus modules and lock file OHDSI/Ulysses#17

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Study Analysis Specification JSON Schema #114

Study Analysis Specification JSON Schema #114

azimov commented Jan 15, 2024 •

edited

Loading

codecov bot commented Jan 15, 2024 •

edited

Loading

azimov commented Jan 15, 2024

schuemie commented Jan 16, 2024

azimov commented Jan 17, 2024

Study Analysis Specification JSON Schema #114

Are you sure you want to change the base?

Study Analysis Specification JSON Schema #114

Conversation

azimov commented Jan 15, 2024 • edited Loading

codecov bot commented Jan 15, 2024 • edited Loading

Codecov Report

azimov commented Jan 15, 2024

schuemie commented Jan 16, 2024

azimov commented Jan 17, 2024

azimov commented Jan 15, 2024 •

edited

Loading

codecov bot commented Jan 15, 2024 •

edited

Loading