Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Study Analysis Specification JSON Schema #114

Open
wants to merge 8 commits into
base: develop
Choose a base branch
from

Conversation

azimov
Copy link
Collaborator

@azimov azimov commented Jan 15, 2024

This PR is designed to create an initial json schema for study designs that learns from the lessons of the past year as well as creating a formal spec for the overall project.

A lot of what I write here is opinion so up for debate, please add to the discussion if you are confused by anything or oppose any of the decisions made here.

Some changes were made from the current study specs (which could easily be converted to this definition with a small script).

  • Added an API version - this is currently missing and will create obvious headaches for us going forward if we need to retire old versions

  • The cohortDefinition set object uses JSON HAL see #98 references to refer to cohort definitions rather than baking them in to the json document itself. JSON HAL spec still supports these resources being included in the final document as "_embeded" objects but this makes references more explicit.
    We also have no json schema for cohorts and that would be too time consuming to include here. This is also really more of a Circe thing so outside of scope.
    Think of a HAL resource as being "collected" when the document is "processed". These resources are then cached. The link is also preserved so if the resource is updated we can find that out. E.g. if a phenotype library definition has changed.
    (on a side note, I also don't really see a good reason to have a massive json document as a preference over a tarball of a directory that you can navigate. This is how jar files work, for example).

  • The use of json strings inside a json document is also bad practice and removed because its unneccesary if with consider them as resources that can be remote URLs or on disks.

  • The negative control concept set is now just concept sets that can optionally be negative controls, concepts can also include a cohortDefinitionId as this is there currently

  • Shared resources can also be references to other remote json schemas (so you can extend this to whatver objects you like, and how that is handled is internal to the packages)

  • renv is now a single study wide file that can be optionally overridden in a module. This is referenced as a hal resource (so either some remote URL or some path local to this file)

  • modules are now an array of module types. I didn't include the remote repo because I had a long think and I really think there are few cases where we need a separate git repository for a module that is not just an R package. So the approach taken here is as follows - a "module" just has settings which are defined in a schemaRef which is a remote resource. Naturally this URL should relate to the specific package version and branch. It should be up to the module/package maintainer to keep the json schemas for these up to date (which should be easy to achieve if we have github actions checking them on every PR)

Martijn has already mentioned a utility to make this easier for packages so that should go some of the way there for adding validators for that.

Things I missed out:

  • description fields for these properties - this should be added but they're pretty self explanatory

Also a note on types - "integer" in json schema does not describe the size of data types so assume its 64 bit storage for big ints as needed for cohort ids and concept ids.

@azimov azimov changed the base branch from main to develop January 15, 2024 19:32
Copy link

codecov bot commented Jan 15, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (13d020e) 69.19% compared to head (a31486f) 69.19%.
Report is 1 commits behind head on develop.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop     #114   +/-   ##
========================================
  Coverage    69.19%   69.19%           
========================================
  Files            8        8           
  Lines          922      922           
========================================
  Hits           638      638           
  Misses         284      284           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@azimov
Copy link
Collaborator Author

azimov commented Jan 15, 2024

Here is a chatgipty made example document that conforms to this schema:

{
  "apiVersion": "1.0",
  "studyName": "MyOHDSIStudy",
  "studyDescription": "A sample OHDSI study definition",
  "sharedResources": [
    {
      "cohortDefinitionSet": {
        "cohortDefinitions": [
          {
            "_links": {"self": {"href": "/cohorts/123"}},
            "cohortName": "DrugX",
            "cohortDefinitionId": 123
          },
          {
            "_links": {"self": {"href": "/cohorts/456"}},
            "cohortName": "ConditionY",
            "cohortDefinitionId": 456
          },
          {
            "cohortDefinitionId": 789,
            "targetCohortDefinitionId": 101,
            "subsetDefinitionId": 202,
            "cohortName": "CombinedCohort"
          }
        ],
        "subsetDefinitions": [
          {
            "_links": {"self": {"href": "/subsets/101"}},
            "subsetDefinitionName": "SubsetX",
            "subsetDefinitionId": 101
          },
          {
            "_links": {"self": {"href": "/subsets/202"}},
            "subsetDefinitionName": "SubsetY",
            "subsetDefinitionId": 202
          }
        ]
      }
    },
    {
      "conceptSetDefinition": {
        "id": 123,
        "conceptSetName": "DrugX Concept Set",
        "isNegativeControlConceptSet": false,
        "concepts": [
          {
            "conceptId": 789,
            "conceptName": "DrugX Concept A",
            "standardConcept": "S",
            "standardConceptCaption": "Standard Concept",
            "invalidReason": "Invalid Reason",
            "invalidReasonCaption": "Invalid Reason Caption",
            "conceptCode": "ABC123",
            "domainId": 1,
            "vocabularyId": 2,
            "conceptClassId": 3,
            "validStartDate": "2022-01-01",
            "validEndDate": "2022-12-31",
            "cohortDefinitionId": 123
          },
          {
            "conceptId": 456,
            "conceptName": "DrugX Concept B",
            "standardConcept": "NS",
            "standardConceptCaption": "Non-Standard Concept",
            "invalidReason": "No Invalid Reason",
            "invalidReasonCaption": "No Invalid Reason Caption",
            "conceptCode": "XYZ789",
            "domainId": 4,
            "vocabularyId": 5,
            "conceptClassId": 6,
            "validStartDate": "2022-02-01",
            "validEndDate": "2022-11-30",
            "cohortDefinitionId": 123
          }
        ]
      }
    },
    {
      "sharedResource": {
        "name": "MySharedResource",
        "schemaRef": "http://example.com/shared-resource-schema"
      }
    }
  ],
  "renv": {
    "_links": {"self": {"href": "/renv-lock/789"}}
  },
  "modules": [
    {
      "module": "ModuleX",
      "version": "1.2.3",
      "schemaRef": "http://example.com/module-x-schema",
      "settings": {"key1": "value1", "key2": "value2"},
      "renv": {
        "_links": {"self": {"href": "/module-x-renv/789"}}
      }
    },
    {
      "module": "ModuleY",
      "version": "4.5.6",
      "schemaRef": "http://example.com/module-y-schema",
      "settings": {"key3": "value3", "key4": "value4"},
      "renv": {
        "_links": {"self": {"href": "/module-y-renv/789"}}
      }
    }
  ]
}

@schuemie
Copy link
Member

The reason we made negate control concept sets (NCCS) different from a regular concept set (CS) is that they work differently; In a NCCS, each listed concept will result in a cohort, and you can separately specify whether you want to include all descendants of a concept when creating the cohort for that concept (and importantly: there will be no cohorts for each descendant). In a CS, you define logic defining a set of concepts, for example by listing concepts that should be included or excluded from the overall set, with or without their descendants.

I therefore propose to keep the NCCS definition as it was.

@azimov
Copy link
Collaborator Author

azimov commented Jan 17, 2024

The reason we made negate control concept sets (NCCS) different from a regular concept set (CS) is that they work differently; In a NCCS, each listed concept will result in a cohort, and you can separately specify whether you want to include all descendants of a concept when creating the cohort for that concept (and importantly: there will be no cohorts for each descendant). In a CS, you define logic defining a set of concepts, for example by listing concepts that should be included or excluded from the overall set, with or without their descendants.

I therefore propose to keep the NCCS definition as it was.

Noted and I have updated the spec to include this however, it would be possible to include these as optional fields in a concept set definition that are validated differently.

I have also aded a resourceName field to top level resources which can be referenced in the modules as requiredResource array as a runtime check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add linkability to design documents to implement JSON Hypertext Application Language (HAL)
2 participants