Flattened FHIR defined by a subset of FHIRPath #69

rbrush · 2023-04-20T23:07:07Z

rbrush
Apr 20, 2023
Maintainer

This post explores some options for defining flattened FHIR tables via a subset of FHIRPath expressions. This is an exploration of options intended to start a discussion.

Possible requirements

A few possible requirements for this system:

A portable, unambiguous specification

Any good standard is unambiguous and portable between technology stacks, and this is no exception.

Build on an improved SQL-on-FHIR “level 0”, when available

This proposal builds on a proposed “level 0” representation of FHIR data in SQL databases that improves on the approach previously documented at https://github.com/FHIR/sql-on-fhir/blob/master/sql-on-fhir.md. For example, the “level 0” may hash ids, add raw id fields to FHIR reference types, or expand “date” types into start date/end date pairs. Users may query level 0 directly if needed, or use tabular views proposed here on top of it.

Ability to select from repeated structures based on field values

Flattened repeated structures in FHIR requires checking the content of those fields. For example, creating a table of patient home addresses requires checking that the address.use field is ‘home’. Similarly, a table with columns for systolic and diastolic blood pressures needs to check the Observation.component.code fields to select them properly.

Ability to filter based on value sets when creating a view.

Many useful FHIR queries rely heavily on value sets to identify needed resources. For instance, users may be interested in a table of statin meds for analysis, requiring a value set of statin medication codes to allow such a flattened view of statins. Therefore some form of valueset-based filter should be used to create the needed views.

Leverage existing standards whenever possible

Whenever practical we should avoid creating new standards and use existing approaches to these problems.

Support direct exports from data sources

Some users have limited analytic needs and only need views over a small subset of FHIR data that could be produced by a given system. Ideally a flattened FHIR definition could be interpreted by a FHIR service so only the needed subset of data is produced – whether directly in a tabular form or limited to the FHIR resources needed for the views.

Proposal: columns and filters with a subset of FHIRPath

One approach to the above would be to use a subset of FHIRPath to define flattened columns for use. Here’s a simple example showing a flattened table of home addresses.

The goal with these examples is not to fully define a spec, but rather to illustrate the approach for discussion.

{
  "viewName": "patient_with_home_address", 
  "resource": "Patient",
  "columns": {
    "id": "id",
    "gender": "gender",
    "birthdate": "birthDate",
    "street": "address.where(type = 'home').first().line.first()",
    "city": "address.where(type = 'home').first().city",
    "state": "address.where(type = 'home').first().state",
    "zip": "address.where(type = 'home').first().postalCode"
  }
}

Here’s another example that creates a simple table of LDL values. Notice this adds a “constraints” section that checks for a specific value set.

{
  "viewName": "ldl_values"
  "resource": "Observation",
  "columns": {
    "subject": "subject.idFor('Patient')",
    "value": "value.ofType('Quantity').value",
    "unit": "value.ofType('Quantity').unit",
    "display": "code.coding.display.first()",
    "effectiveTime": "effective.ofType('dateTime')"
  },
  "constraints": [
    "code.memberOf('http://example/ldl/valueset/')"
  ]
}

Subset of FHIRPath

Here is a working list of the subset of FHIRPath we might support for this:

Simple dot-delimited path expressions to select fields
The where function to select items in arrays (like the home address)
The equals operator, primarily for use in the where function above.
The ofType function to select the desired value type, as seen above.
The first function. This is easily implemented as getting the first item in an array, and simplifies the output when users are looking for only a single value for a scalar column.
The memberOf function to allow checking for value sets.

We may consider additional FHIRPath expressions over time, but the above starting point will be able to handle many analytic workloads.

Cross-resource support explicitly out of scope

This possibility of including cross-resource joins as part of the flattened FHIR specification was mentioned, but tentatively ruled out. This is because once there are flattened FHIR tables, simply using SQL joins on top of them are flexible and reasonably portable between systems.

Translating FHIRPath to SQL

The ability to translate FHIRPath into SQL is central to this proposal. There are a couple of systems that have shown this is viable: Pathling and an open source Python library from Google.

A deeper exploration of FHIRPath to SQL translation can follow, but the above projects do so by recursively walking the FHIRPath parse tree and converting each sub-expression into an SQL expression – then assembling the full SQL expression as the logic unwinds that recursion.

I won’t fully detail the logic here, but consider this example of selecting the home postal code from a patient’s address:

address.where(type = 'home').postalCode

The logic will find that postalCode comes from the address, and produce an SQL sub-expression like this:

SELECT address_element_.postalCode 
FROM UNNEST(address) AS address_element_

From there we work back up the tree recursively, and encounter the where clause, so we translate that to SQL and add it to the UNNEST:

SELECT address_element_.postalCode 
FROM UNNEST(address) AS address_element_
WHERE (type = 'home')

JSON-backed implementation options

The above example focuses on a FHIR encoding that uses first-class columns in databases, but some users may prefer to keep FHIR in JSON form in the database. We won’t go into that deeply here, but translation to JSONPath supported by some databases could be straightforward. For instance, this FHIRPath:

address.where(type = 'home').postalCode

would translate to this JSONPath with a relatively straightforward traversal of the parse tree:

$.address[?(@.type == "home")].postalCode

Challenges

There are a few challenges in translating FHIRPath to SQL that should be considered going forward:

FHIRPath by itself does not offer a straightforward way to get the raw ids, or handle any id hashing or translation that could happen on “level 0”. The above examples cheat with an idFor expression to get raw IDs – making that the only case where we stray from strict FHIRPath.
FHIRPath to SQL translation requires knowing the cardinality of nested fields to generate proper SQL. For instance, Patient.address requires an UNNEST expression to access, but patient.gender is a single field and would not. This adds some complexity to the SQL generator, since it must keep track of the FHIR cardinality while traversing expressions.

Alternatives

There are a few alternatives we can consider, listed below. The notes below are incomplete, but we can follow up with more detailed analyses the pros and cons of these if they seem promising.

Transpile a standard SQL dialect

Tools like sqlglot can take standard ANSI SQL and translate it into a variety of SQL dialects, allowing for a portable specification. However, based on some limited exploration, there may not be a transpiler capable of handling the full nested and repeated structures seen in FHIR. For instance, UNNEST and LATERAL VIEW EXPLODE don’t seem fully supported, and the fact that not all databases support correlated subqueries to tap into nested data can lead to significantly different data structures. That said, deeper analysis could be justified here if there is interest in this approach.

Collections of DBT-like macros

It may also be possible to mitigate the variance in SQL dialects with an approach like DBT macros, with a distinct implementation of each macro. This may not be desirable since it ties the approach to DBT tooling rather than a portable standard.

niquola · 2023-04-21T10:14:02Z

niquola
Apr 21, 2023
Maintainer

@rbrush What restrictions are inside where expression do you see? Only equality? Can where expressions be nested?

1 reply

rbrush Apr 21, 2023
Maintainer Author

The use cases I've played with don't need nested where expressions or anything more than simple quality checks within them. There may be others that pop up that do, but I'd be open to constraining the nesting here if it makes implementations simpler.

niquola · 2023-04-21T15:05:44Z

niquola
Apr 21, 2023
Maintainer

Here is a draft of minimized FHIRPath grammar https://gist.github.com/niquola/61703a2a83d6f851e4cb2a6c021f08ba

0 replies

rbrush · 2023-04-21T16:10:53Z

rbrush
Apr 21, 2023
Maintainer Author

Some open questions and additional needs from today's working group meeting that we should expand to. Please feel free to add others as well.

I'll highlight one as the biggest open question, and then list others:

We should offer a mechanism to expand nested structures into their own, flattened rows with some form of unnest or cross-join semantics. For example, rather than getting just the home address, we should be able to create separate rows for each address with its own type column, plus the top-level patient id as a column. A similar use case could apply to Observation components.

Here are other open questions, but perhaps not as complicated as the first:

We may add the String join() function to the list so users can concatenate address lines (rather than just getting the first line).
Need to determine what to do with expressions that return repeated arrays or structs. (e.g., do we allow them in the output views even though they are not fully tabular, or treat them as error conditions?)
We can consider whether to use a StructureDefinition or some other resource to share such flattened FHIR content, but for the purposes of this conversation we use minimal JSON structures like those above while we work out the semantics.
We will need database type semantics, e.g. some function that defines the database type based on the FHIRPath expression.

2 replies

chrisgrenz Apr 25, 2023

For #1, we could choose an arbitrary "root" for the view and use the existing FHIRPath conventions for selectors:

{
  "viewName": "obs_codes"
  "viewRoot": "Observation.code.coding",
  "columns": {
    "obs_id": "%resource.id",
    "system":"system",
    "code":"code",
    "display":"display"
  },
  "constraints": [
    "%resource.code.memberOf('http://example/ldl/valueset/')"
  ]
}

Or even:

{
  "viewName": "obs_codes"
  "viewRoot": "Observation.where(code.memberOf('http://example/ldl/valueset/')).code.coding",
  "columns": {
    "obs_id": "%resource.id",
    "system":"system",
    "code":"code",
    "display":"display"
  }
}

rbrush Apr 29, 2023
Maintainer Author

Someone I missed this comment earlier, but quite like this "view root" approach -- it's semantically equivalent to having a single "unnest" operation for collection within the resource, but seems easier to understand and explain. I'll bring this up in the working group and on Zulip as well.

gotdan · 2023-04-28T16:40:20Z

gotdan
Apr 28, 2023
Maintainer

Notes from our working group meeting on 4/28

Revised BP flattening definition example:

{
    "view": "blood_pressure_with_dar",
    "resource": "Observation",
    "filters": [
        "Observation.code.coding.exists(system='http://loinc.org' and code='85354-9')"
    ],
    "vars": {
        "component_sbp": "Observation.component.where(code.coding.exists(system='http://loinc.org' and code='8480-6')).first()",
        "component_dbp": "Observation.component.where(code.coding.exists(system='http://loinc.org' and code='8462-4')).first()"
    },
    "fhirVersion": ["4.0.1", "5.0.0"], //optional  
    "columns": {
        "id": "Observation.id",
        "patient_id": "Observation.subject.getId()",
        "effective_date_time": "Observation.effective.ofType(dateTime)",
        "sbp_quantity_code": "%component_sbp.value.ofType(Quantity).system",
        "sbp_quantity_code": "%component_sbp.value.ofType(Quantity).code",
        "sbp_quantity_display": "%component_sbp.value.ofType(Quantity).unit",
        "sbp_quantity_value": "%component_sbp.value.ofType(Quantity).value",
        "sbp_has_dar": "%component_sbp.dataAbsentReason.exists()",
        "dbp_quantity_system": "%component_dbp.value.ofType(Quantity).system",
        "dbp_quantity_code": "%component_dbp.value.ofType(Quantity).code",
        "dbp_quantity_display": "%component_dbp.value.ofType(Quantity).unit",
        "dbp_quantity_value": "%component_dbp.value.ofType(Quantity).value",
        "dbp_has_dar": "%component_dbp.dataAbsentReason.exists()"
    }
}

Discussion and open questions:

Is there a way to indicate which segments in a FP expression are arrays, or will a transpiler always need to do a lookup into FHIR StructureDefinition data?
Add exists, and and or to our minimal FHIRPath subset - are any other parts of FHIRPath critical? New items should balance value vs. complexity of implementation.
Add optional FHIR version parameter as array in flattening definition json structure to guide implementations.
Think about namespacing for flattening definitions - maybe these could translate into db schemas? Could be a registry layer on top of the definitions.
Should we use objects for the column definitions to allow for metadata - e.g., {expression, outputType, description}?
Spec should describe how cross joins expand multiple hierarchies in a single view (e.g., address and contact in a Patient resource). Also, what happens if one of the unnested elements doesn't exist in a resource?
Should we have a way to store constants (e.g., extension urls) as variables to make expressions more readable (right now all the vars are collections, so this may add complexity unless we create another property)?

2 replies

rbrush Apr 29, 2023
Maintainer Author

Thanks for posting the notes, Dan. A few thoughts on these that we can discuss here or at the connectathon:

Is there a way to indicate which segments in a FP expression are arrays, or will a transpiler always need to do a lookup into FHIR StructureDefinition data?

Ideally we wouldn't require FHIRPath authors to annotate arrays, so I'd like to explore ways to make that easier to implement in a transpiler. For instance, I think Nikolai suggested pre-processing StructureDefinitions into a minimal JSON document with only the needed type information in a standalone program, and share the resulting JSON across implementations of this proposal. That way it would be small and light, and lift the burden of dealing with the full StructureDefinitions in implementations that need that info.

Should we use objects for the column definitions to allow for metadata - e.g., {expression, outputType, description}?

+1 to some per-column metadata, like descriptions and output type. Both could be quite useful when creating a database view. (For instance, the column descriptions could be provided to the DB data dictionary when creating the view, providing valuable inline documentation for database users. And output types could allow those views to use db-friendly types, like first-class dates rather than strings).

Spec should describe how cross joins expand multiple hierarchies in a single view (e.g., address and contact in a Patient resource). Also, what happens if one of the unnested elements doesn't exist in a resource?

Chris Grenz had a nice idea above (that I completely missed until now), that I think nicely complements the vars you have in this example: we could create a "view root" expression when we want to unnest something in particular (like Patient.address), but reference still be able to reference the parent resource as outlined in Chris's comment above.

In this case, we would keep the vars concept for things like getting diastolic blood pressure...but that would basically be a shorthand to avoid having to repeat an expression inline. The "view root" would be the only thing that does a full unnest (and implicit cross join with the parent). The constraint is that it we would only unnest/cross join one item -- which seems sufficient for most use cases and easy to understand, but we should pressure test that against expected uses.

gotdan May 1, 2023
Maintainer

Regarding pre-processing the structure definitions, I wrote some quick code a few years ago to do that for another project which may be worth dusting off: https://github.com/sync-for-science/data-census/blob/master/builder/src/build-definitions.js . I'm somewhat resigned to that approach unless we have another reason to break with FHIRPath syntax, since the trade offs seem to favor being compatible. That said, it does mean that the tool chain has to be kept up to date with new FHIR versions and it will be much more difficult to, for example, support the SQL generation via a user defined function in the DB.

EvanMachusak · 2023-05-01T12:22:50Z

EvanMachusak
May 1, 2023

I spent some time playing with this technology this weekend, and came up with some thoughts.

We should consider replacing fhirVersion with something like this:

"models": [
    {
        "url": "http://hl7.org/fhir", // canonical; tooling knows to use level 0 storage schema & support FHIRPath; in our IG?
        "version": "4.0.1",
        "name": "fhir401" // local name used in this mapping
    }
  ]

I am imagining scenarios where the software interpreting this model who issues CREATE VIEW statements could be configured to understand non-FHIR models, with the assumption that non-FHIR models would not support columns expressed via non-trivial FHIRPath.

The resource key is perhaps too prescriptive. There are many structures in FHIR that derive from Element such as Address that could be their own View. We've discussed that normalizing every component of every resource into its own table would create thousands of tables and wouldn't be ideal but in many cases it is by far the path of least resistance, e.g. Claim.item. Claims are essentially a collection of tables and normalizing seems the right approach for my use case.
These examples are flattening by simply ignoring every data point except the first, e.g. "street": "address.where(type = 'home').first().line.first(). This is surely fine in a lot of cases but in others could be quite dangerous as we're truncating data by schema and my fear is that decision might burn anyone who goes down that path in the long run.
How do I express that I don't want to truncate, and instead I actually do want to normalize a collection into another View? Can we consider another element? Something like this:

{
  "view": "Patient_Address",
  "models": [
    {
      "url": "http://hl7.org/fhir",
      "version": "4.0.1",
      "name": "fhir401"
    }
  ],
  "$this":  // using Chris's idea instead of "resource"; clearer than "viewRoot" imo but not sure
  {
      "model": "fhir401",
      "type": "Address",
      "path": "$this" // maybe?
  },
  "columns": [
    {
        "name": "patient",
        "type": "fhir401.id" // could use the LIKE SQL constraint for some FHIR regex's
    },
    {
      "name": "use",
      "path": "$this.use",
      "type": "fhir401.code"
    },
    {
      "name": "type",
      "path": "$this.type",
      "type": "fhir401.code"
    },
    {
      "name": "line",
      "path": "$this.line",
      "type": "fhir401.string"
    },
    {
      "name": "postalCode",
      "path": "$this.postalCode",
      "type": "fhir401.string"
    },
    {
      "name": "period",
      "path": "$this.period",
      "type": "fhir401.period"
    }
  ],
  "relationships": [
    {
      "name": "FK_Patient",
      "path": "$this.patient",
      "target": "Patient", // another View
      "cardinality": "1..1" // 0..1, 1..1, 0..*, 1..*
    }
  ]
}

For columns defined with FHIRPath path properties, tooling should be able to infer the correct column type, but for columns added to the view either based on another model or created to be the foreign keys, we'd need to express the desired type.
FHIR has many data types which are just strings constrained by regular expressions e.g. id. Views could copy forward these regular expressions into LIKE constraints for platforms that support this mechanism.
The tool that translates this into CREATE VIEW statements is going to be a fairly complicated piece of software. This isn't an "afternoon" thing. If we are going to collaborate on a reference implementation, shall we pick a language?

As an interesting side note - I created these JSON examples in VSCode running Copilot, and Copilot auto-generated the columns and relationships elements with those structures. The future is now.

That's all I have for now. Looking forward to seeing you all at the connectathon.

1 reply

rbrush May 4, 2023
Maintainer Author

It could be interesting to see if we can create views over non-FHIR models....assuming we can keep the more common FHIR-specific use case straightforward, and not require additional steps for those users.

For instance, many views could be run against multiple versions of FHIR, (e.g., a table of patient addresses or blood pressure results) so ideally users wouldn't need to include version-specific type information in multiple places. So perhaps a "FHIR by default -- but allow for metadata for non-FHIR sources as optional" could be a workable model?

I'd also like to better understand the use case. Are you considering views over CQL structures, JSON in general, or perhaps other data types entirely?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flattened FHIR defined by a subset of FHIRPath #69

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Flattened FHIR defined by a subset of FHIRPath #69

rbrush Apr 20, 2023 Maintainer

Possible requirements

A portable, unambiguous specification

Build on an improved SQL-on-FHIR “level 0”, when available

Ability to select from repeated structures based on field values

Ability to filter based on value sets when creating a view.

Leverage existing standards whenever possible

Support direct exports from data sources

Proposal: columns and filters with a subset of FHIRPath

Subset of FHIRPath

Cross-resource support explicitly out of scope

Translating FHIRPath to SQL

JSON-backed implementation options

Challenges

Alternatives

Transpile a standard SQL dialect

Collections of DBT-like macros

Replies: 5 comments · 6 replies

niquola Apr 21, 2023 Maintainer

rbrush Apr 21, 2023 Maintainer Author

niquola Apr 21, 2023 Maintainer

rbrush Apr 21, 2023 Maintainer Author

chrisgrenz Apr 25, 2023

rbrush Apr 29, 2023 Maintainer Author

gotdan Apr 28, 2023 Maintainer

Notes from our working group meeting on 4/28

Revised BP flattening definition example:

Discussion and open questions:

rbrush Apr 29, 2023 Maintainer Author

gotdan May 1, 2023 Maintainer

EvanMachusak May 1, 2023

rbrush May 4, 2023 Maintainer Author

rbrush
Apr 20, 2023
Maintainer

Replies: 5 comments 6 replies

niquola
Apr 21, 2023
Maintainer

rbrush Apr 21, 2023
Maintainer Author

niquola
Apr 21, 2023
Maintainer

rbrush
Apr 21, 2023
Maintainer Author

rbrush Apr 29, 2023
Maintainer Author

gotdan
Apr 28, 2023
Maintainer

rbrush Apr 29, 2023
Maintainer Author

gotdan May 1, 2023
Maintainer

EvanMachusak
May 1, 2023

rbrush May 4, 2023
Maintainer Author