Skip to content

Latest commit

 

History

History
235 lines (205 loc) · 11.7 KB

schema.md

File metadata and controls

235 lines (205 loc) · 11.7 KB

Parquet (and Avro) Schema for FHIR Resources

In this document we review the mapping (a.k.a. projection) from FHIR resources to Parquet schema. A few high-level reminders:

If you want to look at examples before reading the details, you can check Patient_no-extension.schema for the projection of base Patient resource. Patient_US-Core.schema provides an example for US Core Patient profile. To see the intermediate Avro schema for this resource, see us-core-patient-schema.json.

Type mapping rules

Note: In the following subsections we cover the rules for mapping a FHIR type to a Parquet schema. As mentioned above, this involves the intermediate Avro types which are covered as well. In all cases, the real Avro type is a union because all fields are nullable. So, for example, when we say the FHIR code type is mapped to Avro string, it is really the ["null", "string"] union type. This is not reiterated below but that is also the reason all Parquet fields are optional. This is even true for fields whose cardinality is exactly one like Observation.status.

Primitive types

The FHIR primitive types are mapped according to this table (code reference):

FHIR type Avro type Parquet type
base64Binary string STRING
boolean boolean boolean
canonical string STRING
code string STRING
date string STRING
datetime string STRING
decimal double double*
id string STRING
instant string STRING
integer int int32
markdown string STRING
oid string STRING
positiveInt int int32
string string STRING
time string STRING
unsignedInt int int32
xhtml string STRING
uri string STRING
url string STRING
uuid string STRING

* The original Bunsen used to use Avro decimal type to represent FHIR decimal. But we changed this because of precision issues as described in Issue #156.

Records

A FHIR record type, i.e., a complex type that has one or more fields, are mapped to an Avro record, which in turn is mapped to Parquet group. FHIR examples include any Complex Type, BackboneElement, and Resource.

For example a period field with FHIR Period type is mapped to the following group in Parquet:

optional group period {
  optional binary start (STRING);
  optional binary end (STRING);
}

Lists

Many FHIR record types have fields that can be repeated. Each element with max cardinality higher than 1 is mapped to an Avro array which in turn is mapped to a Parquet LIST. As an example, here is the schema for the address field of a Patient resource:

optional group address (LIST) {
  repeated group array {
    optional binary use (STRING);
    optional binary type (STRING);
    optional binary text (STRING);
    optional group line (LIST) {
      repeated binary array (STRING);
    }
    optional binary city (STRING);
    optional binary district (STRING);
    optional binary state (STRING);
    optional binary postalCode (STRING);
    optional binary country (STRING);
    optional group period {
      optional binary start (STRING);
      optional binary end (STRING);
    }
  }
}

Choice types

A FHIR "choice type", i.e., fields ending with [x] which can take multiple types, are modeled as a record. The fields of the record are named after the possible types. For example, Patient.deceased[x] can be a boolean or a dateTime; hence it is modeled with the following Parquet schema:

optional group deceased {
  optional boolean boolean;
  optional binary dateTime (STRING);
}

References

FHIR references are also records but because they frequently participate in JOIN queries between different resource tables, they have some extra special fields. These fields represent each resource type that a reference can refer to and make it easier to write JOIN queries. For example, the Patient.generalPractitioner can be a reference to an Organization or Practitioner or PractitionerRole. Therefor, it is mapped to the following Parquet schema (only special fields are shown; note there might be multiple generalPractitioner, hence the LIST):

optional group generalPractitioner (LIST) {
  repeated group array {
    optional binary organizationId (STRING);
    optional binary practitionerId (STRING);
    optional binary practitionerRoleId (STRING);
    ... [rest of the usual fields]
  }
}

Recursion

When mapping FHIR types to Parquet schema, we sometime need to break recursive structures. For example, a FHIR references has an identifier field which has an assigner field which is a reference itself. Therefor, there is a recursiveDepth configuration parameter that controls how many times a recursive type should be traversed in the same branch.

Extensions

To make it easier to query extension fields, top-level fields are created for them. For example, in the US-Core Patient profile there is an extension for birthsex whose type is code; therefor we get the following field at the topmost level in the Patient Parquet schema:

optional binary birthsex (STRING);

The above example is a "simple" extension. For "complex" extensions, i.e., extensions that have nested extensions (and have no value), the same structure is repeated in the generated schema as well. For example, the US-Core Patient profile has a complex race extension which has a list of ombCategory values, a list of detailed values, and a text. Therefor the corresponding Parquet schema is:

optional group race {
  optional group ombCategory (LIST) {
    repeated group array {
      optional binary system (STRING);
      optional binary version (STRING);
      optional binary code (STRING);
      optional binary display (STRING);
      optional boolean userSelected;
    }
  }
  optional group detailed (LIST) {
    repeated group array {
      optional binary system (STRING);
      optional binary version (STRING);
      optional binary code (STRING);
      optional binary display (STRING);
      optional boolean userSelected;
    }
  }
  optional binary text (STRING);
}

As mentioned above, this race would be a top-level field, i.e., Patient.race.

Resource types with multiple extensions

In a profile, it is possible that a single resource type, may have multiple extension files, each having a StructureDefinition. As long as these extensions are compatible (which is expected in a single profile), all of them are merged into a single schema. For example, if one extension adds a new field X on resource type R and another extension adds Y, the generated Parquet schema of R has both fields X and Y.