Skip to content

Commit

Permalink
FHIR export file to flat (#21)
Browse files Browse the repository at this point in the history
* Update the README file

* Create a fhir_file_to_flat function
Enables direct conversion from FHIR .ndjson export to FHIRflat .parquet
  • Loading branch information
pipliggins authored May 3, 2024
1 parent 42d4cb2 commit 5b65ced
Show file tree
Hide file tree
Showing 4 changed files with 238 additions and 11 deletions.
170 changes: 169 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,13 @@ data= {
patient = Patient(**data)
```

or in bulk from a FHIR export as an .ndjson file.
```
from fhir.resources.patient import Patient
patients = Patient.fhir_bulk_import("patient_export.ndjson")
```

### To FHIRflat
Once initialised, FHIR resources can be transformed to FHIRflat files using the `to_flat()` function like this
```
Expand All @@ -30,9 +37,170 @@ which will produce a [parquet file](https://towardsdatascience.com/demystifying-
|--------------|------|--------|------------|-----------------|
| Patient | f001 | male | 1996-05-30 | False |

or a FHIRflat file can be generated directly from a FHIR .ndjson export file.
```
from fhir.resources.patient import Patient
Patient.fhir_file_to_flat("patient_export.ndjson")
```
will create a "patient_export.parquet" FHIRflat file.
This first initialises a Patient data class for each row to make use of the Pydantic
data validation, then creates a FHIRflat file.

### From FHIRflat
FHIR resources can also be created directly from FHIRflat files
```
Patient.from_flat("patient_flat.parquet")
```
which will return either a single Patient resource, or a list of Patient resources.
which will return either a single Patient resource, or a list of Patient resources if
the Parquet file contains multiple rows of data.

### Specification

The FHIRflat structure closely follows that of FHIR, and simply flattens nested columns
in a manner similar to `pd.json_normalize()`. Some fields are excluded either because they are simply used for convenience within a FHIR server, because they contain information not relevant within ISARIC clinical data, or because they would contain Personally identifiable information (PII). These fields can be accessed and edited for each resource using the `flat_exclusions` property. There are a few specifics to FHIRflat that differ from simply normalising a FHIR structure, noted below.

1. **codeableConcepts**

CodeableConcepts are converted into 2 lists, one of codes and one of the corresponding text. The coding is compressed into a single string with the format `system|code`. The ‘|’ symbol was chosen as it is the standard way to query codes in FHIR servers [(example)](https://www.hl7.org/fhir/search.html#3.2.1.5.5.1.3). Thus a JSON snippet containing a codebleConcept:
```
"code": {
"coding": [
[
{
"system": "http://loinc.org",
"code": "3141-9",
"display": "Body weight Measured",
},
{
"system": "http://snomed.info/sct",
"code": "27113001",
"display": "Body weight",
},
]
]
}
```
is coded as two fields
| code.code | code.text |
|------------------------------------------------------------------|-----------------------------------------|
| ["http://loinc.org\|3141-9", "http://snomed.info/sct\|27113001"] | ["Body weight Measured", "Body weight"] |

Note that the external `coding` label is removed.

2. **References**

Reference are a string with the name of the resource with the ID, separated by a forward slash.
```
"subject": {
"reference": "Patient/f001",
"display": "Donald Duck"
}
```
becomes
| subject.reference |
|-------------------|
|"Patient/f001" |

The display text will not be converted due to the risk of revealing identifying information (e.g., a patient's name).

3. **Extensions**

The base FHIR schema can be extented to meet the needs of individual implementations using extension fields. FHIRflat displays these with the extension `url` as part of the column name. For example

```
"extension": [
{
"url": "timingPhase",
"valueCodeableConcept": {
"coding": [
{
"system": "http://snomed.info/sct",
"code": 278307001,
"display": "on admission",
}
]
},
},
{
"url": "relativePeriod",
"extension": [
{"url": "relativeStart", "valueInteger": 2},
{"url": "relativeEnd", "valueInteger": 5},
],
},
]
```
becomes
| extension.timingPhase.code | extension.timingPhase.text | extension.relativePeriod.relativeStart | extension.relativePeriod.relativeEnd |
|-------------------------------------|----------------------------|----------------------------------------|--------------------------------------|
| "http://snomed.info/sct\|278307001" | "on admission" | 2 | 5 |

Complex (nested) extensions such as relativePeriod also omit the internal `extension` labels.


3. **0..\* cardinality fields**

Fields which can contain an unspecified number of duplicate entries are dealt with according to the number of entries present. lists of length == 1 are expanded out as above, while any longer lists are kept in a single column with the data in it's original nested structure and `_dense` appended to the end of the field name. These fields are not expected to be queried regularly in standard analyses.

For example, the `diagnosis` field of the [Encounter](https://hl7.org/fhir/encounter.html) resource has 0..* cardinality. If a single diagnosis is present, the field is expanded out:
```
"diagnosis": [
{
"condition": [{"reference": {"reference": "Condition/stroke"}}],
"use": [
{
"coding": [
{
"system": "http://terminology.hl7.org/CodeSystem/diagnosis-role",
"code": "AD",
"display": "Admission diagnosis",
}
]
}
],
}
]
```
becomes
| diagnosis.condition.reference | diagnosis.use.code | diagnosis.use.text |
|-------------------------------|------------------------------------------------------------|---------------------|
| Condition/stroke | "http://terminology.hl7.org/CodeSystem/diagnosis-role\|AD" | Admission diagnosis |

whereas if 2 different diagnoses are present
```
"diagnosis": [
{
"condition": [{"reference": {"reference": "Condition/stroke"}}],
"use": [
{
"coding": [
{
"system": "http://terminology.hl7.org/CodeSystem/diagnosis-role",
"code": "AD",
"display": "Admission diagnosis",
}
]
}
],
},
{
"condition": [{"reference": {"reference": "Condition/f201"}}],
"use": [
{
"coding": [
{
"system": "http://terminology.hl7.org/CodeSystem/diagnosis-role",
"code": "DD",
"display": "Discharge diagnosis",
}
]
}
],
},
]
```
becomes
| encounter.diagnosis_dense |
|--------------------------------------|
|"[{"condition": [{"reference"...}]}]" |
42 changes: 35 additions & 7 deletions fhirflat/resources/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,40 @@ def fhir_bulk_import(cls, file: str) -> list[FHIRFlatBase]:
else:
return resources

@classmethod
def fhir_file_to_flat(cls, source_file: str, output_name: str | None = None):
"""
Converts a .ndjson file of exported FHIR resources to a FHIRflat parquet file.
source_file: str
Path to the FHIR resource file.
output_name: str
Name of the parquet file to be generated.
Returns
-------
parquet file
FHIRflat file containing condition data
"""

if not output_name:
output_name = f"{cls.resource_type}.parquet"

# identify attributes that are lists of FHIR types and not excluded
list_resources = [x for x in cls.attr_lists() if x not in cls.flat_exclusions]

fhir_data = cls.fhir_bulk_import(source_file)

flat_rows = []
for resource in fhir_data:
for field in cls.flat_exclusions:
setattr(resource, field, None)
flat_rows.append(fhir2flat(resource, lists=list_resources))

df = pd.concat(flat_rows)
return df.to_parquet(output_name)

def to_flat(self, filename: str) -> None:
"""
Generates a FHIRflat parquet file from the resource.
Expand All @@ -111,17 +145,11 @@ def to_flat(self, filename: str) -> None:
FHIRflat file containing condition data
"""

# TODO: add support for lists of fhir resources, most likely from a fhir bundle
# or single file json output.
# Most likely the input format from FHIR bulk export or for import into FHIR
# server will be ndjson as referenced in
# https://build.fhir.org/ig/HL7/bulk-data/export.html.

# identify attributes that are lists of FHIR types
list_resources = self.attr_lists()

# clear data from attributes not used in FHIRflat
for field in [x for x in self.elements_sequence() if x in self.flat_exclusions]:
for field in self.flat_exclusions:
setattr(self, field, None)
list_resources.remove(field) if field in list_resources else None

Expand Down
Loading

0 comments on commit 5b65ced

Please sign in to comment.