FHIR export file to flat (#21)

* Update the README file * Create a fhir_file_to_flat function Enables direct conversion from FHIR .ndjson export to FHIRflat .parquet
globaldothealth · May 3, 2024 · 5b65ced · 5b65ced
1 parent 42d4cb2
commit 5b65ced
Show file tree

Hide file tree

Showing 4 changed files with 238 additions and 11 deletions.
diff --git a/README.md b/README.md
@@ -20,6 +20,13 @@ data= {
 patient = Patient(**data)
 ```
 
+or in bulk from a FHIR export as an .ndjson file.
+```
+from fhir.resources.patient import Patient
+
+patients = Patient.fhir_bulk_import("patient_export.ndjson")
+```
+
 ### To FHIRflat
 Once initialised, FHIR resources can be transformed to FHIRflat files using the `to_flat()` function like this
 ```
@@ -30,9 +37,170 @@ which will produce a [parquet file](https://towardsdatascience.com/demystifying-
 |--------------|------|--------|------------|-----------------|
 | Patient      | f001 | male   | 1996-05-30 | False           |
 
+or a FHIRflat file can be generated directly from a FHIR .ndjson export file.
+```
+from fhir.resources.patient import Patient
+
+Patient.fhir_file_to_flat("patient_export.ndjson")
+```
+will create a "patient_export.parquet" FHIRflat file.
+This first initialises a Patient data class for each row to make use of the Pydantic 
+data validation, then creates a FHIRflat file.
+
 ### From FHIRflat
 FHIR resources can also be created directly from FHIRflat files
 ```
 Patient.from_flat("patient_flat.parquet")
 ```
-which will return either a single Patient resource, or a list of Patient resources.
+which will return either a single Patient resource, or a list of Patient resources if 
+the Parquet file contains multiple rows of data.
+
+### Specification
+
+The FHIRflat structure closely follows that of FHIR, and simply flattens nested columns
+in a manner similar to `pd.json_normalize()`. Some fields are excluded either because they are simply used for convenience within a FHIR server, because they contain information not relevant within ISARIC clinical data, or because they would contain Personally identifiable information (PII). These fields can be accessed and edited for each resource using the `flat_exclusions` property. There are a few specifics to FHIRflat that differ from simply normalising a FHIR structure, noted below.
+
+1. **codeableConcepts**
+
+    CodeableConcepts are converted into 2 lists, one of codes and one of the corresponding text. The coding is compressed into a single string with the format `system|code`. The ‘|’ symbol was chosen as it is the standard way to query codes in FHIR servers [(example)](https://www.hl7.org/fhir/search.html#3.2.1.5.5.1.3). Thus a JSON snippet containing a codebleConcept:
+    ```
+        "code": {
+            "coding": [
+                        [
+                            {
+                                "system": "http://loinc.org",
+                                "code": "3141-9",
+                                "display": "Body weight Measured",
+                            },
+                            {
+                                "system": "http://snomed.info/sct",
+                                "code": "27113001",
+                                "display": "Body weight",
+                            },
+                        ]
+                    ]
+                }
+    ```
+    is coded as two fields
+    | code.code                                                        | code.text                               |
+    |------------------------------------------------------------------|-----------------------------------------|
+    | ["http://loinc.org\|3141-9", "http://snomed.info/sct\|27113001"] | ["Body weight Measured", "Body weight"] |
+
+    Note that the external `coding` label is removed.
+
+2. **References**
+
+    Reference are a string with the name of the resource with the ID, separated by a forward slash.
+    ```
+    "subject": {
+        "reference": "Patient/f001",
+        "display": "Donald Duck"
+        }
+    ```
+    becomes 
+    | subject.reference |
+    |-------------------|
+    |"Patient/f001"     |
+
+    The display text will not be converted due to the risk of revealing identifying information (e.g., a patient's name).
+
+3. **Extensions**
+
+    The base FHIR schema can be extented to meet the needs of individual implementations using extension fields. FHIRflat displays these with the extension `url` as part of the column name. For example
+
+    ```
+    "extension": [
+        {
+            "url": "timingPhase",
+            "valueCodeableConcept": {
+                "coding": [
+                    {
+                        "system": "http://snomed.info/sct",
+                        "code": 278307001,
+                        "display": "on admission",
+                    }
+                ]
+            },
+        },
+        {
+            "url": "relativePeriod",
+            "extension": [
+                {"url": "relativeStart", "valueInteger": 2},
+                {"url": "relativeEnd", "valueInteger": 5},
+            ],
+        },
+    ]
+    ```
+    becomes
+    | extension.timingPhase.code          | extension.timingPhase.text | extension.relativePeriod.relativeStart | extension.relativePeriod.relativeEnd |
+    |-------------------------------------|----------------------------|----------------------------------------|--------------------------------------|
+    | "http://snomed.info/sct\|278307001" | "on admission"             | 2                                      | 5                                    |
+
+    Complex (nested) extensions such as relativePeriod also omit the internal `extension` labels.
+
+
+3. **0..\* cardinality fields**
+
+    Fields which can contain an unspecified number of duplicate entries are dealt with according to the number of entries present. lists of length == 1 are expanded out as above, while any longer lists are kept in a single column with the data in it's original nested structure and `_dense` appended to the end of the field name. These fields are not expected to be queried regularly in standard analyses.
+
+    For example, the `diagnosis` field of the [Encounter](https://hl7.org/fhir/encounter.html) resource has 0..* cardinality. If a single diagnosis is present, the field is expanded out:
+    ```
+    "diagnosis": [
+        {
+            "condition": [{"reference": {"reference": "Condition/stroke"}}],
+            "use": [
+                {
+                    "coding": [
+                        {
+                            "system": "http://terminology.hl7.org/CodeSystem/diagnosis-role",
+                            "code": "AD",
+                            "display": "Admission diagnosis",
+                        }
+                    ]
+                }
+            ],
+        }
+    ]
+    ```
+    becomes
+    | diagnosis.condition.reference | diagnosis.use.code                                         | diagnosis.use.text  |
+    |-------------------------------|------------------------------------------------------------|---------------------|
+    | Condition/stroke              | "http://terminology.hl7.org/CodeSystem/diagnosis-role\|AD" | Admission diagnosis |
+
+    whereas if 2 different diagnoses are present
+    ```
+    "diagnosis": [
+        {
+            "condition": [{"reference": {"reference": "Condition/stroke"}}],
+            "use": [
+                {
+                    "coding": [
+                        {
+                            "system": "http://terminology.hl7.org/CodeSystem/diagnosis-role",
+                            "code": "AD",
+                            "display": "Admission diagnosis",
+                        }
+                    ]
+                }
+            ],
+        },
+        {
+            "condition": [{"reference": {"reference": "Condition/f201"}}],
+            "use": [
+                {
+                    "coding": [
+                        {
+                            "system": "http://terminology.hl7.org/CodeSystem/diagnosis-role",
+                            "code": "DD",
+                            "display": "Discharge diagnosis",
+                        }
+                    ]
+                }
+            ],
+        },
+    ]
+    ```
+    becomes 
+    | encounter.diagnosis_dense            |
+    |--------------------------------------|
+    |"[{"condition": [{"reference"...}]}]" |
diff --git a/fhirflat/resources/base.py b/fhirflat/resources/base.py
@@ -98,6 +98,40 @@ def fhir_bulk_import(cls, file: str) -> list[FHIRFlatBase]:
         else:
             return resources
 
+    @classmethod
+    def fhir_file_to_flat(cls, source_file: str, output_name: str | None = None):
+        """
+        Converts a .ndjson file of exported FHIR resources to a FHIRflat parquet file.
+
+        source_file: str
+            Path to the FHIR resource file.
+
+        output_name: str
+            Name of the parquet file to be generated.
+
+        Returns
+        -------
+        parquet file
+            FHIRflat file containing condition data
+        """
+
+        if not output_name:
+            output_name = f"{cls.resource_type}.parquet"
+
+        # identify attributes that are lists of FHIR types and not excluded
+        list_resources = [x for x in cls.attr_lists() if x not in cls.flat_exclusions]
+
+        fhir_data = cls.fhir_bulk_import(source_file)
+
+        flat_rows = []
+        for resource in fhir_data:
+            for field in cls.flat_exclusions:
+                setattr(resource, field, None)
+            flat_rows.append(fhir2flat(resource, lists=list_resources))
+
+        df = pd.concat(flat_rows)
+        return df.to_parquet(output_name)
+
     def to_flat(self, filename: str) -> None:
         """
         Generates a FHIRflat parquet file from the resource.
@@ -111,17 +145,11 @@ def to_flat(self, filename: str) -> None:
             FHIRflat file containing condition data
         """
 
-        # TODO: add support for lists of fhir resources, most likely from a fhir bundle
-        # or single file json output.
-        # Most likely the input format from FHIR bulk export or for import into FHIR
-        # server will be ndjson as referenced in
-        # https://build.fhir.org/ig/HL7/bulk-data/export.html.
-
         # identify attributes that are lists of FHIR types
         list_resources = self.attr_lists()
 
         # clear data from attributes not used in FHIRflat
-        for field in [x for x in self.elements_sequence() if x in self.flat_exclusions]:
+        for field in self.flat_exclusions:
             setattr(self, field, None)
             list_resources.remove(field) if field in list_resources else None