Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataframer/Pivot with Facet Mangement #23

Open
bwalsh opened this issue Jul 10, 2024 · 4 comments
Open

Dataframer/Pivot with Facet Mangement #23

bwalsh opened this issue Jul 10, 2024 · 4 comments

Comments

@bwalsh
Copy link
Collaborator

bwalsh commented Jul 10, 2024

Use Cases:

  • As a data engineer, I need to transform FHIR Observations to a flattened DataFrame with Observation Codes as columns for analysis and reporting purposes.
  • As a UI developer, I need to fetch and display patient-specific observations from a FHIR server in a tabular format with observation codes as columns. Given there are many observations, I need to ensure efficient way to group these facets into categories.

Example

image

References

Goals/Forces/Motivation :

The primary reason for using the pivot function is to reshape the data. It transforms data from long to wide format, which helps when comparing different variables more effectively. This reshaping is fundamental in preparing datasets for analysis or visualization as it allows for a more structured and readable form of data representation.

Data pivoting is the process of transforming data from a long format (rows) to a wide format (columns), typically to make it more understandable or suitable for analysis.
Common scenarios for data pivoting include summarizing data, creating cross-tabulations, or presenting data in a more structured format for reporting purposes.

See R's tidyverse

Tidy data refers to ‘rectangular’ data. These are the data we typically see in spreadsheet software like Googlesheets, Microsoft Excel, or in a relational database like MySQL, PostgreSQL, or Microsoft Access, The three principles for tidy data are:

  • Variables make up the columns
  • Observations (or cases) go in the rows
  • Values are in cells
    Put them together, and these three statements make up the contents in a tidy data frame or tibble. While these principles might seem obvious at first, many of the data arrangements we encounter in real life don’t adhere to this guidance.

In context of FHIR, the Observation is the "tidy data" aka “indexed”, and has a defined, fixed schema.
The dataframe is the "wide data" or "Cartesian" data see ggplot2 it's schema is not fixed as "variables" are defined by Observation.code.

image image

Workflow

  • Create dataframe and maintain ES dataframe mapping (g3t meta dataframe)
  • Based on newly submitted Observations, maintain a hierarchy of facets for the dataframe - driven by the Observation.category and Observation.code
  • Automatically update the explorer configuration while reading submitted data
  • Signal the guppy service to restart after updating the ES mapping
  • Signal the portal to read the new explorer configuration on change

Data Frame Creation:

Create an initial pandas DataFrame with columns for each unique observation code extracted from the observations.
Each row represents a set of observations with its associated attributes (subject, encounter, value, effectiveDateTime, etc.) for a given subject, specimen, focus at a given time.
Use observation codes as columns to ensure each code has its dedicated column in the DataFrame.

De normalization by Observation.subject:

  • Identify the subject resourceType attribute within each Observation resource, which represents the entity (e.g., patient, device, location) to which the observation applies.
  • Normalize the value[x] attribute in the Observation resource, extracting the value based on the data type specified in the value[x] field (e.g., valueQuantity, valueCodeableConcept, valueString).
  • Temporal Data Extraction:
    • Extract the effectiveDateTime attribute from the Observation resource, representing the date and time of the observation.
    • If no observation.effectiveDateTime, include the onsetAge or use effectiveDateTime attribute from the specimen or focus resource, representing temporal data - represented as an age. If no temporal data is available, the field should be null.
    • In any case, ensure that a new dataframe row is created for each corresponding temporal data.
  • Retrieve from working or create the dataframe row
    • If we need to create the row:
      • retrieve the subject, specimen and focus entities referenced in the Observation resource, expanding the DataFrame to include their scalar attributes and coding, prefix each attribute with lower case resourceType eg patient_*, specimen_*
      • create an identifier for the row based on the subject, specimen, focus, and temporal data (age), ensuring uniqueness.
  • Add a new column to the data frame
  • Verify new column exists or should be added to explorer_config
  • After processing all Observations, write dataframe to ES, maintaining the ES index mapping
    • If the explorerConfig is updated:
      • upsert the document in ES.
      • signal the guppy & portal to read the new configuration

Facet Management Category and Coding Extraction:

While creating the "dataframe" we also need to discover and maintain a hierarchy of facets.

  • Before reading incoming Observations, read the new explorer_config ES index to retrieve current explorer configuration.
    • This index will contain a single document, that will contain the explorerConfig needed by the front end
  • Specifically, maintain a "has changed" flag on that document.
  • Changes to dataframer:
    • In the case of multiple code values, select the well known code (e.g., LOINC, SNOMED) for normalization.
    • Apply "string to variable name conversion" to normalize 'code' to column name.
    • Maintain a "gitops" explorer fragment:
explorerConfig:
  {{ for all unique subject.resourceType }}
  - tabTitle: <resourceType>  # e.g., Patient, Specimen, etc.
    charts: []  # manually add charts
    filters:
      {{for all categories}}  # e.g., 'Vital Signs', 'Laboratory', etc.
        tabs:
          - title: category
            fields: <codes from category>  # e.g., 'heart_rate', 'blood_pressure', etc.
                       <scalars and extensions>
    table:
        enabled: true
        detailsConfig: # manually add detailsConfig
        fields:
          # all flattened references
          # all codes

Notes: Out of scope or static elements in explorerConfig.

  • we expect that a the tabTitle corresponds 1:1 with a dataframer - as such we expect that the entry in explorer config will be initialized as the dataframer is developed. In other words, if the tabTitle, the dataframer will initialize it with [detailsConfig, charts] etc.

Example: from Prostate_Microenvironments

image

Summary view of dataframe e.g. patient-centric

image

Changes to guppy config

# Guppy configuration
guppy:
  enabled: true
  dbRestore: false
  indices:
  - index: observation
    type: observation
  - index: file
    type: file
# added to support facet management
  - index: explorer_config
    type: explorerConfig

  configIndex: gen3.aced.io_array-config

End to End

image

Guppy PR link: uc-cdis/guppy#273

FF issue: ACED-IDP/gen3-frontend-framework#12

Existing work

 g3t meta dataframe --help
Usage: g3t meta dataframe [OPTIONS] [DIRECTORY_PATH] [OUTPUT_PATH]

  Render a metadata dataframe.

  directory_path: The directory path to the metadata.
  output_path: The output path for the dataframe. default [meta.csv]

Options:
  --dtale                         Open the graph in a browser using the dtale
                                  package for interactive data exploration.
  --data_type [Patient|Specimen|Observation|DocumentReference]
                                  Create a data frame for a specific data
                                  type.  [required]
  --debug

E.G.

g3t meta dataframe --data_type Observation META/ --dtale

image

Discussion points

  • Independent of submission ("push") method
  • Dependency on guppy PR (guppy/_restart)
  • Responsibility for facet category delegated to Observation.category/.code. AKA - etl driven facet hierarchy
  • Does guppy's graphql support fetching a nested document (explorerConfig)
  • Does FEF support reading explorerConfig dynamically via graphql packages/sampleCommons/config/aced/explorer.json
    • If guppy/graphql does not support nested document, can the FEF read the document dynamically via a url (public bucket)
@bwalsh
Copy link
Collaborator Author

bwalsh commented Jul 31, 2024

A worked example

Consider this bundle (see fsh editor for shorthand.)

image
Alias: $sct = http://snomed.info/sct
Alias: $condition-category = http://terminology.hl7.org/CodeSystem/condition-category
Alias: $observation-category = http://terminology.hl7.org/CodeSystem/observation-category
Alias: $loinc = https://loinc.org
Alias: $mylab = http://mylab.org

Instance: undefined
InstanceOf: Bundle
Usage: #example
* type = #bundle
* entry[0].resource = example
* entry[+].resource = example-specimen
* entry[+].resource = example-cancer
* entry[+].resource = example-common-cold
* entry[+].resource = example-fever
* entry[+].resource = example-gleason-score
* entry[+].resource = example-favorite-color

Instance: example
InstanceOf: Patient
Usage: #inline
* birthSex.coding.system = "http://terminology.hl7.org/CodeSystem/v3-AdministrativeGender"
* birthSex.coding.code = "M"

Instance: example-specimen
InstanceOf: Specimen
Usage: #inline
* subject = Reference(example)
* type = $sct#122555 "Biopsy"
* collection.bodySite.coding.system = "http://snomed.info/sct"
* collection.bodySite.coding.code = "122456"
* collection.bodySite.coding.display = "Prostate"
* processing.method = $sct#" 787376009" "Preparation of formalin fixed paraffin embedded tissue specimen"

Instance: example-cancer
InstanceOf: Condition
Usage: #inline
* subject = Reference(example)
* category = $condition-category#encounter-diagnosis
* code = $sct#123456 "Cancer"
* onsetAge = 600 'm' "months"
* evidence.reference = "Observation/example-gleason-score"

Instance: example-common-cold
InstanceOf: Condition
Usage: #inline
* subject = Reference(example)
* category = $condition-category#encounter-diagnosis
* code = $sct#7890 "Common Cold"
* onsetAge = 601 'm' "months"
* evidence.reference = "Observation/example-fever"

Instance: example-fever
InstanceOf: Observation
Usage: #inline
* subject = Reference(example)
* focus = Reference(example)
* category = $observation-category#vital-signs
* code = $loinc#45701-0 "Fever"
* valueBoolean = true
* effectiveAge.value = 601
* effectiveAge.code = "m"
* effectiveAge.system = "http://unitsofmeasure.org"
* effectiveAge.unit = "months"

Instance: example-gleason-score
InstanceOf: Observation
Usage: #inline
* subject = Reference(example)
* focus = Reference(example-specimen)
* category = $observation-category#laboratory
* code = $loinc#94734-1 "Gleason score"
* valueCodeableConcept = $loinc#LA30796-9 "ISUP Grade (Grade Group) 3 (Gleason score 4+3=7)"
* effectiveAge.value = 600
* effectiveAge.code = "m"
* effectiveAge.system = "http://unitsofmeasure.org"
* effectiveAge.unit = "months"

Instance: example-favorite-color
InstanceOf: Observation
Usage: #inline
* subject = Reference(example)
* focus = Reference(example)
* category = $observation-category#survey
* code = $mylab#favorite-color "Favorite color"
* valueString = "Blue"

Resulting dataframe

(Note that onsetAge, a temporal field was used prompt a new line)

patient birthSex favorite_color condition_code onsetAge gleason_score fever specimen specimen_type specimen_collection_body_site specimen_processing_method
example M Blue Cancer 600 ISUP Grade (Grade Group) 3 (Gleason score 4+3=7) example-specimen Biopsy Prostate Preparation of formalin fixed paraffin embedded tissue specimen
example M Blue Common Cold 601 TRUE

Resulting Facet Hierarchy

(source resource included for completeness)

category facet resource
patient patient Patient
patient birthSex Patient
survey favorite_color Observation
condition condition_code Condition
condition onsetAge Condition
laboratory gleason_score Observation
vital-signs fever Observation
specimen specimen Specimen
specimen specimen_type Specimen
specimen specimen_collection_body_site Specimen
specimen specimen_processing_method Specimen

@bwalsh
Copy link
Collaborator Author

bwalsh commented Aug 12, 2024

@bwalsh bwalsh changed the title Dataframer with Facet Mangement Dataframer/Pivot with Facet Mangement Aug 14, 2024
@bwalsh
Copy link
Collaborator Author

bwalsh commented Aug 22, 2024

@bwalsh
Copy link
Collaborator Author

bwalsh commented Sep 11, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant