Dataframer/Pivot with Facet Mangement #23

bwalsh · 2024-07-10T01:14:39Z

Use Cases:

As a data engineer, I need to transform FHIR Observations to a flattened DataFrame with Observation Codes as columns for analysis and reporting purposes.
As a UI developer, I need to fetch and display patient-specific observations from a FHIR server in a tabular format with observation codes as columns. Given there are many observations, I need to ensure efficient way to group these facets into categories.

Example

References

https://www.researchgate.net/publication/241136160_Design_Patterns_for_Relational_Databases

Goals/Forces/Motivation :

The primary reason for using the pivot function is to reshape the data. It transforms data from long to wide format, which helps when comparing different variables more effectively. This reshaping is fundamental in preparing datasets for analysis or visualization as it allows for a more structured and readable form of data representation.

Data pivoting is the process of transforming data from a long format (rows) to a wide format (columns), typically to make it more understandable or suitable for analysis.
Common scenarios for data pivoting include summarizing data, creating cross-tabulations, or presenting data in a more structured format for reporting purposes.

See R's tidyverse

Tidy data refers to ‘rectangular’ data. These are the data we typically see in spreadsheet software like Googlesheets, Microsoft Excel, or in a relational database like MySQL, PostgreSQL, or Microsoft Access, The three principles for tidy data are:

Variables make up the columns

Observations (or cases) go in the rows

Values are in cells
Put them together, and these three statements make up the contents in a tidy data frame or tibble. While these principles might seem obvious at first, many of the data arrangements we encounter in real life don’t adhere to this guidance.

In context of FHIR, the Observation is the "tidy data" aka “indexed”, and has a defined, fixed schema.
The dataframe is the "wide data" or "Cartesian" data see ggplot2 it's schema is not fixed as "variables" are defined by Observation.code.

Workflow

Create dataframe and maintain ES dataframe mapping (g3t meta dataframe)
Based on newly submitted Observations, maintain a hierarchy of facets for the dataframe - driven by the Observation.category and Observation.code
Automatically update the explorer configuration while reading submitted data
Signal the guppy service to restart after updating the ES mapping
Signal the portal to read the new explorer configuration on change

Data Frame Creation:

Create an initial pandas DataFrame with columns for each unique observation code extracted from the observations.
Each row represents a set of observations with its associated attributes (subject, encounter, value, effectiveDateTime, etc.) for a given subject, specimen, focus at a given time.
Use observation codes as columns to ensure each code has its dedicated column in the DataFrame.

De normalization by Observation.subject:

Identify the subject resourceType attribute within each Observation resource, which represents the entity (e.g., patient, device, location) to which the observation applies.
Normalize the value[x] attribute in the Observation resource, extracting the value based on the data type specified in the value[x] field (e.g., valueQuantity, valueCodeableConcept, valueString).
Temporal Data Extraction:
- Extract the effectiveDateTime attribute from the Observation resource, representing the date and time of the observation.
- If no observation.effectiveDateTime, include the onsetAge or use effectiveDateTime attribute from the specimen or focus resource, representing temporal data - represented as an age. If no temporal data is available, the field should be null.
- In any case, ensure that a new dataframe row is created for each corresponding temporal data.
Retrieve from working or create the dataframe row
- If we need to create the row:
  - retrieve the subject, specimen and focus entities referenced in the Observation resource, expanding the DataFrame to include their scalar attributes and coding, prefix each attribute with lower case resourceType eg patient_*, specimen_*
  - create an identifier for the row based on the subject, specimen, focus, and temporal data (age), ensuring uniqueness.
Add a new column to the data frame
Verify new column exists or should be added to explorer_config
After processing all Observations, write dataframe to ES, maintaining the ES index mapping
- If the explorerConfig is updated:
  - upsert the document in ES.
  - signal the guppy & portal to read the new configuration

Facet Management Category and Coding Extraction:

While creating the "dataframe" we also need to discover and maintain a hierarchy of facets.

Before reading incoming Observations, read the new explorer_config ES index to retrieve current explorer configuration.
- This index will contain a single document, that will contain the explorerConfig needed by the front end
Specifically, maintain a "has changed" flag on that document.
Changes to dataframer:
- In the case of multiple code values, select the well known code (e.g., LOINC, SNOMED) for normalization.
- Apply "string to variable name conversion" to normalize 'code' to column name.
- Maintain a "gitops" explorer fragment:
  - see gen3 documentation
  - Apply inflection.titleize to normalize 'category' name

explorerConfig:
  {{ for all unique subject.resourceType }}
  - tabTitle: <resourceType>  # e.g., Patient, Specimen, etc.
    charts: []  # manually add charts
    filters:
      {{for all categories}}  # e.g., 'Vital Signs', 'Laboratory', etc.
        tabs:
          - title: category
            fields: <codes from category>  # e.g., 'heart_rate', 'blood_pressure', etc.
                       <scalars and extensions>
    table:
        enabled: true
        detailsConfig: # manually add detailsConfig
        fields:
          # all flattened references
          # all codes

Notes: Out of scope or static elements in explorerConfig.

we expect that a the tabTitle corresponds 1:1 with a dataframer - as such we expect that the entry in explorer config will be initialized as the dataframer is developed. In other words, if the tabTitle, the dataframer will initialize it with [detailsConfig, charts] etc.

Example: from `Prostate_Microenvironments`

Summary view of dataframe e.g. patient-centric

Changes to guppy config

# Guppy configuration
guppy:
  enabled: true
  dbRestore: false
  indices:
  - index: observation
    type: observation
  - index: file
    type: file
# added to support facet management
  - index: explorer_config
    type: explorerConfig

  configIndex: gen3.aced.io_array-config

End to End

Guppy PR link: uc-cdis/guppy#273

FF issue: ACED-IDP/gen3-frontend-framework#12

Existing work

 g3t meta dataframe --help
Usage: g3t meta dataframe [OPTIONS] [DIRECTORY_PATH] [OUTPUT_PATH]

  Render a metadata dataframe.

  directory_path: The directory path to the metadata.
  output_path: The output path for the dataframe. default [meta.csv]

Options:
  --dtale                         Open the graph in a browser using the dtale
                                  package for interactive data exploration.
  --data_type [Patient|Specimen|Observation|DocumentReference]
                                  Create a data frame for a specific data
                                  type.  [required]
  --debug

E.G.

g3t meta dataframe --data_type Observation META/ --dtale

Discussion points

Independent of submission ("push") method
Dependency on guppy PR (guppy/_restart)
Responsibility for facet category delegated to Observation.category/.code. AKA - etl driven facet hierarchy
Does guppy's graphql support fetching a nested document (explorerConfig)
Does FEF support reading explorerConfig dynamically via graphql packages/sampleCommons/config/aced/explorer.json
- If guppy/graphql does not support nested document, can the FEF read the document dynamically via a url (public bucket)

The text was updated successfully, but these errors were encountered:

bwalsh · 2024-07-31T21:03:08Z

A worked example

Consider this bundle (see fsh editor for shorthand.)

Alias: $sct = http://snomed.info/sct
Alias: $condition-category = http://terminology.hl7.org/CodeSystem/condition-category
Alias: $observation-category = http://terminology.hl7.org/CodeSystem/observation-category
Alias: $loinc = https://loinc.org
Alias: $mylab = http://mylab.org

Instance: undefined
InstanceOf: Bundle
Usage: #example
* type = #bundle
* entry[0].resource = example
* entry[+].resource = example-specimen
* entry[+].resource = example-cancer
* entry[+].resource = example-common-cold
* entry[+].resource = example-fever
* entry[+].resource = example-gleason-score
* entry[+].resource = example-favorite-color

Instance: example
InstanceOf: Patient
Usage: #inline
* birthSex.coding.system = "http://terminology.hl7.org/CodeSystem/v3-AdministrativeGender"
* birthSex.coding.code = "M"

Instance: example-specimen
InstanceOf: Specimen
Usage: #inline
* subject = Reference(example)
* type = $sct#122555 "Biopsy"
* collection.bodySite.coding.system = "http://snomed.info/sct"
* collection.bodySite.coding.code = "122456"
* collection.bodySite.coding.display = "Prostate"
* processing.method = $sct#" 787376009" "Preparation of formalin fixed paraffin embedded tissue specimen"

Instance: example-cancer
InstanceOf: Condition
Usage: #inline
* subject = Reference(example)
* category = $condition-category#encounter-diagnosis
* code = $sct#123456 "Cancer"
* onsetAge = 600 'm' "months"
* evidence.reference = "Observation/example-gleason-score"

Instance: example-common-cold
InstanceOf: Condition
Usage: #inline
* subject = Reference(example)
* category = $condition-category#encounter-diagnosis
* code = $sct#7890 "Common Cold"
* onsetAge = 601 'm' "months"
* evidence.reference = "Observation/example-fever"

Instance: example-fever
InstanceOf: Observation
Usage: #inline
* subject = Reference(example)
* focus = Reference(example)
* category = $observation-category#vital-signs
* code = $loinc#45701-0 "Fever"
* valueBoolean = true
* effectiveAge.value = 601
* effectiveAge.code = "m"
* effectiveAge.system = "http://unitsofmeasure.org"
* effectiveAge.unit = "months"

Instance: example-gleason-score
InstanceOf: Observation
Usage: #inline
* subject = Reference(example)
* focus = Reference(example-specimen)
* category = $observation-category#laboratory
* code = $loinc#94734-1 "Gleason score"
* valueCodeableConcept = $loinc#LA30796-9 "ISUP Grade (Grade Group) 3 (Gleason score 4+3=7)"
* effectiveAge.value = 600
* effectiveAge.code = "m"
* effectiveAge.system = "http://unitsofmeasure.org"
* effectiveAge.unit = "months"

Instance: example-favorite-color
InstanceOf: Observation
Usage: #inline
* subject = Reference(example)
* focus = Reference(example)
* category = $observation-category#survey
* code = $mylab#favorite-color "Favorite color"
* valueString = "Blue"

Resulting dataframe

(Note that onsetAge, a temporal field was used prompt a new line)

patient	birthSex	favorite_color	condition_code	onsetAge	gleason_score	fever	specimen	specimen_type	specimen_collection_body_site	specimen_processing_method
example	M	Blue	Cancer	600	ISUP Grade (Grade Group) 3 (Gleason score 4+3=7)		example-specimen	Biopsy	Prostate	Preparation of formalin fixed paraffin embedded tissue specimen
example	M	Blue	Common Cold	601		TRUE

Resulting Facet Hierarchy

(source resource included for completeness)

category	facet	resource
patient	patient	Patient
patient	birthSex	Patient
survey	favorite_color	Observation
condition	condition_code	Condition
condition	onsetAge	Condition
laboratory	gleason_score	Observation
vital-signs	fever	Observation
specimen	specimen	Specimen
specimen	specimen_type	Specimen
specimen	specimen_collection_body_site	Specimen
specimen	specimen_processing_method	Specimen

bwalsh · 2024-08-12T16:14:59Z

Additional gen3 FEF documentation for explorer config. https://github.com/uc-cdis/gen3-frontend-framework/blob/develop/docs/Configuration/Explorer.md#selection-facet

bwalsh · 2024-08-22T15:41:18Z

Prototype here: https://github.com/ACED-IDP/gen3_util/blob/763b1f1a1f3aa3e35b60be910662983e99c6aef9/tests/unit/dataframer/test_dataframer.py#L209

bwalsh · 2024-09-11T20:54:21Z

Production example:

bwalsh mentioned this issue Jul 25, 2024

Bundle Submission #9

Open

bwalsh changed the title ~~Dataframer with Facet Mangement~~ Dataframer/Pivot with Facet Mangement Aug 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataframer/Pivot with Facet Mangement #23

Dataframer/Pivot with Facet Mangement #23

bwalsh commented Jul 10, 2024 •

edited

Loading

bwalsh commented Jul 31, 2024 •

edited

Loading

bwalsh commented Aug 12, 2024

bwalsh commented Aug 22, 2024

bwalsh commented Sep 11, 2024

Dataframer/Pivot with Facet Mangement #23

Dataframer/Pivot with Facet Mangement #23

Comments

bwalsh commented Jul 10, 2024 • edited Loading

Use Cases:

Example

References

Goals/Forces/Motivation :

Workflow

Data Frame Creation:

Facet Management Category and Coding Extraction:

Example: from Prostate_Microenvironments

Summary view of dataframe e.g. patient-centric

Changes to guppy config

End to End

Guppy PR link: uc-cdis/guppy#273

FF issue: ACED-IDP/gen3-frontend-framework#12

Existing work

E.G.

Discussion points

bwalsh commented Jul 31, 2024 • edited Loading

A worked example

Resulting dataframe

Resulting Facet Hierarchy

bwalsh commented Aug 12, 2024

bwalsh commented Aug 22, 2024

bwalsh commented Sep 11, 2024

bwalsh commented Jul 10, 2024 •

edited

Loading

Example: from `Prostate_Microenvironments`

bwalsh commented Jul 31, 2024 •

edited

Loading