Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark which resources are loaded for which patients (i.e. "completion tracking") #296

Closed
mikix opened this issue Feb 9, 2024 · 6 comments
Labels
enhancement New feature or request

Comments

@mikix
Copy link
Contributor

mikix commented Feb 9, 2024

This comes from a study need:

  • Sometimes loading data from the EHR can take a long time and/or can happen in fits and starts. You may take weeks to fully finish loading a set of cohorts.
  • During that time, studies (including the core__ tables) would probably want to ignore patients that don't have the resources they care about loaded.
  • i.e. studies want to be able to know whether the Conditions table is accurate for patient X or not - and exclude the patient if not.

My initial thoughts on this are to have the ETL keep a metadata table around, marking which resources are "finished" at the Group level. And then which patients belong to which Groups. That way a study could ask if patient X has Conditions yet.

Brainstorming for that approach:

  • UX for ETL:
    • --auto-mark (looks for log file adjacent to input files, or from URL if ETL is doing export)
    • --mark GROUP_ID (alternative option if user wants to override)
    • If no mark detected or given, all this logic below will be skipped
    • Not in love with "mark"... --mark-complete --mark-finished --complete --finish...
  • ETLing a resource group:
    • writes to a table like etl_complete
    • with columns (GROUP_ID, RESOURCE_NAME, ETL_DATE, NEWEST_LAST_UPDATED_DATE, OLDEST_LAST_UPDATED_DATE)
    • ETL will add new row when it successfully finishes a run
    • Open question: should I update a single unique group_id/resource_name row instead of appending to table? Might make querying less messy? But at cost of some potentially useful log records.
  • ETLing patients encounters:
    • When uploading patients encounters, ETL will also write all row IDs to a table like etl_patient_groups etl_encounter_groups
    • with columns (PATIENT_ID ENCOUNTER_ID, GROUP_ID)
    • Non-unique, as resources can be in multiple groups
  • Library can tell all resources we have uploaded at least once for a given encounter by doing something like:
    SELECT DISTINCT etl_complete.resource_name
    FROM etl_encounter_groups
    INNER JOIN etl_complete
    ON etl_complete.group_id = etl_encounter_groups.group_id
    WHERE etl_encounter_groups.encounter_id = 'xxx'
  • Assumptions of this approach:
    • Exports are full / not-sliced - i.e. if fed a pile of Conditions, the ETL can assume that's all the conditions available at the time (i.e. not sliced by something like severity=mild). Slicing by date is not really supported either, but if we include the bounds of meta.lastUpdated as suggested above, slicing by that field would be fine.
    • Users would be able to keep the logs around or manually provide the group name.
    • Group exports are a suitable way to cluster users. (I guess that's just a restatement of the above two assumptions)
@mikix mikix added the enhancement New feature or request label Feb 9, 2024
@mikix
Copy link
Contributor Author

mikix commented Mar 19, 2024

Further thinking about this.

tl;dr; Let's track Encounters as our primary completeness kernel-of-truth (rather than patients as the description above talks about). Studies can ignore any and all Encounter-linked data if it's not loaded yet.

There are two use cases I can think of:

  1. A researcher running a study wants to know "If I run the study now, is it going to be the most complete view of the data we have available to us?" - i.e. "Is now the best time to run the study, or should I wait for an ongoing data ingestion process to finish?"
  2. An engineer doing data ingestion does not want to cause studies to create misleading data while the ingestion is in-flight. Flipped around: a researcher running a study wants to feel confident that an ongoing ingestion process will not provide misleading data in the meantime. (e.g. no Conditions for an Encounter because they haven't been loaded in yet)

Those are closely related, but a little different. The first is about incompleteness at a broad scale leading to a diminished ability to do meaningful analysis. The second is about incompleteness at a small scale leading to inaccurate analysis.

Solving for the first (incompleteness aka "Is there an ongoing ingestion process?"):

  • It's hard to solve programmatically unless we added a global "dirty" flag. Which... might be reasonable, but really only useful for this one use case and adds complexity (how should it be managed? and it will definitely get stale)
  • You could answer this question with a little bit of manual effort with the proposed etl_complete table above in the description - query for the resources you want and a date range and see if you've got all the Conditions loaded up to June of this year, for example.
  • Or you could just ask the engineers doing the ingestion: "you done yet?" - honestly, this seems the easiest and most natural answer to this problem

Solving for the second (inaccuracy aka "How to stop the data from lying to me during an ingestion process?"):

  • We want to solve not just for the initial patient load, but also updates of that data (like, now we're loading in just the last month's worth of updates)
  • For rows that are "unlinked" (think: Device), those are just loose piles of data and any updates (read: new bucket of data being poured on top) can just come in when they come in. If you don't have all the latest updates in the pile yet, that's an incompleteness issue ("do we have it all yet?"), but not a inaccurate outcome.
  • But for rows that are "linked" (think: Condition.encounter) we want to avoid considering either side of the link until both are available. Or we risk an inaccurate view of the data.
  • We could mark a patient/group combo as incomplete once we start an ingestion. But:
    • that requires some tooling knowledge of ingestion like "ok I'm starting a data update" and "ok I'm done"
    • it's disruptive to remove a patient from all consideration until all their data is updated again
  • Instead, if we identified the clusters of data (is it just Encounter clusters?), we could track those.
    • That is, instead of (or in addition to?) the proposed etl_patient_groups table from the description above, where we link patients to groups... maybe we link encounters to groups with an etl_encounter_groups table.
    • As Condition groups come in, they get marked as complete and then you can programmatically know that your study (which cares about Conditions in this example) can now use that group's Encounters.

@mikix
Copy link
Contributor Author

mikix commented Mar 19, 2024

Problem scenarios, tricky to get right even with the above Encounter-oriented thinking:

  • I'm updating already-ingested group A with a fresh batch of data from this month. New Conditions and Encounters. I load Encounters in. The ETL marks that Encounter X is part of group A. How do we denote that the new Conditions in group A aren't actually loaded yet? Maybe we need some date-based timestamping when we say that X is a part of group A since date Z. But how do we get that correct, depending on the order of ETL runs (condition then encounter or flipped) or incomplete/inaccurate date info in the record's fields.
  • I export Conditions from the EHR first and then Encounters a day later. I will end up loading a day's worth of Encounters that don't have connected Conditions yet.
  • I export Encounters first and then Conditions a day later. This one doesn't matter so much.

Some of the above is probably helped a lot by doing resource exports at the same time. And then we could probably try to use transactionTime from the bulk export response as a timestamp. That way, our data is guaranteed to have a comprehensive view at least.

So:

  • Ask folks to keep the log for the export around to pull a timestamp from (and/or allow the user to enter a timestamp themselves?)
  • Now every ETL job would need two extra bits of info: the group & the export timestamp for the resources being loaded.
  • When we mark completion info, we add the timestamp - a study will want to see a newer-or-equal Condition/group update timestamp compared to the time the Encounter first appeared in the group. If the Condition/group timestamp is older, that encounter is not viable.
  • Update our documentation to encourage exporting resources at the same time, if possible. Or at least, export Encounters first.

This would also let us catch probable-mistakes like loading old data on top of newer data by looking at the export timestamp you are providing. (important in the Cerner context, which doesn't have meta.lastUpdated)

@mikix mikix changed the title Mark which resources are loaded for which patients Mark which resources are loaded for which patients (i.e. "completion tracking") Jul 2, 2024
@mikix
Copy link
Contributor Author

mikix commented Jul 2, 2024

Current status

This mostly works! 🎉 ... But you have to opt-in.

You can manually enable this feature on the ETL side and the Library will automatically respect the tracking:

  • Pass --write-completion to the ETL to turn this feature on.
  • If your input ndjson folder does not also include a log.ndjson from a Bulk Export (from which the ETL can grab a group name and export timestamp), you will need to also pass in --export-group and --export-timestamp.
  • You have to be lightly careful about export ordering - you'll want to export your encounters first, before other data.

What does completion tracking actually do again?

The Library core study will ignore Encounters that both:

  • Have completion info for themselves
    • This offers backwards compatibility - any Encounters that aren't registered with completion tracking data will be included in core (i.e. all legacy Encounters will be included, because they won't have tracking data until you re-export their group)
  • AND do not have completion info for AllergyIntolerance, Conditions, DocumentReferences, MedicationRequests, and Observations loaded for the Encounter's group at later-or-equal timestamps as the Encounter's timestamp.
    • This indicates an incomplete / in-progress ETL ingestion.
    • We look at those resources, because those are the resources that the Library examines - if it started looking at Procedure, we'd probably add Procedure to that list.

See code.

Remaining work

Ideally completion tracking would be enabled by default. But before flipping that switch, this is the remaining work to be done:

  1. Consider doing something smart for the "empty input set" case - you exported group A and got zero Procedures. Ideally we'd still mark Procedures as complete for that group. How do we detect that case (vs not having exported Procedures in the first place)? (See below for more discussion of how to solve this.)
  2. Require the group name & timestamp from somewhere (from log or user) and drop the --write-completion flag
  3. Update user docs to mention this feature, and caveats around it (like exporting encounters first)
  4. (Optional) If the user provides a bulk export URL that chops down a group (like a URL that includes _typeFilter), we should probably require an explicit --export-group name instead of auto-detecting the group name from the URL.
  5. (Optional) Prevent overwriting newer group data with older data -- we do this for meta.lastUpdated, but this feature would give us the ability to look at the export timestamp and do the same kind of check. Which would help with Epic, which does not provide meta.lastUpdated. This doesn't have to happen before turning this feature on by default, it can happen whenever. Just mentioning it since it's a related feature and would be handy.

Empty input set thoughts

  • Granted, this might not be very common. But it could happen, so we should try to handle it.
  • Right now, the ETL would write completion tracking info - it can't distinguish between "no export was attempted" and "export happened but we got no data".
  • The Bulk FHIR export spec curiously discourages servers from indicating the difference by saying that if there is no data for a resource, servers SHOULD NOT return an empty file / output element for the resource.
  • One solution: The ETL could try to distinguish these cases by looking at the export log and parsing _type from the export URL.
    • But what if no _type was provided? (The user exported everything the server had...) Could warn the user in that case and ignore the problem, hoping that there were no zero counts...?
    • What if the log file isn't present? Offer a CLI flag to say "no really, mark this group complete"? Or do same warn-and-pray strategy for that case.
  • Alternative fix: stop running all ETL core tasks by default, which could allow us to assume that if the user passed us --task=procedure, the folder has all the Procedure data for this export, zero or not.
    • But that reduces the convenience of the CLI for everyone, just to cater to this edge case.
    • And users might not appreciate that we're doing this behind the scenes - I could imagine them copying and pasting a big line with all the tasks named.

@mikix
Copy link
Contributor Author

mikix commented Oct 22, 2024

Update on the empty input set problem: I've gone with a solution that assumes it's rare you actually want to do it - the ETL refuses to upload an empty set unless you pass an --allow-missing-resources flag. This felt less flimsy to me than trying to infer what's going on by looking at the log (which might be copied around and inaccurate to the ndjson in the folder). PR #351

Once that's landed, I think the only remaining tasks blocking enabling this by default are:

  • make group name and export timestamp required (either from log or CLI)
  • drop the --write-completion flag
  • add user docs for completion tracking, with caveats and advice

@mikix
Copy link
Contributor Author

mikix commented Oct 23, 2024

There is one other new thought: look into whether we can make the completion table non-unique for group/resource (i.e. record every time we push to the table, not just the latest time).

This would give us more provenance information, but would maybe require changes to the completion code in Library & ETL - have to confirm what would be necessary there.

@mikix
Copy link
Contributor Author

mikix commented Oct 29, 2024

Now that completion tracking is enabled by default, I'll close this ticket. I spun a remaining useful item into its own ticket: #356

@mikix mikix closed this as completed Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant