Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip tasks or error out when resources are missing #351

Merged
merged 1 commit into from
Oct 23, 2024

Conversation

mikix
Copy link
Contributor

@mikix mikix commented Oct 21, 2024

  • If tasks are selected with --task or --task-filter and a needed resource is missing, we now error out.
  • If tasks are not specifically selected, we now only run tasks for which resources are available (and error out if no resources exist at all)
  • If --allow-missing-resources is passed, we skip both above checks and do historical behavior of running any given task with zero input rows (which pushes up a schema, cleans up the delta lake, and stamps the resource as complete in the completion tracking tables).

This addresses a lingering issue from #296, where we didn't adequately protect ourselves from incorrectly handling "empty set" inputs and wrongly marking them as complete. I've decided to go the simplest route and refuse to handle empty-sets unless specifically told to. I think this is appropriate given how unlikely empty sets are in the real world. (We see them with Cerner for some resources like ServiceRequest, but it's the exception not the rule.)

(An alternate solution that I did consider was looking at the bulk export log if it was present, and seeing what resources were exported - then continuing if the export log says it exported Patients, even if no patient files where present. But it seemed more complicated code-wise and more brittle if users are moving files around. The approach I went with requires a human in the loop to agree to allow empty sets. Or... someone just adds the --allow flag to their scripts and we lose this protection anyway - but that's at least not on me. 😄)

Checklist

  • Consider if documentation (like in docs/) needs to be updated
  • Consider if tests should be added

Copy link

github-actions bot commented Oct 21, 2024

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
3542 3481 98% 98% 🟢

New Files

No new covered files...

Modified Files

File Coverage Status
cumulus_etl/errors.py 100% 🟢
cumulus_etl/etl/cli.py 100% 🟢
cumulus_etl/etl/tasks/base.py 100% 🟢
cumulus_etl/loaders/base.py 100% 🟢
cumulus_etl/loaders/fhir/bulk_export.py 100% 🟢
cumulus_etl/loaders/fhir/ndjson_loader.py 100% 🟢
cumulus_etl/loaders/i2b2/loader.py 100% 🟢
cumulus_etl/upload_notes/downloader.py 100% 🟢
TOTAL 100% 🟢

updated for commit: 96442bc by action🐍

Comment on lines 43 to 44
if self.root.protocol in {"http", "https"}:
return None # we haven't done the export yet!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - with this and the i2b2 tcp thing below, if i was coming into this codebase fresh, a little more context as to what this means might be helpful (i.e. it's we've got the URL of the place to get the data, not the data itself)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I added some more context:

            # We haven't done the export yet, so there are no files to inspect yet.
            # Returning None means "dunno" (i.e. "just accept whatever you eventually get").
            return None

- If tasks are selected with --task or --task-filter and a needed
  resource is missing, we now error out.
- If tasks are not specifically selected, we now only run tasks for
  which resources are available (and error out if no resources exist
  at all)
- If --allow-missing-resources is passed, we skip both above checks
  and do historical behavior of running any given task with zero input
  rows (which pushes up a schema, cleans up the delta lake, and stamps
  the resource as complete in the completion tracking tables).
@mikix mikix force-pushed the mikix/skip-missing-resources branch from 6262610 to 96442bc Compare October 23, 2024 19:34
@mikix mikix merged commit 98e2b8c into main Oct 23, 2024
3 checks passed
@mikix mikix deleted the mikix/skip-missing-resources branch October 23, 2024 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants