Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ETL timestamp to completion table, make it non-unique #353

Merged
merged 1 commit into from
Oct 28, 2024

Conversation

mikix
Copy link
Contributor

@mikix mikix commented Oct 25, 2024

This allows us to add an entry to the completion table for every ETL run - which might help debugging.

I've looked at and tested the Library side code that looks at these tables - it'll still work if these rows are non-unique.

This is another brick in the wall of #296

Checklist

  • Consider if documentation (like in docs/) needs to be updated
  • Consider if tests should be added

@mikix mikix marked this pull request as ready for review October 25, 2024 18:28
Copy link

github-actions bot commented Oct 25, 2024

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
3543 3482 98% 98% 🟢

New Files

No new covered files...

Modified Files

File Coverage Status
cumulus_etl/completion/schema.py 100% 🟢
cumulus_etl/etl/config.py 100% 🟢
cumulus_etl/etl/tasks/base.py 100% 🟢
TOTAL 100% 🟢

updated for commit: 5a43957 by action🐍

@mikix mikix force-pushed the mikix/extra-completion-info branch from b4a974a to ed84719 Compare October 28, 2024 12:04
Comment on lines -16 to +20
"uniqueness_fields": {"table_name", "group_name"},
# These fields altogether basically guarantee that we never collide.
# (i.e. that every 'merge' is really an 'insert')
# That's intentional - we want this table to be a bit of a historical log.
# (We couldn't have no uniqueness fields -- delta lake doesn't like that.)
"uniqueness_fields": {"table_name", "group_name", "export_time", "etl_time"},
Copy link
Contributor Author

@mikix mikix Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the change I was most interested in making. Some considerations:

  • Growth of this table - I don't think it's too bad? This will be one entry per FHIR ETL'd resource every run. At BCH, we have about 120 groups, across 12 resources (1440 entries assuming we never ETL'd something twice, which i'm sure we did a bit). But point is, this table will always be quite small and definitely smaller than the source tables. I think the debugging history is worth it. (We already kept some of this info in JSON log files in S3, but this is now much easier to inspect)
    • In fact, I just checked and our Cerner output folder has 1632 JobConfig entries (which is how many times the ETL has been run) - but usually for a single resource, so that's probably a close count to how big this table would be.
  • Can the Library side handle this change already? Yes - I looked at the code and did a quick test. Its existing completion tracking code handles these rows not being table-group unique.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just confirming - for everything before this change, we basically have one big group?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naw more like we only kept records of each group & resource combo once. Like if I ETL patients from group Cohort1 yesterday, then I do a new export and ETL Cohort1 again with new data, we'd only keep track of the latest one.

@@ -272,6 +272,7 @@ def _update_completion_table(self) -> None:
"export_time": self.task_config.export_datetime.isoformat(),
"export_url": self.task_config.export_url,
"etl_version": cumulus_etl.__version__,
"etl_time": self.task_config.timestamp.isoformat(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What time to use here was a small choice. I initially had "whatever time it is right now as we're writing this completion entry" - but then I switched it to the global ETL timestamp, which gets set early in the ETL run. This means that the timestamp is a little less accurate to the exact time the entries get uploaded to Athena, but... we gain the ability to check which resources have the same timestamp (i.e. were in the same run) and we gain the ability to go look up the log files in S3 because they use the same global timestamp in their filename.

This allows us to add an entry to the completion table for every
ETL run - which might help debugging.
@mikix mikix force-pushed the mikix/extra-completion-info branch from ed84719 to 5a43957 Compare October 28, 2024 12:20
Comment on lines -16 to +20
"uniqueness_fields": {"table_name", "group_name"},
# These fields altogether basically guarantee that we never collide.
# (i.e. that every 'merge' is really an 'insert')
# That's intentional - we want this table to be a bit of a historical log.
# (We couldn't have no uniqueness fields -- delta lake doesn't like that.)
"uniqueness_fields": {"table_name", "group_name", "export_time", "etl_time"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just confirming - for everything before this change, we basically have one big group?

@mikix mikix merged commit ab6caa3 into main Oct 28, 2024
3 checks passed
@mikix mikix deleted the mikix/extra-completion-info branch October 28, 2024 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants