Add ETL timestamp to completion table, make it non-unique #353

mikix · 2024-10-25T18:18:58Z

This allows us to add an entry to the completion table for every ETL run - which might help debugging.

I've looked at and tested the Library side code that looks at these tables - it'll still work if these rows are non-unique.

This is another brick in the wall of #296

Checklist

Consider if documentation (like in docs/) needs to be updated
Consider if tests should be added

github-actions · 2024-10-25T18:36:56Z

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines	Covered	Coverage	Threshold	Status
3543	3482	98%	98%	🟢

New Files

No new covered files...

Modified Files

File	Coverage	Status
cumulus_etl/completion/schema.py	100%	🟢
cumulus_etl/etl/config.py	100%	🟢
cumulus_etl/etl/tasks/base.py	100%	🟢
TOTAL	100%	🟢

updated for commit: 5a43957 by action🐍

mikix · 2024-10-28T12:03:24Z

cumulus_etl/completion/schema.py

-        "uniqueness_fields": {"table_name", "group_name"},
+        # These fields altogether basically guarantee that we never collide.
+        # (i.e. that every 'merge' is really an 'insert')
+        # That's intentional - we want this table to be a bit of a historical log.
+        # (We couldn't have no uniqueness fields -- delta lake doesn't like that.)
+        "uniqueness_fields": {"table_name", "group_name", "export_time", "etl_time"},


This is the change I was most interested in making. Some considerations:

Growth of this table - I don't think it's too bad? This will be one entry per FHIR ETL'd resource every run. At BCH, we have about 120 groups, across 12 resources (1440 entries assuming we never ETL'd something twice, which i'm sure we did a bit). But point is, this table will always be quite small and definitely smaller than the source tables. I think the debugging history is worth it. (We already kept some of this info in JSON log files in S3, but this is now much easier to inspect)

In fact, I just checked and our Cerner output folder has 1632 JobConfig entries (which is how many times the ETL has been run) - but usually for a single resource, so that's probably a close count to how big this table would be.

Can the Library side handle this change already? Yes - I looked at the code and did a quick test. Its existing completion tracking code handles these rows not being table-group unique.

just confirming - for everything before this change, we basically have one big group?

Naw more like we only kept records of each group & resource combo once. Like if I ETL patients from group Cohort1 yesterday, then I do a new export and ETL Cohort1 again with new data, we'd only keep track of the latest one.

mikix · 2024-10-28T12:07:05Z

cumulus_etl/etl/tasks/base.py

@@ -272,6 +272,7 @@ def _update_completion_table(self) -> None:
                    "export_time": self.task_config.export_datetime.isoformat(),
                    "export_url": self.task_config.export_url,
                    "etl_version": cumulus_etl.__version__,
+                    "etl_time": self.task_config.timestamp.isoformat(),


What time to use here was a small choice. I initially had "whatever time it is right now as we're writing this completion entry" - but then I switched it to the global ETL timestamp, which gets set early in the ETL run. This means that the timestamp is a little less accurate to the exact time the entries get uploaded to Athena, but... we gain the ability to check which resources have the same timestamp (i.e. were in the same run) and we gain the ability to go look up the log files in S3 because they use the same global timestamp in their filename.

This allows us to add an entry to the completion table for every ETL run - which might help debugging.

dogversioning · 2024-10-28T12:50:16Z

cumulus_etl/completion/schema.py

-        "uniqueness_fields": {"table_name", "group_name"},
+        # These fields altogether basically guarantee that we never collide.
+        # (i.e. that every 'merge' is really an 'insert')
+        # That's intentional - we want this table to be a bit of a historical log.
+        # (We couldn't have no uniqueness fields -- delta lake doesn't like that.)
+        "uniqueness_fields": {"table_name", "group_name", "export_time", "etl_time"},


just confirming - for everything before this change, we basically have one big group?

mikix marked this pull request as ready for review October 25, 2024 18:28

mikix force-pushed the mikix/extra-completion-info branch from b4a974a to ed84719 Compare October 28, 2024 12:04

mikix commented Oct 28, 2024

View reviewed changes

Add ETL timestamp to completion table, make it non-unique

5a43957

This allows us to add an entry to the completion table for every ETL run - which might help debugging.

mikix force-pushed the mikix/extra-completion-info branch from ed84719 to 5a43957 Compare October 28, 2024 12:20

dogversioning approved these changes Oct 28, 2024

View reviewed changes

mikix merged commit ab6caa3 into main Oct 28, 2024
3 checks passed

mikix deleted the mikix/extra-completion-info branch October 28, 2024 13:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ETL timestamp to completion table, make it non-unique #353

Add ETL timestamp to completion table, make it non-unique #353

mikix commented Oct 25, 2024 •

edited

Loading

github-actions bot commented Oct 25, 2024 •

edited

Loading

mikix Oct 28, 2024 •

edited

Loading

dogversioning Oct 28, 2024

mikix Oct 28, 2024

mikix Oct 28, 2024

dogversioning Oct 28, 2024

Add ETL timestamp to completion table, make it non-unique #353

Add ETL timestamp to completion table, make it non-unique #353

Conversation

mikix commented Oct 25, 2024 • edited Loading

Checklist

github-actions bot commented Oct 25, 2024 • edited Loading

☂️ Python Coverage

Overall Coverage

New Files

Modified Files

mikix Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

dogversioning Oct 28, 2024

Choose a reason for hiding this comment

mikix Oct 28, 2024

Choose a reason for hiding this comment

mikix Oct 28, 2024

Choose a reason for hiding this comment

dogversioning Oct 28, 2024

Choose a reason for hiding this comment

mikix commented Oct 25, 2024 •

edited

Loading

github-actions bot commented Oct 25, 2024 •

edited

Loading

mikix Oct 28, 2024 •

edited

Loading