-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ETL timestamp to completion table, make it non-unique #353
Conversation
☂️ Python Coverage
Overall Coverage
New FilesNo new covered files... Modified Files
|
b4a974a
to
ed84719
Compare
"uniqueness_fields": {"table_name", "group_name"}, | ||
# These fields altogether basically guarantee that we never collide. | ||
# (i.e. that every 'merge' is really an 'insert') | ||
# That's intentional - we want this table to be a bit of a historical log. | ||
# (We couldn't have no uniqueness fields -- delta lake doesn't like that.) | ||
"uniqueness_fields": {"table_name", "group_name", "export_time", "etl_time"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the change I was most interested in making. Some considerations:
- Growth of this table - I don't think it's too bad? This will be one entry per FHIR ETL'd resource every run. At BCH, we have about 120 groups, across 12 resources (1440 entries assuming we never ETL'd something twice, which i'm sure we did a bit). But point is, this table will always be quite small and definitely smaller than the source tables. I think the debugging history is worth it. (We already kept some of this info in JSON log files in S3, but this is now much easier to inspect)
- In fact, I just checked and our Cerner output folder has 1632 JobConfig entries (which is how many times the ETL has been run) - but usually for a single resource, so that's probably a close count to how big this table would be.
- Can the Library side handle this change already? Yes - I looked at the code and did a quick test. Its existing completion tracking code handles these rows not being table-group unique.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just confirming - for everything before this change, we basically have one big group?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naw more like we only kept records of each group & resource combo once. Like if I ETL patients from group Cohort1 yesterday, then I do a new export and ETL Cohort1 again with new data, we'd only keep track of the latest one.
@@ -272,6 +272,7 @@ def _update_completion_table(self) -> None: | |||
"export_time": self.task_config.export_datetime.isoformat(), | |||
"export_url": self.task_config.export_url, | |||
"etl_version": cumulus_etl.__version__, | |||
"etl_time": self.task_config.timestamp.isoformat(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What time to use here was a small choice. I initially had "whatever time it is right now as we're writing this completion entry" - but then I switched it to the global ETL timestamp, which gets set early in the ETL run. This means that the timestamp is a little less accurate to the exact time the entries get uploaded to Athena, but... we gain the ability to check which resources have the same timestamp (i.e. were in the same run) and we gain the ability to go look up the log files in S3 because they use the same global timestamp in their filename.
This allows us to add an entry to the completion table for every ETL run - which might help debugging.
ed84719
to
5a43957
Compare
"uniqueness_fields": {"table_name", "group_name"}, | ||
# These fields altogether basically guarantee that we never collide. | ||
# (i.e. that every 'merge' is really an 'insert') | ||
# That's intentional - we want this table to be a bit of a historical log. | ||
# (We couldn't have no uniqueness fields -- delta lake doesn't like that.) | ||
"uniqueness_fields": {"table_name", "group_name", "export_time", "etl_time"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just confirming - for everything before this change, we basically have one big group?
This allows us to add an entry to the completion table for every ETL run - which might help debugging.
I've looked at and tested the Library side code that looks at these tables - it'll still work if these rows are non-unique.
This is another brick in the wall of #296
Checklist
docs/
) needs to be updated