Prevent the "data blip" that happens when a run is updating the output files #76

mikix · 2022-11-16T14:35:04Z

When we are writing output files, we write them all from scratch each time. The decision was made (by Mike) to delete all existing files before writing any new ones, despite the fact that this causes a gap (or "data blip" or whatever you want to call it) where Athena queries will not work.

Ideally Athena or S3 would be able to atomically update the data set in some fashion.

Other ways of updating:

Having blue/green deploy folders and having the Glue crawler point Athena at each in turn. But my investigation couldn't make this work, but I think there might be something there. (My issue: You'd have to update the crawler or have two crawlers. And two crawlers create different tables - AWS doesn't seem to like two crawlers editing the same table.)

Solutions that aren't atomic:

Writing a new directory and atomically replacing it. S3 doesn't support atomic directory moves (it doesn't really DO directories)
Overwrite files as we go, delete any extra once done. This still could cause data issues, if the incoming data is in a different order.
Write files adjacent to existing ones, then delete existing ones. Similar data issues, but now too many entries instead of not enough.

mikix · 2022-11-16T14:35:16Z

Note that #75 (run-to-run deltas) would solve this for us because we would not be replacing files anymore (unless and until we do a fresh reset).

mikix · 2022-12-28T17:32:38Z

This was essentially fixed by #89 and using delta lake as the output format. It's not the recommended path yet because we're missing some other fancy features like cloudformation support. But it can be done and that's how we'll fix this in future. Leaving open for now while that support is tested and finalized.

mikix · 2023-01-09T15:44:26Z

Delta lakes are now the default, with atomic updates using a symlink manifest. This feels closed.

mikix added the enhancement New feature or request label Nov 16, 2022

mikix mentioned this issue Dec 21, 2022

feat: add Delta Lake support #89

Merged

2 tasks

mikix closed this as completed Jan 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent the "data blip" that happens when a run is updating the output files #76

Prevent the "data blip" that happens when a run is updating the output files #76

mikix commented Nov 16, 2022 •

edited

Loading

mikix commented Nov 16, 2022 •

edited

Loading

mikix commented Dec 28, 2022

mikix commented Jan 9, 2023

Prevent the "data blip" that happens when a run is updating the output files #76

Prevent the "data blip" that happens when a run is updating the output files #76

Comments

mikix commented Nov 16, 2022 • edited Loading

mikix commented Nov 16, 2022 • edited Loading

mikix commented Dec 28, 2022

mikix commented Jan 9, 2023

mikix commented Nov 16, 2022 •

edited

Loading

mikix commented Nov 16, 2022 •

edited

Loading