You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When we are writing output files, we write them all from scratch each time. The decision was made (by Mike) to delete all existing files before writing any new ones, despite the fact that this causes a gap (or "data blip" or whatever you want to call it) where Athena queries will not work.
Ideally Athena or S3 would be able to atomically update the data set in some fashion.
Other ways of updating:
Having blue/green deploy folders and having the Glue crawler point Athena at each in turn. But my investigation couldn't make this work, but I think there might be something there. (My issue: You'd have to update the crawler or have two crawlers. And two crawlers create different tables - AWS doesn't seem to like two crawlers editing the same table.)
Solutions that aren't atomic:
Writing a new directory and atomically replacing it. S3 doesn't support atomic directory moves (it doesn't really DO directories)
Overwrite files as we go, delete any extra once done. This still could cause data issues, if the incoming data is in a different order.
Write files adjacent to existing ones, then delete existing ones. Similar data issues, but now too many entries instead of not enough.
The text was updated successfully, but these errors were encountered:
This was essentially fixed by #89 and using delta lake as the output format. It's not the recommended path yet because we're missing some other fancy features like cloudformation support. But it can be done and that's how we'll fix this in future. Leaving open for now while that support is tested and finalized.
When we are writing output files, we write them all from scratch each time. The decision was made (by Mike) to delete all existing files before writing any new ones, despite the fact that this causes a gap (or "data blip" or whatever you want to call it) where Athena queries will not work.
Ideally Athena or S3 would be able to atomically update the data set in some fashion.
Other ways of updating:
Solutions that aren't atomic:
The text was updated successfully, but these errors were encountered: