Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent the "data blip" that happens when a run is updating the output files #76

Closed
mikix opened this issue Nov 16, 2022 · 3 comments
Closed
Labels
enhancement New feature or request

Comments

@mikix
Copy link
Contributor

mikix commented Nov 16, 2022

When we are writing output files, we write them all from scratch each time. The decision was made (by Mike) to delete all existing files before writing any new ones, despite the fact that this causes a gap (or "data blip" or whatever you want to call it) where Athena queries will not work.

Ideally Athena or S3 would be able to atomically update the data set in some fashion.

Other ways of updating:

  • Having blue/green deploy folders and having the Glue crawler point Athena at each in turn. But my investigation couldn't make this work, but I think there might be something there. (My issue: You'd have to update the crawler or have two crawlers. And two crawlers create different tables - AWS doesn't seem to like two crawlers editing the same table.)

Solutions that aren't atomic:

  • Writing a new directory and atomically replacing it. S3 doesn't support atomic directory moves (it doesn't really DO directories)
  • Overwrite files as we go, delete any extra once done. This still could cause data issues, if the incoming data is in a different order.
  • Write files adjacent to existing ones, then delete existing ones. Similar data issues, but now too many entries instead of not enough.
@mikix mikix added the enhancement New feature or request label Nov 16, 2022
@mikix
Copy link
Contributor Author

mikix commented Nov 16, 2022

Note that #75 (run-to-run deltas) would solve this for us because we would not be replacing files anymore (unless and until we do a fresh reset).

@mikix mikix mentioned this issue Dec 21, 2022
2 tasks
@mikix
Copy link
Contributor Author

mikix commented Dec 28, 2022

This was essentially fixed by #89 and using delta lake as the output format. It's not the recommended path yet because we're missing some other fancy features like cloudformation support. But it can be done and that's how we'll fix this in future. Leaving open for now while that support is tested and finalized.

@mikix
Copy link
Contributor Author

mikix commented Jan 9, 2023

Delta lakes are now the default, with atomic updates using a symlink manifest. This feels closed.

@mikix mikix closed this as completed Jan 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant