-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support run-to-run deltas #75
Comments
OK here's my stab at a comparison between the big-three ACID data lake formats: Delta Lake, Hudi, and Iceberg. Features Our needs are fairly modest. We want to occasionally atomically upsert (update/insert plus we'll also need delete support) into a data lake. We likely write less often than we read, and our reads are complicated. All thee ACID lakes basically all have extremely similar features and are converging on each other. See this comparison article with some history in it -- but honestly just google around for acid/incremental/upsert data lake or the product names and you'll come to many comparison articles all talking about fairly minor differences (to my naive eyes). They all do copy-on-write (slow write, but fast reads) vs merge-on-read (offered by Hudi), so that's fine. We want our Athena reads to be performant, since they happen a lot. They are all open-source projects, growing quickly, managed by either the Apache or Linux Foundation. Data Lake does however tend to lag behind its proprietary Delta Engine commercial solution offered by DataBricks. The most interesting feature for us that I saw was that Iceberg does some automatic partitioning for you. So better performance without having to manually determine partitions. Python Cumulus ETL is Python based. All these ACID lakes are Sparks (Java) based. There is Hudi doesn't look like it has any native Python bindings coming down the pipeline at all. But both Delta Lake and Iceberg have up and coming (but not feature complete) native Python implementations (meaning no Java). Delta Lake: 1st party Rust implementation with a Python wrapper. Currently missing merges, which is a critical feature for us. It also has a Sparks-based wrapper that adds a tiny amount of syntactic sugar over bare Iceberg: 1st party pure Python implementation. Currently missing merge support too. So I'm much more interested in either Delta Lake or Iceberg, all else being equal, because we can eventually drop the behind-the-scenes-Java piece of the pipeline, which while not an active problem is certainly odd and makes things hard to debug when there are problems. Cloud Support We'd like to avoid being too dependent on one cloud platform (so no managed solutions), but also mindful of not being too difficult for future expansion as we add more platforms to the fold, beyond our current target of AWS. AWS: It seems to have support for all three (as well as a managed Lake Formation). It has much more documentation for Iceberg, which makes me think it is better supported, but Delta Lake and Hudi get mentions too. Azure: They seem completely to be a Delta Lake shop, with their big data lake support even being called Azure Databricks. The most documentation I could find about Iceberg was how to convert into Delta Lake. Conclusion Disclaimer: I myself am not an expert in this domain and I may have missed important facts. But assuming they all have comparable features, I'm inclined to disregard Hudi for its lack of a native Python solution and favor Delta Lake to make eventually supporting Azure even easier. Thoughts from folks more knowledgeable? (Also, nothing says we couldn't grow support for multiple data lakes in the future. I'm just focused on what to build first and primarily support.) |
Regarding ACID lakes, we did end up going with Delta Lake and that part of this story is resolved. The remaining bits of this issue are about the |
Currently, each time we are run, we read all data and rewrite all output data. This increases the run time and puts unnecessary stress on bulk export servers.
Presumably, we could use a
_since
operator for bulk exports to capture everything since the last run.Considerations when fixing this:
_since=
is the obvious answer for bulk exports, but I hear some EHR systems (like Epic?) don't record last modified date (if so, do they even support_since
?). For non-bulk-exports (csv or ndjson files sitting there, maybe we don't worry about this?)Meta.versionId
(but I hear that's not reliably supported in EHRs). (Currently, the ETL just assumes whichever input data you give it is the latest, and that might continue to work for us.)The text was updated successfully, but these errors were encountered: