-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enable merge on ICEBERG via Athena #633
Comments
I'm interested to have merge support for the Athena destination. For now I'll have to post process a merge within dbt or Airflow Athena operator. So in essence will use dlt to build Athena staging tables (with replace write disposition). I'd be happy to collaborate to help build this out. |
@n0012 we had preliminary merge support working. @sh-rp was testing it. you can join our slack and ping me or Dave - maybe we can find out how you could help us...
(2) is the tricky part. we use just INSERT + DELETE to do merges but each destination has small differences in the syntax that needs to be handled |
@rudolfix, what sort of timeline would you be looking at to add this merge support? I don't need a hard deadline or anything, just a vague idea if it's in the vicinity of weeks, 2-3 months or 6+ months etc |
Adding here my use cases.
worth to mention, I don't need to use iceberg here, because awswrangler overwrite the data for a partition performing a delete on s3 directly and just an append again. Using iceberg tables will be possible to do delete/insert
Once merge disposition is implemented for athena iceberg table I could consider to use dlt instead - also a deal breaker for me is that I need also to have partition definition on my table to reduce data scans, but this is another issue |
Background
See rationale below for why iceberg/open table. We can support it easily via existing destinations: Athena and Snowflake.
Why not spark:
Requirements
Implementation Notes
Rationale
Just few points on why iceberg is a game changer on data lakes:
e.g. often data engineers design partitioning based on few use cases that they change over time, and Iceberg allows to just change those partition definitions, thanks to hidden partition (without rewriting the all dataset, that's the case for pure parquet)
are you aware of the small file problem in data-lakes? if so iceberg has a build in function to compact "objects" and improve performance and cost. Just a note on that: I wrote quite some compaction procedures on my life, and I believe that Iceberg make that way of compacting data standardised (same as DeltaLake)
this decision came after leading a GDPR project that allow to remove sensitive data on a data lake - again game changer that make data engineers life much easier.
last but not least: Iceberg is becoming one of the facto open table format in the data landscape:
The text was updated successfully, but these errors were encountered: