-
I have a source table, and I have a derived table with a transformed version of each of the records in the source table. I want to update the derived table on a schedule, and I want to do this in an incremental way, i.e. to avoid dropping and re-creating the entire table each time I do this. I also don't want to use partitions. Instead, I have a way of finding out which source records have been added / removed / changed since the last run. I want to update only these records in the derived table. Question adapted from this Slack message: https://dagster.slack.com/archives/C01U954MEER/p1686258558582379 |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Basically, you can handle this with a vanilla Dagster asset. You're responsible for writing the logic that finds out what records in the source table have changed and updates the relevant records in the target table. You generally shouldn't use IO managers when using this pattern. The built-in IO managers assume that every time you materialize an asset, you're overwriting the full asset or a full partition of the asset. from dagster import asset
@asset
def my_derived_table(context) -> None:
added_records, deleted_records, updated_records = find_changed_records()
transformed_added_records = transform_records(added_records)
transformed_updated_records = transform_records(updated_records)
update_table(
add_records=transformed_added_records
update_records=transformed_updated_records,
delete_records=deleted_records,
) To make it easy / possible to find out what records have changed, it can be useful to store metadata each time you materialize your asset. E.g. you might store the timestamp of the last record you observed, or some sort of cursor into an update stream. #14338 includes information on how to retrieve metadata from prior materializations of the same asset. |
Beta Was this translation helpful? Give feedback.
Basically, you can handle this with a vanilla Dagster asset. You're responsible for writing the logic that finds out what records in the source table have changed and updates the relevant records in the target table.
You generally shouldn't use IO managers when using this pattern. The built-in IO managers assume that every time you materialize an asset, you're overwriting the full asset or a full partition of the asset.