load_cdf return an empty dataframe when a version is out of range #3035

pblocz · 2024-11-28T23:57:52Z

Description

In spark delta table you can enable an option to manage out of range versions or timestamps. https://docs.delta.io/latest/delta-change-data-feed.html#read-changes-in-streaming-queries

Right now the behaviour of load_cdf is inconsistent, if you provide an out of range version you get an error:

But with a timestamp out of range, you get an empty dataset:

It would be useful for incremental pipelines to have a way to manage this behaviour and make it consistent.

ion-elgreco · 2024-11-29T06:27:22Z

If you know some rust it's probably a simple fix

pblocz · 2024-11-29T18:34:48Z

If you know some rust it's probably a simple fix

@ion-elgreco I have never used rust and it is been a while since I have done anything that needs to be compiles, but can give it a go

pblocz added the enhancement New feature or request label Nov 28, 2024

pblocz mentioned this issue Dec 1, 2024

feat: add out_of_range flag to load_cdf #3040

Merged

ion-elgreco closed this as completed in #3040 Dec 7, 2024