[Versioning]: Explore Kedro + Delta Lake for versioning #4240

ElenaKhaustova · 2024-10-17T22:35:34Z

Description

At the current stage by versioning we assume mapping a single version number to the corresponding versions of parameters, I/O data, and code. So one is able to retrieve a full project state including data at any point in time.

The goal is to check if we can use Delta Lake to map a single version number to code, parameters, and I/O data within Kedro and how it aligns with Kedro’s workflow.

As a result, we expect a working example of kedro project used with Delta Lake for versioning and some assumptions on:

whether it solves the main task and what are the constraints;
how easy is to set up;
how the workflow looks like;
whether any changes are required on the kedro side;
what data formats are supported;
how easy is to work with local/remote storage;
how demanding is it in terms of dependencies.

Context

#4199

@astrojuanlu:

Kedro + Delta Lake is not only possible, but works really well. this is the demo I showed on August 22nd Coffee Chat https://github.com/astrojuanlu/kedro-deltalake-demo

Market research

lrcouto · 2024-11-19T14:34:00Z

Still exploring the implementation but here are some insights from what I've found.

For the setup: if you're already using Spark on your project, you don't need to add it as a dependency for DeltaLake. If you're going to add it just for versioning, the Java setup can be a bit of a headache and the dependencies are pretty chunky in file size.

For data formats, DeltaLake uses the parquet format under the hood so it'll work better with data that follows a similar structure. Making it work with unstructured data might be more work than it's worth. Looking at the PolarsDeltaDataset class that @astrojuanlu made for his demo, I'm wondering if it would be possible to make a generic class to make several different formats work with the DeltaLake tables.

astrojuanlu · 2024-11-19T14:41:01Z

I'm wondering if it would be possible to make a generic class to make several different formats work with the DeltaLake tables.

Good question... I guess this is equivalent to avoid making a PandasCSVDataset and PolarsCSVDataset.

lrcouto · 2024-11-25T13:57:44Z

Here are some further impressions after more exploration:

Starting a project already with the intention of using DeltaLake for versioning is likely viable. What I've been trying to do is gauge how difficult it'd be to get an already existing project. From what I'm seeing here, this would not be trivial and would require some specific knowledge of how DeltaLake works.

My impression is that, if we were to offer a DeltaLake versioning option, we'd have at the very least to create some specific dataset classes that work with DeltaLake's format constraints. It would not be a "generic" kind of versioning, it would be something that'd be build specifically around DeltaLake's features.

lrcouto · 2024-11-26T01:34:58Z

Here I've tried to integrate some DeltaLake functionality into a spaceflights-pandas starter project: https://github.com/lrcouto/spaceflights-delta.

I've created two dataset classes to take data from the .csv and .xlsx files (https://github.com/lrcouto/spaceflights-delta/tree/main/src/spaceflights_delta/extras), and the _delta_log files are created under /data/delta. I believe it would be possible to have a single class handle multiple file types, but that would end up requiring a log of conditional logic and it would be kind of ugly. I also had to initialize the delta tables with a new node. Other than that, the nodes themselves did not need any changes, so someone who wanted to implement this feature in an already existing project wouldn't have to do major alterations to their pipelines.

If we wished to spend some time implementing interfaces to access these _delta_log files, they could be used to for queries or version number, date of creation, etc. Would be either some work from us, or some work from the user. The way I'm seeing it, this would be attractive to users who would be interesting in specific DeltaLake features as well as having the possibility of versioning datasets.

This is using the DeltaLake Rust implementation that @astrojuanlu suggested, and it's significantly easier to use than the Spark implementation. It's interesting to point out that going to delta.io and then checking the documentation will lead you straight to the Spark setup, and if I ended up doing that, other users might do too. =p

astrojuanlu · 2024-11-26T10:38:24Z

I also had to initialize the delta tables with a new node.

👀 https://github.com/lrcouto/spaceflights-delta/blob/34530b6/src/spaceflights_delta/pipelines/data_processing/nodes.py#L5-L17, https://github.com/lrcouto/spaceflights-delta/blob/34530b6/conf/base/parameters_data_processing.yml

That's significant. Was that because otherwise the nodes would fail? Is it a one-off thing you have to do (some sort of "cold start" problem?)

At some point I had to use an undocumented deltalake.writer.try_get_deltatable (see discussion at delta-io/delta-rs#2662)

I think it's quite ugly if we have to tell users to initialize the tables like this themselves.

astrojuanlu · 2024-11-26T10:50:15Z

If we wished to spend some time implementing interfaces to access these _delta_log files, they could be used to for queries or version number, date of creation, etc.

I think that's already possible without manually interfacing with _delta_log? For example https://delta-io.github.io/delta-rs/usage/loading-table/#time-travel

>>> dt = DeltaTable("../rust/tests/data/simple_table", version=2)
>>> dt.load_version(1)
>>> dt.load_with_datetime("2021-11-04 00:05:23.283+00:00")

astrojuanlu · 2024-11-26T10:59:20Z

It's interesting to point out that going to delta.io and then checking the documentation will lead you straight to the Spark setup, and if I ended up doing that, other users might do too. =p

Indeed: delta-io/delta#2225

lrcouto · 2024-11-26T13:28:49Z

I also had to initialize the delta tables with a new node.

👀 https://github.com/lrcouto/spaceflights-delta/blob/34530b6/src/spaceflights_delta/pipelines/data_processing/nodes.py#L5-L17, https://github.com/lrcouto/spaceflights-delta/blob/34530b6/conf/base/parameters_data_processing.yml

That's significant. Was that because otherwise the nodes would fail? Is it a one-off thing you have to do (some sort of "cold start" problem?)

At some point I had to use an undocumented deltalake.writer.try_get_deltatable (see discussion at delta-io/delta-rs#2662)

I think it's quite ugly if we have to tell users to initialize the tables like this themselves.

Yeah, without initializing it I kept getting a "DatasetError: Failed while loading data from dataset" error. I'm sure there's a more elegant way to solve this, I tried to do it on a hook but found a node easier. Either way, I agree that asking users to initialize the tables themselves would not be ideal.

astrojuanlu · 2024-11-28T13:12:04Z

To note (this probably applies to Iceberg as well), the 2021 Lakehouse paper (Armbrust et al "Lakehouse: a new generation of open platforms that unify data warehousing and advanced analytics" https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf) does mention unstructured data:

From a Solutions Architect working at Databricks in 2021:

Parquet supports unstructured/semi-structured types like:

StructType

MapType

ArrayType

BinaryType

https://groups.google.com/g/delta-users/c/HlwnoHs9hBQ/m/h6SDb4PEDgAJ

There are limitations though. For example, for images,

For large image files (average image size greater than 100 MB), Databricks recommends using the Delta table only to manage the metadata (list of file names) and loading the images from the object store using their paths when needed.

https://docs.databricks.com/en/machine-learning/reference-solutions/images-etl-inference.html#limitations-image-file-sizes

noklam · 2024-11-28T14:33:36Z

@astrojuanlu I found HuggingFace documentation advise people to upload audio as parquet file. I find this weird as Parquet was designed for table, i.e. row groups, compression doesn't make much sense for audio.

On the other hand, there are 0 reference to image/audio in Parquet documentation:

https://huggingface.co/docs/hub/en/datasets-audio
Parquet format
Instead of uploading the audio files and metadata as individual files, you can embed everything inside a Parquet file. This is useful if you have a large number of audio files, if you want to embed multiple audio columns, or if you want to store additional information about the audio in the same file. Parquet is also useful for storing data such as raw bytes, which is not supported by JSON/CSV.

It makes some sense to me for this use case (though feels weird), especially it's quite convenient to have one file where you can store metadata with the binary. i.e. the schema could look like "label", "filename", "height", "width", "category", "Image(binary)".

Parquet usually does not compress raw bytes (images) efficiently, but we still use it for images to be consistent with the other data types. https://discuss.huggingface.co/t/parquet-compression-for-image-dataset/64514

So in theory, it's possible to keep source code as a Parquet file. However, my question is, is it better than just saving it as a .zip file that simply have the same version name?

astrojuanlu · 2024-11-28T17:15:03Z

It makes some sense to me for this use case (though feels weird), especially it's quite convenient to have one file where you can store metadata with the binary. i.e. the schema could look like "label", "filename", "height", "width", "category", "Image(binary)".

Good point! I'm not used to storing data this way either but I see how it can be useful. I'm guessing many folks just use filepaths. Although for data exchange it does make sense to store the data inside the file.

So in theory, it's possible to keep source code as a Parquet file. However, my question is, is it better than just saving it as a .zip file that simply have the same version name?

I guess you're talking about the "code versioning" use case. I'm thinking more about storing the git sha somewhere, although that of course assumes the user uses git.

ElenaKhaustova added this to Kedro Framework Oct 17, 2024

ElenaKhaustova moved this to To Do in Kedro Framework Oct 17, 2024

ElenaKhaustova assigned ElenaKhaustova and ankatiyar Oct 17, 2024

merelcht added this to the Dataset Versioning milestone Oct 18, 2024

github-actions bot mentioned this issue Nov 1, 2024

Monthly issue metrics report #4280

Open

merelcht assigned lrcouto and unassigned ElenaKhaustova Nov 11, 2024

merelcht assigned noklam and ElenaKhaustova and unassigned ankatiyar Nov 25, 2024

astrojuanlu moved this from In Progress to In Review in Kedro Framework Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Versioning]: Explore Kedro + Delta Lake for versioning #4240

[Versioning]: Explore Kedro + Delta Lake for versioning #4240

ElenaKhaustova commented Oct 17, 2024 •

edited

Loading

lrcouto commented Nov 19, 2024

astrojuanlu commented Nov 19, 2024

lrcouto commented Nov 25, 2024

lrcouto commented Nov 26, 2024

astrojuanlu commented Nov 26, 2024

astrojuanlu commented Nov 26, 2024

astrojuanlu commented Nov 26, 2024

lrcouto commented Nov 26, 2024 •

edited

Loading

astrojuanlu commented Nov 28, 2024

noklam commented Nov 28, 2024 •

edited

Loading

astrojuanlu commented Nov 28, 2024

[Versioning]: Explore Kedro + Delta Lake for versioning #4240

[Versioning]: Explore Kedro + Delta Lake for versioning #4240

Comments

ElenaKhaustova commented Oct 17, 2024 • edited Loading

Description

Context

lrcouto commented Nov 19, 2024

astrojuanlu commented Nov 19, 2024

lrcouto commented Nov 25, 2024

lrcouto commented Nov 26, 2024

astrojuanlu commented Nov 26, 2024

astrojuanlu commented Nov 26, 2024

astrojuanlu commented Nov 26, 2024

lrcouto commented Nov 26, 2024 • edited Loading

astrojuanlu commented Nov 28, 2024

noklam commented Nov 28, 2024 • edited Loading

astrojuanlu commented Nov 28, 2024

ElenaKhaustova commented Oct 17, 2024 •

edited

Loading

lrcouto commented Nov 26, 2024 •

edited

Loading

noklam commented Nov 28, 2024 •

edited

Loading