Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Versioning]: Explore Kedro + Delta Lake for versioning #4240

Open
ElenaKhaustova opened this issue Oct 17, 2024 · 11 comments
Open

[Versioning]: Explore Kedro + Delta Lake for versioning #4240

ElenaKhaustova opened this issue Oct 17, 2024 · 11 comments
Assignees

Comments

@ElenaKhaustova
Copy link
Contributor

ElenaKhaustova commented Oct 17, 2024

Description

At the current stage by versioning we assume mapping a single version number to the corresponding versions of parameters, I/O data, and code. So one is able to retrieve a full project state including data at any point in time.

The goal is to check if we can use Delta Lake to map a single version number to code, parameters, and I/O data within Kedro and how it aligns with Kedro’s workflow.

As a result, we expect a working example of kedro project used with Delta Lake for versioning and some assumptions on:

  • whether it solves the main task and what are the constraints;
  • how easy is to set up;
  • how the workflow looks like;
  • whether any changes are required on the kedro side;
  • what data formats are supported;
  • how easy is to work with local/remote storage;
  • how demanding is it in terms of dependencies.

Context

#4199

@astrojuanlu:

Kedro + Delta Lake is not only possible, but works really well. this is the demo I showed on August 22nd Coffee Chat https://github.com/astrojuanlu/kedro-deltalake-demo

Market research

@lrcouto
Copy link
Contributor

lrcouto commented Nov 19, 2024

Still exploring the implementation but here are some insights from what I've found.

For the setup: if you're already using Spark on your project, you don't need to add it as a dependency for DeltaLake. If you're going to add it just for versioning, the Java setup can be a bit of a headache and the dependencies are pretty chunky in file size.

For data formats, DeltaLake uses the parquet format under the hood so it'll work better with data that follows a similar structure. Making it work with unstructured data might be more work than it's worth. Looking at the PolarsDeltaDataset class that @astrojuanlu made for his demo, I'm wondering if it would be possible to make a generic class to make several different formats work with the DeltaLake tables.

@astrojuanlu
Copy link
Member

I'm wondering if it would be possible to make a generic class to make several different formats work with the DeltaLake tables.

Good question... I guess this is equivalent to avoid making a PandasCSVDataset and PolarsCSVDataset.

@lrcouto
Copy link
Contributor

lrcouto commented Nov 25, 2024

Here are some further impressions after more exploration:

Starting a project already with the intention of using DeltaLake for versioning is likely viable. What I've been trying to do is gauge how difficult it'd be to get an already existing project. From what I'm seeing here, this would not be trivial and would require some specific knowledge of how DeltaLake works.

My impression is that, if we were to offer a DeltaLake versioning option, we'd have at the very least to create some specific dataset classes that work with DeltaLake's format constraints. It would not be a "generic" kind of versioning, it would be something that'd be build specifically around DeltaLake's features.

@merelcht merelcht assigned noklam and ElenaKhaustova and unassigned ankatiyar Nov 25, 2024
@lrcouto
Copy link
Contributor

lrcouto commented Nov 26, 2024

Here I've tried to integrate some DeltaLake functionality into a spaceflights-pandas starter project: https://github.com/lrcouto/spaceflights-delta.

I've created two dataset classes to take data from the .csv and .xlsx files (https://github.com/lrcouto/spaceflights-delta/tree/main/src/spaceflights_delta/extras), and the _delta_log files are created under /data/delta. I believe it would be possible to have a single class handle multiple file types, but that would end up requiring a log of conditional logic and it would be kind of ugly. I also had to initialize the delta tables with a new node. Other than that, the nodes themselves did not need any changes, so someone who wanted to implement this feature in an already existing project wouldn't have to do major alterations to their pipelines.

If we wished to spend some time implementing interfaces to access these _delta_log files, they could be used to for queries or version number, date of creation, etc. Would be either some work from us, or some work from the user. The way I'm seeing it, this would be attractive to users who would be interesting in specific DeltaLake features as well as having the possibility of versioning datasets.

This is using the DeltaLake Rust implementation that @astrojuanlu suggested, and it's significantly easier to use than the Spark implementation. It's interesting to point out that going to delta.io and then checking the documentation will lead you straight to the Spark setup, and if I ended up doing that, other users might do too. =p

@astrojuanlu
Copy link
Member

I also had to initialize the delta tables with a new node.

👀 https://github.com/lrcouto/spaceflights-delta/blob/34530b6/src/spaceflights_delta/pipelines/data_processing/nodes.py#L5-L17, https://github.com/lrcouto/spaceflights-delta/blob/34530b6/conf/base/parameters_data_processing.yml

That's significant. Was that because otherwise the nodes would fail? Is it a one-off thing you have to do (some sort of "cold start" problem?)

At some point I had to use an undocumented deltalake.writer.try_get_deltatable (see discussion at delta-io/delta-rs#2662)

I think it's quite ugly if we have to tell users to initialize the tables like this themselves.

@astrojuanlu
Copy link
Member

If we wished to spend some time implementing interfaces to access these _delta_log files, they could be used to for queries or version number, date of creation, etc.

I think that's already possible without manually interfacing with _delta_log? For example https://delta-io.github.io/delta-rs/usage/loading-table/#time-travel

>>> dt = DeltaTable("../rust/tests/data/simple_table", version=2)
>>> dt.load_version(1)
>>> dt.load_with_datetime("2021-11-04 00:05:23.283+00:00")

@astrojuanlu
Copy link
Member

It's interesting to point out that going to delta.io and then checking the documentation will lead you straight to the Spark setup, and if I ended up doing that, other users might do too. =p

Indeed: delta-io/delta#2225

@lrcouto
Copy link
Contributor

lrcouto commented Nov 26, 2024

I also had to initialize the delta tables with a new node.

👀 https://github.com/lrcouto/spaceflights-delta/blob/34530b6/src/spaceflights_delta/pipelines/data_processing/nodes.py#L5-L17, https://github.com/lrcouto/spaceflights-delta/blob/34530b6/conf/base/parameters_data_processing.yml

That's significant. Was that because otherwise the nodes would fail? Is it a one-off thing you have to do (some sort of "cold start" problem?)

At some point I had to use an undocumented deltalake.writer.try_get_deltatable (see discussion at delta-io/delta-rs#2662)

I think it's quite ugly if we have to tell users to initialize the tables like this themselves.

Yeah, without initializing it I kept getting a "DatasetError: Failed while loading data from dataset" error. I'm sure there's a more elegant way to solve this, I tried to do it on a hook but found a node easier. Either way, I agree that asking users to initialize the tables themselves would not be ideal.

@astrojuanlu astrojuanlu moved this from In Progress to In Review in Kedro Framework Nov 27, 2024
@astrojuanlu
Copy link
Member

To note (this probably applies to Iceberg as well), the 2021 Lakehouse paper (Armbrust et al "Lakehouse: a new generation of open platforms that unify data warehousing and advanced analytics" https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf) does mention unstructured data:

image

From a Solutions Architect working at Databricks in 2021:

Parquet supports unstructured/semi-structured types like:

  • StructType
  • MapType
  • ArrayType
  • BinaryType

https://groups.google.com/g/delta-users/c/HlwnoHs9hBQ/m/h6SDb4PEDgAJ

There are limitations though. For example, for images,

For large image files (average image size greater than 100 MB), Databricks recommends using the Delta table only to manage the metadata (list of file names) and loading the images from the object store using their paths when needed.

https://docs.databricks.com/en/machine-learning/reference-solutions/images-etl-inference.html#limitations-image-file-sizes

@noklam
Copy link
Contributor

noklam commented Nov 28, 2024

@astrojuanlu I found HuggingFace documentation advise people to upload audio as parquet file. I find this weird as Parquet was designed for table, i.e. row groups, compression doesn't make much sense for audio.

On the other hand, there are 0 reference to image/audio in Parquet documentation:

https://huggingface.co/docs/hub/en/datasets-audio
Parquet format
Instead of uploading the audio files and metadata as individual files, you can embed everything inside a Parquet file. This is useful if you have a large number of audio files, if you want to embed multiple audio columns, or if you want to store additional information about the audio in the same file. Parquet is also useful for storing data such as raw bytes, which is not supported by JSON/CSV.

It makes some sense to me for this use case (though feels weird), especially it's quite convenient to have one file where you can store metadata with the binary. i.e. the schema could look like "label", "filename", "height", "width", "category", "Image(binary)".

Parquet usually does not compress raw bytes (images) efficiently, but we still use it for images to be consistent with the other data types. https://discuss.huggingface.co/t/parquet-compression-for-image-dataset/64514

So in theory, it's possible to keep source code as a Parquet file. However, my question is, is it better than just saving it as a .zip file that simply have the same version name?

@astrojuanlu
Copy link
Member

It makes some sense to me for this use case (though feels weird), especially it's quite convenient to have one file where you can store metadata with the binary. i.e. the schema could look like "label", "filename", "height", "width", "category", "Image(binary)".

Good point! I'm not used to storing data this way either but I see how it can be useful. I'm guessing many folks just use filepaths. Although for data exchange it does make sense to store the data inside the file.

So in theory, it's possible to keep source code as a Parquet file. However, my question is, is it better than just saving it as a .zip file that simply have the same version name?

I guess you're talking about the "code versioning" use case. I'm thinking more about storing the git sha somewhere, although that of course assumes the user uses git.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Review
Development

No branches or pull requests

6 participants