-
Notifications
You must be signed in to change notification settings - Fork 906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Versioning]: Explore Kedro + Delta Lake for versioning #4240
Comments
Still exploring the implementation but here are some insights from what I've found. For the setup: if you're already using Spark on your project, you don't need to add it as a dependency for DeltaLake. If you're going to add it just for versioning, the Java setup can be a bit of a headache and the dependencies are pretty chunky in file size. For data formats, DeltaLake uses the parquet format under the hood so it'll work better with data that follows a similar structure. Making it work with unstructured data might be more work than it's worth. Looking at the PolarsDeltaDataset class that @astrojuanlu made for his demo, I'm wondering if it would be possible to make a generic class to make several different formats work with the DeltaLake tables. |
Good question... I guess this is equivalent to avoid making a PandasCSVDataset and PolarsCSVDataset. |
Here are some further impressions after more exploration: Starting a project already with the intention of using DeltaLake for versioning is likely viable. What I've been trying to do is gauge how difficult it'd be to get an already existing project. From what I'm seeing here, this would not be trivial and would require some specific knowledge of how DeltaLake works. My impression is that, if we were to offer a DeltaLake versioning option, we'd have at the very least to create some specific dataset classes that work with DeltaLake's format constraints. It would not be a "generic" kind of versioning, it would be something that'd be build specifically around DeltaLake's features. |
Here I've tried to integrate some DeltaLake functionality into a spaceflights-pandas starter project: https://github.com/lrcouto/spaceflights-delta. I've created two dataset classes to take data from the .csv and .xlsx files (https://github.com/lrcouto/spaceflights-delta/tree/main/src/spaceflights_delta/extras), and the _delta_log files are created under /data/delta. I believe it would be possible to have a single class handle multiple file types, but that would end up requiring a log of conditional logic and it would be kind of ugly. I also had to initialize the delta tables with a new node. Other than that, the nodes themselves did not need any changes, so someone who wanted to implement this feature in an already existing project wouldn't have to do major alterations to their pipelines. If we wished to spend some time implementing interfaces to access these _delta_log files, they could be used to for queries or version number, date of creation, etc. Would be either some work from us, or some work from the user. The way I'm seeing it, this would be attractive to users who would be interesting in specific DeltaLake features as well as having the possibility of versioning datasets. This is using the DeltaLake Rust implementation that @astrojuanlu suggested, and it's significantly easier to use than the Spark implementation. It's interesting to point out that going to delta.io and then checking the documentation will lead you straight to the Spark setup, and if I ended up doing that, other users might do too. =p |
👀 https://github.com/lrcouto/spaceflights-delta/blob/34530b6/src/spaceflights_delta/pipelines/data_processing/nodes.py#L5-L17, https://github.com/lrcouto/spaceflights-delta/blob/34530b6/conf/base/parameters_data_processing.yml That's significant. Was that because otherwise the nodes would fail? Is it a one-off thing you have to do (some sort of "cold start" problem?) At some point I had to use an undocumented I think it's quite ugly if we have to tell users to initialize the tables like this themselves. |
I think that's already possible without manually interfacing with >>> dt = DeltaTable("../rust/tests/data/simple_table", version=2)
>>> dt.load_version(1)
>>> dt.load_with_datetime("2021-11-04 00:05:23.283+00:00") |
Indeed: delta-io/delta#2225 |
Yeah, without initializing it I kept getting a "DatasetError: Failed while loading data from dataset" error. I'm sure there's a more elegant way to solve this, I tried to do it on a hook but found a node easier. Either way, I agree that asking users to initialize the tables themselves would not be ideal. |
To note (this probably applies to Iceberg as well), the 2021 Lakehouse paper (Armbrust et al "Lakehouse: a new generation of open platforms that unify data warehousing and advanced analytics" https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf) does mention unstructured data: From a Solutions Architect working at Databricks in 2021:
https://groups.google.com/g/delta-users/c/HlwnoHs9hBQ/m/h6SDb4PEDgAJ There are limitations though. For example, for images,
|
@astrojuanlu I found HuggingFace documentation advise people to upload audio as parquet file. I find this weird as Parquet was designed for table, i.e. row groups, compression doesn't make much sense for audio. On the other hand, there are 0 reference to image/audio in Parquet documentation:
It makes some sense to me for this use case (though feels weird), especially it's quite convenient to have one file where you can store metadata with the binary. i.e. the schema could look like "label", "filename", "height", "width", "category", "Image(binary)".
So in theory, it's possible to keep source code as a Parquet file. However, my question is, is it better than just saving it as a |
Good point! I'm not used to storing data this way either but I see how it can be useful. I'm guessing many folks just use filepaths. Although for data exchange it does make sense to store the data inside the file.
I guess you're talking about the "code versioning" use case. I'm thinking more about storing the git sha somewhere, although that of course assumes the user uses git. |
Description
At the current stage by versioning we assume mapping a single version number to the corresponding versions of parameters, I/O data, and code. So one is able to retrieve a full project state including data at any point in time.
The goal is to check if we can use Delta Lake to map a single version number to code, parameters, and I/O data within Kedro and how it aligns with Kedro’s workflow.
As a result, we expect a working example of kedro project used with Delta Lake for versioning and some assumptions on:
Context
#4199
@astrojuanlu:
Market research
The text was updated successfully, but these errors were encountered: