[DRAFT] Separate file format from processing engine in datasets #273

deepyaman · 2023-07-19T13:38:23Z

Context

There is currently no clear consistency in what a dataset does; it loads (or, in cases like Spark, connects to) data in some format, and then you need to make sure the node consuming the dataset matches the loaded format. This means you can never truly have a separation of node from dataset, and swap one out without changing the other, unless they are "compatible" (by unenforceable rules).
If you want to support a new file format (e.g. Delta), you need to write a connector for each engine. In many cases, it makes sense (perhaps there's nothing to reuse between the way Spark will load Delta and pandas will load Delta). In other cases, perhaps it should be possible to not have to define the loader in each place? Especially with things like dataframe exchange protocols coming into the picture.
The current design of kedro-datasets makes datasets entirely independent, so you can't reuse logic from one dataset in another. This is great in many ways (separation of dependencies), but also makes it impossible (I think?) to share loading code.

Inspired by:

          > By adding this into kedro-datasets, there will be 3 possible ways of handling delta table:

Apache Spark
delta-rs, a non-Spark approach (this PR)
Databricks Unity Catalog

Agree, although DeltaTable is often associated with Spark, but it's actually just a file format and you can read it via Pandas or maybe other libraries later.

I think the current space is still dominated by Parquet for data processing, Delta files are usually larger due to the compression and version history. IMO the versioning features are quite important and it deserves wider adoption outside of the Spark ecosystem.

I have no idea how the compacting and re-partitioning works with non-spark implementation? This feels like the responsibility of some kind of DB or data processing engine, it's probably too much for the Dataset abstraction. WDYT?

Originally posted by @noklam in #243 (comment)

Possible Implementation

The purpose of this issue (thus far) is to raise some potential issues, but I don't have a good solution in mind. I'm also not 100% sure this is solvable, or that Kedro wants to solve this problem.

One half-baked thought is to make the "engine" on datasets a parameter of load/save. Then, it is the datasets responsibility as to when to more concretely manifest data.

The text was updated successfully, but these errors were encountered:

astrojuanlu · 2023-07-19T15:27:32Z

I essentially agree with all your points.

xref kedro-org/kedro#4314, kedro-org/kedro#1936, kedro-org/kedro#1981, kedro-org/kedro#1778, and to some extent kedro-org/kedro#2536

It's clear that at some point we need to sit and see if we can come up with a better design.

deepyaman changed the title ~~Separate file format from processing engine in datasets~~ [DRAFT] Separate file format from processing engine in datasets Jul 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] Separate file format from processing engine in datasets #273

[DRAFT] Separate file format from processing engine in datasets #273

deepyaman commented Jul 19, 2023

astrojuanlu commented Jul 19, 2023

[DRAFT] Separate file format from processing engine in datasets #273

[DRAFT] Separate file format from processing engine in datasets #273

Comments

deepyaman commented Jul 19, 2023

Context

Possible Implementation

astrojuanlu commented Jul 19, 2023