You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is currently no clear consistency in what a dataset does; it loads (or, in cases like Spark, connects to) data in some format, and then you need to make sure the node consuming the dataset matches the loaded format. This means you can never truly have a separation of node from dataset, and swap one out without changing the other, unless they are "compatible" (by unenforceable rules).
If you want to support a new file format (e.g. Delta), you need to write a connector for each engine. In many cases, it makes sense (perhaps there's nothing to reuse between the way Spark will load Delta and pandas will load Delta). In other cases, perhaps it should be possible to not have to define the loader in each place? Especially with things like dataframe exchange protocols coming into the picture.
The current design of kedro-datasets makes datasets entirely independent, so you can't reuse logic from one dataset in another. This is great in many ways (separation of dependencies), but also makes it impossible (I think?) to share loading code.
Inspired by:
> By adding this into kedro-datasets, there will be 3 possible ways of handling delta table:
Agree, although DeltaTable is often associated with Spark, but it's actually just a file format and you can read it via Pandas or maybe other libraries later.
I think the current space is still dominated by Parquet for data processing, Delta files are usually larger due to the compression and version history. IMO the versioning features are quite important and it deserves wider adoption outside of the Spark ecosystem.
I have no idea how the compacting and re-partitioning works with non-spark implementation? This feels like the responsibility of some kind of DB or data processing engine, it's probably too much for the Dataset abstraction. WDYT?
The purpose of this issue (thus far) is to raise some potential issues, but I don't have a good solution in mind. I'm also not 100% sure this is solvable, or that Kedro wants to solve this problem.
One half-baked thought is to make the "engine" on datasets a parameter of load/save. Then, it is the datasets responsibility as to when to more concretely manifest data.
The text was updated successfully, but these errors were encountered:
deepyaman
changed the title
Separate file format from processing engine in datasets
[DRAFT] Separate file format from processing engine in datasets
Jul 19, 2023
Context
kedro-datasets
makes datasets entirely independent, so you can't reuse logic from one dataset in another. This is great in many ways (separation of dependencies), but also makes it impossible (I think?) to share loading code.Inspired by:
Agree, although
DeltaTable
is often associated with Spark, but it's actually just a file format and you can read it via Pandas or maybe other libraries later.I think the current space is still dominated by Parquet for data processing, Delta files are usually larger due to the compression and version history. IMO the versioning features are quite important and it deserves wider adoption outside of the Spark ecosystem.
I have no idea how the compacting and re-partitioning works with non-spark implementation? This feels like the responsibility of some kind of DB or data processing engine, it's probably too much for the Dataset abstraction. WDYT?
Originally posted by @noklam in #243 (comment)
Possible Implementation
The purpose of this issue (thus far) is to raise some potential issues, but I don't have a good solution in mind. I'm also not 100% sure this is solvable, or that Kedro wants to solve this problem.
One half-baked thought is to make the "engine" on datasets a parameter of load/save. Then, it is the datasets responsibility as to when to more concretely manifest data.
The text was updated successfully, but these errors were encountered: