Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move AbstractDataSet to Kedro-Plugins #2409

Closed
bpmeek opened this issue Mar 9, 2023 · 21 comments
Closed

Move AbstractDataSet to Kedro-Plugins #2409

bpmeek opened this issue Mar 9, 2023 · 21 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@bpmeek
Copy link

bpmeek commented Mar 9, 2023

Description

I wouldd like to be able to share implementations of AbstractDataSet with non-Kedro users.

Context

I have a library with several implementations of AbstractDataSet that I use to access proprietary data connectors at my employer, I would like to share this library with coworkers that are not using Kedro for use through the code API without the overhead of a full Kedro install.

Possible Implementation

Move the contents of kedro.io.core.py to a standalone plugin within the Kedro-Plugins repo and migrate imports away from core.py to the new plugin

Possible Alternatives

An alternative solution would be to use setuptools to build a second module with only the AbstractDataSet

@bpmeek bpmeek added the Issue: Feature Request New feature or improvement to existing feature label Mar 9, 2023
@deepyaman
Copy link
Member

Thanks for the feature request! I personally think it's an interesting enabler, because a lot of other tools/workflows lack this nice data abstraction (something I've heard from other MLEs, as well as personally experienced in deploying to and looking at the examples for Azure ML Pipelines, Kubeflow Pipelines, Argo Workflows, etc.).

I do recall that there was some discussion around whether AbstractDataSet, etc. should move to kedro-datasets, too, and I remember some arguments that it was very core to how Kedro functioned. However, that's not to say that it should be impossible to componentize Kedro more, and to enable use of standalone components.

Move the contents of kedro.io.core.py to a standalone plugin within the Kedro-Plugins repo and migrate imports away from core.py to the new plugin

Would it not make sense to just move it all to the existing (relatively new) kedro-datasets plugin then?

@bpmeek
Copy link
Author

bpmeek commented Mar 9, 2023

I initially thought of that as well, but kedro-datasets requires kedro so for this specific use-case it wouldn't provide any benefit, so I'm not sure if it would be better to move it to Kedro-Datasets and then optimize the imports to only require a Kedro install where actually necessary or to move it to a standalone plugin.

@deepyaman
Copy link
Member

deepyaman commented Mar 9, 2023

I initially thought of that as well, but kedro-datasets requires kedro so for this specific use-case it wouldn't provide any benefit, so I'm not sure if it would be better to move it to Kedro-Datasets and then optimize the imports to only require a Kedro install where actually necessary or to move it to a standalone plugin.

I think it only requires stuff from kedro.io.core, which is what you wanted to move out in the first place. (Could be wrong, just checked one dataset.)

@bpmeek
Copy link
Author

bpmeek commented Mar 9, 2023

I initially thought of that as well, but kedro-datasets requires kedro so for this specific use-case it wouldn't provide any benefit, so I'm not sure if it would be better to move it to Kedro-Datasets and then optimize the imports to only require a Kedro install where actually necessary or to move it to a standalone plugin.

I think it only requires stuff from kedro.io.core, which is what you wanted to move out in the first place. (Could be wrong, just checked one dataset.)

I believe you're correct, and if the team decides that is the better path then I'm all for it.

@noklam
Copy link
Contributor

noklam commented Mar 9, 2023

#1758 (comment)

This is the relevant discussion we had.

@bpmeek
Copy link
Author

bpmeek commented Mar 9, 2023

Thanks for the link, it seems there was a lot of discussion on the topic which I understand completely, but it does seem like a lot of users were advocating for AbstractDataSet to be moved to kedro-datasets and I obviously concur with their points. I think this could also be framed as an inheritance vs composition question.

I've submitted a PR with the changes made, please let me know if there's anything else I can do to help with.

I suppose another option could be to create an entirely different repository, something like miniconda vs anaconda.

@merelcht
Copy link
Member

I'm very interested in your statement: "It does seem like a lot of users were advocating for AbstractDataSet", from the discussions we had previously we only had 1 or two users commenting, but not such a significant number that it convinced us we should move. Have you noticed other people, outside your company, talking about the need for AbstractDataSet to be outside of core Kedro?

@merelcht merelcht added the Community Issue/PR opened by the open-source community label Mar 13, 2023
@bpmeek
Copy link
Author

bpmeek commented Mar 13, 2023

I'm very interested in your statement: "It does seem like a lot of users were advocating for AbstractDataSet", from the discussions we had previously we only had 1 or two users commenting, but not such a significant number that it convinced us we should move. Have you noticed other people, outside your company, talking about the need for AbstractDataSet to be outside of core Kedro?

Sorry, that was probably an inaccurate phrasing on my part, and some assumptions.

But to answer your question, since I've joined the conversation, I've only seen a couple people advocate for the move. That's also not completely surprising, most Kedro users likely won't care where AbstractDataSet lives because they'll have to have both Kedro and Kedro-datasets installed, my specific use case would be for the benefit of non-Kedro users. But there were some really good points made around what Kedro being a dependency for Kedro-datasets means for production upgrades.

@antonymilne
Copy link
Contributor

antonymilne commented Mar 13, 2023

@bpmeek in case you didn't already come across it, in #1776 I originally proposed something very similar, which was decided against (for good reasons I think).

In my mind, many of the advantages posted there still hold true and it would make more sense for AbstractDataSet etc. to live in kedro-datasets rather than in core. However, it was decided before that using the kedro dataset abstraction outside kedro was just an edge case and not many people would want to do it, but you are a good example of exactly that, and actually I have a similar case to you. But still I'm not sure there are enough people to want to do this that we should make this change unless you can also think of a way around the problems that were mentioned in #1776 (comment)?

In practice, what is the actual disadvantage of the current system that you find? As far as I understand, the only real difference for you is that you are required to pip install kedro + all its dependencies while you'd (understandably) like to just do pip install kedro-datasets[...]. Is this what you mean by "the overhead of a full Kedro install" or is it something else? kedro + dependencies doesn't feel that big to me, though I understand it's heavier than what the minimum that you might like to be necessary.

@Galileo-Galilei
Copy link
Member

Just to give my two cents (and also because I was involved in the initial conversation):

  • I have users complaining that the current kedro-datasets is conflicting with kedro<=0.18.3. This is really a loss because it is almost portable "as is" to old kedro versions, including some legacy 0.16.X I would like to migrate step by step. This is one of the concern we anticipated in the old discussion
  • As the OP, I have a bucnh of use cases where I introduced kedro to data analysts who are not familiar with python. I introduced Kedro to them though the catalog and a notebook (with extra facilities to do SQL inside). They really enjoy the abstraction and the ability to use data from very differetn sources (old Access databases, files from s3 storage, parquet table from internal datalake, SAS for old datawarehouse... and save there results in excel). This avoids copying data and therefore increase security / data management / development speed. Kedro is a bit scary to them and I'd like to introduce the abstraction separately. Not installing kedro but only kedro-datasets also limits the risk of versions conflicts by decreasing the potential conflicts area with "exotic" datasets.
  • When experimenting on Allow new attributes to be added to DataSets #1076 I found that I would likely need to modify the AbstractDataSet class, hence this will force kedro-datasets to pin a given high version of kedro, which is something we want to avoid. If they lived in the same repo, it would avoid breaking changes (because kedro-datasets would be "internally consistent") and kedro API to kedro-datasets (i.e. the load and save methods) would remain unchanged.

So my original requests still holds :)

To be honest I don't really understand the original argument:

(AntonyMilneQB commented on Sep 1, 2022) [@idanov] thinks all implementations of AbstractDataSet should be treated like they are written by a third party and split out. If we change AbstractDataSet then all third parties (us + anyone else implementing a dataset) will need to modify to match the new interface. MemoryDataSet and the like are "core" dataset implementations and so remain in kedro

I understand you want to separate "team made" versus "third party" datasets, but it seems that separating the AbstractDataSet from the library where datasets lives has the opposite effect. Whatever happen, if Abstract DataSet is modified, all dependent datasets will need to be modified. This seems easier to maintain "intra-library" consistency rather than consistency between several librairies which have different release lifecycle and pinned dependencies.

@bpmeek
Copy link
Author

bpmeek commented Mar 14, 2023

But still I'm not sure there are enough people to want to do this that we should make this change unless you can also think of a way around the problems that were mentioned in #1776 (comment)?

Correct me if I'm wrong, but if the primary concern is ensuring backwards compatibility with third-party developed tools, Could we not replace the current contents of kedro.io.core with something along the lines of (assuming a merge with my current PR):

from kedro_datasets.io.core import AbstractDataSet, ...
__all__ = [AbstractDataSet, ...]

That way no current integrations are broken, and a deprecation warning could even be thrown if the decision is made to completely remove it at a later date.

Also, with a great amount of respect to @idanov, while the AbstractDataSet and MemoryDataSet are essential to Kedro Runners, looking at the FAQ on Kedro's Documentation:

We focus on a different problem, which is the process of authoring pipelines, as opposed to running

So while these classes are core to the Runner, I don't view the Runner as being core to Kedro, and there may come a day that users can decide between multiple runners.

In practice, what is the actual disadvantage of the current system that you find? As far as I understand, the only real difference for you is that you are required to pip install kedro + all its dependencies while you'd (understandably) like to just do pip install kedro-datasets[...]. Is this what you mean by "the overhead of a full Kedro install" or is it something else? kedro + dependencies doesn't feel that big to me, though I understand it's heavier than what the minimum that you might like to be necessary.

This is the only disadvantage I personally have encountered, but I could easily envision the issues that @Galileo-Galilei mentioned in his earlier comments.

@idanov
Copy link
Member

idanov commented Mar 22, 2023

I am really confused by this conversation, not sure I understand the pain point here. kedro is the smaller of the two, kedro-datasets is the one that has huge amount of dependencies. If you are installing kedro-datasets, why is kedro being an extra dependency to that in any way different than, let's say pandas? Why are the Data Scientists liking the DataSets abstraction so bothered by having kedro installed alongside? No one requires them to use kedro, if they just want the datasets, they simply need to depend on it, as they will depend on pandas or another package.

Please could you provide some specific reasoning why this is necessary at all @bpmeek?

Introducing dependency on kedro-datasets to kedro defeats the purpose of separating those. kedro in principle should and can work without kedro-datasets, and people can create their own separate packages as they see fit. Making kedro depend on kedro-datasets will make it impossible for kedro users to use newer versions of kedro-datasets, since there will be version conflicts on installation. Moreover, a breaking change in kedro-datasets will immediately trigger a breaking change in kedro, which is exactly why we split them.

AbstractDataSet is just the interface, in all frameworks the base interface belongs to the framework, because this is the abstraction the whole framework is programmed against. kedro-datasets is just an extension to that framework, which provides a package of useful datasets, but nothing prevents people of creating a set of community datasets that can be used and (who knows) maybe someday become more popular than the original set of datasets. However, in order to ensure this is possible and people can have kedro-datasets, kedro-community-dataset, kedro-my-company-datasets independently from each other, they need to share an interface, which kedro understands and that interface is AbstractDataSet.

I have users complaining that the current kedro-datasets is conflicting with kedro<=0.18.3. This is really a loss because it is almost portable "as is" to old kedro versions, including some legacy 0.16.X I would like to migrate step by step. This is one of the concern we anticipated in the old discussion

To me this is the real problem here, because for unknown reasons, the kedro-datasets package is restricted to use:
kedro~=0.18.3, where the AbstractDataSet, which all of kedro-datasets use, hasn't changed since prehistoric times. So I see no reason not to relax this to kedro > 0.x.y, where 0.x.y is the minimal version where AbstractDataSet changed. This pinning is unnecessary and actually very much against what this split was meant to do. So the only thing we need to to in order to solve all of the problems listed in this thread is to relax the pinning of kedro-datasets, so any Kedro version can use it, be it old or new.

Just as a reminder for the goal of the split:

  • kedro and kedro-datasets develop at a different pace
  • kedro-dataset only depend on one thing from Kedro, namely AbstractDataSet
  • breaking changes in kedro shouldn't influence kedro-datasets, unless they are in AbstractDataSet
  • breaking changes in kedro-datasets shouldn't influence kedro in absolutely any way, because kedro shouldn't even know that kedro-datasets exists
  • mixing and matching kedro versions with kedro-dataset versions should be possible in the widest range

These are stated and implicit goals. But the one true north star goal that sparked this split is to decouple kedro from its datasets, which were holding us back from releasing upgrades to either.

Actions:

  • Relax the requirement for kedro in kedro-datasets to the earliest version possible as minimal and open-ended versions

@bpmeek
Copy link
Author

bpmeek commented Mar 22, 2023

@idanov I think I may be approaching this from a different angle, in what I'm suggesting an install of kedro-datasets would come with only what's currently in kedro.io.core and kedro.utils this only requires cachetools which has no dependencies itself, all of the other datasets would fall under "extras" as I believe they currently do. Installing Kedro to a fresh environment installed 47 packages including itself. While this personally has not been an issue for me it seemed like unnecessary overhead when I opened the ticket.

Just as a reminder for the goal of the split:

  • kedro and kedro-datasets develop at a different pace
  • kedro-dataset only depend on one thing from Kedro, namely AbstractDataSet
  • breaking changes in kedro shouldn't influence kedro-datasets, unless they are in AbstractDataSet
  • breaking changes in kedro-datasets shouldn't influence kedro in absolutely any way, because kedro shouldn't even know that kedro-datasets exists
  • mixing and matching kedro versions with kedro-dataset versions should be possible in the widest range

I think the proposed change wouldn't impact any of these goals, only reverse the direction, i.e. kedro only depend on one thing from kedro-datasets, namely AbstractDataSet

But this is also in part why the original request recommends moving it to its own plugin.

Moreover, a breaking change in kedro-datasets will immediately trigger a breaking change in kedro, which is exactly why we split them.

Can you help me understand this a little more? Currently it seems that kedro-datasets is just a library of different implementations of AbstractDataSet and doesn't provide any core functionality itself, so it would seem that a breaking change to AbstractDataSet would impact both Kedro and kedro-datasets regardless of which package it lives in.

I do definitely understand wanting to keep the interface part of the framework, and if it's too much of a risk to separate then I understand.

@deepyaman
Copy link
Member

Installing Kedro to a fresh environment installed 47 packages including itself. While this personally has not been an issue for me it seemed like unnecessary overhead when I opened the ticket.

+1 that Kedro is not that light. https://github.com/kedro-org/kedro/blob/main/dependency/requirements.txt includes a lot of requirements, and they're not even that loosely defined.

To provide a concrete (but simple) example, I wanted to abstract data loading from execution logic in something similar to https://github.com/deepyaman/exedra-examples/blob/main/nyc_taxi_data_regression/deploy/kfp/pipeline.py (the automated translation process is less straightforward, but not the focus). As a result, I wanted a Dataset interface where the dataset defines load/save function, etc., too. However, installing Kedro on each container is unnecessarily heavy.

@astrojuanlu
Copy link
Member

Crazy idea: stop requiring inheritance from AbstractDataSet and relying on derived datasets to implement a protocol instead (in other words, good old Duck Typing™️), which were introduced in Python 3.8 https://peps.python.org/pep-0544/

@noklam
Copy link
Contributor

noklam commented Mar 30, 2023

Interesting idea, I don't know how does it works in practice but happy to explore.

@antonymilne
Copy link
Contributor

antonymilne commented Mar 30, 2023

Unfortunately this isn't really a possibility without drastic changes because AbstractDataSet does a lot more than just provide the interface - it has functionality wrapped up in its methods, e.g. load wraps the subclass's _load method. This is the template method pattern and there are good arguments against it, many of which apply to AbstractDataSet and its implementations IMO. I alluded to this and other issues in #1778 but didn't write much about AbstractDataSet yet. Either way, I don't think this is likely to change any time soon, so I don't think a protocol will help us now.

@astrojuanlu
Copy link
Member

astrojuanlu commented Jun 19, 2023

There have been some very reasonable points raised in this thread:

On the other hand, I feel that some arguments are not core to the discussion:

  • Users who find Kedro "scary" ❓ As @idanov said, having Kedro being pulled as a dependency is not forcing anybody to use it. In addition, if you're using anything in kedro-datasets, chances are that you're pulling tensorflow, pyspark... those are the really scary ones IMHO.
  • Needing to modify all datasets after adding a metadata attribute ❓ In the end, Add metadata attribute to kedro.io datasets #2537 and feat: Add metadata attribute to datasets kedro-plugins#189 happened without touching AbstractDataSet and in a fully backwards compatibility way (the role of layer in Kedro Viz is a separate story), so I don't see how a base class in kedro-datasets would have made things easier.

The discussion of what direction should the dependency link have is not settled, but I sympathise with @idanov arguments to keep the interface as part of the framework. I don't have articulate arguments for this, but definitely there are precedents of plugin ecosystems depending on the central package: for example, I picked 4 pytest plugins at random and they all depend on pytest, and the same happened after picking 4 random Django packages. I don't see why we should do things differently in Kedro. Moreover, this would create a weird assymmetry in which all Kedro plugins depend on kedro except for kedro-datasets.

Given the lack of consensus, I'm inclined to say that we should close this as "Won't fix", and instead focus our efforts on (1) improving the current design of Kedro datasets (#1936 (comment), #1778) and (2) continue working on making Kedro more modular (#2388).

@bpmeek bpmeek closed this as completed Jun 21, 2023
@bpmeek
Copy link
Author

bpmeek commented Jun 21, 2023

@astrojuanlu you bring up some really good points, and it seems like the core Issues I was wanting to address have been addressed. Thank you team for the work and conversation.

@bpmeek bpmeek closed this as not planned Won't fix, can't repro, duplicate, stale Jun 21, 2023
@yetudada
Copy link
Contributor

yetudada commented Aug 7, 2023

Thank you everyone! I'll also drop in the conversation that prompted this too: https://linen-slack.kedro.org/t/9724969/hey-all-if-i-wanted-to-request-that-a-part-of-kedro-be-moved

@idanov
Copy link
Member

idanov commented Oct 17, 2023

Thank you @bpmeek for raising this, because without the discussion here we wouldn't have found out about kedro-org/kedro-plugins#140

Meanwhile after we drop Python 3.7 we maybe should look into using Protocols in as many places as possible as @astrojuanlu suggested, because I have the suspicion that this might make things a lot more decoupled and give the freedom to users in more similar situations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
None yet
Development

No branches or pull requests

9 participants