Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike : Polishing API Surface #4473

Open
rashidakanchwala opened this issue Feb 10, 2025 · 13 comments
Open

Spike : Polishing API Surface #4473

rashidakanchwala opened this issue Feb 10, 2025 · 13 comments
Assignees
Labels
Issue: Feature Request New feature or improvement to existing feature Type: Parent Issue

Comments

@rashidakanchwala
Copy link
Contributor

Description

This is a spike ticket to review how Kedro-Viz and Kedro plugins interact with Kedro APIs and identify potential improvements for Kedro 1.0, particularly regarding the use of private methods. Additionally, we will assess how general users engage with Kedro APIs to explore further enhancements.

@rashidakanchwala rashidakanchwala added the Issue: Feature Request New feature or improvement to existing feature label Feb 10, 2025
@rashidakanchwala rashidakanchwala moved this to To Do in Kedro 🔶 Feb 10, 2025
@rashidakanchwala rashidakanchwala moved this from To Do to In Progress in Kedro 🔶 Feb 11, 2025
@astrojuanlu
Copy link
Member

Can we make _find_kedro_project public? See #4440 (comment) cc @merelcht

Also, should we have a broader consultation with the community, or at least our TSC, about this instead of just researching on our own? I know @Galileo-Galilei has expressed some desires for the public API in recent times.

@rashidakanchwala
Copy link
Contributor Author

rashidakanchwala commented Feb 18, 2025

In Kedro-Viz, we were initially calling _find_kedro_project from Kedro. Since these were just utility methods, we copy-pasted them into the Kedro-Viz repository to avoid doing that. However, given that other plugins also rely on these methods, it makes sense to make them part of Kedro’s public API so that we don't duplicate them.

@astrojuanlu
Copy link
Member

Also xref kedro-org/kedro-plugins#139 ? Or should we leave that outside of 1.0?

@astrojuanlu
Copy link
Member

About _find_kedro_project, see #4497

@merelcht
Copy link
Member

Also xref kedro-org/kedro-plugins#139 ? Or should we leave that outside of 1.0?

AFAICT that doesn't require changes in the core Kedro framework, so no need to tackle it as part of 1.0.0. Could do it if we have time, but I'd put it on the low priority list for now.

@ravi-kumar-pilla
Copy link
Contributor

ravi-kumar-pilla commented Feb 22, 2025

Hi Team,

For reference -

Private methods/variables found in plugins

Kedro-mlflow 

  1. _is_project
  2. _find_kedro_project
  3. __version__
  4. _get_project_metadata
  5. _create_hook_manager

Kedro-airflow

  1. _split_params

Kedro-datasets
No private methods/variables used

Kedro-docker

  1. _VERBOSE

Kedro-telemetry

  1. __version__

Kedro-boot

  1. _NullPluginManager
  2. _is_project
  3. _find_kedro_project
  4. _add_src_to_path (in conftest)

Thank you

@Galileo-Galilei
Copy link
Member

Galileo-Galilei commented Feb 22, 2025

Here after are some API inconstencies I'd like to discuss. I indicate if they are breaking changes (💥) or not (✅).

✅ Private methods in kedro-mlflow

Thanks @ravi-kumar-pilla. Some remarks:

💥 Runtime parameters

Runtime parameters are, at the very least, confusing #2169 (comment). They are named runtime_params in the CLI, runtime_params in the resolver, extra_params in the Session, _extra_params in the context, which gives these kind of confusing lines of code :

runtime_params=extra_params,

def _register_runtime_params_resolver(self) -> None:
OmegaConf.register_new_resolver(
"runtime_params",
self._get_runtime_value,
replace=True,
)

_extra_params: dict[str, Any] | None = field(
init=True, default=None, converter=deepcopy
)

not to mention weird documentation references and different recommandation when using runtime parameters inside a notebook, deployment docs, or the CLI:

conf_source=conf_source, env=env, runtime_params=runtime_params

> To **override not just parameters but other configurations**, such as catalog entries or file paths, or to specify upfront that certain parameters must be set at runtime, use `$runtime_params` with the `OmegaConfigLoader`. Introduced in Kedro `0.18.14`, this feature allows dynamic overrides of various configuration types using the `--params` CLI option. It’s particularly useful for scenarios like switching data sources or fine-tuning runtime settings. [Learn more about `$runtime_params`.](advanced_configuration.md#how-to-override-configuration-with-runtime-parameters-with-the-omegaconfigloader)

- `extra_params`: Optional dictionary containing extra project parameters

extra_params: dict[str, Any] | None = None,

with KedroSession.create(env=env, extra_params=params) as session:

We should rename everything runtime params, and make the code and documentation consistent. This is a breaking change though.

Datasets

✅ Public access to (versioned) filepath

Currently, a lot of plugins or private datasets are using _get_load_path method and _filepath attribute. Actually this is a long time known issue, see #1778 for reference. It said (3 years ago!):

In theory we're free to do as we please with any function starting with _, i.e. we can remove, rename, changes arguments at will. In practice, however, some of these might be very commonly used (e.g. self._get_load_path() used in lots of datasets) so should not be regarded as non-breaking.

It was discussed, but never properly tackled in the DataCatalog refactoring (unless I missed something, maybe @ElenaKhaustova or @astrojuanlu can chime in?). I've found this discussion #3753, and I think it's definitely a method / function that should go public before 1.0. I understand that this needs more engineering effort than 'let's just make _get_load_path and _filepath public".

💥 io vs data catalog

Kedro used to use a lot "io" as a synomym for datacatalog, and it was replaced 1 year ago: #3924.
I'd be in favor of renaming the "io" folder (which used to be io.core and io.data_catalog (#1778): https://github.com/kedro-org/kedro/tree/main/kedro/io
in favor of something more meaningful, but it may be too big a breaking change for little value.

💥 Rename MatplotlibWriter :

This one has been on my list for a very long time because it is the only dataset with inconsistent naming, but that's a kedro-dataset issue: kedro-org/kedro-plugins#353 . Note that we once had a user complaining on slack that ChatGPT renamed MatplotlibWriter to MatplotlibDataset in its code 😅

💥 Deprecate public "parameters"

It is currently possible to pass all parameters in a node witht the keyword parameters insteatd of param:my_param. I've been advocating against it for a while and even want to deprecate this possibility. I remember a couple discussions about this on slack. No one is really complaining about it so I may be just wrong, but I'd like we make it a conscious decision, that's why I mention it here.

@astrojuanlu
Copy link
Member

Amazing writeup, thanks a lot @Galileo-Galilei !

1 quick Saturday afternoon reaction from me: I agree we should generally move away from __version__ and use importlib.metadata instead (but please let's not remove __version__ from kedro and plugins just yet!)

@ravi-kumar-pilla
Copy link
Contributor

Thank you @Galileo-Galilei for the writeup. Here is what I think -

Private methods in kedro-mlflow

  1. _is_project and _find_kedro_project - should definitely be public. - I agree with this
  2. version - I agree it is a convention in most python packages. We can retain this but also may be push on using below via docs.
from importlib.metadata import version
version(‘kedro’)
  1. _get_project_metadata - No strong opinion on this but I think this can be made public. I do not see any issue in making this public unless the team think so.

  2. _create_hook_manager - This is used internally and also mentioned in docs multiple places. I feel this to be an internal method registering several specs. At the same time, I feel the hook manager might be of use for plugin developers. Not sure why hook_manager is not made as public attribute within Session. Based on my understanding, hooks come into play within a KedroSession which makes this part of KedroSession. So why can’t we make hook_manager a public attribute. (Currently we have _hook_manager as an attribute in a KedroSession)

Run time parameter naming
Inconsistent naming:

runtime_params - CLI, resolver
extra_params - Session
_extra_params - Context

We should rename everything runtime_params, and make the code and documentation consistent. This is a breaking change though. (Agree !)

Rename MatplotlibWriter

Based on the discussions in kedro-org/kedro-plugins#353 I think we can rename the dataset to MatplotlibWriterDataset which would include the fact that it is a writer and does not have load function, but also is consistent with Dataset naming

Deprecate public "parameters"

I am not sure why passing all parameters using parameters is a bad thing, though I agree it passes unnecessary params if not needed and is less explicit for debugging. I would like to know your use case or issue with this

  • Is it because of not explicitly stating each parameter name ?
  • Is there any issue with the parameter names clashing with dataset names ?

Thank you

@rashidakanchwala
Copy link
Contributor Author

@Galileo-Galilei Regarding _get_project_metadata, if the goal is only to retrieve package_name, in Kedro-Viz, we directly import it after bootstrapping the project:
https://github.com/kedro-org/kedro-viz/blob/main/package/kedro_viz/launchers/cli/run.py#L107.

Would it make sense to do the same for kedro-mlflow?

@datajoely
Copy link
Contributor

Some thoughts -

  • _find_kedro_project could have a better name maybe find_kedro_project_path?
  • Can we do this analysis for kedro-boot IIRC there were lots of hacks needed to make the session work in the desired way?

@Galileo-Galilei
Copy link
Member

(@ravi-kumar-pilla) Based on the discussions in kedro-org/kedro-plugins#353 I think we can rename the dataset to MatplotlibWriterDataset which would include the fact that it is a writer and does not have load function, but also is consistent with Dataset naming

I tend to disagree, there are several other datasets without load or save method (SQLQueryDataset, APIDataset, PlotlyDataset...), and it is not visible in their name. It is much more consistent with PlotlyDataset to name it MatplotlibDataset without the "Writer" suffix.

@ravi-kumar-pilla I am not sure why passing all parameters using parameters is a bad thing, though I agree it passes unnecessary params if not needed and is less explicit for debugging. I would like to know your use case or issue with this

Actually it is because it messes up kedro-mlflow automatic parameter tracking with parameters unrelated to the pipeline, but I am not sure this is really an issue. If some people want to do this, maybe we should allow them ; on the same time we should promote getting parameters from a dict unpacking with a syntax like "params:hyperparameters.temperature`` (I think something like this is already possible?).

@rashidakanchwala Regarding _get_project_metadata, if the goal is only to retrieve package_name, in Kedro-Viz, we directly import it after bootstrapping the project:
https://github.com/kedro-org/kedro-viz/blob/main/package/kedro_viz/launchers/cli/run.py#L107.

😮 I did not know the package name was accessible like this. I should try to check if there is any issue with kedro-mlflow (especially if bootstrapping is necessary first), but it should do the trick, thanks!

Can we do this analysis for kedro-boot IIRC there were lots of hacks needed to make the session work in the desired way?

I tag @takikadiri for reference. Hopefully we can make most of the hack go away if reinjecting data in the session become possible in the core framework, and I hope this will become a reality soon!

@datajoely
Copy link
Contributor

I'd love to have a broader conversation of dict unpacking

Actually it is because it messes up kedro-mlflow automatic parameter tracking with parameters unrelated to the pipeline, but I am not sure this is really an issue. If some people want to do this, maybe we should allow them ; on the same time we should promote getting parameters from a dict unpacking with a syntax like "params:hyperparameters.temperature`` (I think something like this is already possible?).

It would be cool to support * and ** somehow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature Type: Parent Issue
Projects
Status: In Progress
Development

No branches or pull requests

6 participants