-
Notifications
You must be signed in to change notification settings - Fork 910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DataCatalog]: Spike - Lazy dataset loading #3935
Comments
Kedro Viz workflowkedro_viz -> integrations -> kedro -> data_loader -> _load_data_helper (https://github.com/kedro-org/kedro-viz/blob/main/package/kedro_viz/integrations/kedro/data_loader.py#L58 ) - This creates all required Kedro objects like Session, Context, DataCatalog and Pipelines Specially for datasets we need DataCatalog object
Thank you, @ravi-kumar-pilla |
Kedro Viz workflow - lite modeWhen running Kedro-Viz in lite mode |
Summary on Kedro Viz workflow
|
ProblemBased on the context above we can conclude that data loading is already done in a lazy manner and the scope of the problem is bounded by the lazy dependencies resolution:
Solution proposed
The solution proposed avoids dataset initializations that are not part of the run and ensures that execution will not fail because of the missing imports. On the Viz side, we can only materialize a dataset when the user expands the node, so datasets installation is not required for graph preview. The solution proposed will also solve the |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
Draft implementation and issues identifiedWe implemented a draft for the solution proposed above: #4270 When testing the implementation we found out that there are different aspects of
|
After discussing this with @idanov, we decided to address problems 1 and 3 separately and proceed with the suggested solution. |
Solved in #4270 |
Description
Users are required to install all dependencies even for unused datasets, leading to unnecessary complexity and confusion.
We propose implementing a lazy dataset loading feature to allow users to load only the datasets they need without causing pipeline failures.
Relates to #2829
Context
DataCatalog
to load acsv
which does not need Excel?"Spike task
Investigate how to actually do the lazy loading: 1) the actual lazy loading of datasets, only import the datasets at the time we load the data 2) understand what part of the pipeline needs to be run, and only import what's required for that run.
The text was updated successfully, but these errors were encountered: