Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataCatalog]: Spike - Lazy dataset loading #3935

Closed
ElenaKhaustova opened this issue Jun 6, 2024 · 9 comments
Closed

[DataCatalog]: Spike - Lazy dataset loading #3935

ElenaKhaustova opened this issue Jun 6, 2024 · 9 comments
Assignees
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@ElenaKhaustova
Copy link
Contributor

ElenaKhaustova commented Jun 6, 2024

Description

Users are required to install all dependencies even for unused datasets, leading to unnecessary complexity and confusion.

We propose implementing a lazy dataset loading feature to allow users to load only the datasets they need without causing pipeline failures.

Relates to #2829

Context

  • "You need to install all dependencies even for unused datasets (in case you want to run pipeline partially or do not load some dataset when standalone catalog usage)."
  • "We have a lot of data entries and different dependencies and when we just want to rerun an anaysis partially, we are frustrated because we need to install all the packages to just load one data source. Why would I need to install excel dependencies to instantiate the DataCatalog to load a csv which does not need Excel?"
  • The error users get now in case of missing dependencies is unclear [DataCatalog]: Error message is confusing when kedro-dataset is not installed #3911
DatasetError: An exception occurred when parsing config for dataset 'companies':
No module named 'pandas'. Please see the documentation on how to install relevant dependencies for kedro_datasets.pandas.CSVDataset:

Spike task

Investigate how to actually do the lazy loading: 1) the actual lazy loading of datasets, only import the datasets at the time we load the data 2) understand what part of the pipeline needs to be run, and only import what's required for that run.

@ElenaKhaustova ElenaKhaustova added the Issue: Feature Request New feature or improvement to existing feature label Jun 6, 2024
@merelcht merelcht changed the title [DataCatalog]: Lazy dataset loading [DataCatalog]: Spike - Lazy dataset loading Oct 21, 2024
@ElenaKhaustova
Copy link
Contributor Author

Kedro Viz workflow

kedro_viz -> integrations -> kedro -> data_loader -> _load_data_helper (https://github.com/kedro-org/kedro-viz/blob/main/package/kedro_viz/integrations/kedro/data_loader.py#L58 ) - This creates all required Kedro objects like Session, Context, DataCatalog and Pipelines

Specially for datasets we need DataCatalog object

Thank you, @ravi-kumar-pilla

@ElenaKhaustova
Copy link
Contributor Author

Kedro Viz workflow - lite mode

When running Kedro-Viz in lite mode AbstractDatasetLite - a custom implementation of Kedro's AbstractDataset is used. It provides an UnavailableDataset instance by overriding from_config of AbstractDataset. It allows initializing catalog without required datasets installed.

https://github.com/kedro-org/kedro-viz/blob/9996c9950f60810cdaeb7c439614597572354a71/package/kedro_viz/integrations/kedro/data_loader.py#L97

https://github.com/kedro-org/kedro-viz/blob/9996c9950f60810cdaeb7c439614597572354a71/package/kedro_viz/integrations/kedro/abstract_dataset_lite.py#L15

@ElenaKhaustova
Copy link
Contributor Author

Summary on Kedro Viz workflow

  1. Both Kedro Viz modes (default and lite) use lazy data loading;
  2. The default model requires all the datasets to be installed as they create a session with the catalog inside (currently catalog init does datasets init) though the dataset object is not needed for the pipeline preview, it's only needed when the user clicks on the node when the actual dataset.load() is happening;
  3. The lite mode doesn't require all the datasets to be installed and it handles missing imports errors with the AbstractDatasetLite;
  4. They consider making lite mode a default behaviour in future.

@ElenaKhaustova
Copy link
Contributor Author

ElenaKhaustova commented Oct 23, 2024

Problem

Based on the context above we can conclude that data loading is already done in a lazy manner and the scope of the problem is bounded by the lazy dependencies resolution:

  • Users struggle to run partial pipelines (several nodes, slices) without installing all datasets required for the full pipeline run
  • Kedro Viz doesn't need datasets installed for the graph preview until the user expands the node.

Solution proposed

  1. Introduce NonInitializedDataset storing the configuration needed to initialize the actual dataset (not inherited from AbstractDataset).
  2. In the catalog constructor initialize only NonInitializedDataset's instead of the actual and store the in a separate dictionary.
  3. Materialize actual datasets when someone gets the dataset from the catalog (get(), __iter__(), keys(), values(), items()) and add the to the _datasets dictionary.
  4. Catalog and dataset printing should not meterialize datasets and print them as they are, so __repr__ should be implemented for NonInitializedDataset.
  5. In order to avoid cases when pipeline execution breaks because of the missing dependencies we can do a warm-up in the runner, particularly in the AbstractRunner.run() method before calling _run(). For that, we need to have a filtered pipeline and materialize its datasets, so we make it only for the datasets required for the run.
    self._run(pipeline, catalog, hook_or_null_manager, session_id) # type: ignore[arg-type]

The solution proposed avoids dataset initializations that are not part of the run and ensures that execution will not fail because of the missing imports. On the Viz side, we can only materialize a dataset when the user expands the node, so datasets installation is not required for graph preview.

The solution proposed will also solve the ThreadRunner problem (#4250) as the warm-up will be done for all the runners in the common AbstractRunner.run(). But first, we suggest solving #4250 by moving the existing warm-up to the AbstractRunner.run().

@astrojuanlu

This comment was marked as off-topic.

@astrojuanlu

This comment was marked as off-topic.

@ElenaKhaustova
Copy link
Contributor Author

Draft implementation and issues identified

We implemented a draft for the solution proposed above: #4270

When testing the implementation we found out that there are different aspects of lazy loading problem related to dependencies:

  1. dependencies used in the pipeline itself - they are loaded with find_pipelines function;
  2. dependencies required for dataset init - they are loaded when a dataset is initialized and defined in its implementation;
  3. dependencies required for dataset safe/load - they are loaded when calling load/safe methods and they're defined in the dataset's requirements;

The solution proposed addresses problem 2 from the above with several buts:

  1. We cannot solve problem 1 without changing the implementation of find_pipelines function.
    Under the hood, the find_pipelines() function traverses the src/<package_name>/pipelines/ directory and returns a mapping from pipeline directory name to Pipeline object by:
  • Importing the <package_name>.pipelines.<pipeline_name> module;
  • Calling the create_pipeline() function exposed by the <package_name>.pipelines.<pipeline_name> module;
  • Validating that the constructed object is a Pipeline;
  • By default, if any of these steps fail, find_pipelines() raises an appropriate warning and skips the current pipeline but continues traversal. During development, this enables you to run your project with some pipelines, even if other pipelines are broken.
  1. When we install any of kedro-datasets we always install all the datasets (as they are part of the package) and the dependencies for the specific dataset. To initialize a dataset we need the dataset to be installed and all the dependencies defined in its implementation. So there are cases when one is able to initialize a dataset without actually installing it as some other dataset was installed previously as well as the dependencies. Thus we cannot guarantee warm-up solves the missing dependencies problem as some of the dependencies will only be imported at the save/load time, meaning a pipeline can fail during the run.

Alignment with issues reported by users

If referring back to the user reported issues they all can be summarised as You need to install all dependencies even for unused datasets (in case you want to run pipeline partially or do not load some dataset when standalone catalog usage).

The proposed solution solves it, but problem 1 still remains. So, there might be a case where one wants to run the pipeline partially and datasets are loading lazily, but the dependencies are still required for the pipeline discovery step before we even instantiate a catalog. And the entire pipelines is exclude from the run.

Next steps

We need to define the desired behaviour for problems 1, 2, and 3 and agree on how/whether we want to cover all these cases and whether we're happy with the original solution proposed and the draft implementation.

@ElenaKhaustova
Copy link
Contributor Author

Next steps
We need to define the desired behaviour for problems 1, 2, and 3 and agree on how/whether we want to cover all these cases and whether we're happy with the original solution proposed and the draft implementation.

After discussing this with @idanov, we decided to address problems 1 and 3 separately and proceed with the suggested solution.

@ElenaKhaustova
Copy link
Contributor Author

Solved in #4270

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
Status: Done
Development

No branches or pull requests

3 participants