Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataCatalog]: Lazy dataset loading #4270

Merged
merged 46 commits into from
Nov 6, 2024
Merged

Conversation

ElenaKhaustova
Copy link
Contributor

@ElenaKhaustova ElenaKhaustova commented Oct 30, 2024

Description

Implementation of #3935 (comment)

The PR is done on top of #4262

Development notes

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
@ElenaKhaustova ElenaKhaustova changed the title [DataCatalog]: Lazy loading draft solution [DataCatalog]: Lazy dataset loading Nov 1, 2024
Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR itself looks clean to me and I don't have more to add.

I am still not completely clear what problem this is solving. As I understand there are always 2 conflicting requirements:

  1. Eagerly failure is good
  2. Lazy loading (load as little dependencies as possible).

At which point datasets are getting materialized? Is it at Runner?

@ElenaKhaustova
Copy link
Contributor Author

The PR itself looks clean to me and I don't have more to add.

I am still not completely clear what problem this is solving. As I understand there are always 2 conflicting requirements:

  1. Eagerly failure is good
  2. Lazy loading (load as little dependencies as possible).

At which point datasets are getting materialized? Is it at Runner?

Just in case it was missed:

Long story short: Now, we initialize the datasets before the pipeline run (or anytime we get the dataset from the catalog) and only require the datasets used in the run; before, we initialized all the datasets in the catalog when creating the catalog object, no matter whether we used them in the run or not.

Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining, I think the solution make sense to only materialise the needed datasets.

ds_config = self._config_resolver.resolve_pattern(key)
if ds_config:
self._add_from_config(key, ds_config)

non_initialized_dataset = self._lazy_datasets.pop(key, None)
Copy link
Member

@Galileo-Galilei Galileo-Galilei Nov 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct that we also want this PR to pave the way for #3932 by storing the configuration in the catalog for lazy dataset? If so, should we really remove the dataset from _lazy_datasets once it is materialized? If we do this it prevents further access to the original raw configuration (this can still be reassessed further, and I am not sure we want to tackle both problems in this PR, but it's worth thinking it it will fits well in the broader roadmap)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We think that proper solution for #3932 will require storing configuration at the dataset level and dataset.to_yaml(), dataset.from_yaml() for the following reasons:

  • We allow to initializing catalog from datasets or their configurations. If it's the first case currently we are not able to retrieve configuration from the dataset object and thus cannot serialize it.
  • And the same problem appears if someone adds dataset to the catalog which is probably the main use case for catalog dump and load.

Signed-off-by: Elena Khaustova <[email protected]>
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clean solution! 🌟 The explanations also make perfect sense to me, thanks for adding that as context.

Copy link
Contributor

@rashidakanchwala rashidakanchwala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I’m hopeful that lazy loading datasets will also improve the initial load time in Kedro-Viz. Looking forward to seeing this!

RELEASE.md Outdated Show resolved Hide resolved
self, ds_name: str, dataset: AbstractDataset, replace: bool = False
self,
ds_name: str,
dataset: AbstractDataset | _LazyDataset,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we'd like to expose _LazyDataset to the users? If yes, then maybe we need to make it a proper LazyDataset that can be instantiated even outside of the catalog. Otherwise, I'd prefer for us to hide it from this method signature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to expose it and we added it here only until we use the old add() method for compatibility with the old catalog. I will go after the breaking change as setter will be used instead.

@ElenaKhaustova ElenaKhaustova enabled auto-merge (squash) November 6, 2024 12:39
@ElenaKhaustova ElenaKhaustova merged commit 7b7dad9 into main Nov 6, 2024
34 checks passed
@ElenaKhaustova ElenaKhaustova deleted the feature/3935-lazy-loading branch November 6, 2024 12:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants