-
Notifications
You must be signed in to change notification settings - Fork 911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DataCatalog]: Lazy dataset loading #4270
Conversation
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR itself looks clean to me and I don't have more to add.
I am still not completely clear what problem this is solving. As I understand there are always 2 conflicting requirements:
- Eagerly failure is good
- Lazy loading (load as little dependencies as possible).
At which point datasets are getting materialized? Is it at Runner?
Just in case it was missed:
Long story short: Now, we initialize the datasets before the pipeline run (or anytime we get the dataset from the catalog) and only require the datasets used in the run; before, we initialized all the datasets in the catalog when creating the catalog object, no matter whether we used them in the run or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining, I think the solution make sense to only materialise the needed datasets.
kedro/io/kedro_data_catalog.py
Outdated
ds_config = self._config_resolver.resolve_pattern(key) | ||
if ds_config: | ||
self._add_from_config(key, ds_config) | ||
|
||
non_initialized_dataset = self._lazy_datasets.pop(key, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I correct that we also want this PR to pave the way for #3932 by storing the configuration in the catalog for lazy dataset? If so, should we really remove the dataset from _lazy_datasets
once it is materialized? If we do this it prevents further access to the original raw configuration (this can still be reassessed further, and I am not sure we want to tackle both problems in this PR, but it's worth thinking it it will fits well in the broader roadmap)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We think that proper solution for #3932 will require storing configuration at the dataset level and dataset.to_yaml()
, dataset.from_yaml()
for the following reasons:
- We allow to initializing catalog from datasets or their configurations. If it's the first case currently we are not able to retrieve configuration from the dataset object and thus cannot serialize it.
- And the same problem appears if someone adds dataset to the catalog which is probably the main use case for catalog dump and load.
Signed-off-by: Elena Khaustova <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very clean solution! 🌟 The explanations also make perfect sense to me, thanks for adding that as context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! I’m hopeful that lazy loading datasets will also improve the initial load time in Kedro-Viz. Looking forward to seeing this!
self, ds_name: str, dataset: AbstractDataset, replace: bool = False | ||
self, | ||
ds_name: str, | ||
dataset: AbstractDataset | _LazyDataset, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure we'd like to expose _LazyDataset
to the users? If yes, then maybe we need to make it a proper LazyDataset
that can be instantiated even outside of the catalog. Otherwise, I'd prefer for us to hide it from this method signature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to expose it and we added it here only until we use the old add()
method for compatibility with the old catalog. I will go after the breaking change as setter will be used instead.
Signed-off-by: Elena Khaustova <[email protected]>
Signed-off-by: Elena Khaustova <[email protected]>
Description
Implementation of #3935 (comment)
The PR is done on top of #4262
Development notes
Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-by
line in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
RELEASE.md
file