[DataCatalog]: Lazy dataset loading #4270

ElenaKhaustova · 2024-10-30T13:59:46Z

Description

The PR is done on top of #4262

Development notes

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

Read the contributing guidelines
Signed off each commit with a Developer Certificate of Origin (DCO)
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes
Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Elena Khaustova <[email protected]>

…-loading

Signed-off-by: Elena Khaustova <[email protected]>

…-loading

Signed-off-by: Elena Khaustova <[email protected]>

…-loading

noklam

The PR itself looks clean to me and I don't have more to add.

I am still not completely clear what problem this is solving. As I understand there are always 2 conflicting requirements:

Eagerly failure is good
Lazy loading (load as little dependencies as possible).

At which point datasets are getting materialized? Is it at Runner?

ElenaKhaustova · 2024-11-01T15:33:57Z

The PR itself looks clean to me and I don't have more to add.

I am still not completely clear what problem this is solving. As I understand there are always 2 conflicting requirements:

Eagerly failure is good

Lazy loading (load as little dependencies as possible).

At which point datasets are getting materialized? Is it at Runner?

Just in case it was missed:

here is the solution described: [DataCatalog]: Spike - Lazy dataset loading #3935 (comment)
here is the summary of what it solves and what remains untouched: [DataCatalog]: Spike - Lazy dataset loading #3935 (comment)

Long story short: Now, we initialize the datasets before the pipeline run (or anytime we get the dataset from the catalog) and only require the datasets used in the run; before, we initialized all the datasets in the catalog when creating the catalog object, no matter whether we used them in the run or not.

noklam

Thanks for explaining, I think the solution make sense to only materialise the needed datasets.

Galileo-Galilei · 2024-11-02T14:36:55Z

kedro/io/kedro_data_catalog.py

            ds_config = self._config_resolver.resolve_pattern(key)
            if ds_config:
                self._add_from_config(key, ds_config)

+        non_initialized_dataset = self._lazy_datasets.pop(key, None)


Am I correct that we also want this PR to pave the way for #3932 by storing the configuration in the catalog for lazy dataset? If so, should we really remove the dataset from _lazy_datasets once it is materialized? If we do this it prevents further access to the original raw configuration (this can still be reassessed further, and I am not sure we want to tackle both problems in this PR, but it's worth thinking it it will fits well in the broader roadmap)

We think that proper solution for #3932 will require storing configuration at the dataset level and dataset.to_yaml(), dataset.from_yaml() for the following reasons:

We allow to initializing catalog from datasets or their configurations. If it's the first case currently we are not able to retrieve configuration from the dataset object and thus cannot serialize it.

And the same problem appears if someone adds dataset to the catalog which is probably the main use case for catalog dump and load.

Signed-off-by: Elena Khaustova <[email protected]>

merelcht

Very clean solution! 🌟 The explanations also make perfect sense to me, thanks for adding that as context.

rashidakanchwala

This looks great! I’m hopeful that lazy loading datasets will also improve the initial load time in Kedro-Viz. Looking forward to seeing this!

RELEASE.md

idanov · 2024-11-05T17:37:12Z

kedro/io/kedro_data_catalog.py

-        self, ds_name: str, dataset: AbstractDataset, replace: bool = False
+        self,
+        ds_name: str,
+        dataset: AbstractDataset | _LazyDataset,


Are we sure we'd like to expose _LazyDataset to the users? If yes, then maybe we need to make it a proper LazyDataset that can be instantiated even outside of the catalog. Otherwise, I'd prefer for us to hide it from this method signature.

We don't want to expose it and we added it here only until we use the old add() method for compatibility with the old catalog. I will go after the breaking change as setter will be used instead.

Signed-off-by: Elena Khaustova <[email protected]>

ElenaKhaustova added 28 commits October 28, 2024 16:04

Move warm-up to runner

7c794f4

Signed-off-by: Elena Khaustova <[email protected]>

Implemented test for running thread runner with patterns

208a24b

Signed-off-by: Elena Khaustova <[email protected]>

Added test for new catalog

7c6729a

Signed-off-by: Elena Khaustova <[email protected]>

Add line separator to file

9d5b37d

Signed-off-by: Elena Khaustova <[email protected]>

Replaced writing csv manually to writing with pandas

c3229c0

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'main' into fix/4250-move-warm-up-to-runner

6c509d9

Fixed fixture

bd878c9

Signed-off-by: Elena Khaustova <[email protected]>

Removed new catalog from test

68010aa

Signed-off-by: Elena Khaustova <[email protected]>

Made catalog type a parameter

29d373f

Signed-off-by: Elena Khaustova <[email protected]>

Removed old catalog from test

e90cfd7

Signed-off-by: Elena Khaustova <[email protected]>

Removed new catalog from test

3f1dbe0

Signed-off-by: Elena Khaustova <[email protected]>

Removed data creation/loading

892cda4

Signed-off-by: Elena Khaustova <[email protected]>

Fixed test docstring

e7f2632

Signed-off-by: Elena Khaustova <[email protected]>

Removed extra loop

429ca13

Signed-off-by: Elena Khaustova <[email protected]>

Renamed variable for clarifty

3ffd538

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'main' into fix/4250-move-warm-up-to-runner

681d3f1

Replaced ald method in the constructor

8177ddc

Signed-off-by: Elena Khaustova <[email protected]>

Implemented dataset materialization

379f3b4

Signed-off-by: Elena Khaustova <[email protected]>

Added temporal repr

9335578

Signed-off-by: Elena Khaustova <[email protected]>

Removed replacing warning when init

26d0c84

Signed-off-by: Elena Khaustova <[email protected]>

Improved repr

dea5630

Signed-off-by: Elena Khaustova <[email protected]>

Fixed bug in get()

0da806f

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'fix/4250-move-warm-up-to-runner' into feature/3935-lazy…

b813408

…-loading

Moved warm-up to the top

c63ece6

Signed-off-by: Elena Khaustova <[email protected]>

Moved warm-up to the top

01f9b62

Signed-off-by: Elena Khaustova <[email protected]>

Moved warm-up to the top

069dff4

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'fix/4250-move-warm-up-to-runner' into feature/3935-lazy…

15df0f8

…-loading

Updated eq and repr

da669f5

Signed-off-by: Elena Khaustova <[email protected]>

ElenaKhaustova mentioned this pull request Oct 30, 2024

[DataCatalog]: Spike - Lazy dataset loading #3935

Closed

Fixed mypy errors

3999913

Signed-off-by: Elena Khaustova <[email protected]>

ElenaKhaustova added 3 commits November 1, 2024 11:44

Merge branch 'main' into feature/3935-lazy-loading

72c106b

Added docstrings

870b794

Signed-off-by: Elena Khaustova <[email protected]>

Updated release notes

e031f8a

Signed-off-by: Elena Khaustova <[email protected]>

ElenaKhaustova changed the title ~~[DataCatalog]: Lazy loading draft solution~~ [DataCatalog]: Lazy dataset loading Nov 1, 2024

Merge branch 'main' into fix/4250-move-warm-up-to-runner

9d0f579

ElenaKhaustova marked this pull request as ready for review November 1, 2024 14:07

ElenaKhaustova requested a review from merelcht as a code owner November 1, 2024 14:07

ElenaKhaustova requested review from idanov, noklam, astrojuanlu, rashidakanchwala, lrcouto, ravi-kumar-pilla and DimedS November 1, 2024 14:07

ElenaKhaustova added 2 commits November 1, 2024 14:34

Updated release notes

5f6ef85

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'fix/4250-move-warm-up-to-runner' into feature/3935-lazy…

6d23b4d

…-loading

noklam reviewed Nov 1, 2024

View reviewed changes

noklam approved these changes Nov 1, 2024

View reviewed changes

Galileo-Galilei reviewed Nov 2, 2024

View reviewed changes

Renamed variable for consistency

c31c5eb

Signed-off-by: Elena Khaustova <[email protected]>

merelcht approved these changes Nov 4, 2024

View reviewed changes

rashidakanchwala approved these changes Nov 4, 2024

View reviewed changes

idanov approved these changes Nov 5, 2024

View reviewed changes

ElenaKhaustova and others added 3 commits November 5, 2024 22:39

Updated release notes

822e206

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'main' into feature/3935-lazy-loading

1f14ded

Signed-off-by: Elena Khaustova <[email protected]>

Merge branch 'main' into feature/3935-lazy-loading

25aebce

ElenaKhaustova enabled auto-merge (squash) November 6, 2024 12:39

ElenaKhaustova merged commit 7b7dad9 into main Nov 6, 2024
34 checks passed

ElenaKhaustova deleted the feature/3935-lazy-loading branch November 6, 2024 12:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataCatalog]: Lazy dataset loading #4270

[DataCatalog]: Lazy dataset loading #4270

ElenaKhaustova commented Oct 30, 2024 •

edited

Loading

noklam left a comment

ElenaKhaustova commented Nov 1, 2024

noklam left a comment

Galileo-Galilei Nov 2, 2024 •

edited

Loading

ElenaKhaustova Nov 4, 2024

merelcht left a comment

rashidakanchwala left a comment

idanov Nov 5, 2024

ElenaKhaustova Nov 6, 2024

[DataCatalog]: Lazy dataset loading #4270

[DataCatalog]: Lazy dataset loading #4270

Conversation

ElenaKhaustova commented Oct 30, 2024 • edited Loading

Description

Development notes

Developer Certificate of Origin

Checklist

noklam left a comment

Choose a reason for hiding this comment

ElenaKhaustova commented Nov 1, 2024

noklam left a comment

Choose a reason for hiding this comment

Galileo-Galilei Nov 2, 2024 • edited Loading

Choose a reason for hiding this comment

ElenaKhaustova Nov 4, 2024

Choose a reason for hiding this comment

merelcht left a comment

Choose a reason for hiding this comment

rashidakanchwala left a comment

Choose a reason for hiding this comment

idanov Nov 5, 2024

Choose a reason for hiding this comment

ElenaKhaustova Nov 6, 2024

Choose a reason for hiding this comment

ElenaKhaustova commented Oct 30, 2024 •

edited

Loading

Galileo-Galilei Nov 2, 2024 •

edited

Loading