Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grouping artifacts in the data catalog #4260

Open
namedgraph opened this issue Oct 28, 2024 · 2 comments
Open

Grouping artifacts in the data catalog #4260

namedgraph opened this issue Oct 28, 2024 · 2 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@namedgraph
Copy link

Description

I tried grouping the artifacts by introducing "namespaces" as the first level of config in YAML while moving the actual artifacts to the second level:

a_group_of_artifacts:
  outputs:
    type: ...

  errors:
    type: ...

and was planning to address the artifacts as a_group_of_artifacts:outputs, a_group_of_artifacts:errors etc.

But it turns out that Kedro does not support this?

DatasetError: An exception occurred when parsing config for dataset 'a_group_of_artifacts':
'type' is missing from dataset catalog configuration

Context

Our pipelines mostly augment the initial inputs, which means we end up with a lot of similarly named artifacts (e.g. final_outputs, processed_outputs and other kinds of _outputs) which gets confusing. It feels that there should be a better way to group/namespace the artifacts.

Possible Implementation

Instead of treating the 1st-level YAML blocks as artifacts, why not traverse the levels recursively until a block with type is encountered -- and treating it as artifact while ignoring the other nesting blocks?

Possible Alternatives

Maybe some other solution I don't know about? Not a Kedro expert...

@namedgraph namedgraph added the Issue: Feature Request New feature or improvement to existing feature label Oct 28, 2024
@lrcouto
Copy link
Contributor

lrcouto commented Oct 28, 2024

Hey @namedgraph, thank you for your feature proposal. Your idea makes sense, but as of now, Kedro does not support grouping artifacts in the manner you describe, and interprets each entry on the catalog as a separate data source with it's own type definition.

For now, you can try to use Kedro dataset factories to reduce the number of similar catalog entries on your project.

@namedgraph
Copy link
Author

@lrcouto it feels inconsistent that one can nest YAML in parameters and use the parent:child syntax, but not in the catalog 🤷‍♂️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
None yet
Development

No branches or pull requests

2 participants