-
Notifications
You must be signed in to change notification settings - Fork 910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow new attributes to be added to DataSets #1076
Comments
I'll also highlight the great work dbt do with adding documentation and tests to their source and models: |
I completely missed this issue. I'm excited to see this one happen. We had another use case this week. We are using some versioned datasets and would like to be able to create a plugin to clean up some of the old versions, inspired by sanoid. It would be nice if we could pop some extra config into the catalog to specify a policy of how many to keep, then we can create a separate plugin that can clean these up in a scheduled run. And keep all of our configurations for each dataset together. my_dataset:
versioned: true
extras: # whatever place you want to give us to configure plugins/extras
cleanup_old_versions_plugin:
frequent: 4
daily: 7
year: 1
month: 1
|
Several other libraries allow users to add a |
DVC allows this through a |
So my next question is - then what? The user as provided some arbitrary metadata:
|
This issue was discussed in a Technical Design session:
Open questions that need to be answered:
my_dataset:
type: ...
filepath: ...
metadata: <- this would be the top-level key
kedro-viz:
layer: ...
kedro-mlflow: ....
kedro-great-expectations: .... or my_dataset:
type: ...
filepath: ...
kedro-viz:
layer: ...
kedro-mlflow: ....
kedro-great-expectations: .... A top-level key would make it clearer that any entries under it aren't core Kedro, but custom attributes that will be handled by a plugin. However, the top-level key does add extra nesting, and especially for the
Next step:Design an implementation proposal for allowing new attributes to be added to DataSets. Make sure to address the questions above. Discuss this proposal with plugin developers e.g. @WaylonWalker and @Galileo-Galilei |
This is exactly what we were looking for today. We were thinking about parametrizing pandas-profiling package, in a way that allows you to decide which dataset to profile and how. The top-level key looks like a great solution. I would argue, that the layer parameter now makes more sense, especially since kedro-viz is in fact a separate plugin. |
This is something my team and me would love. Use case: display in kedro viz, possibly by displaying datasets with a certain tag close to each other (hard to do if you choose to allow for nested tags, I believe) |
Hi, I'd love to see this feature implemented. My main use-case would be to add a schema (backed by
I can create an extra configuration file where the user declares all these attributes, but this would be a much better user experience to be able to declare this schema in the catalog. Regarding the open questions:
@merelcht Do you want me to draft a PR for this ? I don't want to waste my time and yours if you do not have time to review or if you think that technical design is not mature enough, but if you think we have a chance to include it to the core library in say, a couple of months I'd be glad to provide support for this. |
Hi @Galileo-Galilei, thanks for responding to this issue. I'd be very happy for you to start drafting an implementation. We agreed with the team that this is functionality we want to support. We simply haven't been able to prioritise this yet. So if we can speed this up by getting your help, I'm all for it! 👍 😄 |
Hi @Galileo-Galilei , we'll be prioritising this task in our next sprint, so I just wanted to check if you'd made any progress or have any findings we should probably know about? In #2409 you mentioned you might need to change the |
Hi @merelcht, as usual I overestimated the time I can allocate for this PR, sorry 😅 I played a bit with the syntax, and I ultimately wanted to add an extra |
No worries @Galileo-Galilei and thanks for getting back to my question! I personally think it would make sense to have the attribute on the individual datasets, but not on the |
Hiya, just jumping in to add some feedback that arose from a conversation with @datajoely: At the moment I'm looking into adding validation to some datasets, potentially using pandera. Pandera supports yml syntax (like so) for defining these validations, so if I've understood the purpose of this feature correctly, it could be a good match for this. One potential source of friction I see is that things like validations could change a lot more often than the dataset definition itself, and are also very verbose, so I was wondering whether it would make sense to somehow provide these additional attributes from a different file (e.g. My perspective was something like this, relying on OmegaConf's subkey merging:
And Joel's suggestion was something like this:
|
Hi @inigohidalgo, I agree it would be great to do this and indeed OmegaConf should enable us to inject yaml like this. Since we merge together all files matching the
The only catch I can think of here is that we might need something to prevent kedro from trying to process |
In a way this is the same concept as yaml anchors which we already use in our usecases for dataset definitions, we could do this without omegaconf even. I was wondering if you'd be able to do this merge how I outlined it in my example, leveraging OmegaConf's nested sub-config merge. Laying it out like this should get around the issue of kedro processing it as a dataset (I'm making an assumption about the order of operations in Kedro: that it does all this config merging and interpolating first, and then with the final catalog dictionary, you instantiate the Catalog):
Note: I haven't read the entire issue in-depth, I notice you used a sub-key metadata: to indicate this additional metadata, which would mean the required validations.yml to match the structure you wrote would be:
But the point is the same :) |
Oh I see - sorry I misunderstood your previous example. You're right that what I wrote above is just a cross-file extension of what yaml anchors can do. I didn't think of doing it the way you suggested here but it makes perfect sense. Aside from the fact that
Right now the config loader will throw an error about duplicate keys but if we remove omegaconfigloader's call to
So yes, in future, I think this should work exactly like you say 🙂 The only thing you'd need to be careful of is to make sure that your |
@WaylonWalker @inigohidalgo @foxale @GabrieleCacchioni We're excited to announce that this feature has been added and will be released soon. #2537 introduced the changes, this It works in the following way:
I'll close this issue but please shout when you use it ❤️ Thank you so much for giving us feedback on this. |
Problem
The Data Catalog does not allow users to add custom attributes or metadata to datasets. This means that users have to:
This request has presented itself in different ways, most commonly in a request to add dataset parameters and not use
parameters.yml
for parameters that are directly related to a dataset (link). It has also shown itself in users wanting to add optional tags for datasets and have a way to interact with their user-defined tags (link).User evidence
Create their own custom datasets
"I have certain attributes to track within my datasets and have created custom DataSets to get around this issue. Now that hooks are out most of my reasons for custom DataSets are gone, and I can achieve the same thing with an after_node_run hook, but I still cannot attach custom attributes to datasets." - Waylon Walker (link)
Use comments in their to help them keep track of associated metadata
"Am I the only one that leaves random notes in the catalog next to at least the raw datasets?
Was thinking if there would be any usage of having a meta or info field." - Andrej Marsic
Connect Parameters and the Data Catalog
"There could easily be other cases where a node might need to know what parameter(s) a dataset was loaded with. You can always just duplicate the list in both yaml files, but it's more ideal to have the parameter specified in only one place, especially it's some parameter you can play around with." - D. Sturgeon (link)
Possible implementation
My suggestion is for the following:
It could look something like this:
or even:
Questions
layers
be one of these metadata fields?The text was updated successfully, but these errors were encountered: