-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Declare datasets as 'only for debugging' and don't save them in standard kedro run #1160
Comments
Hi @drtorchwood this is an interesting idea. It's not something that has come up before so I'd be interested to see if anyone else from the community would see value in adding this in natively. Off the top of my head there are a couple of ways you could do this today:
my_dataset:
type: DebugMemoryDataSet
dataset_config:
type: pandas.CSVDataSet
filepath: .... |
[AUTO-MERGE] Merge master into develop via merge-master-to-develop
Actually I think there's an easier way to do this already using a run environment. If you define datasets in a
Now if you do Would that achieve what you're looking for? |
As always @AntonyMilneQB comes up with a smarter solution - do that instead 🤣 @drtorchwood (nice name btw) |
Thanks @AntonyMilneQB for this nice idea. I can confirm that it works. |
Description
I would like to propose the following feature: declare datasets in the catalog as 'only relevant for debugging' and then ignore them during a standard run of the kedro pipeline, i.e. handle them as MemoryDataSets without writing to disk. An argument when starting the pipeline should enable/disable this behavior.
Context
During the development and error tracing, I normally store the output of almost all nodes to analyze them and to find errors. For this, I define them in the catalog with a meaningful name. However, storing to disk is a quite time-consuming step and in the standard deployment of the pipeline, these files are not needed. Therefore, I remove them (as comment lines) from the data catalog. It would be much easier, if I could flag them as 'only for debugging' and switch them on/off with a flag when starting the pipeline.
In a perfect would, I could also remove all these files (if they exist on disk) with kedro directly.
Possible Implementation
The dataset should have an additional (optional) attribute like
for_debugging: false/true
(default false) and the pipeline(s) could be started with something likekedro run --store_debug_files
.Perhaps related to: #1076
The text was updated successfully, but these errors were encountered: