Declare datasets as 'only for debugging' and don't save them in standard kedro run #1160

drtorchwood · 2022-01-18T10:44:40Z

Description

I would like to propose the following feature: declare datasets in the catalog as 'only relevant for debugging' and then ignore them during a standard run of the kedro pipeline, i.e. handle them as MemoryDataSets without writing to disk. An argument when starting the pipeline should enable/disable this behavior.

Context

During the development and error tracing, I normally store the output of almost all nodes to analyze them and to find errors. For this, I define them in the catalog with a meaningful name. However, storing to disk is a quite time-consuming step and in the standard deployment of the pipeline, these files are not needed. Therefore, I remove them (as comment lines) from the data catalog. It would be much easier, if I could flag them as 'only for debugging' and switch them on/off with a flag when starting the pipeline.
In a perfect would, I could also remove all these files (if they exist on disk) with kedro directly.

Possible Implementation

The dataset should have an additional (optional) attribute like for_debugging: false/true (default false) and the pipeline(s) could be started with something like kedro run --store_debug_files.

Perhaps related to: #1076

The text was updated successfully, but these errors were encountered:

datajoely · 2022-01-18T10:54:39Z

Hi @drtorchwood this is an interesting idea. It's not something that has come up before so I'd be interested to see if anyone else from the community would see value in adding this in natively.

Off the top of my head there are a couple of ways you could do this today:

Implement a custom runner by adapting DryRunner. Here you could implement a runner class that (possibly using config) converts certain datasets to MemoryDataSet.
Implement a custom wrapper dataset which looks something like the snippet below (this is the same way some of our built in datasets like PartitionedDataSet or CachedDataSet work too.)

my_dataset:
    type: DebugMemoryDataSet
    dataset_config:
       type: pandas.CSVDataSet
       filepath: ....

[AUTO-MERGE] Merge master into develop via merge-master-to-develop

antonymilne · 2022-01-28T07:50:19Z

Actually I think there's an easier way to do this already using a run environment. If you define datasets in a debug run environment only then they will only be written to disk when using that run environment. If they're not defined explicitly then they will implicitly just be MemoryDataSet, as you want for a normal run.

# conf/base/catalog.yml doesn't define my_dataset

# conf/debug/catalog.yml
my_dataset:
  type: pandas.ParquetDataSet
  filepath: ...

Now if you do kedro run then my_dataset will remain in memory, but if you do kedro run -e debug it will be saved to disk.

Would that achieve what you're looking for?

datajoely · 2022-01-28T10:17:59Z

As always @AntonyMilneQB comes up with a smarter solution - do that instead 🤣 @drtorchwood (nice name btw)

drtorchwood · 2022-01-31T10:47:32Z

Thanks @AntonyMilneQB for this nice idea. I can confirm that it works.
Just as a hint for anyone who also wants to use this solution: the datasets in the new catalog in the debug folder are created in addition to those in the base folder (or overwrite the base values if duplicates exist).

drtorchwood added the Issue: Feature Request New feature or improvement to existing feature label Jan 18, 2022

merelcht pushed a commit that referenced this issue Jan 19, 2022

Merge pull request #1160 from quantumblacklabs/merge-master-to-develop

f448367

[AUTO-MERGE] Merge master into develop via merge-master-to-develop

drtorchwood closed this as completed Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Declare datasets as 'only for debugging' and don't save them in standard kedro run #1160

Declare datasets as 'only for debugging' and don't save them in standard kedro run #1160

drtorchwood commented Jan 18, 2022

datajoely commented Jan 18, 2022

antonymilne commented Jan 28, 2022 •

edited

Loading

datajoely commented Jan 28, 2022

drtorchwood commented Jan 31, 2022

Declare datasets as 'only for debugging' and don't save them in standard kedro run #1160

Declare datasets as 'only for debugging' and don't save them in standard kedro run #1160

Comments

drtorchwood commented Jan 18, 2022

Description

Context

Possible Implementation

datajoely commented Jan 18, 2022

antonymilne commented Jan 28, 2022 • edited Loading

datajoely commented Jan 28, 2022

drtorchwood commented Jan 31, 2022

antonymilne commented Jan 28, 2022 •

edited

Loading