Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Declare datasets as 'only for debugging' and don't save them in standard kedro run #1160

Closed
drtorchwood opened this issue Jan 18, 2022 · 4 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@drtorchwood
Copy link

Description

I would like to propose the following feature: declare datasets in the catalog as 'only relevant for debugging' and then ignore them during a standard run of the kedro pipeline, i.e. handle them as MemoryDataSets without writing to disk. An argument when starting the pipeline should enable/disable this behavior.

Context

During the development and error tracing, I normally store the output of almost all nodes to analyze them and to find errors. For this, I define them in the catalog with a meaningful name. However, storing to disk is a quite time-consuming step and in the standard deployment of the pipeline, these files are not needed. Therefore, I remove them (as comment lines) from the data catalog. It would be much easier, if I could flag them as 'only for debugging' and switch them on/off with a flag when starting the pipeline.
In a perfect would, I could also remove all these files (if they exist on disk) with kedro directly.

Possible Implementation

The dataset should have an additional (optional) attribute like for_debugging: false/true (default false) and the pipeline(s) could be started with something like kedro run --store_debug_files.

Perhaps related to: #1076

@drtorchwood drtorchwood added the Issue: Feature Request New feature or improvement to existing feature label Jan 18, 2022
@datajoely
Copy link
Contributor

Hi @drtorchwood this is an interesting idea. It's not something that has come up before so I'd be interested to see if anyone else from the community would see value in adding this in natively.

Off the top of my head there are a couple of ways you could do this today:

  1. Implement a custom runner by adapting DryRunner. Here you could implement a runner class that (possibly using config) converts certain datasets to MemoryDataSet.
  2. Implement a custom wrapper dataset which looks something like the snippet below (this is the same way some of our built in datasets like PartitionedDataSet or CachedDataSet work too.)
my_dataset:
    type: DebugMemoryDataSet
    dataset_config:
       type: pandas.CSVDataSet
       filepath: ....

merelcht pushed a commit that referenced this issue Jan 19, 2022
[AUTO-MERGE] Merge master into develop via merge-master-to-develop
@antonymilne
Copy link
Contributor

antonymilne commented Jan 28, 2022

Actually I think there's an easier way to do this already using a run environment. If you define datasets in a debug run environment only then they will only be written to disk when using that run environment. If they're not defined explicitly then they will implicitly just be MemoryDataSet, as you want for a normal run.

# conf/base/catalog.yml doesn't define my_dataset

# conf/debug/catalog.yml
my_dataset:
  type: pandas.ParquetDataSet
  filepath: ...

Now if you do kedro run then my_dataset will remain in memory, but if you do kedro run -e debug it will be saved to disk.

Would that achieve what you're looking for?

@datajoely
Copy link
Contributor

As always @AntonyMilneQB comes up with a smarter solution - do that instead 🤣 @drtorchwood (nice name btw)

@drtorchwood
Copy link
Author

Thanks @AntonyMilneQB for this nice idea. I can confirm that it works.
Just as a hint for anyone who also wants to use this solution: the datasets in the new catalog in the debug folder are created in addition to those in the base folder (or overwrite the base values if duplicates exist).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
None yet
Development

No branches or pull requests

3 participants