[FEATURE] Wrapping any estimator for caching at fit, predict, and transform time. #706

antngh · 2024-10-25T15:21:00Z

Please let me know if this is not the correct place or way to start this discussion.

I have some code for a wrapper to an estimator (transformer or predictor) that quickly saves the object and the data to disk. If the wrapped estimator is called with an identical instance (same properties etc) and with the same input data, then it will fetch from disk rather than rerunning the corresponding fit/predict/transform/(etc) code. The wrapped estimator behaves exactly as an estimator would in all the cases I've tested.

Sklearn provides something similar with the memory arg in the pipeline class, but it doesn't extend to inference, only fitting, and even then, it won't apply to the last transformer in the pipeline.

This is especially useful for when there is an estimator that has a slow running predict/transform step, and you want to run this pipeline quickly. It will run again if needed (if either the estimator or the input data has changed), but otherwise will just load from file. This also maintains it across runs - if you restart the kernel or run the script again, you can pick up where you left off. This isn't intended as a data store. It can really speed up the development of pipelines when there are slow steps.

Please let me know if this functionality in full or in part - say the code that checks if the data is the same - could be useful, and I will look and adding it to this repo. I can't commit to fully maintain it going forward, but as of now it seems to work well.

The text was updated successfully, but these errors were encountered:

antngh · 2024-10-27T17:32:53Z

You can see the code here : https://github.com/antngh/sklearn-estimator-caching

FBruzzesi · 2024-10-27T18:08:09Z

Hey @antngh , thanks for the issue and for already putting the effort into this. I took a sneak peak at the repo, and just by the volume it could deserve to be in its own project/repo.

I can imagine people having multiple use cases for such caching mechanism, and therefore different feature request for it.

I will wait for @koaning to weight on this as well.

koaning · 2024-10-27T21:48:02Z

I have also observed pipelines becomes slower with caching on the sklearn side. If the numpy array going in is huge, the hashing might actually be slower than the pipeline. Doesn't happen all the time, but worth to keep in the back of your mind.

I wonder, if the final element of a pipeline is skipped, why not add a FunctionTransformer at the end? That behaves like an identity function if you pass no arguments, but it will act as a "final" transformer. Does that not work with the memory flag in a normal pipeline?

Another sensible way to cache an intermediate output is to manually store a transformer array in memory or to write it to disk manually from there. This only works if you know exactly what needs to be remembered and if it does not change, but it might be easier to reason about compared to the hashing involved with a caching mechanism. I am personally a little hestitant to support it here because I am a bit wary of all the edge cases.

But if you have a compelling benchmark, I'd sure be all ears!

antngh · 2024-10-28T07:58:54Z

Thanks both.

Some timing here: https://github.com/antngh/sklearn-estimator-caching/blob/main/notebooks/caching_timing.ipynb
For 10 million rows, and 10 columns, the overhead added from the wrapper is around 4 seconds (on par or slightly less than using a pipeline with memory). The second call is similar, whereas for the pipeline with memory it is about 1.5 seconds. (Edit: for 100M rows my code takes around 2 mins on both first and second call, and sklearn's memory uses about 1 minute on first call and 20 seconds on the second call).

The overhead for the inference calls are about the same as for the fit calls (pipeline has no equivalent caching here).

I first created this code specifically due to some very slow custom transformers I was working with. In my case it wasn't a matter of a normal transformer with a huge dataset, but rather a big dataset but a super slow transformation step. In that case I see a huge improvement when using this wrapper. You're right we could manually save/load the data but that quickly becomes hard to track and manage.

I am a bit wary of all the edge cases.

I fully understand.

antngh added the enhancement New feature or request label Oct 25, 2024

antngh changed the title ~~[FEATURE]~~ [FEATURE] Wrapping any estimator for caching at fit, predict, and transform time. Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Wrapping any estimator for caching at fit, predict, and transform time. #706

[FEATURE] Wrapping any estimator for caching at fit, predict, and transform time. #706

antngh commented Oct 25, 2024 •

edited

Loading

antngh commented Oct 27, 2024

FBruzzesi commented Oct 27, 2024

koaning commented Oct 27, 2024

antngh commented Oct 28, 2024 •

edited

Loading

[FEATURE] Wrapping any estimator for caching at fit, predict, and transform time. #706

[FEATURE] Wrapping any estimator for caching at fit, predict, and transform time. #706

Comments

antngh commented Oct 25, 2024 • edited Loading

antngh commented Oct 27, 2024

FBruzzesi commented Oct 27, 2024

koaning commented Oct 27, 2024

antngh commented Oct 28, 2024 • edited Loading

antngh commented Oct 25, 2024 •

edited

Loading

antngh commented Oct 28, 2024 •

edited

Loading