[FIX] Implement predict() in simple tree and simple RF models #6258

markotoplak · 2022-12-14T14:37:29Z

Issue

This avoids creating intermediate tables in predict(), which then create problems onwards because, currently, creating a table with an array locks the array given. The array remains locked even when the intermediate table is disregarded.

This PR is a fix for some crashes in biolab/orange3-explain#54.

The big issue here, which remains, is how Orange's Tables behave: if arrays were copied in init, only internal arrays would be locked. Anyway, I believe the code is now a bit cleaner.

Includes

Code changes
Tests
Documentation

codecov · 2022-12-14T15:07:11Z

Codecov Report

Merging #6258 (75ff06e) into master (e75f21c) will decrease coverage by 0.00%.
The diff coverage is 92.30%.

❗ Current head 75ff06e differs from pull request most recent head 28c90fb. Consider uploading reports for the commit 28c90fb to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6258      +/-   ##
==========================================
- Coverage   86.70%   86.70%   -0.01%     
==========================================
  Files         316      316              
  Lines       68120    68129       +9     
==========================================
+ Hits        59064    59068       +4     
- Misses       9056     9061       +5

markotoplak · 2022-12-14T15:23:31Z

@VesnaT wondered why was this the only problematic model. Well, because the new deprecation does not break anything we see that it was the only model primarily working through predict_storage.

markotoplak · 2022-12-14T19:25:01Z

I was confused how our models work and thus I wanted to deprecate the default implementation of predict, which creates a new table and runs predict_storage if the latter is undefined. Having it deprecated would force all our prediction logic into predict().

Then I found where predict_storage is actually being used: in StackedModel. There, each sub-model needs to be called with a __call__ (so that preprocessors could run), which requires a data table, and thus predict_storage seems a sensible place for it. On the other hand the predict within StackedModel could just create a table in the same way. That would decrease confusion about what is happening in the general case.

Notice that the tests do not fail even with the deprecation in place. This is no tests in Orange is calling StackedModel.predict() directly (__call__ runs predict_storage, so...), but I guess someone could still call it. Perhaps this would happen with the explanation widgets? @VesnaT, any ideas?

lanzagar

The big issue here, which remains, is how Orange's Tables behave: if arrays were copied in init, only internal arrays would be locked.

I think this is the important observation. If creating tables with existing arrays is a problem, I am sure it happens in many other places, not just predict from this PR.
I like the proposal of copying the arrays in init, maybe by introducing a new copy parameter that is True by default but can be explicitly changed so it is still possible to reuse the same arrays (but only if we can think of cases where this would actually be needed/useful).

In case we change Table's init, we would not need to copy X before using it to construct a Table in predict, which I mentioned in the comments in this PR.

What do you think @markotoplak, @janezd?

lanzagar · 2022-12-16T08:41:05Z

Orange/regression/simple_random_forest.py

+            pt = tree.predict(X)
+            p += pt


I don't see a reason to introduce pt - you could still += in one line

Yes I could. But I kept it for symmetry with the other class (for classification).

lanzagar · 2022-12-16T09:01:49Z

Orange/base.py

+            warnings.warn("Automatic fallback to predict_storage will be removed in 3.36. "
+                          "All models need to implement predict.",
+                          OrangeDeprecationWarning)


What do you expect people to do, when they see this warning?
If the model did not implement a predict before, they could just copy the 4 lines below into their predict which of course would solve nothing?
If I understand the original problem correctly, the Table should be created with a copy of X. If that is the fix, I think it is better to put it here one time and avoid people maybe implementing predict incorrectly in their own forced implementations.

I currently just do not see any good reasons that anyone would directly implement predict_storage. Most likely they should implement predict and I'd guess that in most of cases they would not need to construct a table.

The only exception I can imagine is that they want to call sublearner's __call__.

Waiting for counterexamples.

Main counterexamples are when working with non- numpy/Table data or some additional information not found in X/Y/metas.

SqlTable used fit/predict storage because you can offload processing to the sql server (if you use X, it downloads a sample). Although we are not actively working on that, I like that the architecture supports the option to.

I see that Survival analysis uses fit_storage, because it needs more than X&Y, but not for predicting, where X is enough? @JakaKokosar

You are developing Dask tables, maybe we will want to adjust and optimize some learner in the future that will use some DaskTable methods/caching? (but more likely in this case we could just have an if in standard predict to check the type of X and use dask functions directly).

We have Corpus and Timeseries classes that extend Table and maybe some models will want to use them and the numpy arrays will not be enough

Bottom line, if you are questioning why "anyone would directly implement predict_storage", then the discussion should be if we want to remove that method completely - and as long as everything in Orange is based on Tables and we support its extensions, I would lean towards leaving this support and not rely exclusively on numpy arrays. At least until we switch to pandas, then we should re-evaluate :)

Thanks for a list where this could go wrong. I agree. Let's not deprecate.

markotoplak · 2022-12-19T09:54:50Z

@lanzagar, the goal of the deprecation was to make it clearer to the programmer what is being called where and make it clear that the function to implement is certainly predict.

If we decide we really need a consistent handling of the case when predict does not make sense (we have only 1 (one) such case in core Orange), we could still deprecate it and implement a Model.create_dumb_orange_table (or maybe renamed :) ) helper function for people that would absolutely need Tables within their Models predictors. But I would not...

If this deprecation if too controversial I can remove it from this PR.

This avoid creating intermediate tables.

markotoplak · 2022-12-23T09:27:30Z

I removed the deprectation commit. Let's handle that separately.

markotoplak force-pushed the simpletree-predict branch 2 times, most recently from f48cb7c to fdfba08 Compare December 14, 2022 15:06

markotoplak force-pushed the simpletree-predict branch from fdfba08 to c5a8a1f Compare December 14, 2022 19:11

markotoplak marked this pull request as draft December 14, 2022 19:12

markotoplak changed the title ~~Add predict() to simple tree and simple RF models~~ [RFC] Add predict() to simple tree and simple RF models Dec 14, 2022

markotoplak force-pushed the simpletree-predict branch from c5a8a1f to 75ff06e Compare December 14, 2022 19:26

lanzagar reviewed Dec 16, 2022

View reviewed changes

markotoplak changed the title ~~[RFC] Add predict() to simple tree and simple RF models~~ [FIX] Implement predict() in simple tree and simple RF models Dec 23, 2022

Add predict() to simple tree and simple RF models

28c90fb

This avoid creating intermediate tables.

markotoplak force-pushed the simpletree-predict branch from 75ff06e to 28c90fb Compare December 23, 2022 09:26

markotoplak marked this pull request as ready for review December 23, 2022 09:27

lanzagar merged commit 216264a into biolab:master Dec 23, 2022

markotoplak deleted the simpletree-predict branch January 3, 2023 13:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FIX] Implement predict() in simple tree and simple RF models #6258

[FIX] Implement predict() in simple tree and simple RF models #6258

markotoplak commented Dec 14, 2022

codecov bot commented Dec 14, 2022 •

edited

Loading

markotoplak commented Dec 14, 2022

markotoplak commented Dec 14, 2022 •

edited

Loading

lanzagar left a comment

lanzagar Dec 16, 2022

markotoplak Dec 19, 2022

lanzagar Dec 16, 2022

markotoplak Dec 19, 2022 •

edited

Loading

lanzagar Dec 23, 2022

markotoplak Dec 23, 2022

markotoplak commented Dec 19, 2022

markotoplak commented Dec 23, 2022

[FIX] Implement predict() in simple tree and simple RF models #6258

[FIX] Implement predict() in simple tree and simple RF models #6258

Conversation

markotoplak commented Dec 14, 2022

Issue

Includes

codecov bot commented Dec 14, 2022 • edited Loading

Codecov Report

markotoplak commented Dec 14, 2022

markotoplak commented Dec 14, 2022 • edited Loading

lanzagar left a comment

Choose a reason for hiding this comment

lanzagar Dec 16, 2022

Choose a reason for hiding this comment

markotoplak Dec 19, 2022

Choose a reason for hiding this comment

lanzagar Dec 16, 2022

Choose a reason for hiding this comment

markotoplak Dec 19, 2022 • edited Loading

Choose a reason for hiding this comment

lanzagar Dec 23, 2022

Choose a reason for hiding this comment

markotoplak Dec 23, 2022

Choose a reason for hiding this comment

markotoplak commented Dec 19, 2022

markotoplak commented Dec 23, 2022

codecov bot commented Dec 14, 2022 •

edited

Loading

markotoplak commented Dec 14, 2022 •

edited

Loading

markotoplak Dec 19, 2022 •

edited

Loading