Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Implement predict() in simple tree and simple RF models #6258

Merged
merged 1 commit into from
Dec 23, 2022

Conversation

markotoplak
Copy link
Member

Issue

This avoids creating intermediate tables in predict(), which then create problems onwards because, currently, creating a table with an array locks the array given. The array remains locked even when the intermediate table is disregarded.

This PR is a fix for some crashes in biolab/orange3-explain#54.

The big issue here, which remains, is how Orange's Tables behave: if arrays were copied in init, only internal arrays would be locked. Anyway, I believe the code is now a bit cleaner.

Includes
  • Code changes
  • Tests
  • Documentation

@markotoplak markotoplak force-pushed the simpletree-predict branch 2 times, most recently from f48cb7c to fdfba08 Compare December 14, 2022 15:06
@codecov
Copy link

codecov bot commented Dec 14, 2022

Codecov Report

Merging #6258 (75ff06e) into master (e75f21c) will decrease coverage by 0.00%.
The diff coverage is 92.30%.

❗ Current head 75ff06e differs from pull request most recent head 28c90fb. Consider uploading reports for the commit 28c90fb to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6258      +/-   ##
==========================================
- Coverage   86.70%   86.70%   -0.01%     
==========================================
  Files         316      316              
  Lines       68120    68129       +9     
==========================================
+ Hits        59064    59068       +4     
- Misses       9056     9061       +5     

@markotoplak
Copy link
Member Author

@VesnaT wondered why was this the only problematic model. Well, because the new deprecation does not break anything we see that it was the only model primarily working through predict_storage.

@markotoplak markotoplak marked this pull request as draft December 14, 2022 19:12
@markotoplak markotoplak changed the title Add predict() to simple tree and simple RF models [RFC] Add predict() to simple tree and simple RF models Dec 14, 2022
@markotoplak
Copy link
Member Author

markotoplak commented Dec 14, 2022

I was confused how our models work and thus I wanted to deprecate the default implementation of predict, which creates a new table and runs predict_storage if the latter is undefined. Having it deprecated would force all our prediction logic into predict().

Then I found where predict_storage is actually being used: in StackedModel. There, each sub-model needs to be called with a __call__ (so that preprocessors could run), which requires a data table, and thus predict_storage seems a sensible place for it. On the other hand the predict within StackedModel could just create a table in the same way. That would decrease confusion about what is happening in the general case.

Notice that the tests do not fail even with the deprecation in place. This is no tests in Orange is calling StackedModel.predict() directly (__call__ runs predict_storage, so...), but I guess someone could still call it. Perhaps this would happen with the explanation widgets? @VesnaT, any ideas?

Copy link
Contributor

@lanzagar lanzagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The big issue here, which remains, is how Orange's Tables behave: if arrays were copied in init, only internal arrays would be locked.

I think this is the important observation. If creating tables with existing arrays is a problem, I am sure it happens in many other places, not just predict from this PR.
I like the proposal of copying the arrays in init, maybe by introducing a new copy parameter that is True by default but can be explicitly changed so it is still possible to reuse the same arrays (but only if we can think of cases where this would actually be needed/useful).

In case we change Table's init, we would not need to copy X before using it to construct a Table in predict, which I mentioned in the comments in this PR.

What do you think @markotoplak, @janezd?

Comment on lines +72 to +73
pt = tree.predict(X)
p += pt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a reason to introduce pt - you could still += in one line

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I could. But I kept it for symmetry with the other class (for classification).

Orange/base.py Outdated
Comment on lines 232 to 234
warnings.warn("Automatic fallback to predict_storage will be removed in 3.36. "
"All models need to implement predict.",
OrangeDeprecationWarning)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you expect people to do, when they see this warning?
If the model did not implement a predict before, they could just copy the 4 lines below into their predict which of course would solve nothing?
If I understand the original problem correctly, the Table should be created with a copy of X. If that is the fix, I think it is better to put it here one time and avoid people maybe implementing predict incorrectly in their own forced implementations.

Copy link
Member Author

@markotoplak markotoplak Dec 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I currently just do not see any good reasons that anyone would directly implement predict_storage. Most likely they should implement predict and I'd guess that in most of cases they would not need to construct a table.

The only exception I can imagine is that they want to call sublearner's __call__.

Waiting for counterexamples.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main counterexamples are when working with non- numpy/Table data or some additional information not found in X/Y/metas.

  • SqlTable used fit/predict storage because you can offload processing to the sql server (if you use X, it downloads a sample). Although we are not actively working on that, I like that the architecture supports the option to.
  • I see that Survival analysis uses fit_storage, because it needs more than X&Y, but not for predicting, where X is enough? @JakaKokosar
  • You are developing Dask tables, maybe we will want to adjust and optimize some learner in the future that will use some DaskTable methods/caching? (but more likely in this case we could just have an if in standard predict to check the type of X and use dask functions directly).
  • We have Corpus and Timeseries classes that extend Table and maybe some models will want to use them and the numpy arrays will not be enough

Bottom line, if you are questioning why "anyone would directly implement predict_storage", then the discussion should be if we want to remove that method completely - and as long as everything in Orange is based on Tables and we support its extensions, I would lean towards leaving this support and not rely exclusively on numpy arrays. At least until we switch to pandas, then we should re-evaluate :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for a list where this could go wrong. I agree. Let's not deprecate.

@markotoplak
Copy link
Member Author

@lanzagar, the goal of the deprecation was to make it clearer to the programmer what is being called where and make it clear that the function to implement is certainly predict.

If we decide we really need a consistent handling of the case when predict does not make sense (we have only 1 (one) such case in core Orange), we could still deprecate it and implement a Model.create_dumb_orange_table (or maybe renamed :) ) helper function for people that would absolutely need Tables within their Models predictors. But I would not...

If this deprecation if too controversial I can remove it from this PR.

@markotoplak markotoplak changed the title [RFC] Add predict() to simple tree and simple RF models [FIX] Implement predict() in simple tree and simple RF models Dec 23, 2022
This avoid creating intermediate tables.
@markotoplak
Copy link
Member Author

I removed the deprectation commit. Let's handle that separately.

@markotoplak markotoplak marked this pull request as ready for review December 23, 2022 09:27
@lanzagar lanzagar merged commit 216264a into biolab:master Dec 23, 2022
@markotoplak markotoplak deleted the simpletree-predict branch January 3, 2023 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants