PipeOp to try repair predicting with unseen factor levels #71

berndbischl · 2018-12-20T11:09:37Z

problem: quite often, a learner breaks, because it sees SOME prediction in a larger table, which contains new, unseen factor levels. in such a case the predict of the underlying learner fails, completely.

see reprex here:
mlr-org/mlr3#97

this is really annoying. especially as this can happen on only a few observations, but we still 100% fail the complete prediction.

current options are: the mlr3 fallback learner. that does not really help. because this produces now fallback predictions on the complete test set.

here is MAYBE a better option.

PipOpUnseenLevels

before we go into the learner, we can on-training, store which levels are present in each factor.

PipOpUnseenLevels
train: task--stored-levels--->task
predict: task-->stored-levels-->task

train: simply stores a list, one element per factor feature, with the seen level
predict: does through all observations. for each observation where we see "unseen" levels, we create a random row, by sampling from the marginals of the columns.
that is a bit hacky, but should work?

prockenschaub · 2019-10-31T14:55:21Z

By "random row" you mean that for an observation with an unseen level we samplw every variable of from the marginals, even those variables for which the observed value was within the training sample, or just replace the unseen value itself with a draw from its own column marginal?

If I were to use this PipeOp, I would personally favour a more deterministic approach, either by filtering out those observations during prediction that contain a value that wasn't seen during training (ideally with warning) or by prespecifying a category that should be used if a new value is seen (the specification could say "marginals", which would include the second case in my question above)

mb706 · 2020-02-12T08:17:25Z

Note we have the POBackupLearner for something like this: #204

pfistfl · 2020-03-30T14:41:52Z

This should be solved via fixfactors and imputation.
See e.g.
https://mlr3gallery.mlr-org.com/basics_pipelines_titanic/

mb706 · 2020-06-21T23:28:48Z

together with the robustify pipeline this is probably as good as it gets.

berndbischl mentioned this issue Dec 20, 2018

rpart fails to predict on titanic task mlr-org/mlr3#97

Closed

berndbischl added the Priority: Low label Dec 20, 2018

mb706 added this to the far milestone Jan 30, 2019

berndbischl added Priority: Medium and removed Priority: Low labels Jul 15, 2019

mb706 removed this from the far range milestone Aug 19, 2019

mb706 added Status: Needs Design Needs some thought and design decisions. Status: Contrib (unprepared) In someone's opinion, this is an issue that could be handled by a contributor with the right support Type: New PipeOp Issue suggests a new PipeOp labels Feb 10, 2020

berndbischl self-assigned this Mar 30, 2020

berndbischl added this to the v0.2 milestone Mar 30, 2020

mb706 closed this as completed Jun 21, 2020

statist-bhfz mentioned this issue Sep 20, 2020

Encoding new levels for factors #508

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PipeOp to try repair predicting with unseen factor levels #71

PipeOp to try repair predicting with unseen factor levels #71

berndbischl commented Dec 20, 2018

prockenschaub commented Oct 31, 2019

mb706 commented Feb 12, 2020

pfistfl commented Mar 30, 2020

mb706 commented Jun 21, 2020

PipeOp to try repair predicting with unseen factor levels #71

PipeOp to try repair predicting with unseen factor levels #71

Comments

berndbischl commented Dec 20, 2018

prockenschaub commented Oct 31, 2019

mb706 commented Feb 12, 2020

pfistfl commented Mar 30, 2020

mb706 commented Jun 21, 2020