Question FunctionTransformer(copy) #581

saddy001 · 2017-10-04T08:54:12Z

In my best estimator I see a FunctionTransformer(copy). Is it useful? It just seems to copy the input to the output.

rhiever · 2017-10-04T14:02:15Z

The FunctionTransformer(copy) object allows for a basic form of stacking when a classifier is present in the middle of a pipeline. FunctionTransformer(copy) makes a copy of the entire dataset, and that is merged with the predictions of a classifier on that dataset.

saddy001 · 2017-10-04T16:14:55Z

Nice, a feedback classifier. Is it somewhere mentioned in the sklearn docs, I couldn't find anything about this?

rhiever · 2017-10-05T12:32:36Z

I don't think this is mentioned in the sklearn docs. We implemented this feature ourselves within the existing sklearn pipeline framework.

BenjaminHabert · 2018-02-19T16:32:00Z

This can lead to weird pipelines though. Here is what I got

# Score on the training set was:0.333968253968
exported_pipeline = make_pipeline(
    make_union(
        FunctionTransformer(copy),
        FunctionTransformer(copy)
    ),
    RandomForestClassifier(bootstrap="false", criterion="gini", max_features=0.15, min_samples_leaf=10, min_samples_split=4, n_estimators=100)
)

In this case I doubt that the FunctionTransformer(copy) is useful. I guess adding copies is roughly equivalent to tweaking the max_features parameter of the random forest.

Context:

train dataset of ~30 000 sample x 15 features
tpot v 0.9.2

rhiever · 2018-02-19T17:17:53Z

That's interesting. How long did you run TPOT (population & generations) when it gave you this solution?

BenjaminHabert · 2018-02-20T10:20:28Z

I since deleted this example but I got another one:

# Score on the training set was:0.522222222222
exported_pipeline = make_pipeline(
    make_union(
        FunctionTransformer(copy),
        FunctionTransformer(copy)
    ),
    StandardScaler(),
    MaxAbsScaler(),
    StackingEstimator(estimator=LinearSVC(C=10.0, dual=False, loss="squared_hinge", penalty="l2", tol=0.01)),
    RandomForestClassifier(bootstrap=True, criterion="gini", max_features=0.5, min_samples_leaf=8, min_samples_split=18, n_estimators=100)
)

Here is the tpot classifier that I configured:

model = tpot.TPOTClassifier(
    cv=LeaveOneGroupOut(),
    scoring=experiment.build_scorer(),
    periodic_checkpoint_folder=files.create_abspath('models/multi_pca_usine_lcdv'),
    max_time_mins=11 * 60,
    max_eval_time_mins=10,
    n_jobs=10,
    verbosity=2
)

So population is 100. Not sure about the number of generations at this point.. I guess at least 5 since there are 5 exported pipelines in the output folder before this one.

The optimizer ran for ~6 hours before reaching this intermediate result (better pipelines obtained later in the same run did not contain such artifacts).

rhiever · 2018-02-20T18:36:48Z

Ah, 5 generations isn't very much time for TPOT to really refine the pipelines - at that point the GA has only gone through 5 rounds of selection. That's good to hear that pipelines from later in the run didn't retain this artifact.

The reason why TPOT doesn't immediately get rid of pipelines like this is because this artifact is potentially useful for building more complex pipelines later in the optimization process. Either of those FunctionTransformers can be replaced with another pipeline operation in subsequent generations, and that could potentially be useful for improving prediction performance. As such, our pipeline regularization process doesn't penalize pipelines that make two copies of the features like this because it technically doesn't "hurt" the pipeline.

We've discussed other approaches to pipeline regularization (#207) that would probably weed out pipelines like you showed above, bu we haven't gotten to implementing those ideas yet.

BenjaminHabert · 2018-02-22T09:06:59Z

Interesting, thank you for the explanation. Overall I found TPOT to be very useful, well done!

rhiever added the question label Oct 4, 2017

saddy001 closed this as completed Oct 8, 2017

weixuanfu mentioned this issue May 23, 2019

Question: What is the point of a double copy union pipeline step? #871

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question FunctionTransformer(copy) #581

Question FunctionTransformer(copy) #581

saddy001 commented Oct 4, 2017

rhiever commented Oct 4, 2017

saddy001 commented Oct 4, 2017

rhiever commented Oct 5, 2017

BenjaminHabert commented Feb 19, 2018

rhiever commented Feb 19, 2018

BenjaminHabert commented Feb 20, 2018

rhiever commented Feb 20, 2018

BenjaminHabert commented Feb 22, 2018

Question FunctionTransformer(copy) #581

Question FunctionTransformer(copy) #581

Comments

saddy001 commented Oct 4, 2017

rhiever commented Oct 4, 2017

saddy001 commented Oct 4, 2017

rhiever commented Oct 5, 2017

BenjaminHabert commented Feb 19, 2018

rhiever commented Feb 19, 2018

BenjaminHabert commented Feb 20, 2018

rhiever commented Feb 20, 2018

BenjaminHabert commented Feb 22, 2018