Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question FunctionTransformer(copy) #581

Closed
saddy001 opened this issue Oct 4, 2017 · 8 comments
Closed

Question FunctionTransformer(copy) #581

saddy001 opened this issue Oct 4, 2017 · 8 comments
Labels

Comments

@saddy001
Copy link

saddy001 commented Oct 4, 2017

In my best estimator I see a FunctionTransformer(copy). Is it useful? It just seems to copy the input to the output.

@rhiever
Copy link
Contributor

rhiever commented Oct 4, 2017

The FunctionTransformer(copy) object allows for a basic form of stacking when a classifier is present in the middle of a pipeline. FunctionTransformer(copy) makes a copy of the entire dataset, and that is merged with the predictions of a classifier on that dataset.

@saddy001
Copy link
Author

saddy001 commented Oct 4, 2017

Nice, a feedback classifier. Is it somewhere mentioned in the sklearn docs, I couldn't find anything about this?

@rhiever
Copy link
Contributor

rhiever commented Oct 5, 2017

I don't think this is mentioned in the sklearn docs. We implemented this feature ourselves within the existing sklearn pipeline framework.

@saddy001 saddy001 closed this as completed Oct 8, 2017
@BenjaminHabert
Copy link

This can lead to weird pipelines though. Here is what I got

# Score on the training set was:0.333968253968
exported_pipeline = make_pipeline(
    make_union(
        FunctionTransformer(copy),
        FunctionTransformer(copy)
    ),
    RandomForestClassifier(bootstrap="false", criterion="gini", max_features=0.15, min_samples_leaf=10, min_samples_split=4, n_estimators=100)
)

In this case I doubt that the FunctionTransformer(copy) is useful. I guess adding copies is roughly equivalent to tweaking the max_features parameter of the random forest.

Context:

  • train dataset of ~30 000 sample x 15 features
  • tpot v 0.9.2

@rhiever
Copy link
Contributor

rhiever commented Feb 19, 2018

That's interesting. How long did you run TPOT (population & generations) when it gave you this solution?

@BenjaminHabert
Copy link

I since deleted this example but I got another one:

# Score on the training set was:0.522222222222
exported_pipeline = make_pipeline(
    make_union(
        FunctionTransformer(copy),
        FunctionTransformer(copy)
    ),
    StandardScaler(),
    MaxAbsScaler(),
    StackingEstimator(estimator=LinearSVC(C=10.0, dual=False, loss="squared_hinge", penalty="l2", tol=0.01)),
    RandomForestClassifier(bootstrap=True, criterion="gini", max_features=0.5, min_samples_leaf=8, min_samples_split=18, n_estimators=100)
)

Here is the tpot classifier that I configured:

model = tpot.TPOTClassifier(
    cv=LeaveOneGroupOut(),
    scoring=experiment.build_scorer(),
    periodic_checkpoint_folder=files.create_abspath('models/multi_pca_usine_lcdv'),
    max_time_mins=11 * 60,
    max_eval_time_mins=10,
    n_jobs=10,
    verbosity=2
)

So population is 100. Not sure about the number of generations at this point.. I guess at least 5 since there are 5 exported pipelines in the output folder before this one.

The optimizer ran for ~6 hours before reaching this intermediate result (better pipelines obtained later in the same run did not contain such artifacts).

@rhiever
Copy link
Contributor

rhiever commented Feb 20, 2018

Ah, 5 generations isn't very much time for TPOT to really refine the pipelines - at that point the GA has only gone through 5 rounds of selection. That's good to hear that pipelines from later in the run didn't retain this artifact.

The reason why TPOT doesn't immediately get rid of pipelines like this is because this artifact is potentially useful for building more complex pipelines later in the optimization process. Either of those FunctionTransformers can be replaced with another pipeline operation in subsequent generations, and that could potentially be useful for improving prediction performance. As such, our pipeline regularization process doesn't penalize pipelines that make two copies of the features like this because it technically doesn't "hurt" the pipeline.

We've discussed other approaches to pipeline regularization (#207) that would probably weed out pipelines like you showed above, bu we haven't gotten to implementing those ideas yet.

@BenjaminHabert
Copy link

Interesting, thank you for the explanation. Overall I found TPOT to be very useful, well done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants