Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducibility of the export pipeline #1270

Open
Iris7788 opened this issue Sep 18, 2022 · 2 comments
Open

Reproducibility of the export pipeline #1270

Iris7788 opened this issue Sep 18, 2022 · 2 comments

Comments

@Iris7788
Copy link

Context of the issue

I used tpot to fit my dataset, I got the different export pipeline for each run.

Process to reproduce the issue

The steps for generating exported pipeline, the shape of my dataset was (45, 478).

X_train, X_test, y_train, y_test = \
sklearn.model_selection.train_test_split(X, y, random_state=1,test_size = 0.15)
M1 = TPOTRegressor(generations=10, population_size=40, verbosity=2, random_state=42,n_jobs =-1,cv=5)
M1.fit(X_train, y_train)
M1.export('M1_pipeline.py')

Current result

  1. When I firstly ran, the export pipeline was DecisionTreeRegressor
Generation 1 - Current best internal CV score: -0.6631261058133652
Generation 2 - Current best internal CV score: -0.6631261058133652
Generation 3 - Current best internal CV score: -0.6442071896861652
Generation 4 - Current best internal CV score: -0.5726875496699182
Generation 5 - Current best internal CV score: -0.5726875496699182
Generation 6 - Current best internal CV score: -0.528473933017039
Generation 7 - Current best internal CV score: -0.528473933017039
Generation 8 - Current best internal CV score: -0.528473933017039
Generation 9 - Current best internal CV score: -0.528473933017039
Generation 10 - Current best internal CV score: -0.528473933017039

Best pipeline: DecisionTreeRegressor(Normalizer(input_matrix, norm=max), max_depth=3, min_samples_leaf=10, min_samples_split=9)
  1. When I secondly ran, the export pipeline was ExtraTreesRegressor
Generation 1 - Current best internal CV score: -0.6631261058133652
Generation 2 - Current best internal CV score: -0.6631261058133652
Generation 3 - Current best internal CV score: -0.6593793694494272
Generation 4 - Current best internal CV score: -0.6524528603774085
Generation 5 - Current best internal CV score: -0.636417747633282
Generation 6 - Current best internal CV score: -0.633586381252542
Generation 7 - Current best internal CV score: -0.633586381252542
Generation 8 - Current best internal CV score: -0.633586381252542
Generation 9 - Current best internal CV score: -0.633586381252542
Generation 10 - Current best internal CV score: -0.633586381252542

Best pipeline: ExtraTreesRegressor(LinearSVR(input_matrix, C=1.0, dual=True, epsilon=0.01, loss=epsilon_insensitive, tol=1e-05), bootstrap=False, max_features=0.3, min_samples_leaf=6, min_samples_split=13, n_estimators=100)

Expected result

I would like to have a repeatable and stable export pipeline. The environment version I am using is Python 3.7.12, TPOT 0.11.7.

Thank you very much for the development and maintenance of TPOT.

@perib
Copy link
Contributor

perib commented Sep 29, 2022

If you set n_jobs to 1, reproducibility is more likely. When using parallel processes, exact reproducibility gets challenging since the order of execution has some randomness that is not controllable. It is something we are thinking about

@Iris7788
Copy link
Author

Iris7788 commented Sep 29, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants