-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add capability to provide custom seeds to GP #502
Add capability to provide custom seeds to GP #502
Conversation
tpot/base.py
Outdated
seed_individuals = [creator.Individual.from_string(x, self._pset) for x in seeds] | ||
self._pop = [] | ||
|
||
# Add the same set of seeds to the population until we have population_size seeds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With a little discussions with @rhiever earlier today, we think the seeds
(maybe need better name for this parameter) should only specify a small part of initial population instead of the initial population that is full of duplicated pipelines in the seeds
(If seeds
is a small list, then the pipeline diversity of GP will be limited in the beginning). Other pipelines in initial population rather than pipelines in seeds
should be randomly generated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
population_seeds
should probably be the parameter name.
<p>TPOT allows for the initial population of pipelines to be seeded. This can be done either through the <code>population_seeds</code> parameter in the TPOT constructor, or through a <code>population_seeds</code> attribute in a custom config file.</p> | ||
<pre><code class="Python">population_seeds = [ | ||
'BernoulliNB(GaussianNB(input_matrix), BernoulliNB__alpha=0.1, BernoulliNB__fit_prior=False)', | ||
'BernoulliNB(input_matrix, BernoulliNB__alpha=0.01, BernoulliNB__fit_prior=True)' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How easy would it be to take actual sklearn pipelines as input instead of the string representations? I sense that users would find it easier to seed with sklearn pipelines rather than this (slightly) weird string representation we use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we should use better example clfs in the docs: Maybe a RandomForestClassifier and a LogisticRegression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How easy would it be to take actual sklearn pipelines as input instead of the string representations?
We don't have any code to go from sklearn pipelines to deap pipelines, only the other direction. It would be non-trivial to write, but doable given Python's complete reflection of objects.
we should use better example clfs in the docs: Maybe a RandomForestClassifier and a LogisticRegression?
Sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have a function to go from sklearn pipelines to DEAP pipelines. :-) That will be one step closer to having DEAP work directly on sklearn pipelines themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright. I'll start working on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you propose doing that? We need to make sure that it only takes in the operators, parameters, and parameter values that are defined in the config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pass in the pset that's being used and check that the pipeline is valid (within the context of the pset) as you walk though it.
I'd throw a TypeError
if a parameter is missing, or a parameter is used that's not specified. If an operator is used that doesn't exist in the pset I'd throw a NameError
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@teaearlgraycold I think the function to go from sklearn pipeline to DEAP pipelines was not added in this PR yet. But I will start merging some PRs to dev branch today in case of tons of conflicts between PRs. Please add this function later in another PR.
@@ -1068,7 +1081,7 @@ def _operator_count(self, individual): | |||
return operator_count | |||
|
|||
def _update_pbar(self, val, resulting_score_list): | |||
"""Update self._pbar during pipeline evaluration | |||
"""Update self._pbar during pipeline evaluration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not your typo, but there's a typo here in "evaluation."
tpot_obj = TPOTRegressor(config_dict='tests/test_config.py') | ||
n_seeds = len(tpot_obj._read_config_file('tests/test_config.py').population_seeds) | ||
|
||
assert len(tpot_obj._pop) == n_seeds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the unit tests need to be updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be just one that broke, and only in 2.7. I'll install conda for 2.7 so I can see what went wrong.
What does this PR do?
seeds
parameter toTPOTBase
seeds
in their custom configWhere should the reviewer start?
Customizing TPOT's starting population in the "Using" section of the docs.
In this function in
TPOTBase
:How should this PR be tested?
There are tests added to the
tpot_tests.py
test file.What are the relevant issues?
#59
#296