Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add capability to provide custom seeds to GP #502

Merged
merged 4 commits into from
Jul 17, 2017
Merged

Add capability to provide custom seeds to GP #502

merged 4 commits into from
Jul 17, 2017

Conversation

danthedaniel
Copy link
Contributor

What does this PR do?

  • Adds a seeds parameter to TPOTBase
  • Allows for users to export a list named seeds in their custom config

Where should the reviewer start?

Customizing TPOT's starting population in the "Using" section of the docs.

In this function in TPOTBase:

def _setup_pop(self, seeds, config_path):

How should this PR be tested?

There are tests added to the tpot_tests.py test file.

What are the relevant issues?

#59
#296

@coveralls
Copy link

Coverage Status

Coverage increased (+0.6%) to 87.636% when pulling 4ebda55 on teaearlgraycold:custom_seeds into 7f2b6a7 on rhiever:development.

tpot/base.py Outdated
seed_individuals = [creator.Individual.from_string(x, self._pset) for x in seeds]
self._pop = []

# Add the same set of seeds to the population until we have population_size seeds
Copy link
Contributor

@weixuanfu weixuanfu Jun 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a little discussions with @rhiever earlier today, we think the seeds (maybe need better name for this parameter) should only specify a small part of initial population instead of the initial population that is full of duplicated pipelines in the seeds (If seeds is a small list, then the pipeline diversity of GP will be limited in the beginning). Other pipelines in initial population rather than pipelines in seeds should be randomly generated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

population_seeds should probably be the parameter name.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.5%) to 87.617% when pulling 424fa81 on teaearlgraycold:custom_seeds into 7f2b6a7 on rhiever:development.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.5%) to 87.617% when pulling 424fa81 on teaearlgraycold:custom_seeds into 7f2b6a7 on rhiever:development.

<p>TPOT allows for the initial population of pipelines to be seeded. This can be done either through the <code>population_seeds</code> parameter in the TPOT constructor, or through a <code>population_seeds</code> attribute in a custom config file.</p>
<pre><code class="Python">population_seeds = [
'BernoulliNB(GaussianNB(input_matrix), BernoulliNB__alpha=0.1, BernoulliNB__fit_prior=False)',
'BernoulliNB(input_matrix, BernoulliNB__alpha=0.01, BernoulliNB__fit_prior=True)'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How easy would it be to take actual sklearn pipelines as input instead of the string representations? I sense that users would find it easier to seed with sklearn pipelines rather than this (slightly) weird string representation we use.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we should use better example clfs in the docs: Maybe a RandomForestClassifier and a LogisticRegression?

Copy link
Contributor Author

@danthedaniel danthedaniel Jun 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How easy would it be to take actual sklearn pipelines as input instead of the string representations?

We don't have any code to go from sklearn pipelines to deap pipelines, only the other direction. It would be non-trivial to write, but doable given Python's complete reflection of objects.

we should use better example clfs in the docs: Maybe a RandomForestClassifier and a LogisticRegression?

Sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have a function to go from sklearn pipelines to DEAP pipelines. :-) That will be one step closer to having DEAP work directly on sklearn pipelines themselves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. I'll start working on that.

Copy link
Contributor

@rhiever rhiever Jun 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you propose doing that? We need to make sure that it only takes in the operators, parameters, and parameter values that are defined in the config.

Copy link
Contributor Author

@danthedaniel danthedaniel Jun 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass in the pset that's being used and check that the pipeline is valid (within the context of the pset) as you walk though it.

I'd throw a TypeError if a parameter is missing, or a parameter is used that's not specified. If an operator is used that doesn't exist in the pset I'd throw a NameError.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@teaearlgraycold I think the function to go from sklearn pipeline to DEAP pipelines was not added in this PR yet. But I will start merging some PRs to dev branch today in case of tons of conflicts between PRs. Please add this function later in another PR.

@@ -1068,7 +1081,7 @@ def _operator_count(self, individual):
return operator_count

def _update_pbar(self, val, resulting_score_list):
"""Update self._pbar during pipeline evaluration
"""Update self._pbar during pipeline evaluration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not your typo, but there's a typo here in "evaluation."

tpot_obj = TPOTRegressor(config_dict='tests/test_config.py')
n_seeds = len(tpot_obj._read_config_file('tests/test_config.py').population_seeds)

assert len(tpot_obj._pop) == n_seeds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the unit tests need to be updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be just one that broke, and only in 2.7. I'll install conda for 2.7 so I can see what went wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants