Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Generation Rewrite (WIP) #608

Open
wants to merge 27 commits into
base: master
Choose a base branch
from
Open

Conversation

thcrock
Copy link
Contributor

@thcrock thcrock commented Feb 21, 2019

Note! This is a lot to read! Unfortunately I didn't have a good way of splitting this up into multiple PRs. My suggestion for reading this to start:

  1. Read this entire PR message for the summary of the code changes.
  2. Check out the new doc (extending-features.md) for a summary of the user changes.

Introduces the 'FeatureBlock' as the main interface for Triage to create features, instead of directly interfacing with collate.

  • Add FeatureBlock abstract base class. The abstract methods are aimed
    at providing an easy, flexible interface for feature generators to
    implement. The concrete methods are aimed at building on those abstract
    methods to provide a slightly easier, higher-level interface for
    Experiments, et al to use.

  • Splits up the bloated architect.FeatureGenerator into a couple
    different places: the code that generates is now in SpacetimeAggregations in feature_block _generators, and the code that runs the generated queries now lives in concrete methods of the FeatureBlock base class.

  • ExperimentBase now passes on config to generate_feature_blocks and
    owns a collection of resulting FeatureBlocks, instead of a
    FeatureGenerator object.

  • Instead of a FeatureDictionaryCreator object, the FeatureDictionary
    class is introduced that knows how to build itself from a collection of
    FeatureBlocks. The Experiment holds references to these as before.

  • Continuing the trend of molding collate to more closely fit how Triage
    is using it and removing unneeded flexibility, the unused Aggregation base class is folded
    into the SpacetimeAggregation so the latter can directly inherit from
    FeatureBlock. This way we get to remove/condense many tests, and sunset
    the collate integration tests because the methods being tested no longer
    exist and are tested in either SpacetimeAggregation or FeatureBlock. In
    addition, many of the arguments within SpacetimeAggregation are changed
    to their FeatureBlock equivalents to more tightly fit as a FeatureBlock
    subclass. Some of the more generic helper methods within
    SpacetimeAggregation are moved to FeatureBlock as concrete methods

  • With feature data now residing at a new config key, the experiment's
    config version is bumped to v7.

In general I think these changes will make it easier to build feature introspection functionality. It's easier for code to, based on configuration, do stuff like grab feature lists without having to introspect the database structure.

Introduces the 'FeatureBlock' as the main interface for Triage to create features, instead of directly interfacing with collate.

- Add FeatureBlock abstract base class. The abstract methods are aimed
at providing an easy, flexible interface for feature generators to
implement. The concrete methods are aimed at building on those abstract
methods to provide a slightly easier, higher-level interface for
Experiments, et al to use.

- Splits up the bloated architect.FeatureGenerator into a couple
different places: the code that generates is now in SpacetimeAggregations in feature_block _generators, and the code that runs the generated queries now lives in concrete methods of the FeatureBlock base class.

- ExperimentBase now passes on config to generate_feature_blocks and
owns a collection of resulting FeatureBlocks, instead of a
FeatureGenerator object.

- Instead of a FeatureDictionaryCreator object, the FeatureDictionary
class is introduced that knows how to build itself from a collection of
FeatureBlocks. The Experiment holds references to these as before.

- Continuing the trend of molding collate to more closely fit how Triage
is using it and removing unneeded flexibility, the unused Aggregation base class is folded
into the SpacetimeAggregation so the latter can directly inherit from
FeatureBlock. This way we get to remove/condense many tests, and sunset
the collate integration tests because the methods being tested no longer
exist and are tested in either SpacetimeAggregation or FeatureBlock. In
addition, many of the arguments within SpacetimeAggregation are changed
to their FeatureBlock equivalents to more tightly fit as a FeatureBlock
subclass. Some of the more generic helper methods within
SpacetimeAggregation are moved to FeatureBlock as concrete methods

- With feature data now residing at a new config key, the experiment's
config version is bumped to v7.
@thcrock thcrock requested a review from nanounanue February 21, 2019 19:54
@codecov-io
Copy link

codecov-io commented Feb 21, 2019

Codecov Report

Merging #608 into master will decrease coverage by 1.04%.
The diff coverage is 89.94%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #608      +/-   ##
==========================================
- Coverage   82.18%   81.14%   -1.05%     
==========================================
  Files          93       94       +1     
  Lines        6271     6136     -135     
==========================================
- Hits         5154     4979     -175     
- Misses       1117     1157      +40
Impacted Files Coverage Δ
src/triage/component/architect/builders.py 85.51% <ø> (ø) ⬆️
src/triage/experiments/__init__.py 100% <100%> (ø) ⬆️
src/triage/database_reflection.py 97.14% <100%> (+0.17%) ⬆️
src/triage/experiments/rq.py 86.04% <100%> (-1.19%) ⬇️
src/triage/experiments/base.py 95.37% <100%> (-0.02%) ⬇️
src/triage/experiments/singlethreaded.py 100% <100%> (ø) ⬆️
src/triage/experiments/multicore.py 74.44% <100%> (-1.09%) ⬇️
src/triage/component/architect/features.py 100% <100%> (ø) ⬆️
src/triage/component/collate/imputations.py 100% <100%> (ø) ⬆️
src/triage/component/architect/utils.py 41.25% <100%> (-0.31%) ⬇️
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 11ee044...fd1887c. Read the comment docs.

Copy link
Member

@jesteria jesteria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really sure how helpful I can be -- (nudge @nanounanue) -- and it wasn't obvious what code was being moved vs. newly added; so, I commented on a few things, but I didn't really see any problems.

src/triage/component/architect/feature_block.py Outdated Show resolved Hide resolved
src/triage/component/architect/feature_block.py Outdated Show resolved Hide resolved
src/triage/component/architect/feature_block.py Outdated Show resolved Hide resolved
src/triage/component/architect/feature_block.py Outdated Show resolved Hide resolved
src/triage/component/architect/feature_dictionary.py Outdated Show resolved Hide resolved
src/triage/component/architect/feature_dictionary.py Outdated Show resolved Hide resolved
src/triage/component/architect/feature_query_runners.py Outdated Show resolved Hide resolved
src/triage/util/db.py Show resolved Hide resolved
@thcrock
Copy link
Contributor Author

thcrock commented Feb 27, 2019

The FeatureQueryRunner that was present at first here is actually unused and I'll remove it. It was the first version of what eventually became the concrete methods of FeatureBlock.

I felt that it was too many layers, and that it would be advantageous to allow these new FeatureBlocks to just build themselves. I'm not totally sold on my changes: the pattern of having the Experiment own a bunch of 'verb' components, each totally responsible for doing what its concerned with, is now broken up. Instead control over generating features is now split between a list of things. I'm still unsure if this is a better or worse pattern for the Experiment.

@thcrock thcrock marked this pull request as ready for review February 28, 2019 23:19
src/triage/component/architect/feature_dictionary.py Outdated Show resolved Hide resolved
src/triage/component/architect/feature_dictionary.py Outdated Show resolved Hide resolved
src/triage/component/architect/feature_dictionary.py Outdated Show resolved Hide resolved
raise ValueError("value of FeatureDictionary objects represents a list of "
"feature names, and therefore must be iterable")
for feature_name in feature_names:
if not isinstance(feature_name, str):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you like, this check might make (more) sense in FeatureNameList

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I'm testing this out, and it's a bit of work to actually do right. The same concerns about inheriting from dict apply to inheriting from list, at least once you do what we're doing now and start to change the mutation methods to add something like this in.

https://stackoverflow.com/questions/3487434/overriding-append-method-after-inheriting-from-a-python-list

The top-rated answer makes a good suggestion; while inheriting from list, I've so far overridden insert, __setitem__, and append and still haven't found the one that hits how we actually use FeatureNameList. I assume it's one of extend, __add__, or __iadd__, but that's quite a bit to do to properly inherit from list. So implementing the suggestion in there for a typed list looks promising.

Of course, there's also the comment further below that raises the point that inheriting from MutableSequence also means that we have to figure out serialization on our own. This might not be a bad thing to do anyway: as it stands, this class doesn't serialize right and the calling code is forced to listify it. It would be nice to just implement this right in FeatureNameList.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It ended up being that both __add__ and extend are used by triage

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Again, doesn't abstractly have to be one way or the other, (nor encapsulated in one place or another). But, good to consider 👍

src/triage/component/architect/feature_dictionary.py Outdated Show resolved Hide resolved
src/triage/component/collate/spacetime.py Outdated Show resolved Hide resolved
src/triage/component/collate/spacetime.py Outdated Show resolved Hide resolved
src/triage/experiments/validate.py Show resolved Hide resolved
src/triage/experiments/validate.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants