Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Generation Rewrite (WIP) #608

Open
wants to merge 27 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
680c83c
Feature Generation Rewrite
thcrock Dec 14, 2018
6bc3a29
Remove unnecessary quotes in split_table
thcrock Feb 21, 2019
ca9207e
Test for verbose task info
thcrock Feb 21, 2019
cf96a6c
Remove unused FeatureQueryRunner
thcrock Feb 27, 2019
6bff2bc
Updates to some docs, remove no-longer-useful feature YAML
thcrock Feb 28, 2019
7e6fa3e
Changes from review
thcrock Feb 28, 2019
9dd5933
Merge remote-tracking branch 'origin/master' into notice_removed_feat…
thcrock Mar 1, 2019
3a3c71f
from review
thcrock Mar 5, 2019
5a14339
WIP
thcrock Mar 6, 2019
935e673
changes from review
thcrock Mar 6, 2019
4ce5c6f
Merge remote-tracking branch 'origin/master' into notice_removed_feat…
thcrock Mar 7, 2019
65cb36a
Merge remote-tracking branch 'origin/master' into notice_removed_feat…
thcrock Mar 7, 2019
929a2e9
Update feature mock
thcrock Mar 11, 2019
0463836
Merge remote-tracking branch 'origin/master' into notice_removed_feat…
thcrock Mar 19, 2019
8733fcb
WIP
thcrock Mar 21, 2019
3e40ae0
converted one more spacetime test for now
thcrock Apr 1, 2019
cf74822
Finish converting spacetime tests
thcrock Apr 1, 2019
0fd781b
More fixes
thcrock Apr 2, 2019
daf088d
Fix behavior when no feature config is given
thcrock Apr 2, 2019
4a84431
Fix validation
thcrock Apr 2, 2019
b7c9426
Fix validate call
thcrock Apr 2, 2019
403d0f5
Use prefixes in spacetime tests
thcrock Apr 2, 2019
a628bcd
Postmodeling fixes, start to correct documentation
thcrock Apr 2, 2019
a98fae4
Update docs some more
thcrock Apr 2, 2019
acb8c45
Fix reference to experiment config file in featuretest CLI
thcrock Apr 2, 2019
a9df94b
Merge remote-tracking branch 'origin/master' into notice_removed_feat…
thcrock May 7, 2019
fd1887c
Reimplement impflag squashing changes
thcrock May 7, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion docs/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,15 @@ pages:
- Defining an Experiment: experiments/defining.md
- Testing Feature Configuration: experiments/feature-testing.md
- Running an Experiment: experiments/running.md
- Upgrading an Experiment: experiments/upgrading.md
- Upgrading an Experiment:
to v5: experiments/upgrade-to-v5.md
to v6: experiments/upgrade-to-v6.md
to v7: experiments/upgrade-to-v7.md
- Temporal Validation Deep Dive: experiments/temporal-validation.md
- Cohort and Label Deep Dive: experiments/cohort-labels.md
- Feature Generation Recipe Book: experiments/features.md
- Experiment Algorithm: experiments/algorithm.md
- Experiment Architecture: experiments/architecture.md
- Extending Experiment Features: experiments/extending-features.md
- Audition: https://github.com/dssg/triage/tree/master/src/triage/component/audition
- Postmodeling: https://github.com/dssg/triage/tree/master/src/triage/component/postmodeling
143 changes: 143 additions & 0 deletions docs/sources/experiments/extending-features.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
# Extending Feature Generation

This document describes how to extend Triage's feature generation capabilities by writing new FeatureBlock classes and incorporating them into Experiments.

## What is a FeatureBlock?

A FeatureBlock represents a single feature table in the database and how to generate it. If you're familiar with `collate` parlance, a `SpacetimeAggregation` is similar in scope to a FeatureBlock. A `FeatureBlock` class can be instantiated with whatever arguments it needs,and from there can provide queries to produce its output feature table. Full-size Triage experiments tend to contain multiple feature blocks. These all live in a collection as the `experiment.feature_blocks` property in the Experiment.

## What existing FeatureBlock classes can I use?

Class name | Experiment config key | Use
------------ | ------------- | ------------
triage.component.collate.SpacetimeAggregation | spacetime_aggregations | Temporal aggregations of event-based data

## Writing a new FeatureBlock class

The `FeatureBlock` base class defines a set of abstract methods that any child class must implement, as well as a number of initialization arguments that it must take and implement in order to fulfill expectations Triage users have on feature generators. Triage expects these classes to define the queries they need to run, as opposed to generating the tables themselves, so that Triage can implement scaling by parallelization.

### Abstract methods

Any method here without parentheses afterwards is expected to be a property.

Method | Task | Return Type
------------ | ------------- | -------------
final_feature_table_name | The name of the final table with all features filled in (no missing values) | string
feature_columns | The list of feature columns in the final, postimputation table. Should exclude any index columns (e.g. entity id, date) | list
preinsert_queries | Return all queries that should be run before inserting any data. The creation of your feature table should happen here, and is expected to have `entity_id(integer)` and `as_of_date(timestamp)` columns. | list
insert_queries | Return all inserts to populate this data. Each query in this list should be parallelizable, and should be valid after all `preinsert_queries` are run. | list
postinsert_queries | Return all queries that should be run after inserting all data | list
imputation_queries | Return all queries that should be run to fill in missing data with imputed values. | list

Any of the query list properties can be empty: for instance, if your implementation doesn't have inserts separate from table creation and is just one big query (e.g. a `CREATE TABLE AS`), you could just define `preinsert_queries` so be that one mega-query and leave the other properties as empty lists.

### Properties Provided by Base Class

There are several attributes/properties that can be used within subclass implementations that the base class provides. Triage experiments take care of providing this data during runtime: if you want to instantiate a FeatureBlock object on your own, you'll have to provide them in the constructor.

Name | Type | Purpose
------------ | ------------- | -------------
as_of_dates | list | Features are created "as of" specific dates, and expects that each of these dates will be populated with a row for each member of the cohort on that date.
cohort_table | string | The final shape of the feature table should at least include every entity id/date pair in this cohort table.
db_engine | sqlalchemy.engine | The engine to use to access the database. Although these instances are mostly returning queries, the engine may be useful for implementing imputation.
features_schema_name | string | The database schema where all feature tables should reside. Defaults to None, which ends up in the public schema.
feature_start_time | string/datetime | A time before which no data should be considered for features. This is generally only applicable if your FeatureBlock is doing temporal aggregations. Defaults to None, which means no data will be excluded.
features_ignore_cohort | bool | If True (the default), features are only computed for members of the cohort. If False, the shape of the final feature table could include more.


`FeatureBlock` child classes can, and in almost all cases will, include more configuration at initialization time that are specific to them. They probably also define many more methods to use internally. But as long as they adhere to this interface, they'll work with Triage.

### Making the new FeatureBlock available to experiments

Triage Experiments run on serializable configuration, and although it's possible to take fully generated `FeatureBlock` instances and bypass this (e.g. `experiment.feature_blocks = <my_collection_of_feature_blocks>`), it's not recommended. The last step is to pick a config key for use within the `features` key of experiment configs, within `triage.component.architect.feature_block_generators.FEATURE_BLOCK_GENERATOR_LOOKUP` and point it to a function that instantiates a bunch of your objects based on config.

## Example

That's a lot of information! Let's see this in action. Let's say that we want to create a very flexible type of feature that simply runs a configured query with a parametrized as-of-date and returns its result as a feature.

```python
from triage.component.architect.feature_block import FeatureBlock


class SimpleQueryFeature(FeatureBlock):
def __init__(self, query, *args, **kwargs):
self.query = query
super().__init__(*args, **kwargs)

@property
def final_feature_table_name(self):
return f"{self.features_schema_name}.mytable"

@property
def feature_columns(self):
return ['myfeature']

@property
def preinsert_queries(self):
return [f"create table {self.final_feature_table_name}" "(entity_id bigint, as_of_date timestamp, myfeature float)"]

@property
def insert_queries(self):
if self.features_ignore_cohort:
final_query = self.query
else:
final_query = f"""
select * from (self.query) raw
join {self.cohort_table} using (entity_id, as_of_date)
"""
return [
final_query.format(as_of_date=date)
for date in self.as_of_dates
]

@property
def postinsert_queries(self):
return [f"create index on {self.final_feature_table_name} (entity_id, as_of_date)"]

@property
def imputation_queries(self):
return [f"update {self.final_feature_table_name} set myfeature = 0.0 where myfeature is null"]
```

This class would allow many different uses: basically any query a user can come up with would be a feature. To instantiate this class outside of triage with a simple query, you could:

```python
feature_block = SimpleQueryFeature(
query="select entity_id, as_of_date, quantity from source_table where date < '{as_of_date}'",
as_of_dates=["2016-01-01"],
cohort_table="my_cohort_table",
db_engine=triage.create_engine(<..mydbinfo..>)
)

feature_block.run_preimputation()
feature_block.run_imputation()
```

To use it from a Triage experiment, modify `triage.component.architect.feature_block_generators.py` and submit a pull request:

Before:

```python
FEATURE_BLOCK_GENERATOR_LOOKUP = {
'spacetime_aggregations': generate_spacetime_aggregations
}
```

After:

```python
FEATURE_BLOCK_GENERATOR_LOOKUP = {
'spacetime_aggregations': generate_spacetime_aggregations,
'simple_query': SimpleQueryFeature,
}
```

At this point, you could use it in an experiment configuration like this:

```yaml

features:
simple_query:
- query: "select entity_id, as_of_date, quantity from source_table where date < '{as_of_date}'"
- query: "select entity_id, as_of_date, other_quantity from other_source_table where date < '{as_of_date}'"
```
31 changes: 19 additions & 12 deletions docs/sources/experiments/feature-testing.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,26 +2,27 @@

Developing features for Triage experiments can be a daunting task. There are a lot of things to configure, a small amount of configuration can result in a ton of SQL, and it can take a long time to validate your feature configuration in the context of an Experiment being run on real data.

To speed up the process of iterating on features, you can run a list of feature aggregations, without imputation, on just one as-of-date. This functionality can be accessed through the `triage` command line tool or called directly from code (say, in a Jupyter notebook) using the `FeatureGenerator` component.
To speed up the process of iterating on features, you can run a list of feature aggregations, without imputation, on just one as-of-date. This functionality can be accessed through the `triage` command line tool or called directly from code (say, in a Jupyter notebook) using the `feature_blocks_from_config` utility.

## Using Triage CLI
![triage featuretest cli help screen](featuretest-cli.png)

The command-line interface for testing features takes in two arguments:
- A feature config file. Refer to [example_feature_config.yaml](https://github.com/dssg/triage/blob/master/example/config/feature.yaml). Essentially this is the content of the [example_experiment_config.yaml](https://github.com/dssg/triage/blob/master/example/config/experiment.yaml)'s `feature_aggregations` section. It consists of a YAML list, with one or more feature_aggregation rows present.
- An as-of-date. This should be in the format `2016-01-01`.

Example: `triage experiment featuretest example/config/feature.yaml 2016-01-01`
- An experiment config file. It should have at least a `features` section, and if a `cohort_config` section is present, it will use that to limit the number of feature rows it creates to the cohort at the given date. Other keys can be in there but are ignored. In other lwords, you can use your experiment config file either before or after its fully completed.
- An as-of-date. This should be in the format `2016-01-01`.

Example: `triage experiment featuretest example/config/experiment.yaml 2016-01-01`

All given feature aggregations will be processed for the given date. You will see a bunch of queries pass by in your terminal, populating tables in the `features_test` schema which you can inspect afterwards.

![triage feature test result](featuretest-result.png)

## Using Python Code
If you'd like to call this from a notebook or from any other Python code, the arguments look similar but are a bit different. You have to supply your own sqlalchemy database engine to create a 'FeatureGenerator' object, and then call the `create_features_before_imputation` method with your feature config as a list of dictionaries, along with an as-of-date as a string. Make sure your logging level is set to INFO if you want to see all of the queries.
If you'd like to call this from a notebook or from any other Python code, the arguments look similar but are a bit different. You have to supply the same arguments plus a few others to the `feature_blocks_from_config` function to create a set of feature blocks, and then call the `run_preimputation` method on each feature block. Make sure your logging level is set to INFO if you want to see all of the queries.


```
from triage.component.architect.feature_generators import FeatureGenerator
from triage.component.architect.feature_block_generators import feature_blocks_from_config
from triage.util.db import create_engine
import logging
import yaml
Expand All @@ -32,12 +33,13 @@ logging.basicConfig(level=logging.INFO)
db_url = 'your db url here'
db_engine = create_engine(db_url)

feature_config = [{
feature_config = {'spacetime_aggregations': [{
'prefix': 'aprefix',
'aggregates': [
{
'quantity': 'quantity_one',
'metrics': ['sum', 'count'],
}
],
'categoricals': [
{
Expand All @@ -50,10 +52,15 @@ feature_config = [{
'intervals': ['all'],
'knowledge_date_column': 'knowledge_date',
'from_obj': 'data'
}]
}]}

FeatureGenerator(db_engine, 'features_test').create_features_before_imputation(
feature_aggregation_config=feature_config,
feature_dates=['2016-01-01']
feature_blocks = feature_blocks_from_config(
feature_config,
as_of_dates=['2016-01-01'],
cohort_table=None,
db_engine=db_engine,
features_schema_name="features_test",
)
for feature_block in feature_blocks:
feature_block.run_preimputation(verbose=True)
```
66 changes: 66 additions & 0 deletions docs/sources/experiments/upgrade-to-v7.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Upgrading your experiment configuration to v7


This document details the steps needed to update a triage v6 configuration to
v7, mimicking the old behavior.

Experiment configuration v7 includes only one change from v6: The features are given at a different key. Instead of `feature_aggregations`, to make space for non-collate features to be added in the future, there is now a more generic `features` key, under which collate features reside at `spacetime_aggregations`.


Old:

```
feature_aggregations:
-
prefix: 'prefix'
from_obj: 'cool_stuff'
knowledge_date_column: 'open_date'
aggregates_imputation:
all:
type: 'constant'
value: 0
aggregates:
-
quantity: 'homeless::INT'
metrics: ['count', 'sum']
intervals: ['1 year', '2 year']
groups: ['entity_id']
```

New:

```
features:
spacetime_aggregations:
-
prefix: 'prefix'
from_obj: 'cool_stuff'
knowledge_date_column: 'open_date'
aggregates_imputation:
all:
type: 'constant'
value: 0
aggregates:
-
quantity: 'homeless::INT'
metrics: ['count', 'sum']
intervals: ['1 year', '2 year']
groups: ['entity_id']
```

## Upgrading the experiment config version

At this point, you should be able to bump the top-level experiment config version to v7:

Old:

```
config_version: 'v6'
```

New:

```
config_version: 'v7'
```

5 changes: 0 additions & 5 deletions docs/sources/experiments/upgrading.md

This file was deleted.

65 changes: 33 additions & 32 deletions example/config/experiment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
# old configuration files are released. Be sure to assign the config version
# that matches the triage.experiments.CONFIG_VERSION in the triage release
# you are developing against!
config_version: 'v6'
config_version: 'v7'

# EXPERIMENT METADATA
# model_comment (optional) will end up in the model_comment column of the
Expand Down Expand Up @@ -72,37 +72,38 @@ label_config:


# FEATURE GENERATION
# The aggregate features to generate for each train/test split
#
# Implemented by wrapping collate: https://github.com/dssg/collate
# Most terminology here is taken directly from collate
#
# Each entry describes a collate.SpacetimeAggregation object, and the
# arguments needed to create it. Generally, each of these entries controls
# the features from one source table, though in the case of multiple groups
# may result in multiple output tables
#
# Rules specifying how to handle imputation of null values must be explicitly
# defined in your config file. These can be specified in two places: either
# within each feature or overall for each type of feature (aggregates_imputation,
# categoricals_imputation, array_categoricals_imputation). In either case, a rule must be given for
# each aggregation function (e.g., sum, max, avg, etc) used, or a catch-all
# can be specified with `all`. Aggregation function-specific rules will take
# precedence over the `all` rule and feature-specific rules will take
# precedence over the higher-level rules. Several examples are provided below.
#
# Available Imputation Rules:
# * mean: The average value of the feature (for SpacetimeAggregation the
# mean is taken within-date).
# * constant: Fill with a constant value from a required `value` parameter.
# * zero: Fill with zero.
# * null_category: Only available for categorical features. Just flag null
# values with the null category column.
# * binary_mode: Only available for aggregate column types. Takes the modal
# value for a binary feature.
# * error: Raise an exception if any null values are encountered for this
# feature.
feature_aggregations:
features:
spacetime_aggregations:
# The aggregate features to generate for each train/test split
#
# Implemented by wrapping collate: https://github.com/dssg/collate
# Most terminology here is taken directly from collate
#
# Each entry describes a collate.SpacetimeAggregation object, and the
# arguments needed to create it. Generally, each of these entries controls
# the features from one source table, though in the case of multiple groups
# may result in multiple output tables
#
# Rules specifying how to handle imputation of null values must be explicitly
# defined in your config file. These can be specified in two places: either
# within each feature or overall for each type of feature (aggregates_imputation,
# categoricals_imputation, array_categoricals_imputation). In either case, a rule must be given for
# each aggregation function (e.g., sum, max, avg, etc) used, or a catch-all
# can be specified with `all`. Aggregation function-specific rules will take
# precedence over the `all` rule and feature-specific rules will take
# precedence over the higher-level rules. Several examples are provided below.
#
# Available Imputation Rules:
# * mean: The average value of the feature (for SpacetimeAggregation the
# mean is taken within-date).
# * constant: Fill with a constant value from a required `value` parameter.
# * zero: Fill with zero.
# * null_category: Only available for categorical features. Just flag null
# values with the null category column.
# * binary_mode: Only available for aggregate column types. Takes the modal
# value for a binary feature.
# * error: Raise an exception if any null values are encountered for this
# feature.
-
# prefix given to the resultant tables
prefix: 'prefix'
Expand Down
Loading