Skip to content

Commit

Permalink
Add feature engineering docs
Browse files Browse the repository at this point in the history
  • Loading branch information
ThomasMeissnerDS committed Aug 23, 2024
1 parent 57c9065 commit 93c9a91
Show file tree
Hide file tree
Showing 4 changed files with 74 additions and 0 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,7 @@ with BlueCast, covering:
* [EDA](docs/source/EDA.md)
* [Basic usage](docs/source/Basic%20usage.md)
* [Customize training settings](docs/source/Customize%20training%20settings.md)
* [Feature engineering](docs/source/Feature%20engineering.md)
* [Customizing configurations and objects](docs/source/Customizing%20configurations%20and%20objects.md)
* [Model evaluation](docs/source/Model%20evaluation.md)
* [Error analysis](docs/source/Error%20analysis.md)
Expand Down
Binary file modified dist/bluecast-1.6.0-py3-none-any.whl
Binary file not shown.
Binary file modified dist/bluecast-1.6.0.tar.gz
Binary file not shown.
73 changes: 73 additions & 0 deletions docs/source/Feature engineering.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Feature engineering

A critical part of machine learning is feature engineering.
BlueCast's pipelines will automatically execute only
necessary feature engineering and leaves this to the end
user. However BlueCast offers some tools for feature
engineering to make this part more approachable and
faster.

First we import the required modules:

```sh
from bluecast.preprocessing.feature_types import FeatureTypeDetector
from bluecast.preprocessing.feature_creation import AddRowLevelAggFeatures, GroupLevelAggFeatures
```

Next we can make use of `FeatureTypeDetector` to identify
numerical columns:

```sh
ignore_cols = [TARGET, "id", "CustomerId"]

feat_type_detector = FeatureTypeDetector()
train_data = feat_type_detector.fit_transform_feature_types(train.drop(ignore_cols, axis=1))
```

Next we use `AddRowLevelAggFeatures` to create features
on row level. This usually adds a small degree of
additional performance.

```sh
agg_feat_creator = AddRowLevelAggFeatures()

train_num = agg_feat_creator.add_row_level_agg_features(train.loc[:, feat_type_detector.num_columns])
test_num = agg_feat_creator.add_row_level_agg_features(test.loc[:, feat_type_detector.num_columns])

train_num = train_num.drop(agg_feat_creator.original_features, axis=1)
test_num = test_num.drop(agg_feat_creator.original_features, axis=1)


train = pd.concat([train, train_num], axis=1)
test = pd.concat([test, test_num], axis=1)
```

Additionally we can also provide information via group
aggregations with `GroupLevelAggFeatures`:

```python
group_agg_creator = GroupLevelAggFeatures()

train_num = group_agg_creator.create_groupby_agg_features(
df = train,
groupby_columns=["Geography", "Gender", "NumOfProducts"],
columns_to_agg=feat_type_detector.num_columns, # None = take all
target_col=None,
aggregations = None # falls back to some aggs
)

test_num = group_agg_creator.create_groupby_agg_features(
df = test,
groupby_columns=["Geography", "Gender", "NumOfProducts"],
columns_to_agg=feat_type_detector.num_columns, # None = take all
target_col=TARGET,
aggregations = None # falls back to some aggs
)

# joining the train information everywhere
train = train.merge(train_num, on=["Geography", "Gender", "NumOfProducts"], how="left")
test = test.merge(train_num, on=["Geography", "Gender", "NumOfProducts"], how="left")
```

Please note that this will increase the number of features
significantly.

0 comments on commit 93c9a91

Please sign in to comment.