Add feature engineering docs

ThomasMeissnerDS · Aug 23, 2024 · 93c9a91 · 93c9a91
1 parent 57c9065
commit 93c9a91
Show file tree

Hide file tree

Showing 4 changed files with 74 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -164,6 +164,7 @@ with BlueCast, covering:
 * [EDA](docs/source/EDA.md)
 * [Basic usage](docs/source/Basic%20usage.md)
 * [Customize training settings](docs/source/Customize%20training%20settings.md)
+* [Feature engineering](docs/source/Feature%20engineering.md)
 * [Customizing configurations and objects](docs/source/Customizing%20configurations%20and%20objects.md)
 * [Model evaluation](docs/source/Model%20evaluation.md)
 * [Error analysis](docs/source/Error%20analysis.md)

diff --git a/dist/bluecast-1.6.0-py3-none-any.whl b/dist/bluecast-1.6.0-py3-none-any.whl
diff --git a/dist/bluecast-1.6.0.tar.gz b/dist/bluecast-1.6.0.tar.gz
diff --git a/docs/source/Feature engineering.md b/docs/source/Feature engineering.md
@@ -0,0 +1,73 @@
+# Feature engineering
+
+A critical part of machine learning is feature engineering.
+BlueCast's pipelines will automatically execute only
+necessary feature engineering and leaves this to the end
+user. However BlueCast offers some tools for feature
+engineering to make this part more approachable and
+faster.
+
+First we import the required modules:
+
+```sh
+from bluecast.preprocessing.feature_types import FeatureTypeDetector
+from bluecast.preprocessing.feature_creation import AddRowLevelAggFeatures, GroupLevelAggFeatures
+```
+
+Next we can make use of `FeatureTypeDetector` to identify
+numerical columns:
+
+```sh
+ignore_cols = [TARGET, "id", "CustomerId"]
+
+feat_type_detector = FeatureTypeDetector()
+train_data = feat_type_detector.fit_transform_feature_types(train.drop(ignore_cols, axis=1))
+```
+
+Next we use `AddRowLevelAggFeatures` to create features
+on row level. This usually adds a small degree of
+additional performance.
+
+```sh
+agg_feat_creator = AddRowLevelAggFeatures()
+
+train_num = agg_feat_creator.add_row_level_agg_features(train.loc[:, feat_type_detector.num_columns])
+test_num = agg_feat_creator.add_row_level_agg_features(test.loc[:, feat_type_detector.num_columns])
+
+train_num = train_num.drop(agg_feat_creator.original_features, axis=1)
+test_num = test_num.drop(agg_feat_creator.original_features, axis=1)
+
+
+train = pd.concat([train, train_num], axis=1)
+test = pd.concat([test, test_num], axis=1)
+```
+
+Additionally we can also provide information via group
+aggregations with `GroupLevelAggFeatures`:
+
+```python
+group_agg_creator = GroupLevelAggFeatures()
+
+train_num = group_agg_creator.create_groupby_agg_features(
+    df = train,
+    groupby_columns=["Geography", "Gender", "NumOfProducts"],
+    columns_to_agg=feat_type_detector.num_columns, # None = take all
+    target_col=None,
+    aggregations = None # falls back to some aggs
+)
+
+test_num = group_agg_creator.create_groupby_agg_features(
+    df = test,
+    groupby_columns=["Geography", "Gender", "NumOfProducts"],
+    columns_to_agg=feat_type_detector.num_columns, # None = take all
+    target_col=TARGET,
+    aggregations = None # falls back to some aggs
+)
+
+# joining the train information everywhere
+train = train.merge(train_num, on=["Geography", "Gender", "NumOfProducts"], how="left")
+test = test.merge(train_num, on=["Geography", "Gender", "NumOfProducts"], how="left")
+```
+
+Please note that this will increase the number of features
+significantly.