Merge pull request #10 from ThomasMeissnerDS/change_categorical_encoding

Change categorical encoding
ThomasMeissnerDS · Jun 30, 2023 · 3d32618 · 3d32618
2 parents f47d7c0 + 6c92671
commit 3d32618
Show file tree

Hide file tree

Showing 15 changed files with 696 additions and 290 deletions.
diff --git a/README.md b/README.md
@@ -32,7 +32,9 @@ the full documentation [here](https://bluecast.readthedocs.io/en/latest/).
 * [General usage](#general-usage)
   * [Basic usage](#basic-usage)
   * [Advanced usage](#advanced-usage)
-    * [Custom training configuration](#custom--training-configuration)
+    * [Explanatory analysis](#explanatory-analysis)
+    * [Enable cross-validation](#enable-cross-validation)
+    * [Categorical encoding](#categorical-encoding)
     * [Custom preprocessing](#custom-preprocessing)
     * [Custom feature selection](#custom-feature-selection)
     * [Custom ML model](#custom-ml-model)
@@ -88,6 +90,66 @@ y_probs, y_classes = automl.predict(df_val)
 
 ### Advanced usage
 
+#### Explanatory analysis
+
+BlueCast offers a simple way to get a first overview of the data. This is
+
+#### Enable cross-validation
+
+While the default behaviour of BlueCast is to use a simple
+train-test-split, cross-validation can be enabled easily:
+
+```sh
+from bluecast.eda.analyse import (
+    bi_variate_plots,
+    correlation_heatmap,
+    correlation_to_target,
+    univariate_plots,
+)
+
+from bluecast.preprocessing.feature_types import FeatureTypeDetector
+
+# Here we automatically detect the numeric columns
+feat_type_detector = FeatureTypeDetector()
+train_data = feat_type_detector.fit_transform_feature_types(train_data)
+
+# show univariate plots
+univariate_plots(
+        train_data.loc[
+            :, feat_type_detector.num_columns  # here the target column EC1 is already included
+        ],
+        "EC1",
+    )
+
+# show bi-variate plots
+bi_variate_plots(train_data.loc[
+            :, feat_type_detector.num_columns
+        ],
+        "EC1")
+
+# show correlation heatmap
+correlation_heatmap(train_data.loc[
+            :, feat_type_detector.num_columns])
+
+# show correlation to target
+correlation_to_target(train_data.loc[
+            :, feat_type_detector.num_columns
+        ],
+        "EC1",)
+```
+
+#### Categorical encoding
+
+By default, BlueCast uses target encoding.
+This behaviour can be changed in the TrainingConfig by setting `cat_encoding_via_ml_algorithm`
+to True. This will change the expectations of `custom_last_mile_computation` though.
+If `cat_encoding_via_ml_algorithm` is set to False, `custom_last_mile_computation`
+will receive numerical features only as target encoding will apply before. If `cat_encoding_via_ml_algorithm`
+is True (default setting) `custom_last_mile_computation` will receive categorical
+features as well, because Xgboost#s inbuilt categorical encoding will be used.
+
+```sh
+
 #### Custom  training configuration
 
 Despite e2eml, BlueCast allows easy customization. Users can adjust the
@@ -405,13 +467,14 @@ with the following features:
 * automatic feature type detection and casting
 * automatic DataFrame schema detection: checks if unseen data has new or
   missing columns
-* categorical feature encoding
+* categorical feature encoding (target encoding or directly in Xgboost)
 * datetime feature encoding
 * automated GPU availability check and usage for Xgboost
   a fit_eval method to fit a model and evaluate it on a validation set
   to mimic production environment reality
 * functions to save and load a trained pipeline
 * shapley values
+* warnings for potential misconfigurations
 
 The fit_eval method can be used like this:
 

diff --git a/bluecast/blueprints/cast.py b/bluecast/blueprints/cast.py
@@ -135,9 +135,32 @@ def initial_checks(self, df: pd.DataFrame) -> None:
             the dataset while feature selection is enabled. Consider reducing the minimum number of features to
             select or disabling feature selection via TrainingConfig."""
             warnings.warn(message, UserWarning, stacklevel=2)
-        if self.target_column in df.columns:
-            message = """The target column is present in the dataset. Consider removing the target column from the
-            dataset to prevent leakage."""
+        if (
+            self.conf_training.cat_encoding_via_ml_algorithm
+            and self.conf_training.calculate_shap_values
+        ):
+            self.conf_training.calculate_shap_values = False
+            message = """SHAP values cannot be calculated when categorical encoding via ML algorithm is enabled due to
+            incompatibility with the shap library. See this GitHub issue for more context:
+            https://github.com/slundberg/shap/issues/266
+
+            Calculation of Shap values has been changed to false.
+            Consider disabling categorical encoding via ML algorithm in the TrainingConfig if shap values are
+            required. Alternatively use Xgboost as a custom model and calculate shap values manually via
+            pred_contribs=True."""
+            warnings.warn(message, UserWarning, stacklevel=2)
+        if self.conf_training.cat_encoding_via_ml_algorithm and self.ml_model:
+            message = """Categorical encoding via ML algorithm is enabled. Make sure to handle categorical features
+            within the provided ml model or consider disabling categorical encoding via ML algorithm in the
+            TrainingConfig alternatively."""
+            warnings.warn(message, UserWarning, stacklevel=2)
+        if (
+            self.conf_training.cat_encoding_via_ml_algorithm
+            and self.custom_last_mile_computation
+        ):
+            message = """Categorical encoding via ML algorithm is enabled. Make sure to handle categorical features
+            within the provided last mile computation or consider disabling categorical encoding via ML algorithm in the
+            TrainingConfig alternatively."""
             warnings.warn(message, UserWarning, stacklevel=2)
 
     def fit(self, df: pd.DataFrame, target_col: str) -> None:
@@ -198,14 +221,25 @@ def fit(self, df: pd.DataFrame, target_col: str) -> None:
         self.schema_detector.fit(x_train)
         x_test = self.schema_detector.transform(x_test)
 
-        if self.cat_columns is not None and self.class_problem == "binary":
+        if (
+            self.cat_columns is not None
+            and self.class_problem == "binary"
+            and not self.conf_training.cat_encoding_via_ml_algorithm
+        ):
             self.cat_encoder = BinaryClassTargetEncoder(feat_type_detector.cat_columns)
             x_train = self.cat_encoder.fit_target_encode_binary_class(x_train, y_train)
             x_test = self.cat_encoder.transform_target_encode_binary_class(x_test)
-        elif self.cat_columns is not None and self.class_problem == "multiclass":
+        elif (
+            self.cat_columns is not None
+            and self.class_problem == "multiclass"
+            and not self.conf_training.cat_encoding_via_ml_algorithm
+        ):
             self.cat_encoder = MultiClassTargetEncoder(feat_type_detector.cat_columns)
             x_train = self.cat_encoder.fit_target_encode_multiclass(x_train, y_train)
             x_test = self.cat_encoder.transform_target_encode_multiclass(x_test)
+        elif self.conf_training.cat_encoding_via_ml_algorithm:
+            x_train[self.cat_columns] = x_train[self.cat_columns].astype("category")
+            x_test[self.cat_columns] = x_test[self.cat_columns].astype("category")
 
         if self.custom_last_mile_computation:
             x_train, y_train = self.custom_last_mile_computation.fit_transform(
@@ -290,15 +324,19 @@ def transform_new_data(self, df: pd.DataFrame) -> pd.DataFrame:
             and self.cat_encoder
             and self.class_problem == "binary"
             and isinstance(self.cat_encoder, BinaryClassTargetEncoder)
+            and not self.conf_training.cat_encoding_via_ml_algorithm
         ):
             df = self.cat_encoder.transform_target_encode_binary_class(df)
         elif (
             self.cat_columns
             and self.cat_encoder
             and self.class_problem == "multiclass"
             and isinstance(self.cat_encoder, MultiClassTargetEncoder)
+            and not self.conf_training.cat_encoding_via_ml_algorithm
         ):
             df = self.cat_encoder.transform_target_encode_multiclass(df)
+        elif self.conf_training.cat_encoding_via_ml_algorithm:
+            df[self.cat_columns] = df[self.cat_columns].astype("category")
 
         if self.custom_last_mile_computation:
             df, _ = self.custom_last_mile_computation.transform(df, predicton_mode=True)
@@ -320,6 +358,9 @@ def predict(self, df: pd.DataFrame) -> Tuple[np.ndarray, np.ndarray]:
         if not self.feat_type_detector:
             raise Exception("Feature type converter could not be found.")
 
+        if not self.conf_training:
+            raise ValueError("conf_training is None")
+
         check_gpu_support()
         df = self.transform_new_data(df)
 

diff --git a/bluecast/config/training_config.py b/bluecast/config/training_config.py
@@ -31,18 +31,19 @@ class TrainingConfig:
     train_split_stratify: bool = True
     use_full_data_for_final_model: bool = True
     min_features_to_select: int = 5
+    cat_encoding_via_ml_algorithm: bool = False
 
 
 @dataclass
 class XgboostTuneParamsConfig:
     """Define hyperparameter tuning search space."""
 
     max_depth_min: int = 2
-    max_depth_max: int = 3
-    alpha_min: float = 1.0
-    alpha_max: float = 1e3
-    lambda_min: float = 1.0
-    lambda_max: float = 1e3
+    max_depth_max: int = 6
+    alpha_min: float = 0.0
+    alpha_max: float = 10.0
+    lambda_min: float = 0.0
+    lambda_max: float = 10.0
     num_leaves_min: int = 2
     num_leaves_max: int = 64
     sub_sample_min: float = 0.3
@@ -51,15 +52,11 @@ class XgboostTuneParamsConfig:
     col_sample_by_tree_max: float = 1.0
     col_sample_by_level_min: float = 0.3
     col_sample_by_level_max: float = 1.0
-    col_sample_by_node_min: float = 0.3
-    col_sample_by_node_max: float = 1.0
-    min_child_samples_min: int = 2
-    min_child_samples_max: int = 1000
+    min_child_weight_min: float = 0.0
+    min_child_weight_max: float = 10.0
     eta: float = 0.1
     steps_min: int = 2
     steps_max: int = 50000
-    num_parallel_tree_min: int = 1
-    num_parallel_tree_max: int = 3
     model_verbosity: int = 0
     model_objective: str = "multi:softprob"
     model_eval_metric: str = "mlogloss"

diff --git a/bluecast/eda/__init__.py b/bluecast/eda/__init__.py
diff --git a/bluecast/eda/analyse.py b/bluecast/eda/analyse.py
@@ -0,0 +1,133 @@
+import matplotlib.pyplot as plt
+import numpy as np
+import pandas as pd
+import seaborn as sns
+
+
+def univariate_plots(df: pd.DataFrame, target: str) -> None:
+    """
+    Plots univariate plots for all the columns in the dataframe.
+
+    Expects numeric columns only.
+    """
+    for col in df.columns:
+        # Check if the col is the target column (EC1 or EC2)
+        if col == target:
+            continue  # Skip target columns in univariate analysis
+
+        plt.figure(figsize=(8, 4))
+
+        # Histogram
+        plt.subplot(1, 2, 1)
+        sns.histplot(data=df, x=col, kde=True)
+        plt.xlabel(col)
+        plt.ylabel("Frequency")
+        plt.title("Histogram")
+
+        # Box plot
+        plt.subplot(1, 2, 2)
+        sns.boxplot(data=df, y=col)
+        plt.ylabel(col)
+        plt.title("Box Plot")
+
+        # Adjust spacing between subplots
+        plt.tight_layout()
+
+        # Show the plots
+        plt.show()
+
+
+def bi_variate_plots(df: pd.DataFrame, target: str) -> None:
+    """
+    Plots bivariate plots for all column combinations in the dataframe.
+
+    Expects numeric columns only.
+    """
+    # Get the list of column names except for the target column
+    variables = [col for col in df.columns if col != target]
+
+    # Define the grid layout based on the number of variables
+    num_variables = len(variables)
+    num_cols = 4  # Number of columns in the grid
+    num_rows = (
+        num_variables + num_cols - 1
+    ) // num_cols  # Calculate the number of rows needed
+
+    # Set the size of the figure
+    fig, axes = plt.subplots(num_rows, num_cols, figsize=(12, 4 * num_rows))
+
+    # Generate violin plots for each variable with respect to EC1
+    for i, variable in enumerate(variables):
+        row = i // num_cols
+        col = i % num_cols
+        ax = axes[row][col]
+
+        sns.violinplot(data=df, x=target, y=variable, ax=ax)
+        ax.set_xlabel(target)
+        ax.set_ylabel(variable)
+        ax.set_title(f"Violin Plot: {variable} vs {target}")
+
+    # Remove any empty subplots
+    if num_variables < num_rows * num_cols:
+        for i in range(num_variables, num_rows * num_cols):
+            fig.delaxes(axes.flatten()[i])
+
+    # Adjust the spacing between subplots
+    plt.tight_layout()
+
+    # Show the plot
+    plt.show()
+
+
+def correlation_heatmap(df: pd.DataFrame) -> None:
+    """
+    Plots half of the heatmap showing correlations of all features.
+
+    Expects numeric columns only.
+    """
+    # Calculate the correlation matrix
+    corr = df.corr()
+
+    # Generate a mask for the upper triangle
+    mask = np.triu(np.ones_like(corr, dtype=bool))
+
+    # Generate a custom diverging colormap
+    cmap = sns.diverging_palette(230, 20, as_cmap=True)
+
+    # Draw the heatmap with the mask and correct aspect ratio
+    sns.heatmap(
+        corr,
+        mask=mask,
+        cmap=cmap,
+        vmax=0.3,
+        center=0,
+        square=True,
+        linewidths=0.5,
+        cbar_kws={"shrink": 0.5},
+    )
+
+    plt.show()
+
+
+def correlation_to_target(df: pd.DataFrame, target: str) -> None:
+    """
+    Plots correlations for all the columns in the dataframe in relation to the target column.
+
+    Expects numeric columns only.
+    """
+    # Calculate the correlation matrix
+    corr = df.corr()
+
+    # Get correlations without 'EC1' and 'EC2'
+    corrs = corr[target].drop([target])
+
+    # Sort correlation values in descending order
+    corrs_sorted = corrs.sort_values(ascending=False)
+
+    # Create a heatmap of the correlations with EC1
+    sns.set(font_scale=0.8)
+    sns.set_style("white")
+    sns.set_palette("PuBuGn_d")
+    sns.heatmap(corrs_sorted.to_frame(), cmap="coolwarm", annot=True, fmt=".2f")
+    plt.title("Correlation with EC1")
+    plt.show()