Skip to content

Commit

Permalink
Merge pull request #10 from ThomasMeissnerDS/change_categorical_encoding
Browse files Browse the repository at this point in the history
Change categorical encoding
  • Loading branch information
thomasmeissnercrm authored Jun 30, 2023
2 parents f47d7c0 + 6c92671 commit 3d32618
Show file tree
Hide file tree
Showing 15 changed files with 696 additions and 290 deletions.
67 changes: 65 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,9 @@ the full documentation [here](https://bluecast.readthedocs.io/en/latest/).
* [General usage](#general-usage)
* [Basic usage](#basic-usage)
* [Advanced usage](#advanced-usage)
* [Custom training configuration](#custom--training-configuration)
* [Explanatory analysis](#explanatory-analysis)
* [Enable cross-validation](#enable-cross-validation)
* [Categorical encoding](#categorical-encoding)
* [Custom preprocessing](#custom-preprocessing)
* [Custom feature selection](#custom-feature-selection)
* [Custom ML model](#custom-ml-model)
Expand Down Expand Up @@ -88,6 +90,66 @@ y_probs, y_classes = automl.predict(df_val)

### Advanced usage

#### Explanatory analysis

BlueCast offers a simple way to get a first overview of the data. This is

#### Enable cross-validation

While the default behaviour of BlueCast is to use a simple
train-test-split, cross-validation can be enabled easily:

```sh
from bluecast.eda.analyse import (
bi_variate_plots,
correlation_heatmap,
correlation_to_target,
univariate_plots,
)

from bluecast.preprocessing.feature_types import FeatureTypeDetector

# Here we automatically detect the numeric columns
feat_type_detector = FeatureTypeDetector()
train_data = feat_type_detector.fit_transform_feature_types(train_data)

# show univariate plots
univariate_plots(
train_data.loc[
:, feat_type_detector.num_columns # here the target column EC1 is already included
],
"EC1",
)

# show bi-variate plots
bi_variate_plots(train_data.loc[
:, feat_type_detector.num_columns
],
"EC1")

# show correlation heatmap
correlation_heatmap(train_data.loc[
:, feat_type_detector.num_columns])

# show correlation to target
correlation_to_target(train_data.loc[
:, feat_type_detector.num_columns
],
"EC1",)
```

#### Categorical encoding

By default, BlueCast uses target encoding.
This behaviour can be changed in the TrainingConfig by setting `cat_encoding_via_ml_algorithm`
to True. This will change the expectations of `custom_last_mile_computation` though.
If `cat_encoding_via_ml_algorithm` is set to False, `custom_last_mile_computation`
will receive numerical features only as target encoding will apply before. If `cat_encoding_via_ml_algorithm`
is True (default setting) `custom_last_mile_computation` will receive categorical
features as well, because Xgboost#s inbuilt categorical encoding will be used.

```sh

#### Custom training configuration

Despite e2eml, BlueCast allows easy customization. Users can adjust the
Expand Down Expand Up @@ -405,13 +467,14 @@ with the following features:
* automatic feature type detection and casting
* automatic DataFrame schema detection: checks if unseen data has new or
missing columns
* categorical feature encoding
* categorical feature encoding (target encoding or directly in Xgboost)
* datetime feature encoding
* automated GPU availability check and usage for Xgboost
a fit_eval method to fit a model and evaluate it on a validation set
to mimic production environment reality
* functions to save and load a trained pipeline
* shapley values
* warnings for potential misconfigurations
The fit_eval method can be used like this:
Expand Down
51 changes: 46 additions & 5 deletions bluecast/blueprints/cast.py
Original file line number Diff line number Diff line change
Expand Up @@ -135,9 +135,32 @@ def initial_checks(self, df: pd.DataFrame) -> None:
the dataset while feature selection is enabled. Consider reducing the minimum number of features to
select or disabling feature selection via TrainingConfig."""
warnings.warn(message, UserWarning, stacklevel=2)
if self.target_column in df.columns:
message = """The target column is present in the dataset. Consider removing the target column from the
dataset to prevent leakage."""
if (
self.conf_training.cat_encoding_via_ml_algorithm
and self.conf_training.calculate_shap_values
):
self.conf_training.calculate_shap_values = False
message = """SHAP values cannot be calculated when categorical encoding via ML algorithm is enabled due to
incompatibility with the shap library. See this GitHub issue for more context:
https://github.com/slundberg/shap/issues/266
Calculation of Shap values has been changed to false.
Consider disabling categorical encoding via ML algorithm in the TrainingConfig if shap values are
required. Alternatively use Xgboost as a custom model and calculate shap values manually via
pred_contribs=True."""
warnings.warn(message, UserWarning, stacklevel=2)
if self.conf_training.cat_encoding_via_ml_algorithm and self.ml_model:
message = """Categorical encoding via ML algorithm is enabled. Make sure to handle categorical features
within the provided ml model or consider disabling categorical encoding via ML algorithm in the
TrainingConfig alternatively."""
warnings.warn(message, UserWarning, stacklevel=2)
if (
self.conf_training.cat_encoding_via_ml_algorithm
and self.custom_last_mile_computation
):
message = """Categorical encoding via ML algorithm is enabled. Make sure to handle categorical features
within the provided last mile computation or consider disabling categorical encoding via ML algorithm in the
TrainingConfig alternatively."""
warnings.warn(message, UserWarning, stacklevel=2)

def fit(self, df: pd.DataFrame, target_col: str) -> None:
Expand Down Expand Up @@ -198,14 +221,25 @@ def fit(self, df: pd.DataFrame, target_col: str) -> None:
self.schema_detector.fit(x_train)
x_test = self.schema_detector.transform(x_test)

if self.cat_columns is not None and self.class_problem == "binary":
if (
self.cat_columns is not None
and self.class_problem == "binary"
and not self.conf_training.cat_encoding_via_ml_algorithm
):
self.cat_encoder = BinaryClassTargetEncoder(feat_type_detector.cat_columns)
x_train = self.cat_encoder.fit_target_encode_binary_class(x_train, y_train)
x_test = self.cat_encoder.transform_target_encode_binary_class(x_test)
elif self.cat_columns is not None and self.class_problem == "multiclass":
elif (
self.cat_columns is not None
and self.class_problem == "multiclass"
and not self.conf_training.cat_encoding_via_ml_algorithm
):
self.cat_encoder = MultiClassTargetEncoder(feat_type_detector.cat_columns)
x_train = self.cat_encoder.fit_target_encode_multiclass(x_train, y_train)
x_test = self.cat_encoder.transform_target_encode_multiclass(x_test)
elif self.conf_training.cat_encoding_via_ml_algorithm:
x_train[self.cat_columns] = x_train[self.cat_columns].astype("category")
x_test[self.cat_columns] = x_test[self.cat_columns].astype("category")

if self.custom_last_mile_computation:
x_train, y_train = self.custom_last_mile_computation.fit_transform(
Expand Down Expand Up @@ -290,15 +324,19 @@ def transform_new_data(self, df: pd.DataFrame) -> pd.DataFrame:
and self.cat_encoder
and self.class_problem == "binary"
and isinstance(self.cat_encoder, BinaryClassTargetEncoder)
and not self.conf_training.cat_encoding_via_ml_algorithm
):
df = self.cat_encoder.transform_target_encode_binary_class(df)
elif (
self.cat_columns
and self.cat_encoder
and self.class_problem == "multiclass"
and isinstance(self.cat_encoder, MultiClassTargetEncoder)
and not self.conf_training.cat_encoding_via_ml_algorithm
):
df = self.cat_encoder.transform_target_encode_multiclass(df)
elif self.conf_training.cat_encoding_via_ml_algorithm:
df[self.cat_columns] = df[self.cat_columns].astype("category")

if self.custom_last_mile_computation:
df, _ = self.custom_last_mile_computation.transform(df, predicton_mode=True)
Expand All @@ -320,6 +358,9 @@ def predict(self, df: pd.DataFrame) -> Tuple[np.ndarray, np.ndarray]:
if not self.feat_type_detector:
raise Exception("Feature type converter could not be found.")

if not self.conf_training:
raise ValueError("conf_training is None")

check_gpu_support()
df = self.transform_new_data(df)

Expand Down
19 changes: 8 additions & 11 deletions bluecast/config/training_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,18 +31,19 @@ class TrainingConfig:
train_split_stratify: bool = True
use_full_data_for_final_model: bool = True
min_features_to_select: int = 5
cat_encoding_via_ml_algorithm: bool = False


@dataclass
class XgboostTuneParamsConfig:
"""Define hyperparameter tuning search space."""

max_depth_min: int = 2
max_depth_max: int = 3
alpha_min: float = 1.0
alpha_max: float = 1e3
lambda_min: float = 1.0
lambda_max: float = 1e3
max_depth_max: int = 6
alpha_min: float = 0.0
alpha_max: float = 10.0
lambda_min: float = 0.0
lambda_max: float = 10.0
num_leaves_min: int = 2
num_leaves_max: int = 64
sub_sample_min: float = 0.3
Expand All @@ -51,15 +52,11 @@ class XgboostTuneParamsConfig:
col_sample_by_tree_max: float = 1.0
col_sample_by_level_min: float = 0.3
col_sample_by_level_max: float = 1.0
col_sample_by_node_min: float = 0.3
col_sample_by_node_max: float = 1.0
min_child_samples_min: int = 2
min_child_samples_max: int = 1000
min_child_weight_min: float = 0.0
min_child_weight_max: float = 10.0
eta: float = 0.1
steps_min: int = 2
steps_max: int = 50000
num_parallel_tree_min: int = 1
num_parallel_tree_max: int = 3
model_verbosity: int = 0
model_objective: str = "multi:softprob"
model_eval_metric: str = "mlogloss"
Expand Down
Empty file added bluecast/eda/__init__.py
Empty file.
133 changes: 133 additions & 0 deletions bluecast/eda/analyse.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns


def univariate_plots(df: pd.DataFrame, target: str) -> None:
"""
Plots univariate plots for all the columns in the dataframe.
Expects numeric columns only.
"""
for col in df.columns:
# Check if the col is the target column (EC1 or EC2)
if col == target:
continue # Skip target columns in univariate analysis

plt.figure(figsize=(8, 4))

# Histogram
plt.subplot(1, 2, 1)
sns.histplot(data=df, x=col, kde=True)
plt.xlabel(col)
plt.ylabel("Frequency")
plt.title("Histogram")

# Box plot
plt.subplot(1, 2, 2)
sns.boxplot(data=df, y=col)
plt.ylabel(col)
plt.title("Box Plot")

# Adjust spacing between subplots
plt.tight_layout()

# Show the plots
plt.show()


def bi_variate_plots(df: pd.DataFrame, target: str) -> None:
"""
Plots bivariate plots for all column combinations in the dataframe.
Expects numeric columns only.
"""
# Get the list of column names except for the target column
variables = [col for col in df.columns if col != target]

# Define the grid layout based on the number of variables
num_variables = len(variables)
num_cols = 4 # Number of columns in the grid
num_rows = (
num_variables + num_cols - 1
) // num_cols # Calculate the number of rows needed

# Set the size of the figure
fig, axes = plt.subplots(num_rows, num_cols, figsize=(12, 4 * num_rows))

# Generate violin plots for each variable with respect to EC1
for i, variable in enumerate(variables):
row = i // num_cols
col = i % num_cols
ax = axes[row][col]

sns.violinplot(data=df, x=target, y=variable, ax=ax)
ax.set_xlabel(target)
ax.set_ylabel(variable)
ax.set_title(f"Violin Plot: {variable} vs {target}")

# Remove any empty subplots
if num_variables < num_rows * num_cols:
for i in range(num_variables, num_rows * num_cols):
fig.delaxes(axes.flatten()[i])

# Adjust the spacing between subplots
plt.tight_layout()

# Show the plot
plt.show()


def correlation_heatmap(df: pd.DataFrame) -> None:
"""
Plots half of the heatmap showing correlations of all features.
Expects numeric columns only.
"""
# Calculate the correlation matrix
corr = df.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(
corr,
mask=mask,
cmap=cmap,
vmax=0.3,
center=0,
square=True,
linewidths=0.5,
cbar_kws={"shrink": 0.5},
)

plt.show()


def correlation_to_target(df: pd.DataFrame, target: str) -> None:
"""
Plots correlations for all the columns in the dataframe in relation to the target column.
Expects numeric columns only.
"""
# Calculate the correlation matrix
corr = df.corr()

# Get correlations without 'EC1' and 'EC2'
corrs = corr[target].drop([target])

# Sort correlation values in descending order
corrs_sorted = corrs.sort_values(ascending=False)

# Create a heatmap of the correlations with EC1
sns.set(font_scale=0.8)
sns.set_style("white")
sns.set_palette("PuBuGn_d")
sns.heatmap(corrs_sorted.to_frame(), cmap="coolwarm", annot=True, fmt=".2f")
plt.title("Correlation with EC1")
plt.show()
Loading

0 comments on commit 3d32618

Please sign in to comment.