Make normalization of features standard practice in data processing #20

khsjursen · 2024-06-25T16:10:55Z

Add normalization of features to the data processing pipeline. Will likely not affect tree models, but needed for neural networks, so should make this standard for all models.

khsjursen · 2024-08-05T18:56:01Z

I tried training the Norway model with normalized features and it gave the same results as using unnormalized features with CustomXGBRegressor. I used MinMaxScaler from scikit-learn as shown below. Should still be implemented if we want to use other model architectures. Could be an option to have the user choose any scaler from scikit-learn based on their feature distributions. In any case we need to ensure that metadata is not scaled. My implementation was as follows:

# Normalize features using min-max scaling from scikit-learn
from sklearn.preprocessing import MinMaxScaler

# Get arrays of features+metadata and targets for training, df_train_X is training feature+metadata dataframe and df_train_y is target training dataframe
X_train_unnorm, y_train = df_train_X.values, df_train_y.values

# Initialize scaler
scaler = MinMaxScaler()

# Extract metadata columns such that these are not normalized
metadata_columns = X_train_unnorm[:, -3:]

# Extract remaining columns to be normalized
feature_columns = X_train_unnorm[:, :-3]

# Apply MinMaxScaler to feature columns
scaled_feature_columns = scaler.fit_transform(feature_columns)

# Combine scaled columns with metadata columns
X_train = np.hstack((scaled_feature_columns, metadata_columns))

# Get arrays of test features+metadata and test targets, df_test_X is test feature+metadata dataframe and df_test_y is target test dataframe
X_test_unnorm, y_test = df_test_X.values, df_test_y.values

# Extract metadata columns
metadata_columns_test = X_test_unnorm[:, -3:]

# Extract feature columns
feature_columns_test = X_test_unnorm[:, :-3]

# Apply MinMaxScaler fit to training features to the feature columns in test dataset
scaled_feature_columns_test = scaler.transform(feature_columns_test)

# Combine scaled columns with metadata columns
X_test = np.hstack((scaled_feature_columns_test, metadata_columns_test))

khsjursen added the enhancement New feature or request label Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make normalization of features standard practice in data processing #20

Make normalization of features standard practice in data processing #20

khsjursen commented Jun 25, 2024

khsjursen commented Aug 5, 2024 •

edited

Loading

Make normalization of features standard practice in data processing #20

Make normalization of features standard practice in data processing #20

Comments

khsjursen commented Jun 25, 2024

khsjursen commented Aug 5, 2024 • edited Loading

khsjursen commented Aug 5, 2024 •

edited

Loading