You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add normalization of features to the data processing pipeline. Will likely not affect tree models, but needed for neural networks, so should make this standard for all models.
The text was updated successfully, but these errors were encountered:
I tried training the Norway model with normalized features and it gave the same results as using unnormalized features with CustomXGBRegressor. I used MinMaxScaler from scikit-learn as shown below. Should still be implemented if we want to use other model architectures. Could be an option to have the user choose any scaler from scikit-learn based on their feature distributions. In any case we need to ensure that metadata is not scaled. My implementation was as follows:
# Normalize features using min-max scaling from scikit-learn
from sklearn.preprocessing import MinMaxScaler
# Get arrays of features+metadata and targets for training, df_train_X is training feature+metadata dataframe and df_train_y is target training dataframe
X_train_unnorm, y_train = df_train_X.values, df_train_y.values
# Initialize scaler
scaler = MinMaxScaler()
# Extract metadata columns such that these are not normalized
metadata_columns = X_train_unnorm[:, -3:]
# Extract remaining columns to be normalized
feature_columns = X_train_unnorm[:, :-3]
# Apply MinMaxScaler to feature columns
scaled_feature_columns = scaler.fit_transform(feature_columns)
# Combine scaled columns with metadata columns
X_train = np.hstack((scaled_feature_columns, metadata_columns))
# Get arrays of test features+metadata and test targets, df_test_X is test feature+metadata dataframe and df_test_y is target test dataframe
X_test_unnorm, y_test = df_test_X.values, df_test_y.values
# Extract metadata columns
metadata_columns_test = X_test_unnorm[:, -3:]
# Extract feature columns
feature_columns_test = X_test_unnorm[:, :-3]
# Apply MinMaxScaler fit to training features to the feature columns in test dataset
scaled_feature_columns_test = scaler.transform(feature_columns_test)
# Combine scaled columns with metadata columns
X_test = np.hstack((scaled_feature_columns_test, metadata_columns_test))
Add normalization of features to the data processing pipeline. Will likely not affect tree models, but needed for neural networks, so should make this standard for all models.
The text was updated successfully, but these errors were encountered: