Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make normalization of features standard practice in data processing #20

Open
khsjursen opened this issue Jun 25, 2024 · 1 comment
Open
Labels
enhancement New feature or request

Comments

@khsjursen
Copy link
Collaborator

Add normalization of features to the data processing pipeline. Will likely not affect tree models, but needed for neural networks, so should make this standard for all models.

@khsjursen khsjursen added the enhancement New feature or request label Jun 25, 2024
@khsjursen
Copy link
Collaborator Author

khsjursen commented Aug 5, 2024

I tried training the Norway model with normalized features and it gave the same results as using unnormalized features with CustomXGBRegressor. I used MinMaxScaler from scikit-learn as shown below. Should still be implemented if we want to use other model architectures. Could be an option to have the user choose any scaler from scikit-learn based on their feature distributions. In any case we need to ensure that metadata is not scaled. My implementation was as follows:

# Normalize features using min-max scaling from scikit-learn
from sklearn.preprocessing import MinMaxScaler

# Get arrays of features+metadata and targets for training, df_train_X is training feature+metadata dataframe and df_train_y is target training dataframe
X_train_unnorm, y_train = df_train_X.values, df_train_y.values

# Initialize scaler
scaler = MinMaxScaler()

# Extract metadata columns such that these are not normalized
metadata_columns = X_train_unnorm[:, -3:]

# Extract remaining columns to be normalized
feature_columns = X_train_unnorm[:, :-3]

# Apply MinMaxScaler to feature columns
scaled_feature_columns = scaler.fit_transform(feature_columns)

# Combine scaled columns with metadata columns
X_train = np.hstack((scaled_feature_columns, metadata_columns))

# Get arrays of test features+metadata and test targets, df_test_X is test feature+metadata dataframe and df_test_y is target test dataframe
X_test_unnorm, y_test = df_test_X.values, df_test_y.values

# Extract metadata columns
metadata_columns_test = X_test_unnorm[:, -3:]

# Extract feature columns
feature_columns_test = X_test_unnorm[:, :-3]

# Apply MinMaxScaler fit to training features to the feature columns in test dataset
scaled_feature_columns_test = scaler.transform(feature_columns_test)

# Combine scaled columns with metadata columns
X_test = np.hstack((scaled_feature_columns_test, metadata_columns_test))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant